HTTrack
Archives created using wget
If you have to use wget
because HTTrack can't cope with the size of the site then you will have to edit files based on MIME type rather than extension, for example, this bash script:
#!/bin/bash find . -type f -exec sh -c ' for f; do file --mime-type -b "${f}" | grep -Eq "text/html" && sed -i "s%https://www.example.org%https://archive.example.org%g" "${f}" done ' sh {} +
Mirroring a MKDoc site
Command line for mirroring a MKDoc site as an archive:
httrack -o0 -K0 -%F "" http://www.example.org/
The top of the file will have Windows format carriage returns (^M at the end of lines), added by HTTrack, convert them into UNIX format:
find . -type f | grep html$ | xargs -0 dos2unix
Then edit the results, as suggested here:
find . -name '*.html' -exec sed -i 's%<!-- Added by HTTrack --><meta http-equiv="content-type" content="text/html;charset=UTF-8" /><!-- /Added by HTTrack -->%%g' {} \;
And then for .meta.rdf
and .headlines.rss
as well:
find . -name '*.html' -exec sed -i 's%www.example.org%archive.example.org%g' {} \; find . -name '*.rdf' -exec sed -i 's%www.example.org%archive.example.org%g' {} \; find . -name '*.rss' -exec sed -i 's%www.example.org%archive.example.org%g' {} \;
And you might need:
find . -name '*.html' -exec sed -i 's% <link rel="stylesheet" href=".static/css/colours.html" type="text/css" title="Screen style sheet" />%%g' {} \; find . -name '*.html' -exec sed -i 's% <meta http-equiv="Content-Location" content="index.html" />%%g' {} \; find . -name '.sitemap.html' -exec sed -i 's% <meta http-equiv="Content-Location" content=".sitemap.html" />%%g' {} \;
And then, perhaps, as suggested here:
find . -name '*.html' -exec sed -i 's%index.html%%g' {} \;
And that might need to be followed with:
find . -name '.account.html' -exec sed -i 's%href=""%href="/"%g' {} \; find . -name '.sitemap.html' -exec sed -i 's%href=""%href="/"%g' {} \;
And also:
find . -name '*.html' -exec sed -i 's%http://stats.webarch.net%https://stats.webarch.net%g' {} \;
Hiding Forms
CSS can be used to hide form elements, for example:
form input, form textarea, form p, form table { display: none } form:before { content: "The form that was here has been removed as this site has been archived" }
WordPress HTML 2 Import
If the resultant files are to be used for importing into another CMS, such as WordPress using a tool such as HTML Import 2 then best remove all sitemap, search, print etc .html
files before running the import, for example:
find ./ -name ".sitemap.html" | xargs rm find ./ -name ".print.html" | xargs rm find ./ -name ".account.html" | xargs rm find ./ -name ".account-2.html" | xargs rm find ./ -name ".headlines.rss" | xargs rm find ./ -name ".meta.rdf" | xargs rm find ./ -name ".headlines.rss" | xargs rm
You can timestamp the files based on the metadata using this script on each file:
#!/bin/bash # get " <meta name="DCTERMS.created" content="2014-09-13T11:40:29Z" scheme="W3CDTF" />" from the file RAW_HTML=$(grep "DCTERMS.created" ${1}) # munge it into the right format for touch: -t [[CC]YY]MMDDHHMM[.SS] DATE=$(echo ${RAW_HTML} | sed 's/<meta name="DCTERMS.created" content="//' | sed 's/" scheme="W3CDTF" \/>//' | sed 's/-//g' | sed 's/T//' | sed 's/://' | sed 's/:/./' | sed 's/Z$//' ) touch -t ${DATE} ${1}
WordPress HTML 2 Import key settings to check:
- Default file
index.html
- File extensions to include
html
- Select content by
div class="mkdoc-content"
- Select title by
<a id="page_content">
- Set timestamps to Last time the file was modified