HTTrack

From WebarchDocs
Jump to navigation Jump to search

Archives created using wget

If you have to use wget because HTTrack can't cope with the size of the site then you will have to edit files based on MIME type rather than extension, for example, this bash script:

#!/bin/bash

find . -type f -exec sh -c '
  for f; do 
    file --mime-type -b "${f}" | grep -Eq "text/html" && sed -i "s%https://www.example.org%https://archive.example.org%g" "${f}"
  done
' sh {} +

Mirroring a MKDoc site

Command line for mirroring a MKDoc site as an archive:

httrack -o0 -K0 -%F "" http://www.example.org/

The top of the file will have Windows format carriage returns (^M at the end of lines), added by HTTrack, convert them into UNIX format:

 find . -type f | grep html$ | xargs -0 dos2unix

Then edit the results, as suggested here:

find . -name '*.html' -exec sed -i 's%<!-- Added by HTTrack --><meta http-equiv="content-type" content="text/html;charset=UTF-8" /><!-- /Added by HTTrack -->%%g' {} \;

And then for .meta.rdf and .headlines.rss as well:

find . -name '*.html' -exec sed -i 's%www.example.org%archive.example.org%g' {} \;
find . -name '*.rdf' -exec sed -i 's%www.example.org%archive.example.org%g' {} \;
find . -name '*.rss' -exec sed -i 's%www.example.org%archive.example.org%g' {} \;

And you might need:

find . -name '*.html' -exec sed -i 's%  <link rel="stylesheet" href=".static/css/colours.html" type="text/css" title="Screen style sheet" />%%g' {} \;
find . -name '*.html' -exec sed -i 's% <meta http-equiv="Content-Location" content="index.html" />%%g' {} \;
find . -name '.sitemap.html' -exec sed -i 's% <meta http-equiv="Content-Location" content=".sitemap.html" />%%g' {} \;

And then, perhaps, as suggested here:

find . -name '*.html' -exec sed -i 's%index.html%%g' {} \;

And that might need to be followed with:

find . -name '.account.html' -exec sed -i 's%href=""%href="/"%g' {} \;
find . -name '.sitemap.html' -exec sed -i 's%href=""%href="/"%g' {} \;

And also:

find . -name '*.html' -exec sed -i 's%http://stats.webarch.net%https://stats.webarch.net%g' {} \;

Hiding Forms

CSS can be used to hide form elements, for example:

form input, form textarea, form p, form table { 
  display: none 
}

form:before {
  content: "The form that was here has been removed as this site has been archived" 
}

WordPress HTML 2 Import

If the resultant files are to be used for importing into another CMS, such as WordPress using a tool such as HTML Import 2 then best remove all sitemap, search, print etc .html files before running the import, for example:

find ./ -name ".sitemap.html" | xargs rm
find ./ -name ".print.html" | xargs rm
find ./ -name ".account.html" | xargs rm
find ./ -name ".account-2.html" | xargs rm
find ./ -name ".headlines.rss" | xargs rm
find ./ -name ".meta.rdf" | xargs rm
find ./ -name ".headlines.rss" | xargs rm

You can timestamp the files based on the metadata using this script on each file:

#!/bin/bash

# get "  <meta name="DCTERMS.created" content="2014-09-13T11:40:29Z" scheme="W3CDTF" />" from the file
RAW_HTML=$(grep "DCTERMS.created" ${1})

# munge it into the right format for touch: -t [[CC]YY]MMDDHHMM[.SS]
DATE=$(echo ${RAW_HTML} | sed 's/<meta name="DCTERMS.created" content="//' | sed 's/" scheme="W3CDTF" \/>//' | sed 's/-//g' | sed 's/T//' | sed 's/://' | sed 's/:/./' | sed 's/Z$//' )

touch -t ${DATE} ${1}

WordPress HTML 2 Import key settings to check:

  • Default file index.html
  • File extensions to include html
  • Select content by div class="mkdoc-content"
  • Select title by <a id="page_content">
  • Set timestamps to Last time the file was modified