doc/integration/warc.texi

   1 @node WARCs
   2 @cindex WARC
   3 @pindex wget
   4 @section Integration with Web pages
   5
   6 Simple HTML web page can be downloaded very easily for sending and
   7 viewing it offline after:
   8
   9 @example
  10 $ wget http://www.example.com/page.html
  11 @end example
  12
  13 But most web pages contain links to images, CSS and JavaScript files,
  14 required for complete rendering.
  15 @url{https://www.gnu.org/software/wget/, GNU Wget} supports that
  16 documents parsing and understanding page dependencies. You can download
  17 the whole page with dependencies the following way:
  18
  19 @example
  20 $ wget \
  21     --page-requisites \
  22     --convert-links \
  23     --adjust-extension \
  24     --restrict-file-names=ascii \
  25     --span-hosts \
  26     --random-wait \
  27     --execute robots=off \
  28     http://www.example.com/page.html
  29 @end example
  30
  31 that will create @file{www.example.com} directory with all files
  32 necessary to view @file{page.html} web page. You can create single file
  33 compressed tarball with that directory and send it to remote node:
  34
  35 @example
  36 $ tar cf - www.example.com | zstd |
  37     nncp-file - remote.node:www.example.com-page.tar.zst
  38 @end example
  39
  40 But there are multi-paged articles, there are the whole interesting
  41 sites you want to get in a single package. You can mirror the whole web
  42 site by utilizing @command{wget}'s recursive feature:
  43
  44 @example
  45 $ wget \
  46     --recursive \
  47     --timestamping \
  48     -l inf \
  49     --no-remove-listing \
  50     --no-parent [@dots{}] \
  51     http://www.example.com/
  52 @end example
  53
  54 There is a standard for creating
  55 @url{https://en.wikipedia.org/wiki/Web_ARChive, Web ARChives}:
  56 @strong{WARC}. Fortunately again, @command{wget} supports it as an
  57 output format.
  58
  59 @example
  60 $ wget [--page-requisites] [--recursive] \
  61     --warc-file www.example.com-$(date '+%Y%M%d%H%m%S') \
  62     --no-warc-keep-log --no-warc-digests \
  63     [--no-warc-compression] [--warc-max-size=XXX] \
  64     [@dots{}] http://www.example.com/
  65 @end example
  66
  67 @pindex tofuproxy
  68 That command will create @file{www.example.com-XXX.warc} web archive.
  69 It could produce specialized segmented
  70 @url{https://en.wikipedia.org/wiki/Gzip, gzip} and
  71 @url{https://en.wikipedia.org/wiki/Zstandard, Zstandard}
  72 indexing/searching-friendly compressed archives. I can advise my own
  73 @url{http://www.tofuproxy.stargrave.org/WARCs.html, tofuproxy} software
  74 (also written on Go) to index, browse and extract those archives
  75 conveniently.