@node WARCs
@section Integration with Web pages

Simple HTML web page can be downloaded very easily for sending and
viewing it offline after:

@example
$ wget http://www.example.com/page.html
@end example

But most web pages contain links to images, CSS and JavaScript files,
required for complete rendering.
@url{https://www.gnu.org/software/wget/, GNU Wget} supports that
documents parsing and understanding page dependencies. You can download
the whole page with dependencies the following way:

@example
$ wget \
    --page-requisites \
    --convert-links \
    --adjust-extension \
    --restrict-file-names=ascii \
    --span-hosts \
    --random-wait \
    --execute robots=off \
    http://www.example.com/page.html
@end example

that will create @file{www.example.com} directory with all files
necessary to view @file{page.html} web page. You can create single file
compressed tarball with that directory and send it to remote node:

@example
$ tar cf - www.example.com | zstd |
    nncp-file - remote.node:www.example.com-page.tar.zst
@end example

But there are multi-paged articles, there are the whole interesting
sites you want to get in a single package. You can mirror the whole web
site by utilizing @command{wget}'s recursive feature:

@example
$ wget \
    --recursive \
    --timestamping \
    -l inf \
    --no-remove-listing \
    --no-parent \
    [...]
    http://www.example.com/
@end example

There is a standard for creating
@url{https://en.wikipedia.org/wiki/Web_ARChive, Web ARChives}:
@strong{WARC}. Fortunately again, @command{wget} supports it as an
output format.

@example
$ wget \
    --warc-file www.example_com-$(date '+%Y%M%d%H%m%S') \
    --no-warc-compression \
    --no-warc-keep-log \
    [...]
    http://www.example.com/
@end example

That command will create uncompressed @file{www.example_com-XXX.warc}
web archive. By default, WARCs are compressed using
@url{https://en.wikipedia.org/wiki/Gzip, gzip}, but, in example above,
we have disabled it to compress with stronger and faster
@url{https://en.wikipedia.org/wiki/Zstd, zstd}, before sending via
@command{nncp-file}.

There are plenty of software acting like HTTP proxy for your browser,
allowing to view that WARC files. However you can extract files from
that archive using @url{https://pypi.python.org/pypi/Warcat, warcat}
utility, producing usual directory hierarchy:

@example
$ python3 -m warcat extract \
    www.example_com-XXX.warc \
    --output-dir www.example.com-XXX \
    --progress
@end example