@node WARCs @section Integration with Web pages Simple HTML web page can be downloaded very easily for sending and viewing it offline after: @example $ wget http://www.example.com/page.html @end example But most web pages contain links to images, CSS and JavaScript files, required for complete rendering. @url{https://www.gnu.org/software/wget/, GNU Wget} supports that documents parsing and understanding page dependencies. You can download the whole page with dependencies the following way: @example $ wget \ --page-requisites \ --convert-links \ --adjust-extension \ --restrict-file-names=ascii \ --span-hosts \ --random-wait \ --execute robots=off \ http://www.example.com/page.html @end example that will create @file{www.example.com} directory with all files necessary to view @file{page.html} web page. You can create single file compressed tarball with that directory and send it to remote node: @example $ tar cf - www.example.com | zstd | nncp-file - remote.node:www.example.com-page.tar.zst @end example But there are multi-paged articles, there are the whole interesting sites you want to get in a single package. You can mirror the whole web site by utilizing @command{wget}'s recursive feature: @example $ wget \ --recursive \ --timestamping \ -l inf \ --no-remove-listing \ --no-parent \ [...] http://www.example.com/ @end example There is a standard for creating @url{https://en.wikipedia.org/wiki/Web_ARChive, Web ARChives}: @strong{WARC}. Fortunately again, @command{wget} supports it as an output format. @example $ wget \ --warc-file www.example_com-$(date '+%Y%M%d%H%m%S') \ --no-warc-compression \ --no-warc-keep-log \ [...] http://www.example.com/ @end example That command will create uncompressed @file{www.example_com-XXX.warc} web archive. By default, WARCs are compressed using @url{https://en.wikipedia.org/wiki/Gzip, gzip}, but, in example above, we have disabled it to compress with stronger and faster @url{https://en.wikipedia.org/wiki/Zstd, zstd}, before sending via @command{nncp-file}. There are plenty of software acting like HTTP proxy for your browser, allowing to view that WARC files. However you can extract files from that archive using @url{https://pypi.python.org/pypi/Warcat, warcat} utility, producing usual directory hierarchy: @example $ python3 -m warcat extract \ www.example_com-XXX.warc \ --output-dir www.example.com-XXX \ --progress @end example