2 @section Integration with Web pages
4 Simple HTML web page can be downloaded very easily for sending and
5 viewing it offline after:
8 $ wget http://www.example.com/page.html
11 But most web pages contain links to images, CSS and JavaScript files,
12 required for complete rendering.
13 @url{https://www.gnu.org/software/wget/, GNU Wget} supports that
14 documents parsing and understanding page dependencies. You can download
15 the whole page with dependencies the following way:
22 --restrict-file-names=ascii \
25 --execute robots=off \
26 http://www.example.com/page.html
29 that will create @file{www.example.com} directory with all files
30 necessary to view @file{page.html} web page. You can create single file
31 compressed tarball with that directory and send it to remote node:
34 $ tar cf - www.example.com | zstd |
35 nncp-file - remote.node:www.example.com-page.tar.zst
38 But there are multi-paged articles, there are the whole interesting
39 sites you want to get in a single package. You can mirror the whole web
40 site by utilizing @command{wget}'s recursive feature:
48 --no-parent [@dots{}] \
49 http://www.example.com/
52 There is a standard for creating
53 @url{https://en.wikipedia.org/wiki/Web_ARChive, Web ARChives}:
54 @strong{WARC}. Fortunately again, @command{wget} supports it as an
59 --warc-file www.example_com-$(date '+%Y%M%d%H%m%S') \
60 --no-warc-compression \
61 --no-warc-keep-log [@dots{}] \
62 http://www.example.com/
65 That command will create uncompressed @file{www.example_com-XXX.warc}
66 web archive. By default, WARCs are compressed using
67 @url{https://en.wikipedia.org/wiki/Gzip, gzip}, but, in example above,
68 we have disabled it to compress with stronger and faster
69 @url{https://en.wikipedia.org/wiki/Zstd, zstd}, before sending via
72 There are plenty of software acting like HTTP proxy for your browser,
73 allowing to view that WARC files. However you can extract files from
74 that archive using @url{https://pypi.python.org/pypi/Warcat, warcat}
75 utility, producing usual directory hierarchy:
78 $ python3 -m warcat extract \
79 www.example_com-XXX.warc \
80 --output-dir www.example.com-XXX \