2 @section Integration with Web pages
4 Simple HTML web page can be downloaded very easily for sending and
5 viewing it offline after:
8 $ wget http://www.example.com/page.html
11 But most web pages contain links to images, CSS and JavaScript files,
12 required for complete rendering.
13 @url{https://www.gnu.org/software/wget/, GNU Wget} supports that
14 documents parsing and understanding page dependencies. You can download
15 the whole page with dependencies the following way:
22 --restrict-file-names=ascii \
25 --execute robots=off \
26 http://www.example.com/page.html
29 that will create @file{www.example.com} directory with all files
30 necessary to view @file{page.html} web page. You can create single file
31 compressed tarball with that directory and send it to remote node:
34 $ tar cf - www.example.com | zstd |
35 nncp-file - remote.node:www.example.com-page.tar.zst
38 But there are multi-paged articles, there are the whole interesting
39 sites you want to get in a single package. You can mirror the whole web
40 site by utilizing @command{wget}'s recursive feature:
48 --no-parent [@dots{}] \
49 http://www.example.com/
52 There is a standard for creating
53 @url{https://en.wikipedia.org/wiki/Web_ARChive, Web ARChives}:
54 @strong{WARC}. Fortunately again, @command{wget} supports it as an
58 $ wget [--page-requisites] [--recursive] \
59 --warc-file www.example.com-$(date '+%Y%M%d%H%m%S') \
60 --no-warc-keep-log --no-warc-digests \
61 [--no-warc-compression] [--warc-max-size=XXX] \
62 [@dots{}] http://www.example.com/
65 That command will create @file{www.example.com-XXX.warc} web archive.
66 It could produce specialized segmented
67 @url{https://en.wikipedia.org/wiki/Gzip, gzip} and
68 @url{https://en.wikipedia.org/wiki/Zstandard, Zstandard}
69 indexing/searching-friendly compressed archives. I can advise my own
70 @url{http://www.tofuproxy.stargrave.org/WARCs.html, tofuproxy} software
71 (also written on Go) to index, browse and extract those archives