X-Git-Url: http://www.git.cypherpunks.ru/?a=blobdiff_plain;f=doc%2Fintegration%2Fwarc.texi;h=f449ed5556834fa534f11c30a2712bcd2e15453f;hb=d708b1fb5ba9fef9ba5c6add645a0c74a2c2b27b;hp=6e62f5c217d2db92674825d2978447b80175debf;hpb=9d09491dd928ed16e357795d8818dbc9153a1d49;p=nncp.git diff --git a/doc/integration/warc.texi b/doc/integration/warc.texi index 6e62f5c..f449ed5 100644 --- a/doc/integration/warc.texi +++ b/doc/integration/warc.texi @@ -1,4 +1,6 @@ @node WARCs +@cindex WARC +@pindex wget @section Integration with Web pages Simple HTML web page can be downloaded very easily for sending and @@ -45,8 +47,7 @@ $ wget \ --timestamping \ -l inf \ --no-remove-listing \ - --no-parent \ - [...] + --no-parent [@dots{}] \ http://www.example.com/ @end example @@ -56,29 +57,23 @@ There is a standard for creating output format. @example -$ wget \ - --warc-file www.example_com-$(date '+%Y%M%d%H%m%S') \ - --no-warc-compression \ - --no-warc-keep-log \ - [...] - http://www.example.com/ +$ wget [--page-requisites] [--recursive] \ + --warc-file www.example.com-$(date '+%Y%M%d%H%m%S') \ + --no-warc-keep-log --no-warc-digests \ + [--no-warc-compression] [--warc-max-size=XXX] \ + [@dots{}] http://www.example.com/ @end example -That command will create uncompressed @file{www.example_com-XXX.warc} -web archive. By default, WARCs are compressed using -@url{https://en.wikipedia.org/wiki/Gzip, gzip}, but, in example above, -we have disabled it to compress with stronger and faster -@url{https://en.wikipedia.org/wiki/Zstd, zstd}, before sending via -@command{nncp-file}. - -There are plenty of software acting like HTTP proxy for your browser, -allowing to view that WARC files. However you can extract files from -that archive using @url{https://pypi.python.org/pypi/Warcat, warcat} -utility, producing usual directory hierarchy: +@pindex crawl +Or even more simpler @url{https://git.jordan.im/crawl/tree/README.md, crawl} +utility written on Go too. -@example -$ python3 -m warcat extract \ - www.example_com-XXX.warc \ - --output-dir www.example.com-XXX \ - --progress -@end example +@pindex tofuproxy +That command will create @file{www.example.com-XXX.warc} web archive. +It could produce specialized segmented +@url{https://en.wikipedia.org/wiki/Gzip, gzip} and +@url{https://en.wikipedia.org/wiki/Zstandard, Zstandard} +indexing/searching-friendly compressed archives. I can advise my own +@url{http://www.tofuproxy.stargrave.org/WARCs.html, tofuproxy} software +(also written on Go) to index, browse and extract those archives +conveniently.