X-Git-Url: http://www.git.cypherpunks.ru/?p=nncp.git;a=blobdiff_plain;f=doc%2Fintegration%2Fwarc.texi;h=4c5872906e7df18cb93b1e12216f84829946170d;hp=de7cc92ade11f0092bb2d02691d27d653218f3d3;hb=203dfe36da7adf2b3089e4fa4017a67409cbad70;hpb=5d9003aa63f733df951fcab8fbd69e60f20ecc38 diff --git a/doc/integration/warc.texi b/doc/integration/warc.texi index de7cc92..4c58729 100644 --- a/doc/integration/warc.texi +++ b/doc/integration/warc.texi @@ -1,4 +1,6 @@ @node WARCs +@cindex WARC +@pindex wget @section Integration with Web pages Simple HTML web page can be downloaded very easily for sending and @@ -55,28 +57,19 @@ There is a standard for creating output format. @example -$ wget \ - --warc-file www.example_com-$(date '+%Y%M%d%H%m%S') \ - --no-warc-compression \ - --no-warc-keep-log [@dots{}] \ - http://www.example.com/ +$ wget [--page-requisites] [--recursive] \ + --warc-file www.example.com-$(date '+%Y%M%d%H%m%S') \ + --no-warc-keep-log --no-warc-digests \ + [--no-warc-compression] [--warc-max-size=XXX] \ + [@dots{}] http://www.example.com/ @end example -That command will create uncompressed @file{www.example_com-XXX.warc} -web archive. By default, WARCs are compressed using -@url{https://en.wikipedia.org/wiki/Gzip, gzip}, but, in example above, -we have disabled it to compress with stronger and faster -@url{https://en.wikipedia.org/wiki/Zstd, zstd}, before sending via -@command{nncp-file}. - -There are plenty of software acting like HTTP proxy for your browser, -allowing to view that WARC files. However you can extract files from -that archive using @url{https://pypi.python.org/pypi/Warcat, warcat} -utility, producing usual directory hierarchy: - -@example -$ python3 -m warcat extract \ - www.example_com-XXX.warc \ - --output-dir www.example.com-XXX \ - --progress -@end example +@pindex tofuproxy +That command will create @file{www.example.com-XXX.warc} web archive. +It could produce specialized segmented +@url{https://en.wikipedia.org/wiki/Gzip, gzip} and +@url{https://en.wikipedia.org/wiki/Zstandard, Zstandard} +indexing/searching-friendly compressed archives. I can advise my own +@url{http://www.tofuproxy.stargrave.org/WARCs.html, tofuproxy} software +(also written on Go) to index, browse and extract those archives +conveniently.