Next: , Previous: , Up: Integration with existing software  


Integration with Web pages

Simple HTML web page can be downloaded very easily for sending and viewing it offline after:

$ wget http://www.example.com/page.html

But most web pages contain links to images, CSS and JavaScript files, required for complete rendering. GNU Wget supports that documents parsing and understanding page dependencies. You can download the whole page with dependencies the following way:

$ wget \
    --page-requisites \
    --convert-links \
    --adjust-extension \
    --restrict-file-names=ascii \
    --span-hosts \
    --random-wait \
    --execute robots=off \
    http://www.example.com/page.html

that will create www.example.com directory with all files necessary to view page.html web page. You can create single file compressed tarball with that directory and send it to remote node:

$ tar cf - www.example.com | zstd |
    nncp-file - remote.node:www.example.com-page.tar.zst

But there are multi-paged articles, there are the whole interesting sites you want to get in a single package. You can mirror the whole web site by utilizing wget’s recursive feature:

$ wget \
    --recursive \
    --timestamping \
    -l inf \
    --no-remove-listing \
    --no-parent […] \
    http://www.example.com/

There is a standard for creating Web ARChives: WARC. Fortunately again, wget supports it as an output format.

$ wget \
    --warc-file www.example_com-$(date '+%Y%M%d%H%m%S') \
    --no-warc-compression \
    --no-warc-keep-log […] \
    http://www.example.com/

That command will create uncompressed www.example_com-XXX.warc web archive. By default, WARCs are compressed using gzip, but, in example above, we have disabled it to compress with stronger and faster zstd, before sending via nncp-file.

There are plenty of software acting like HTTP proxy for your browser, allowing to view that WARC files. However you can extract files from that archive using warcat utility, producing usual directory hierarchy:

$ python3 -m warcat extract \
    www.example_com-XXX.warc \
    --output-dir www.example.com-XXX \
    --progress

Next: BitTorrent and huge files, Previous: Integration with Web feeds, Up: Integration with existing software