Next: , Previous: , Up: Integration with existing software   [Index]


Integration with Web pages

Simple HTML web page can be downloaded very easily for sending and viewing it offline after:

$ wget http://www.example.com/page.html

But most web pages contain links to images, CSS and JavaScript files, required for complete rendering. GNU Wget supports that documents parsing and understanding page dependencies. You can download the whole page with dependencies the following way:

$ wget \
    --page-requisites \
    --convert-links \
    --adjust-extension \
    --restrict-file-names=ascii \
    --span-hosts \
    --random-wait \
    --execute robots=off \
    http://www.example.com/page.html

that will create www.example.com directory with all files necessary to view page.html web page. You can create single file compressed tarball with that directory and send it to remote node:

$ tar cf - www.example.com | zstd |
    nncp-file - remote.node:www.example.com-page.tar.zst

But there are multi-paged articles, there are the whole interesting sites you want to get in a single package. You can mirror the whole web site by utilizing wget’s recursive feature:

$ wget \
    --recursive \
    --timestamping \
    -l inf \
    --no-remove-listing \
    --no-parent [...] \
    http://www.example.com/

There is a standard for creating Web ARChives: WARC. Fortunately again, wget supports it as an output format.

$ wget [--page-requisites] [--recursive] \
    --warc-file www.example.com-$(date '+%Y%M%d%H%m%S') \
    --no-warc-keep-log --no-warc-digests \
    [--no-warc-compression] [--warc-max-size=XXX] \
    [...] http://www.example.com/

Or even more simpler crawl utility written on Go too.

That command will create www.example.com-XXX.warc web archive. It could produce specialized segmented gzip and Zstandard indexing/searching-friendly compressed archives. I can advise my own tofuproxy software (also written on Go) to index, browse and extract those archives conveniently.