Simple HTML web page can be downloaded very easily for sending and viewing it offline after:
$ wget http://www.example.com/page.html
$ wget \ --page-requisites \ --convert-links \ --adjust-extension \ --restrict-file-names=ascii \ --span-hosts \ --random-wait \ --execute robots=off \ http://www.example.com/page.html
that will create www.example.com directory with all files necessary to view page.html web page. You can create single file compressed tarball with that directory and send it to remote node:
$ tar cf - www.example.com | zstd | nncp-file - remote.node:www.example.com-page.tar.zst
But there are multi-paged articles, there are the whole interesting
sites you want to get in a single package. You can mirror the whole web
site by utilizing
wget’s recursive feature:
$ wget \ --recursive \ --timestamping \ -l inf \ --no-remove-listing \ --no-parent […] \ http://www.example.com/
There is a standard for creating
WARC. Fortunately again,
wget supports it as an
$ wget [--page-requisites] [--recursive] \ --warc-file www.example.com-$(date '+%Y%M%d%H%m%S') \ --no-warc-keep-log --no-warc-digests \ [--no-warc-compression] [--warc-max-size=XXX] \ […] http://www.example.com/
Or even more simpler crawl utility written on Go too.
That command will create www.example.com-XXX.warc web archive. It could produce specialized segmented gzip and Zstandard indexing/searching-friendly compressed archives. I can advise my own tofuproxy software (also written on Go) to index, browse and extract those archives conveniently.