Archiving a Website with Wget

2022-02-05

Web pages and even entire websites disappear every day, never to be seen again. Sometimes the author decides to take them offline. Sometimes they pass away. Sometimes (especially with free services), the hosting provider simply purges everything to save a buck. The good news is that you can do something about it, at least on a personal level.

If you aren't too tech savvy, you can easily archive individual pages for free using the Wayback Machine. Or if you're prepared to pay a very reasonable price for the privilege, Pinboard will archive all of your bookmarks for you.

But what happens if you need to archive an entire website? Nobody wants to manually copy and paste individual URLs into an archiving service. If you're prepared to get your hands dirty, there's Wget.

Wget is a command line tool for fetching content from web servers. Using the right command line arguments, it is very effective at downloading a whole website. Wget comes installed on most Linux distributions. Windows users will need to find a Windows build online such as the (now quite old) GnuWin32 packages, or use Cygwin.

Archiving a Website

To archive a website with Wget, use:

wget -mpckE --user-agent="" -e robots=off --wait 1 www.foo.com

Explanation

-m (Mirror): Turns on mirror-friendly settings like infinite recursion depth, timestamps, etc.
-p (Page Requisites): Includes page dependencies such as images, style sheets, etc. in the download.
-c (Continue): Resumes a website that has been partially downloaded.
-k (Convert Links): Convert absolute hyperlinks into relative hyperlinks for offline viewing.
-E (Adjust Extension): Changes web page file extensions to .html for offline viewing.
-user-agent="": Tells Wget not to identify itself. This can be useful if a website is known to detect and block Wget. You can also set your user agent to that of a web browser.
-e robots=off: Tells Wget to ignore robots.txt, if it exists. robots.txt is used to restrict which pages web crawlers such as Wget, GoogleBot, etc. will access.
-wait 1: Tells Wget to wait 1 second between each page or resource download. This avoids being unreasonably taxing on the servers.

The smallest websites only take a few minutes to download. Others can take significantly longer. When Wget is done, you'll have a folder structure matching the website that you downloaded. Open it up, and then open the index.html page to see the finished result.

Further Information

Do you have any thoughts or feedback? Let me know via email!

#Linux