Creating a static archive of a Drupal site

Ryan Weal
Published 2017-04-24

Each year another DrupalCamp comes to pass and as event organizers we are left with +1 sites to maintain. After awhile this builds up to a lot of sites that need continuious updates. What to do?

When a site is ready to become an archive it can be a good idea to convert it to a static site. Security updates are no longer necessary, but interactive features of the site disappear... which is usually a good thing in this scenario.

Creating a site mirror

Long before I used Drupal this was all possible with wget, and it continues to work today:

#!/bin/bash
wget -o download.log -N -S --random-wait -x -r -p -l inf -E --convert-links --domains="`echo $1`" $1

I call this script "getsite", you use it by typing "getsite example.com"

This is a simple script that I place in the /usr/local/bin folder of the computer I will be using to create the site mirror.

This script will probably take awhile to run. You can run tail -f download.log in another terminal to watch the progress.

What does it do?

This is a simple web crawler that will follow all links on the page that you provided, but ONLY the links that are on the same domain.

It will try to fetch ALL the assets that come from this exact domain name you provide.

While doing so, it changes all of the paths to be relative to the root.

I also have it set to crawl slowly so as not to scare any firewalls we may be traversing.

You can look up all of the command line options by typing man wget on your system.

After running the command you will have a folder with the name of the domain and all of the files for the site, in addition to a download.log file that you can use to audit the download.

It can be very useful to use the utility tree to see all of the files.

Oh noes! All my paths have .html appended now!

Relax. Just like we can do clean URLs with index.php files we can specify some rules on our webserver to mask that ugly file extension.

In Nginx you can do this as follows:

location / {
  root   /var/www/html
  index  index.html index.htm;
  try_files $uri $uri/index.html $uri/ =404;
}

The "try_files" patterns will match what used to be our Drupal clean URLs.

You may also want to add some kind of htpasswd-style restriction if your content is not intended to be available to the public.

It is as simple as that! Wget is a great utility for making site mirrors or legal archives.

Cleaning up loose ends

Your Drupal site is going to have some interactive components that will no longer work.

In particular:

  • User login form
  • Webforms
  • Commenting
  • Anything else using a form and/or a captcha (maybe disable captcha too)

It may be simpler to disable these before taking the snapshot, or alternatively opening the resulting HTML in a text editor and removing the form components after the fact.

You may also want to enable or disable caching of different things depending on what results you get. By default you are probably going to see a lot of security tokens in the downloaded paths, so you may want to disable that... on the other hand, you may want to bundle your CSS to make fewer requests. Review your downloaded archive to see what will be best before you shut down your source site.

Other uses

My team has used variations of this script for a variety of other needs as well:

  • to estimate the size and scope of a migration project;
  • to get a complete list of paths we may want to alias or redirect after a migration;
  • to make an archive of a site for legal proceedings (ie, gathering evidence of copyright infringement);
  • to migrate data from a static archive when source databases do not contain fully rendered content;
  • and finally: to "pepper" the caches of large sites by hitting each URL after a migration when the caches are all cold.

In that last example we use the spider option to "not" download the files, but simply request them and then move on.

Wget is an extremely powerful tool for mirroring entire sites and provides us an easy way to archive old dynamically-rendered sites without much hassle, and zero ongoing maintenance.

To find out what other things you can do with wget just type man wget on your console and read all the options that are available.

Thanks for Reading!
Ryan Weal (LinkedIn)
🍪 ☕
Send a comment 💬
Published by Kafei Interactive