Howto make a static copy of Joomla site with wget
Sometimes there comes a time when a previously active site gets abandoned but it is still good to be online to not lose the history. In those cases the common content management systems such as Joomla, Drupal and WordPress became to be a real pain as you should make sure the cms installations remain secure but you just do not want to do any extra work for the forgotten site. Good option is to make a total mirror of the site but use only plain html-pages instead of the dynamic content provided by php.
First you can tweak the cms settings to help in mirror process. You will know what I mean later on. Example here is an older Joomla site.
Wget the site
Wget is a nice command line *nix utility to copy content from internet.
wget -m -k -K -E http://your.domain.com
To figure out what the swicthes do you can try “man wget” or just read from wikipedia or some other online resource. Those are the best options I found to make the work as easy as possible. You will end up with a directory your.domain.com with the contents of the site.
Find missing files
Wget does not look into css files to find design related content so you might need to copy over from the Joomla server some directories, for examle images. Easy way to find missing images is to open the new mirror site, index.html from the your.domain.com directory and disconnect network connection before that. Other possible way of finding files is to use Web Developer Tools from Firefox. Just open Network monitor from developer tools and force reload the same index.html, The monitor should display the loaded content and the domain where it was loaded.
Here you will run into a problem where the content is loaded from the online server but there is also a local copy of the same resource. You might need to tweak the html-files to point to the local mirror or easier is just to drop the domain from the request.
I had to run sed twice to get the files sorted. Mostly because I was playing it safe and did the process in increments. First to get the images loading locally and then to get the template files (css and js)
find . -name '*.html' -print -exec sed -i.bak 's%http://your.domain.com/images/%/images/%g' {} \;
find . -name '*.html' -print -exec sed -i.bak 's%http://your.domain.com/cache/template/gzip.php?%/cache/template/%g' {} \;
The gzip.php in this case was a bit more challenging as it was used to load css and javascript files but wget was treating the wget.php just as one file.
To get the required css and javascript files I used wget again in the local cache/template directory for every template file. Those were easily spotted as they all had the gzip.php in them.
wget "http://your.domain.com/cache/template/gzip.php?template-c87b3b7b.css" mv gzip.php?template-c87b3b7b.css template-c87b3b7b.css
or
wget "http://your.domain.com/cache/template/gzip.php?template-c87b3b7b.css" -O template-c87b3b7b.css
First one to get the file and then the second one to rename the file to get rid of the php reference, in this case gzip.php or combine in one command by providing output filename to wget. I could have also modified the link in the html files to get rid of “?” by replacing it with “%3F”. But that would have been a hack in my opinion.
After this you should have a directory with a fully functioning static copy of your site. Then just put it online and try.
More tweaks – remove unused javascript
Javascript is used in some user interface objects but usually there are some javascript files that are no longer needed as the cms is gone. There are things related to admin interface or some other content editing. You can try to understand what the js on your site does and then remove the ones that are not needed.
http://www.ualberta.ca/~stothard/downloads/misc/sed_commands.txt
http://stackoverflow.com/questions/471183/linux-command-line-global-search-and-replace
Thanks for the instructions, this is excellent. We just ran into a case where we need to do this (keep content online but drop obsolete CMS version).
Great post. It helped to “start” playing with wget for mirroring my old Joomla…
I found, from man page, that there is an option to download “missing files” which is “-p” or “–page-requisites”.
So a command like:
wget -p -m -k -K -E http://your.domain.com
will get (almost) all css, js and images.
You have a fast way to verify which files are still missing:
grep -lr “http:\/\/your.domain.com” your.domain.com/ | sort -u | xargs sed -ne ‘/http:\/\/your\.domain\.com/s/.*”http:\/\/\([^”]*\).*/http:\/\/\1/p’ | sort -u
And you can get them using:
grep -lr “http:\/\/your.domain.com” your.domain.com/ | sort -u | xargs sed -ne ‘/http:\/\/your\.domain\.com/.*”http:\/\/\([^”]*\).*/http:\/\/\1/p’ | sort -u | wget -x -nH –cut-dirs=2 -i –
You sill have to use developer tools to find missing files included by javascript or other (strange) way.
But I wanted my mirror to have the same links. And for that the files names can’t include the “.html” extension.
So I found that the “-p” option (and “-k”) doesn’t work so well if you don’t use the “-E” option.
But using “-E” and “-p” is the best way to get “page-requisites”. So I did a first fetch with “-E”, deleted all “.html” files and then fetched all over again without “-E”.
Has “-k” doesn’t work that well without “-E” I also had to make some other substitutions:
# To converter all missing absolute URL to relative
grep -lr “http:\/\/your.domain.com” your.domain.com/ | sort -u | xargs sed -i -e ‘/http:\/\/your\.domain\.com/s/http:\/\/your\.domain\.com\/\([^”]*\)/\1/g’
# To converter URL with ? into its URL encoding equivalent (%3F)
grep -lr –exclude=*.{css,js} ‘=\s\{0,1\}”[^?”]*?[^”]*”‘ your.domain.com/ | sort -u | xargs sed -i -e ‘/\(=\s\{0,1\}”[^?”]*\)?\([^”]*”\)/s/\(=\s\{0,1\}”[^?”]*\)?\([^”]*”\)/\1%3F\2/g’
Hope that helps.