Since this page is hosted on WordPress, I like to keep a backup of all the files and images on the site. The files (html, etc) are easy to backup – using the WP admin panel and the tools->export method. But, the graphic images are not exported along with the html files. Instead, the export xml file has links to the images inside of it.
I’ve seen a couple work-around ideas for mitigating this task mentioned on various sites, but they don’t have the “simple plan” meme that I normally like to use. Some even involve using a browser plugin to do the job – much more complicated than need be IMO. So, in this case the simple plan I’ve used is to apply the *nix tools grep and wget, used together to perform the download of the backup images.
One line could be used, but I’ve broken it into functional pieces so that it’s easy to see what’s happening. The various output files can be inspected to see how it works. To start with, the export function of WP is used, to get the programming-miscellany-wordpress-com.xml file. This file is parsed by grep and the result of the parsing is put into *filelist.xml. This operation is shown on the first line, in the code window below:
grep -R "wp:attachment_url" programmingmiscellany-export.xml > programmingmiscellany-filelist.xml grep -o 'https*://[^<]*' programmingmiscellany-filelist.xml > programmingmiscellany-parsed-filelist-for-wget.txt wget -w 10 --input-file=programmingmiscellany-parsed-filelist-for-wget.txt
This works because many of the image links in the export backup file use the wp:attachment_url attribute. Note that I haven’t verified that *all* files are referenced in this way, and so one should be vigilant that some files could be excluded. So far, it has done well by me tho …
In the second line, I pull all of the image urls out into a second file, *parsed-filelist-for-wget.txt.
In the third line, I use wget to download all the images for the site that are publicly displayed (and I think privately displayed as well, but am not sure) – i.e. all the media library graphics are not necessarily included – and I don’t know if they are or not, or under what conditions. To get an idea about how it was doing, one day I compared the number of items in the media library (that count which is listed in the admin panel for the library) with the count of downloaded images in the directory on my computer, using:
ls | grep -c ""
The count in the admin panel (357) was two less than the count in the local directory (359), which at first I thought “made sense” because of the . and .. directories. But, I hadn’t used the appropriate form of ls for that to be the case. This was only one look at the count numbers, so it doesn’t say very much. I guess one needs to analyze the exported xml file a little better than I have, in order to see exactly what will be downloaded. Since the result for me has been the successful download of most of the graphics for my sites, I’m happy.
BUT – with another blog using different referencing schemes, public/private/etc – the success rate may not be as good. The detail of all this should be there to find in the export file, for all those out there who are less lazy than I. Whoever does that – please make a comment back here when you find the answer!
One note: the robots.txt file in programmingmiscellany.files.wordpress.com stipulates that robots should have at least ~4 seconds of wait delay between requests. I’ve made the delay 10 seconds in my own script.
Note also that I haven’t extensively tested this method, and it’s not guaranteed to work. As usual, this is not advice. It is not a suggestion for others to use, but instead a notation of what I myself have used successfully to extract backup images to make a local repository for WP. Caveat Emptor! If the export file layout is changed, for instance, then the regular expressions may need to be changed as well.
One last consideration is the form of the wget command. The way I have it, all the images are put into one directory. This is not good form for re-importing the images back into WordPress. But, the wget command line can be modified to use the original directory tree, I believe, but I can’t remember ATM exactly what the command syntax should be. That’s another detail left to the next article.