One of the greatest promises of cloud computing is resilliency. Store your data ‘in the cloud’ and access it from anywhere, enjoy high durability and speed. You know the marketing spiel already. A recent incident reminded me the importance of backups. In fact, the importance of backups of backups. Sounds strange? of course. This is the tale of a missing server image.
I usually try not to keep all eggs in one basket, even if one basket is very warm, cosy and comfortable. However, some times it’s hard to avoid a comfy basket, even if it turns into the least reliable one. One of the nice features of Rackspace is being able to take a snapshot image of your server, even whilst running, and then being able to restore it. You can also use it to clone the server into another instance. This feature I find much more friendly and useful than with AWS or Linode. The Rackspace API even makes it easier to automate, and I also found it easier to work with than, e.g. boto (the great python library for AWS). So I had a couple of not-so-important servers running on Rackspace, and since those were infrequently used, and to really make the most of the per-hour billing, I decided to only build them when they’re actually needed, and then have a cron job later at night that deletes those servers. Of course, it won’t delete them before taking a snapshot and making sure the snapshot is stored safely, ready to be used the next time.
One sunny day, we were trying to build one of those servers, when it ‘Entered an Error state’ (in the Rackspace parlance). This sometimes happens when trying to build a server for no obvious reason. Unfortunately, the only way to release it from this state is by contacting Rackspace support. Their support is usually very prompt so it’s not a huge concern, and indeed after a few minutes they ‘released’ it from the state, and I could try to rebuild the server again. Alas, it entered the error state again, so another support ticket was required. This time it took a little while longer for the investigation. The bad news were that whilst the server image is there, the underlying files with the actual content (stored on cloud files) was mysteriously deleted. Ouch.
Of course we didn’t delete the file, and don’t directly use cloud files. So how could have this happened?
After a good few hours/days of investigation, Rackspace has provided some not-really-decypherable api log of calls that allegedly reported the file being deleted. This seemed strange, so I checked my own logs, to see what my script reported. The script I am using is very simple, but does make the necessary checks before doing anything stupid. Specifically, the Rackspace API returns a
status of certain operations. For example, when you build a new server image, the status changes from
SAVING and (hopefully) into
ACTIVE. It’s this
ACTIVE state that tells you that the image is stored safely and securly and that your image is now sleeping comfortably, protected by the cloud angels. For some reason however, despite my log records clearly showing that the image was being prepared, saved and then
ACTIVE, it ended up vaporizing without a trace.
Do you keep backups?
I do. I hope you do too. But what happens when your backup gets deleted? Does Rackspace keep backups of their cloud files? Well, it’s so cloudy, it probably doesn’t need a backup. If a file is deleted, it’s just gone. There was no way to recover my missing image.
And what about the fanatic support?
I always wondered about it. I can only compare it to other cloud providers I use. I can’t say anything about AWS support, because it probably doesn’t exist for small-fish customers like me. However, I never really needed their support. Everything seemed to work without a problem so far. Linode support is not marketed as fanatic, drastic or fantastic, but it’s really outstanding. I can’t think of better support in any company I’ve dealt with, and I know how hard it is in the hosting world, when it’s hard to please the customer if the network is down or something. Usually I never need the Linode support either, because things are working fine, but when I do – they are great. With Rackspace, I can’t say the same. First of all, of those three companies, I need to contact the Rackspace support most frequently. Things are really not that smooth. I don’t know what makes them fanatic, but so far they haven’t even given me as much as an apology. At some point their support guy said something like “You have a wonderful day!”. I guess he meant well, but it somehow felt ironic, given both the somewhat ordering tone and the unfortunate circumstances of losing a server image.
I doubt it. It’s going on for a couple of days now and I am still waiting for a response from Rackspace. Luckily the server really wasn’t that critical, and we can probably build a different one. But it’s still not quite something that I feel comfortable with. When an API returns a response, you expect it to be reliable. The loss of productivity and time chasing this issue, looking through the logs and communicating with Rackspace didn’t really make me a happy customer.