For anyone following my previous blog posts, this is a bit of a departure for me. Typically, I get down in the weeds and show lots of code. This post, on the other hand, is more of a philosophical rant. At least you can't say I didn't warn you!
Yesterday I was made aware of this bulletin from HPE, which alerts that certain models of HPE SSDs have a firmware bug that will cause the drives to deterministically fail very suddenly at precisely 32,768 hours of operation. You may recognize this as an overflow of a signed short integer. From a practical standpoint, because of the nature of the failure, you are likely to have multiple drives fail nearly simultaneously. This means that your RAID may not save you.
Further, they say "HPE was notified by a Solid State Drive (SSD) manufacturer of a firmware defect affecting certain SAS SSD models". To me, that makes it sound like someone else manufactures these drives on behalf of HPE. Which raises the question, does that manufacturer provide these drives to any other vendors, and if so, to whom?
So why am I all hot and bothered by this bulletin, and what's with the title of the blog? Well, if you are running PostgreSQL, presumably your data is sitting on some physical disks somewhere. Remember, "the cloud is just someone else's computer", right? Now that you are aware of the firmware issue you should be asking some questions of your own:
- Where exactly is my data -- my own data center, the cloud, or some combination?
- What kind of drives are being used to host my data. Is it even possible to find that information out? Are they affected? Is there a plan in place to update the firmware?
- Am I doing regular, continuous, backups?
- Are those backups "Schrodinger Backups", i.e. do you regularly test whether they can be restored successfully?
You may not be able to answer all of those questions in the best possible way, in which case you are hoping that everything will be fine and you will not be affected. Or if you are affected, you are hoping that you have good backups, and when you need them they can be quickly and seamlessly restored and you will incur minimal downtime.
And to that, I say "hope is not a strategy." I worked with someone years ago who would constantly repeat that refrain, and it has stuck in my head.
So what should you do? First of all, call me old fashioned, but it would be good to know a bit about how and where your data is actually stored.
But most importantly, make sure you have a good backup strategy in place and that it is well tested. Where PostgreSQL is concerned, that means getting intimate with pgBackRest, or at least having someone who is watching your back and doing that for you!
Do you know where your data is?