“ Hardware will fail all the time—because it does! But that doesn’t mean durability has to suffer.”
Google Cloud
Amazon uses a 45-foot long ruggedized shipping container called Snowmobile to transfer 100PBs of data if a customer chooses to migrate from on-premise to cloud. Where on-premise platforms struggle, cloud thrives. Cloud computing enables on-demand scaling by allowing users to select custom environments for their processing needs. This allows them to focus more on their data applications and less on the underlying infrastructure. But moving to the cloud is not straightforward. For those who use cloud, optimising the processing environments to understand workload characteristics is still a challenge.
Then, there are challenges like vendor lock-in. The cost of switching to a different vendor gets so high that the customer chooses to stay back. Vendor lock-in is only a part of the problem. There are other challenges exclusive to both on-premise and cloud storage, and computing. Cloud computing gets a bad rap when it comes to security. Customers who deal with projects of extreme confidentiality, do not usually prefer cloud. To address this, the cloud providers go the extra mile to make sure there is no loss of data on their end.
This is where 11 9’s of durability come in. The eleven 9s or 99.999999999% (11 9’s) annual durability, is a high standard the top cloud service providers uphold.
How To Deliver Eleven 9s?
AWS, Azure and GCP make up the Big Three in the cloud business. These companies have left no stone unturned to ensure the eleven 9s are delivered or sometimes even go beyond it to ensure durability. Let’s take a look at few of the best practices these companies have put in place.
Using multiple storage locations
When it comes to durability, engineers typically emphasise network, server, and storage hardware failures. But, how good are these measures in the case of physical failures resulting from a natural calamity? According to Google, software is ultimately the best way to protect against hardware failures. This also eliminates the need for buying expensive hardware.
Data is usually broken into a number of ‘data chunks’ and then moved to different servers with different power sources. Several copies of the metadata helps reconstruct the data in case of failures.
“Software is ultimately the best way to protect against hardware failures.”
To make sure the whole multi-region is more accessible to the customers, GCP offers location types with significantly higher availability SLAs to transparently serve these “data chunks” from more than one location if a region is temporarily inaccessible.
Using Checksums
Data corruption in transit is another major pain point. This can happen during transferring data across networks or when uploading or downloading objects to/from Cloud Storage. To avoid data in transit errors, cloud companies use checksums.
“Cloud Storage is designed to be always checksum-protected, without exception,” said Google. Checksum is a calculated value that indicates the integrity of the data. Cloud service providers store data redundantly across multiple availability zones. Once stored, Google recommends to regularly verify checksums to guard data at rest from certain types of data errors. “If there is a checksum mismatch, data is automatically repaired using the redundancy present in our encodings,” said the GCP team. These encodings provide sufficient redundancy to support a target of more than 11 nines of durability against a hardware failure.
To achieve end-to-end protection, Google recommends users to provide checksums when the data is being uploaded to Cloud Storage, and validate these checksums on the client when downloading.
Managing bugs
There is no bigger and immediate threat than a software bug. Bugs lead to durability loss. Companies keep updating their software whenever they come across any bugs. To avoid too many updates for version changes, the team at GCP, tries to catch the bugs upfront. The software for finding bugs undergoes lengthy integration tests before their release to different locations. “These software rollouts are monitored closely with plans in place for quick rollbacks, if necessary,” explained the team. GCP also has an option of object versioning to restore the deleted objects in the event of accidental deletions .
While every cloud provider has their own tricks to satisfy the customer’s needs, simple things like backing up data are still mandatory. A second data copy on a physically isolated region will do too. All these aforementioned practices along with few others, when deployed together, achieves the holy grail of 99.999999999% durability.