Rome was not built in a day, nor were the data centres. Picking the real estate, power and cooling plant installations, server installations take months and not to forget how expensive the whole ordeal is. The time duration increases whenever an organisation decides to upscale. So, for companies who do not want to burden themselves with the woes of building a data centre, on-prem cannot be an option. That said, the cloud is not without its own hassles. Even migration is a tricky, tedious process for many organisations. When it comes to open-source software solutions, both on-premise and cloud platforms have their fair share of challenges.
Companies like Google, with its diverse big data solutions, have been trying to address challenges on both ends. In the next section, we take a look at pressing issues concerning big data applications, according to Google Cloud.
On-Prem & Cloud Challenges
Companies might find themselves locked into a certain cloud provider. Vendor lock-in can become an issue in cloud computing because it is very difficult to move databases once they’re set up, especially in a cloud migration, which involves moving data to a totally different type of environment and may involve reformatting the data. Vendor lock-in is a situation where the cost of switching to a different vendor is so high that the customer chooses to stick to the original vendor. Vendor lock-in is only a part of the problem. There are quite a few exclusive to both on-premise and cloud storage and computing.
Configuration & Constraint Management
Although the application developers can take advantage of on-prem storage by exploiting the underlying physical environment, they still came with few challenges. Making changes to hardware configuration can be disruptive as most of the open-source software depends on standardisation.
Whereas, constraint management is all about figuring out the right way to optimise resources like power and floor space at the data centres for maximum utilisation optimisation.
Relocation
Data migration to a network is expensive and time-consuming. To avoid the cost and effort of relocating the data and applications, users sometimes even resort to manually migrating the hardware by road. For example, Amazon’s Snowmobile is a 45-foot long ruggedized shipping container, pulled by a semi-trailer truck that offers an exabyte-scale data transfer service with transfers up to 100PB per vehicle.
Where on-premise platforms struggle, cloud thrives. Cloud computing enabled on-demand scaling by allowing data developers to select custom environments for their processing needs, allowing them to focus more on their data applications and less on the underlying infrastructure.
As workloads evolve over time, the need for managing service level objectives (SLOs) or the performance that was promised by the service provider. Spike in data should be handled independently without breaking down the data pipeline. Although the cloud eliminates the need for logistics planning for the data centre, says Google, the complex task of cluster configuration continues to be a challenge. For cloud users, optimizing the processing environments to understand workload characteristics is still a challenge.
Ushering A Serverless Future
Despite the innovations that Google and other top cloud providers have engineered over the years, the challenges still persist. Google too knows that. Google Cloud’s Big Query and Dataproc are designed to empower OSS platforms while also offering a doorway to a serverless future. “Serverless is not new to Google. We have been developing our serverless capabilities for years and even launched BigQuery, the first serverless data warehouse,” said Susheel Kaushik, Product Manager at Google Cloud.
GCP’s Dataproc, for instance, is capable of complementing the likes of OSS platforms like Apache and Presto. Companies like Facebook, which deal with petabytes of data, rely on platforms like Presto. Twitter too, was leveraging Presto until it decided to migrate to Google Cloud. With the Dataproc platform, users can manage, analyze and take full advantage of data and the OSS systems already in use.
Apache is no stranger to the changing times. It has a serverless offering of its own called OpenWhisk. Apache OpenWhisk is an open-source, distributed Serverless platform that executes functions in response to events at any scale. OpenWhisk manages the infrastructure, servers and scaling using Docker containers so you can focus on building amazing and efficient applications. With the advantages of data analytics becoming obvious, we can expect a sporadic growth of serverless offerings.
In the serverless world, customers can focus on their workloads instead of infrastructure. The configuration is automatic. “It’s time for OSS to have its turn. This [serverless] next phase of big data OSS will help our customers accelerate time to market, automate optimizations for latency and cost, and reduce investments in the application development cycle so that they can focus more on building and less on maintaining,” promises Google.