In February 2020, Gartner released its magic chart for Data Science. A pleasant surprise was to see Databricks amongst the leaders. Interestingly, it made a swift transition from the visionaries quadrant to the leader within a year.
However, it is a well-deserved placement, since Databricks is steadily growing into a major analytics vendor. One might wonder the reason for such growth of the former since giants like Google and Microsoft are in the visionary quadrant while the grand old IBM is still a challenger. The primary reason for Databricks in the leader position is a Unified Analytics platform. This brings us to our first and foremost point:
1.Unified Analytics platform
If I was asked one single reason to choose Databricks over anything else, this would be it; the fact that it is a unified analytics platform. If one wishes to build a state of the art analytics system, it will consist of a team of Data Engineers, Data Analysts, Data Scientists and Machine Learning Engineers. The Data Engineers can build cutting edge data pipelines by realizing data architectures like Lambda Architecture and Delta Architecture.
Sign up for your weekly dose of what's up in emerging technology.
Furthermore, Data Analysts can leverage built-in visuals or can connect to Databricks from tools like Power BI to analyze the data, while the Data Scientists can build ML models. Lastly, Machine Learning Engineers can leverage tools like MLflow to manage end to end ML lifecycle.
This makes data bricks a one-stop solution to the entire analytics teams as opposed to giant vendors like Microsoft, where multiple services need to be leveraged to build an end to end analytics system. This leads to high coupling and in turn low cohesion, leading to the high cost of integration and maintenance. I admit that we have tools like Azure Synapse Analytics that show a similar promise as databricks. However, it is still in its nascent stages. Nonetheless, what makes Databricks such a versatile platform? The answer is simplified Apache Spark!
2. Apache Spark simplified
I can clearly remember the days when installing spark was a nightmare. Spinning up a spark cluster on cloud services like Azure HDInsight wasn’t easy as well. However, with Databricks, creating and leveraging a spark cluster is a matter of a few clicks. Furthermore, cloud hosting on AWS and Azure has made it accessible very easily. However, a key advantage in Databricks is the feature of autoscaling in Databricks. With that, the scaling of clusters is done automatically based on the compute requirements. This reduces operational and maintenance costs to a great extent.
3. Multilanguage and Multiple platform support
Since Databricks is based on Spark, all the benefits of apache-spark, the modern-day, in-memory distributed computing platform, are included naturally. For instance, the multi-language support of Spark can be leveraged by default. Hence, as of now, four programming languages viz. Python, Scala, R and SQL form the core of the platform. However, a key advantage that Databricks offers is language interoperability. This comes typically handy when traditional ETL developers move to Big Data environments like databricks. For instance, a developer might read the data from a data store into a pyspark data frame, leverage all the power Spark SQL and convert the result of his SQL query into a pyspark data frame for writing it back to a datastore. This helps us leverage the best of both the traditional and big data world.
Moreover, we have a host of Data Engineers and Data Scientists who are comfortable with a particular toolset. For example, a popular ETL tool viz. Informatica has thousands of developers. These developers are candidate data engineers. Hence, in order to facilitate their smooth transition to the big data world while allowing them to retain their skillset, Databricks has partnered with Informatica for data ingestion into delta lakes. More details here.
Similarly, MATLAB is a famous tool for creating models. However, it has its own language and environment making it difficult for its users to migrate to spark. Hence, databricks has come up with MATLAB integration, thus bringing out the best of the two tools. Although this is in preview, it holds a lot of promise.
4. Rich Notebooks and Dashboards
The icing on the cake is a rich UI experience of Databricks. We know that the usage of Notebooks has risen amongst the Data Science and Data Engineering community exponentially. Databricks gives us the same Notebook experience along with rich visualization embedded into it. This gives an extra appeal to Data Scientists since they can skip some code to create visuals for data analysis.
With all the above advantages, is it a surprise that Databricks has made it to the coveted position?