Active Hackathon

How Apache Spark Became A Dominant Force In Analytics

Launched in 2009, Apache Spark has become the dominating big data platform. Spark’s diverse portfolio ranges from assisting banks, telecommunications and gaming companies to serving the giants like Apple, Facebook, IBM, and Microsoft. Out of the box, Spark can run in a standalone cluster mode that simply requires the Apache Spark framework and a JVM on each machine in the cluster.


Sign up for your weekly dose of what's up in emerging technology.

Spark can be deployed in a variety of ways, provides native bindings for the Java, Scala, Python, and R programming languages, and supports SQL, streaming data, machine learning, and graph processing.

Spark vs Hadoop

When it comes to big data, Hadoop has been around for quite some time. With the advent of Spark and its feasibility to integrate with pre-existing frameworks, made Spark a curious contender in recent times.

Spark can be found in most Hadoop distributions these days. The speed and user-friendly nature have made Spark, a go-to framework when it comes to processing big-data, eclipsing MapReduce that brought Hadoop to prominence.

Spark’s in-memory data engine can perform tasks up to one hundred times faster than MapReduce in certain situations, particularly when compared with multi-stage jobs that require the writing of state back out to disk between stages. Even Apache Spark jobs where the data cannot be completely contained within memory tend to be around 10 times faster than MapReduce.

Apache Spark API is user-friendly and much of the complexity that comes with a typically distributed processing engine is hidden behind simple method calls.

What would have taken around 50 lines in MapReduce could be performed with only a few lines with Spark.

Here’s an example showing the compactness of Spark:

val textFile = sparkSession.sparkContext.textFile(“hdfs:///tmp/words”)

val counts = textFile.flatMap(line => line.split(“ “)).map(word => (word, 1)) .reduceByKey(_ + _)counts.saveAsTextFile(“hdfs:///tmp/words_agg”) 

Find more about Spark here.

By providing bindings to popular languages for data analysis like Python and R, as well as the more enterprise-friendly Java and Scala, Apache Spark allows application developers and data scientists to harness its scalability and speed in an accessible manner.

Moreover, Spark is vendor-neutral i.e., businesses are free to create Spark-based analytics infrastructure without having to worry about the Hadoop vendor.

Key Features That Put Spark On The Map

  • Apache Spark is built on the concept of the Resilient Distributed Dataset (RDD), a programming abstraction that represents an immutable collection of objects that can be split across a computing cluster. The concept of RDD enables traditional map and reduce functionality, but also provides built-in support for joining data sets, filtering, sampling, and aggregation.
  • Spark SQL is focused on the processing of structured data, using a data frame approach borrowed from R and Python (in Pandas). Spark SQL provides a standard interface for reading from and writing to other data stores including JSON, HDFS, Apache Hive, JDBC, Apache ORC, and Apache Parquet, all of which are supported out of the box.
  • Apache Spark also bundles libraries for applying machine learning and graph analysis techniques to data at scale. Spark MLlib includes a framework for creating machine learning pipelines, allowing for easy implementation of feature extraction, selections, and transformations on any structured dataset.
  • Structured Streaming (added in Spark 2.x) is a higher-level API and easier abstraction for writing applications. In the case of Structure Streaming, the higher-level API essentially allows developers to create infinite streaming data frames and datasets.

Spark provides a framework of advanced analytics with tools for accelerated queries, graph processing engine and streaming analytics.

The in-built libraries help data scientists with data preparation and interpretation. Spark had shed itself off the SQL only mindset with its ability to collaborate with other languages, paving way for quicker analysis.

Future Of Spark

The existing pipeline structure of MLlib, the user will be able to construct classifiers in just a few lines of code, as well as apply custom Tensorflow graphs or Keras models to incoming data.

Whereas Structured Streaming is the future of streaming applications with the platform, so if you’re building a new streaming application, you should use Structured Streaming. The Spark team is planning to bring continuous streaming without micro-batching, to alleviate the low latency responses.

Spark has a faithful community of developers and new features are being frequently making it one of the most versatile platforms for data processing.

More Great AIM Stories

Ram Sagar
I have a master's degree in Robotics and I write about machine learning advancements.

Our Upcoming Events

Conference, in-person (Bangalore)
Cypher 2022
21-23rd Sep

Conference, in-person (Bangalore)
Machine Learning Developers Summit (MLDS) 2023
19-20th Jan

Conference, in-person (Bangalore)
Data Engineering Summit (DES) 2023
21st Apr, 2023

3 Ways to Join our Community

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Telegram Channel

Discover special offers, top stories, upcoming events, and more.

Subscribe to our newsletter

Get the latest updates from AIM

Council Post: Enabling a Data-Driven culture within BFSI GCCs in India

Data is the key element across all the three tenets of engineering brilliance, customer-centricity and talent strategy and engagement and will continue to help us deliver on our transformation agenda. Our data-driven culture fosters continuous performance improvement to create differentiated experiences and enable growth.

Ouch, Cognizant

The company has reduced its full-year 2022 revenue growth guidance to 8.5% – 9.5% in constant currency from the 9-11% in the previous quarter