Launched in 2009, Apache Spark has become the dominating big data platform. Spark’s diverse portfolio ranges from assisting banks, telecommunications and gaming companies to serving the giants like Apple, Facebook, IBM, and Microsoft. Out of the box, Spark can run in a standalone cluster mode that simply requires the Apache Spark framework and a JVM on each machine in the cluster.
Spark can be deployed in a variety of ways, provides native bindings for the Java, Scala, Python, and R programming languages, and supports SQL, streaming data, machine learning, and graph processing.
Spark vs Hadoop
When it comes to big data, Hadoop has been around for quite some time. With the advent of Spark and its feasibility to integrate with pre-existing frameworks, made Spark a curious contender in recent times.
Spark can be found in most Hadoop distributions these days. The speed and user-friendly nature have made Spark, a go-to framework when it comes to processing big-data, eclipsing MapReduce that brought Hadoop to prominence.
Spark’s in-memory data engine can perform tasks up to one hundred times faster than MapReduce in certain situations, particularly when compared with multi-stage jobs that require the writing of state back out to disk between stages. Even Apache Spark jobs where the data cannot be completely contained within memory tend to be around 10 times faster than MapReduce.
Apache Spark API is user-friendly and much of the complexity that comes with a typically distributed processing engine is hidden behind simple method calls.
What would have taken around 50 lines in MapReduce could be performed with only a few lines with Spark.
Here’s an example showing the compactness of Spark:
val textFile = sparkSession.sparkContext.textFile(“hdfs:///tmp/words”)
val counts = textFile.flatMap(line => line.split(“ “)).map(word => (word, 1)) .reduceByKey(_ + _)counts.saveAsTextFile(“hdfs:///tmp/words_agg”)
Find more about Spark here.
By providing bindings to popular languages for data analysis like Python and R, as well as the more enterprise-friendly Java and Scala, Apache Spark allows application developers and data scientists to harness its scalability and speed in an accessible manner.
Moreover, Spark is vendor-neutral i.e., businesses are free to create Spark-based analytics infrastructure without having to worry about the Hadoop vendor.
Key Features That Put Spark On The Map
- Apache Spark is built on the concept of the Resilient Distributed Dataset (RDD), a programming abstraction that represents an immutable collection of objects that can be split across a computing cluster. The concept of RDD enables traditional map and reduce functionality, but also provides built-in support for joining data sets, filtering, sampling, and aggregation.
- Spark SQL is focused on the processing of structured data, using a data frame approach borrowed from R and Python (in Pandas). Spark SQL provides a standard interface for reading from and writing to other data stores including JSON, HDFS, Apache Hive, JDBC, Apache ORC, and Apache Parquet, all of which are supported out of the box.
- Apache Spark also bundles libraries for applying machine learning and graph analysis techniques to data at scale. Spark MLlib includes a framework for creating machine learning pipelines, allowing for easy implementation of feature extraction, selections, and transformations on any structured dataset.
- Structured Streaming (added in Spark 2.x) is a higher-level API and easier abstraction for writing applications. In the case of Structure Streaming, the higher-level API essentially allows developers to create infinite streaming data frames and datasets.
Spark provides a framework of advanced analytics with tools for accelerated queries, graph processing engine and streaming analytics.
The in-built libraries help data scientists with data preparation and interpretation. Spark had shed itself off the SQL only mindset with its ability to collaborate with other languages, paving way for quicker analysis.
Future Of Spark
The existing pipeline structure of MLlib, the user will be able to construct classifiers in just a few lines of code, as well as apply custom Tensorflow graphs or Keras models to incoming data.
Whereas Structured Streaming is the future of streaming applications with the platform, so if you’re building a new streaming application, you should use Structured Streaming. The Spark team is planning to bring continuous streaming without micro-batching, to alleviate the low latency responses.
Spark has a faithful community of developers and new features are being frequently making it one of the most versatile platforms for data processing.