Earlier this year Yahoo open sourced a new project called TensorFlowOnSpark, a pairing of Spark and TensorFlow that would make the deep learning framework more attractive to developers, especially to those who are creating models that need to run on large computing clusters. This integration of big data and machine learning actually adds support for the TensorFlow deep learning library into Spark.
The researchers are going gaga over this newly available tool that has eased their work and helping them achieve faster results in analytics functionalities. For the uninitiated, this article brings the differences and/or similarities between TensorFlow and Spark, and why does it give data scientists a reason to celebrate.
Over 100,000 people subscribe to our newsletter.
See stories of Analytics and AI in your inbox.
Spark vs. TensorFlow = Big Data vs. Machine Learning Framework?
Apache Spark or Spark as it is popularly known, is an open source, cluster computing framework that provides an interface for programming entire clusters with implicit data parallelism and fault tolerance. Built on top of Akka, Spark codebase was originally developed at the University of California and was later donated to the Apache’s Software Foundation.
This cluster of computer framework lets the user make the computation faster by providing in-memory computing and easy integration because of the big Spark ecosystem. Spark cluster can be used for various tasks like machine learning, graph computation by paralleling them. By using in-memory computation it reduces an overhead of disk read and write.
This fast and general engine for large scale data processing offers many features like high speed, ease of use, it can combine with SQL, streaming and complex analytics, can run everywhere such as Hadoop, Mesos and cloud.
Spark, essentially a big data framework, has made it possible for a large number of companies generating huge amount of user data to process it efficiently and offer up recommendation at scale.
TensorFlow, on the other hand, is a short library developed by Google that helps in improving the performance of numerical computation and neural networks and generating data flow as graphs—consisting of nodes denoting operations and edges denoting data array. Essentially a machine learning framework, it helps people create deep learning models without the need for rigorous skill sets of a machine learning specialist.
A Google API enabling computation on deep learning and machine learning, TensorFlow gives a graphical representation (Tensorboard) computation flow. The API helps user to write complex neural network design and tune it according to activation values.
In summary, it could be said that Apache Spark is a data processing framework, whereas TensorFlow is used for custom deep learning and neural network design. So if a user wants to apply deep learning algorithms, TensorFlow is the answer, and for data processing, it is Spark.
Can Spark improve deep learning pipelines with TensorFlow:
While these two have been existed separately as tools that are widely used, mingling of deep learning and big data can make it easier for TensorFlow to be deployed easily over existing clusters, just like those running on Spark.
As the experts believe, Spark and machine learning go hand-in-hand. Deep learning in particular depends of large amount of compute, and this is where there is an opportunity for Spark and TensorFlow to join the forces. Earlier, companies like Yahoo have explored tools like SparkNet and TensorFrame, but the much desired success has been achieved through TensorFlowOnSpark, that enables a developer to quickly modify their programs and yield desired results.
TFoS is designed in a way that runs on the existing Spark and Hadoop clusters and use Spark libraries like SparkSQL or Spark’s MLlib machine learning libraries to allow developers to create models without getting lost into much details.
Benefits of combining the two has been that, being clustered machine learning framework, it runs faster and can be used on remote direct memory access (RDMA). The original TensorFlow project doesn’t support RDMA as a core feature.
TensorFlowOnSpark solves the problem of deploying deep learning on big data clusters in a distributed form. This is not a completely new deep learning model but instead an upgrade to the existing frameworks that required the development of multiple programs for deploying intelligence on big data clusters. Combining both TensorFlow and Spark, it gives a space for unwanted system complexity as well as end-to-end learning latency.
Along the years, TensorFlow has progressed exceptionally well and has been picked by the likes of IBM as the deep learning system for its custom machine learning hardware. While facing competition from the likes of MXNet, a deep learning system by Amazon, that is speculated to be better than the competition across multiple nodes, it would be interesting to see how TensorFlowOnSpark compares to it. As of now, it has been seen to be running smoothly on big clusters and is convenient to work with.