Over years, Hadoop has become synonymous to Big Data. Talk about big data in any conversation and Hadoop is sure to pop-up. But like any evolving technology, Big Data encompasses a wide variety of enablers, Hadoop being just one of those, though the most popular one.
Here we list down 10 alternatives to Hadoop that have evolved as a formidable competitor in Big Data space.
Also read, 10 Most sought after Big Data Platforms
1. Apache Spark
Apache Spark is an open-source cluster-computing framework. Originally developed at the University of California, Berkeley’s AMPLab, the Spark codebase was later donated to the Apache Software Foundation, which has maintained it since. Spark provides an interface for programming entire clusters with implicit data parallelism and fault-tolerance.
2. Apache Storm
Apache Storm is a distributed stream processing computation framework written predominantly in the Clojure programming language. Originally created by Nathan Marz and team at BackType, the project was open sourced after being acquired by Twitter. It uses custom created “spouts” and “bolts” to define information sources and manipulations to allow batch, distributed processing of streaming data. The initial release was on 17 September 2011.
A Storm application is designed as a “topology” in the shape of a directed acyclic graph (DAG) with spouts and bolts acting as the graph vertices. Edges on the graph are named streams and direct data from one node to another. Together, the topology acts as a data transformation pipeline. At a superficial level the general topology structure is similar to a MapReduce job, with the main difference being that data is processed in real time as opposed to in individual batches. Additionally, Storm topologies run indefinitely until killed, while a MapReduce job DAG must eventually end.
Ceph, a free-software storage platform, implements object storage on a single distributed computer cluster, and provides interfaces for object-, block- and file-level storage. Ceph aims primarily for completely distributed operation without a single point of failure, scalable to the exabyte level, and freely available.
Ceph replicates data and makes it fault-tolerant, using commodity hardware and requiring no specific hardware support. As a result of its design, the system is both self-healing and self-managing, aiming to minimize administration time and other costs.
On April 21, 2016, the Ceph development team released “Jewel”, the first Ceph release in which CephFS is considered stable. The CephFS repair and disaster recovery tools are feature-complete (snapshots, multiple active metadata servers and some other functionality is disabled by default).
DataTorrent RTS is an enterprise product built around Apache Apex, a Hadoop-native unified stream and batch processing platform. DataTorrent RTS combines Apache Apex engine with a set of enterprise-grade management, monitoring, development, and visualization tools.
DataTorrent RTS platform enables creation and management of real-time big data applications in a way that is
- highly scalable and performant – millions of events per second per node with linear scalability
- fault tolerant – automatic recovery with no data or state loss
- Hadoop native – installs in seconds and works with all existing Hadoop distributions
- easily developed – write and re-use generic Java code
- easily integrated – customizable connectors to file, database, and messaging systems
- easily operable – full suite of management, monitoring, development, and visualization tools
Disco is powerful and easy to use, thanks to Python. Disco distributes and replicates your data, and schedules your jobs efficiently. Disco even includes the tools you need to index billions of data points and query them in real-time.
Disco was born in Nokia Research Center in 2008 to solve real challenges in handling massive amounts of data. Disco has been actively developed since then by Nokia and many other companies who use it for a variety of purposes, such as log analysis, probabilistic modelling, data mining, and full-text indexing.
BigQuery is Google’s fully managed, petabyte scale, low cost enterprise data warehouse for analytics. BigQuery is serverless. There is no infrastructure to manage and you don’t need a database administrator, so you can focus on analyzing data to find meaningful insights using familiar SQL. BigQuery is a powerful Big Data analytics platform used by all types of organizations, from startups to Fortune 500 companies.
HPCC (High-Performance Computing Cluster), also known as DAS (Data Analytics Supercomputer), is an open source, data-intensive computing system platform developed by LexisNexis Risk Solutions. The HPCC platform incorporates a software architecture implemented on commodity computing clusters to provide high-performance, data-parallel processing for applications utilizing big data. The HPCC platform includes system configurations to support both parallel batch data processing (Thor) and high-performance online query applications using indexed data files (Roxie). The HPCC platform also includes a data-centric declarative programming language for parallel data processing called ECL.
Hydra is a distributed data processing and storage system which ingests streams of data (think log files) and builds trees that are aggregates, summaries, or transformations of the data. These trees can be used by humans to explore (tiny queries), as part of a machine learning pipeline (big queries), or to support live consoles on websites (lots of queries).
You can run hydra from the command line to slice and dice that Apache access log you have sitting around (or that gargantuan csv file). Or if terabytes per day is your cup of tea run a Hydra Cluster that supports your job with resource sharing, job management, distributed backups, data partitioning, and efficient bulk file transfer.
Pachyderm is a data lake that offers complete version control for data and leverages the container ecosystem to provide reproducible data processing. Data and Code were meant to be unified. Containerizing them together unlocks Reproducibility and Collaboration for your team.
Running your code in a container and accessing the data through Pachyderm’s version control system (PFS) guarantees that the analysis is Reproducible. And because it’s just a container, you can use any language or libraries you want.
Reproducibility is the requirement for true Collaboration. By enabling Reproducibility with containers, Pachyderm allows each team member to develop data analysis locally and then seamlessly push the same code into a production cluster.
Presto was designed and written from the ground up for interactive analytics and approaches the speed of commercial data warehouses while scaling to the size of organizations like Facebook.
Presto allows querying data where it lives, including Hive, Cassandra, relational databases or even proprietary data stores. A single Presto query can combine data from multiple sources, allowing for analytics across your entire organization.
Presto is targeted at analysts who expect response times ranging from sub-second to minutes. Presto breaks the false choice between having fast analytics using an expensive commercial solution or using a slow “free” solution that requires excessive hardware.
Provide your comments below
Bhasker is a Data Science evangelist and practitioner with proven record of thought leadership and incubating analytics practices for various organizations. With over 16 years of experience in the area of Business Analytics, he is well recognized as an expert within the industry. Earlier, Bhasker worked as Vice President at Goldman Sachs. He is B.Tech from Indian Institute of Technology, Varanasi and MBA from Indian Institute of Management, Lucknow.