Last updated December 22, 2020
In AI Origins & Evolution

10 Hadoop Alternatives that you should consider for Big Data

Published on January 29, 2017

by Bhasker Gupta

Over years, Hadoop has become synonymous to Big Data. Talk about big data in any conversation and Hadoop is sure to pop-up. But like any evolving technology, Big Data encompasses a wide variety of enablers, Hadoop being just one of those, though the most popular one.

Here we list down 10 alternatives to Hadoop that have evolved as a formidable competitor in Big Data space.

Also read, 10 Most sought after Big Data Platforms

1. Apache Spark

Apache Spark is an open-source cluster-computing framework. Originally developed at the University of California, Berkeley’s AMPLab, the Spark codebase was later donated to the Apache Software Foundation, which has maintained it since. Spark provides an interface for programming entire clusters with implicit data parallelism and fault-tolerance.

Read Hadoop vs Spark: Which is the best data analytics engine?

2. Apache Storm

Apache Storm is a distributed stream processing computation framework written predominantly in the Clojure programming language. Originally created by Nathan Marz and team at BackType, the project was open sourced after being acquired by Twitter. It uses custom created “spouts” and “bolts” to define information sources and manipulations to allow batch, distributed processing of streaming data. The initial release was on 17 September 2011.

A Storm application is designed as a “topology” in the shape of a directed acyclic graph (DAG) with spouts and bolts acting as the graph vertices. Edges on the graph are named streams and direct data from one node to another. Together, the topology acts as a data transformation pipeline. At a superficial level the general topology structure is similar to a MapReduce job, with the main difference being that data is processed in real time as opposed to in individual batches. Additionally, Storm topologies run indefinitely until killed, while a MapReduce job DAG must eventually end.

3. Ceph

Ceph, a free-software storage platform, implements object storage on a single distributed computer cluster, and provides interfaces for object-, block- and file-level storage. Ceph aims primarily for completely distributed operation without a single point of failure, scalable to the exabyte level, and freely available.

Ceph replicates data and makes it fault-tolerant, using commodity hardware and requiring no specific hardware support. As a result of its design, the system is both self-healing and self-managing, aiming to minimize administration time and other costs.

On April 21, 2016, the Ceph development team released “Jewel”, the first Ceph release in which CephFS is considered stable. The CephFS repair and disaster recovery tools are feature-complete (snapshots, multiple active metadata servers and some other functionality is disabled by default).

4. DataTorrent RTS

DataTorrent RTS is an enterprise product built around Apache Apex, a Hadoop-native unified stream and batch processing platform. DataTorrent RTS combines Apache Apex engine with a set of enterprise-grade management, monitoring, development, and visualization tools.

DataTorrent RTS platform enables creation and management of real-time big data applications in a way that is

highly scalable and performant – millions of events per second per node with linear scalability
fault tolerant – automatic recovery with no data or state loss
Hadoop native – installs in seconds and works with all existing Hadoop distributions
easily developed – write and re-use generic Java code
easily integrated – customizable connectors to file, database, and messaging systems
easily operable – full suite of management, monitoring, development, and visualization tools

5. Disco

Disco is a lightweight, open-source framework for distributed computing based on the MapReduce paradigm.

Disco is powerful and easy to use, thanks to Python. Disco distributes and replicates your data, and schedules your jobs efficiently. Disco even includes the tools you need to index billions of data points and query them in real-time.

Disco was born in Nokia Research Center in 2008 to solve real challenges in handling massive amounts of data. Disco has been actively developed since then by Nokia and many other companies who use it for a variety of purposes, such as log analysis, probabilistic modelling, data mining, and full-text indexing.

6. Google BigQuery

BigQuery is Google’s fully managed, petabyte scale, low cost enterprise data warehouse for analytics. BigQuery is serverless. There is no infrastructure to manage and you don’t need a database administrator, so you can focus on analyzing data to find meaningful insights using familiar SQL. BigQuery is a powerful Big Data analytics platform used by all types of organizations, from startups to Fortune 500 companies.

7. High-Performance Computing Cluster (HPCC)

HPCC (High-Performance Computing Cluster), also known as DAS (Data Analytics Supercomputer), is an open source, data-intensive computing system platform developed by LexisNexis Risk Solutions. The HPCC platform incorporates a software architecture implemented on commodity computing clusters to provide high-performance, data-parallel processing for applications utilizing big data. The HPCC platform includes system configurations to support both parallel batch data processing (Thor) and high-performance online query applications using indexed data files (Roxie). The HPCC platform also includes a data-centric declarative programming language for parallel data processing called ECL.

8. Hydra

Hydra is a distributed data processing and storage system which ingests streams of data (think log files) and builds trees that are aggregates, summaries, or transformations of the data. These trees can be used by humans to explore (tiny queries), as part of a machine learning pipeline (big queries), or to support live consoles on websites (lots of queries).

You can run hydra from the command line to slice and dice that Apache access log you have sitting around (or that gargantuan csv file). Or if terabytes per day is your cup of tea run a Hydra Cluster that supports your job with resource sharing, job management, distributed backups, data partitioning, and efficient bulk file transfer.

9. Pachyderm

Pachyderm is a data lake that offers complete version control for data and leverages the container ecosystem to provide reproducible data processing. Data and Code were meant to be unified. Containerizing them together unlocks Reproducibility and Collaboration for your team.

Running your code in a container and accessing the data through Pachyderm’s version control system (PFS) guarantees that the analysis is Reproducible. And because it’s just a container, you can use any language or libraries you want.

Reproducibility is the requirement for true Collaboration. By enabling Reproducibility with containers, Pachyderm allows each team member to develop data analysis locally and then seamlessly push the same code into a production cluster.

10. Presto

Presto is an open source distributed SQL query engine for running interactive analytic queries against data sources of all sizes ranging from gigabytes to petabytes.

Presto was designed and written from the ground up for interactive analytics and approaches the speed of commercial data warehouses while scaling to the size of organizations like Facebook.

Presto allows querying data where it lives, including Hive, Cassandra, relational databases or even proprietary data stores. A single Presto query can combine data from multiple sources, allowing for analytics across your entire organization.

Presto is targeted at analysts who expect response times ranging from sub-second to minutes. Presto breaks the false choice between having fast analytics using an expensive commercial solution or using a slow “free” solution that requires excessive hardware.

PS: The story was written using a keyboard.

Access all our open Survey & Awards Nomination forms in one place

Bhasker Gupta

Bhasker is a techie turned media entrepreneur. Bhasker started AIM in 2012, out of a desire to speak about emerging technologies and their commercial, social and cultural impact. Earlier, Bhasker worked as Vice President at Goldman Sachs. He is a B.Tech from the Indian Institute of Technology, Varanasi and an MBA from the Indian Institute of Management, Lucknow.

Comprehensive Guide to Datasaur – The Text Data Annotator Tool

Top Emerging Trends In AI & ML To Watch Out For In The Post COVID World

India’s Leading Organisations Under MeitY Which Are Driving AI For India

NVIDIA Rethinks AI: Next Gen Data Centers, Supercomputers For Healthcare & More at GTC Day 1

Why Altair Acquired Analytics Firm Ellexus

Report: State of Artificial Intelligence in India – 2020

Download our Mobile App

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

3 Ways to Join our Community

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox

Recent Stories

India is Making its Own AI Servers

Pritam Bordoloi

PLI scheme marks the beginning of India ‘s manufacturing venture

GPT-5 Likely to be Released After the US Elections

Donna Eva

Generative AI Jobs in India can Fetch You up to Rs 1 Crore

Siddharth Jindal

Top Editorial Picks

Elon Musk Set to Meet Indian Spacetech Startups During Upcoming Visit

Shyam Nandan Upadhyay

Happiest Minds Technologies Acquires Macmillan Learning India, Expands Edutech Reach

Shritama Saha

Meta Releases Llama 3, Beats Claude 3 Sonnet and Gemini Pro 1.5

Mohit Pandey

Nothing Becomes the First Smartphone Company to Integrate OpenAI’s ChatGPT

Siddharth Jindal

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Featured

Enhancing AI Integration through Optimal Data Management in the Global Convenience Food and Beverage Sector

Through the implementation of advanced data management methodologies, resilient data observability solutions, and cutting-edge AI frameworks, Course5 is spearheading the