8 Scala Libraries For Data Science In 2021

- Apache Spark MLlib & ML - DeepLearning4J - BigDL - H2O Sparkling Water - Conjecture - Akka - Spray - Slick
Scala libraries

Programming language Scala combines object-oriented and functional programming. It is an extension of Java and runs on Java Virtual Machine (JVM). Many developers prefer Scala over Java since the same programmes can be written on the former using a significantly smaller number of lines. 

Scala’s complex features aids better coding and offers efficient performance. Scala integrates functional programming and object-oriented programming into one. 

We have listed eight Scala libraries for data scientists to use: 

AIM Daily XO

Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

Apache Spark MLlib & ML 

Built on Spark, Apache Spark  is a scalable machine learning library. It consists of algorithms and utilities including the likes of regression, classification, collaborative filtering, clustering, dimensionality reduction and underlying optimisation primitives. 

The MLlib library provides functional APIs for Java, Python and R, and consists of two different packages:

Download our Mobile App

  • MLlib: The RDD-based library comprising machine learning algorithms and utilities including the likes of regression, classification, collaborative filtering, clustering, dimensionality reduction and underlying optimisation primitives. It can be used for basic statistics including correlations, hypothesis testing and random data generation. Additionally, its utilities include linear algebra and data handling. 

MLlib fits into Spark’s APIs and interoperates with NumPy in Python and R libraries. MLlib’s goal is to make practical machine learning scalable and easy.  

  • ML: The more recent package introduced in Spark 1.2, provides high-level APIs to help users create practical machine learning pipelines. It operates on data frames and datasets. 


DeepLearning4J or  DL4J is an open-source, distributed, deep learning library for Java and Scala. Created by the Eclipse team, DL4J takes advantage of the latest distributed computing frameworks– Apache Spark and Hadoop to accelerate training. 

The libraries are open-source and maintained by the Konduit team. While DL4J is written in Java, it is compatible with JVM languages– Scala, Clojure and Kotlin. Its underlying computations are written in C, C++ and Cuda. 

DL4J allows developers to create deep neural nets (called ‘layer’) from various shallow nets, allowing them to combine variational autoencoders, sequence-to-sequence autoencoders, convolutional nets or recurrents nets, as required. 


Intel tool BigDL is a distributed deep learning library for Apache Spark that can be used to write deep learning applications as standard Spark programmes, directly running on top of Spark or Hadoop clusters. It provides support for deep learning including numeric computing and high-level neural networks. Additionally, developers can also load pre-trained Torch or Caffee models into Spark programmes with the use of BigDL. 

It can be scaled to perform data analytics by leveraging Apache Spark. Additionally, BigDL allows the implementation of synchronous SGD and all-reduce communications on Spark. 

H20 Sparkling Water 

An integration of H2O’s fast, scalable machine learning engine with Spark by Sparkling Water provides utilities to publish Spark data structures (RDDs, DataFrames and Datasets) as H2O’s frames, and vice-versa. Additionally, H2O Sparkling Water provides DSL to use Spark data structures as input for H2O’s algorithms; basic building blocks to create machine learning applications; and Python interface to enable use of Sparkling Water directly from PySpark. 


Etsy-created Conjecture is a framework to build machine learning models in Hadoop using Scalding DSL. It helps in the development of statistical models as viable components. Conjecture’s applications include classification and categorisation, recommender systems, ranking, filtering and regression. 

Conjecture focuses on flexibility and can handle a wide range of inputs. Its integration with Hadoop and Scalding DSL enables the seamless handling of humongous amounts of data. 


Written in Scala, Akka toolkit is used to build concurrent, distributed and resilient message-driven applications for both Scala and Java. 

Akka is often called an actor-based model where the ‘actor’ actually is similar to an object in an object-oriented model. The only difference is that unlike object-oriented models, an actor-based model is specifically designed and architected to serve as a concurrenct model. It created a layer between the actors and the underlying system. 

Recently, Akka launched Akka 2.6.15 with 21 closed issues. It adheres to the Reactive Manifesto. It is event-driven, scalable, resilient and responsive. 


As a suite of lightweight library by Scana Spray provides client and server-side REST and HTTP support on top of Akka. 

Ever since its inception, rather than building application cores, Scales has been focused on providing tools for building integration layers. It guarantees asynchronous, non-blocking, actor-based and high performance request processing. The internal Scala DSL provides a defining web service behaviour, efficient and convenient testing capabilities. 


Sacala Language-Integrated Connection Kit or Slick is a modern database query and access library for Scala. It allows developers to work with stored data, almost as if using Scala collections. This, while giving full control to the developer over when a database access happens and which data is transferred. 

Slick also features an extensible query compiler to generate code for different backends. It creates and executes database queries for H2, MySQL, and PostgreSQL.

Sign up for The Deep Learning Podcast

by Vijayalakshmi Anandan

The Deep Learning Curve is a technology-based podcast hosted by Vijayalakshmi Anandan - Video Presenter and Podcaster at Analytics India Magazine. This podcast is the narrator's journey of curiosity and discovery in the world of technology.

Debolina Biswas
After diving deep into the Indian startup ecosystem, Debolina is now a Technology Journalist. When not writing, she is found reading or playing with paint brushes and palette knives. She can be reached at debolina.biswas@analyticsindiamag.com

Our Upcoming Events

24th Mar, 2023 | Webinar
Women-in-Tech: Are you ready for the Techade

27-28th Apr, 2023 I Bangalore
Data Engineering Summit (DES) 2023

23 Jun, 2023 | Bangalore
MachineCon India 2023 [AI100 Awards]

21 Jul, 2023 | New York
MachineCon USA 2023 [AI100 Awards]

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox

Council Post: Evolution of Data Science: Skillset, Toolset, and Mindset

In my opinion, there will be considerable disorder and disarray in the near future concerning the emerging fields of data and analytics. The proliferation of platforms such as ChatGPT or Bard has generated a lot of buzz. While some users are enthusiastic about the potential benefits of generative AI and its extensive use in business and daily life, others have raised concerns regarding the accuracy, ethics, and related issues.