Collecting and processing data to make decisions is a crucial part of businesses. This also means dealing and managing big data with the help of data lakes. It makes it easier to manage large quantities of data allowing companies to retain more information — including unstructured and raw data. It also allows companies to run their analytics and machine learning algorithms on large datasets to better discover patterns.
AWS Lake Formation, which is a fully-managed service by Amazon facilitating building, securing and management of data lakes, recently announced that it is now “generally available” — meaning that it is now available for developers to purchase. It comes as a breath of fresh air for those with large datasets to help them deal with data more efficiently.
AIM Daily XO
Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Amazon Lake Formation was announced in November last year at the AWS re:Invent conference in Las Vegas. It brings about automation in a number of steps typically involved in creating a data lake such as collecting, cleaning, deduplicating, cataloguing data and making said data available for analytics in provisioned and configured storage. It also lets users bring data into a data lake from a range of sources.
What Would This Mean For Organisations?
While the concept of data lakes has been around for a long time, setting it up to store vast amounts of raw data in its native formats has never been easy. With the launch of AWS Lake Formation last year, Amazon aimed at allowing developers to create a secure data lake without much hassle.
Download our Mobile App
The usual process requires cumbersome process such as configuring storage, moving data, adding metadata, cleaning the data, setting up the right policies and more. This is a lot of work for the companies and it might take them several months to set up a data lake.
Lake Formation makes it easy by handling all these complications required to create a data lake with just a few clicks. It sets up the right tags, cleans up and deduplicates the data automatically. It also provides admins with a list of security policies, governance, and auditing across multiple analytics engines to help secure that data. It also plans to enable engineers to analyse data within those data sets using their choice of AWS analytics and machine learning services such as Amazon Redshift, Amazon Athena, Amazon QuickSight, Amazon SageMaker and more.
In a nutshell:
- It reduces the heavy lifting and automates the manual, time-consuming steps, like provisioning and configuring storage, crawling the data to extract schema and metadata tags
- It automatically optimises the partitioning of data, transforming data into formats like Apache Parquet and ORC that are ideal for analytics
- It cleans and deduplicates data using machine learning to improve data consistency and quality
- It provides a single, centralised place to set up and manage data access policies
Some of the clients who have been using it are Panasonic Avionics Corporation, Accenture, Quantiphi, Life360, Amgen and more. Not just the ability to manage security settings for all the different applications in the environment, it gives enhanced control giving them secure access to data.
Amazon’s Move Is Timely
Reports suggest that the global data lakes market is anticipated to $12.01 billion by 2024. Some of the key players in the space are Microsoft, Google, IBM and more. While Microsoft has its own fully managed solution in Azure Data Lake, Google has a suite of data lake processing and analytics tools in Cloud Datalab, Dataproc, and Dataflow.
The data lakes market is drastically growing and there is a major competition between the key players to set up a stronger foot in the market. With Amazon’s Lake Formation, it aims to take a dig at its competitors. The ease that it has brought, AWS claims to be now hosting more data lakes than anyone else, which is growing every day at a fast pace. It has allowed developers to make the most of data as they can now learn and innovate with it, rather than wrestling that data into functioning data lakes.