A data lake is a centralised location holding large swathes of data in a flat architecture as opposed to the hierarchical storage in a data warehouse. A data lake can store structured data from relational databases, semi-structured data, unstructured data and binary data and can be set up ‘on premises’ or ‘cloud’. Below, we look at how Google’s BigLake stacks up against popular data lakes.
Google Big Lake
BigLake is a storage engine to unify data warehouses and lakes. It eliminates the need to duplicate or move data, reducing cost and inefficiencies. BigLake gives fine-grained access controls and performance acceleration across BigQuery and multicloud data lakes on AWS and Azure. BigLake also makes data uniformly accessible across Google Cloud and open source engines.
Sign up for your weekly dose of what's up in emerging technology.
“BigLake extends a decade of innovations with BigQuery to data lakes on multicloud storage, with open formats to ensure a unified, flexible, and cost-effective lakehouse architecture,” said the team.
Top features of BigLake:
- In BigLake, users can keep a single copy of data and enforce consistent access controls across most analytics engines.
- Allows users to achieve unified governance and management at scale through seamless integration with Dataplex.
- Users can extend BigQuery to multicloud data lakes and open formats such as Parquet and ORC with fine-grained security controls without setting up new infrastructure.
Azure Data Lake Storage
Azure Data Lake is packed with capabilities designed to help developers, data scientists and analysts store data of any size, shape and speed, and do all types of processing and analytics across platforms and languages. Azure Data Lake obviates the complexities of ingesting and storing all data and expedite batch, streaming and interactive analytics.
Top features of Azure Data Lake:
- Provides limitless scale and data durability with automatic geo-replication
- Proficient in working with demanding workloads with the same performance
- Highly secure with flexible mechanisms for protection across data access, encryption and network-level control
- Cost optimisation through independent scaling of storage and compute
- Single storage platform for ingestion, processing and visualisation that supports the most common analytics frameworks
AWS Lake Formation
AWS Lake Formation is one of the easiest ways to set up data storage for analytics and ML services. AWS claims to provide “the most secure, scalable, comprehensive, and cost-effective portfolio of services” for customers to build their data lake in the cloud. AWS has customers like NETFLIX, Zillow, NASDAQ, Yelp, iRobot, and FINRA and offers scale, agility, and flexibility companies need to combine data and analytics approaches.
Top features of AWS Lake Formation:
- Define and manage security, governance, and auditing policies to meet industry and geography-specific regulations.
- Access to your data wherever it lives, along with custom-labelling provisions
- The audit log helps identify data access history across various services.
- Integration with other analytics-based services
- Users can automatically store copies of data across a minimum of three Availability Zones (AZs). Availability Zones are separated by several miles to provide fault tolerance but no more than a hundred to ensure low latencies.
Databricks’ Delta Lake
Delta Lake is an open format storage layer that delivers reliability, security and performance for both streaming and batch operations. Delta Lake is cost-effective and highly scalable and provides a single storage space for structured, semi-structured and unstructured data.
Top features of Delta Lake:
- High-quality, reliable data with single source of truth for all of the data, including real-time streams
- Open and secure data sharing
- Good performance with Apache Spark under the hood
- Open and agile
- Automated and trusted data engineering
- Security and governance at scale
Snowflake is a cloud computing-based data warehouse company providing a fully managed service with high scalability of concurrent workloads. It offers a cloud data warehouse built atop Amazon Web Services. The cross-cloud platform can access governed data self-service for various workloads without resource contention or concurrency issues.
Top features of Snowflake Data Lake
- One platform for all data, combining structured, semi-structured, and unstructured data of any format
- Fast, reliable processing and querying
- Secure collaboration