Introduced back in May, Google Cloud’s Analytics Hub “efficiently and securely exchanges data analytics assets across organisations.” The feature allows users to address data reliability and cost challenges by curating a library of internal and external assets.
The features, as noted by Google, are:
- Drive innovation with unique datasets from Google, commercial data providers, or your partners
- Exchange data, ML models, or other analytics assets to increase the ROI of data initiatives
- Easily publish or subscribe to shared datasets in an open, secure, and privacy-safe environment.
Data Sharing and its Challenges
The data-sharing culture is a rising trend in the industry, and rightfully so. The Gartner Chief Data Officer Survey found a three-time increase in economic growth for data and analytics leaders who shared data externally than those who didn’t. However, the data-sharing culture is still developing, and for new employers, it is overwhelming, with many worried about unreliable data and security threats. Gartner had emphasised the importance of trusting the quality of the collected data; most importantly, reliable data is collecting data from reliable sources, using and sharing data that matches the requirements of the business; followed by appropriate resharing of data.
Another issue with traditional data sharing is the use of batch data pipelines that are expensive to run, create late-arriving data, and break any changes to the source data; without the ability for data monetisation.
Google’s answer
Google Cloud’s Analytics Hub is Google’s answer to overcome these challenges and reach the full potential of data sharing. The service helps clients unlock the value of data sharing, learn new insights and increase business value.
The service’s ecosystem is rich in data by publishing and subscribing to analytics-ready datasets. It controls and monitors how the data is being used, and Google claims it to be its self-service way to access valuable and trusted data assets. Google has marketed the service as “an easy way to monetise your data assets without the overhead of building and managing the infrastructure.”
Foundational Architecture
The Analytics Hub service is built on Google’s petabyte-scale, serverless cloud data warehouse, BigQuery. BigQuery has provided cross-organisational sharing for more than a decade, and it is chosen for Analytics Hub given its architecture that provides a separation between computing and storage.
This makes it possible for data publishers to share data with several subscribers without making multiple copies. Additionally, since there are no servers to deploy or manage on BigQuery, data consumers get immediate value from shared data. BigQuery also provides streaming services so that data can be provided and consumed in real-time. Other additional services on BigQuery include built-in machine learning, geospatial and natural language capabilities, and native business intelligence support with Looker, Google Sheets, and Data Studio.
Even just in April, Google’s usage metrics for BigQuery had found over 3,000 different organisations sharing over 200 petabytes of data during a week.
How it works
Analytics Hub allows for shared datasets and exchanges.
The data publisher can create shared datasets containing the views of data they plan to deliver to their subscribers. This step is followed by creating exchanges; these are private to the users, allowing them to organise and secure shared datasets. The last step is to publish shared datasets into an exchange to make them available to subscribers.
The data subscribers can search through the datasets from all the exchanges and subscribe to the relevant datasets. This creates a linked dataset in their project, and they can join it with their data. Any additions made by the data providers will be immediately available to subscribers.
To ensure security, publishers can track subscribers, disable subscriptions, and see aggregated usage information for the shared data. There are four datasets available: public datasets, google datasets, commercial datasets, and internal datasets.
Similar types of Data Repositories
Data Lake
Increasing in popularity with Hadoop, Data lake is a central repository to store data at any scale or structure. This makes it easy to move raw data into the central repository to store it at a low cost while also reducing the time to load data on the front end. However, the downside of Data Lakes is that the data may not be curated or searchable and thus require external tools to analyse or operationalise the data.
Data Warehouses
A data warehouse is a repository for persistent and primarily structured data built over time from multiple downstream data sources. Data warehouses are solely intended to perform queries and analyses and often contain large amounts of historical data. Data Warehouses from Oracle, IBM, and Teradata tend to be IT-centric and are managed by one or more database administrators. As a result, users usually do not have direct interaction with the data warehouse.
Data Virtualisation
Data virtualisation creates virtual views of data stored in existing databases, where the physical data does not move, but it is possible to add external virtual layers. Essentially, it allows an application to retrieve and manipulate data without requiring technical details about the data.