MITB Banner

How Google Built A Search Engine For Finding Datasets

Share

Over the years the word Google has become synonymous with the word Internet.

What Google does best is the way it makes life easy by providing most optimised search results. They use state-of-the-art ranking algorithms(built-in house usually) and bring it to the fore considering both ‘what was’ and ‘what could have been’.

For data scientists, especially, Google has been assisting with information regarding latest stock prices and historical data spanning hundreds of years. Not only with information but Google has been contributing to the AI community by open-sourcing its tools and frameworks, providing inexpensive processing power cloud resources via TPUs among many others.

Now it has built a customised search engine just for searching datasets.

Google Dataset search engine is an attempt to establish and open ecosystem of millions of datasets.

The objective here is to pull up most appropriate datasets with as few queries as possible.

Key Challenges

It is really difficult to list out all the dataset repositories even if it is in a single domain say, medical or trading.

One primary challenge for the search engine would be to target the right datasets. To identify something as a dataset, the developers at Google began with an assumption that whenever a data owner uploads some data and calls it a dataset, it IS a dataset; tables or files or images or binary files etc.

With this, the first hurdle of distinguishing the right data is addressed.

Next comes metadata searches. The search keywords can also consist of data like titles, time, and other data within a dataset. It need not be a dataset. The quality of this metadata varies and lot can go downhill provided the scale at which the engine operates.

The format in which data gets published and the format in which some metadata is searched varies. For example, format of date.

The searches can be so similar yet the success of finding one can depend on something as trivial as a space.

Most of the developers attach metadata to each dataset in search-result listing but not to the profile page. This can result in picking up of large number of copies of metadata for the same dataset.

The copies of metadata descriptions of same dataset in different repositories are treated as replicas. Identifying these replicas as cluster can give more options to the users.

There is also a problem of surfacing of stale links during searches. To steer users away from these links, the team at Google deletes on an average 3% of the datasets from their index.

Source: Google

Google crawler(user agents) collects the metadata from the Web; Dataset Search backend normalizes and reconciles the metadata; then reconciled metadata is indexed and results are given ranking for user queries.

To improve the ranking of datasets, the result users on a given query is of great significance. So, the team plans on improving the coverage by encouraging the growth of explicit metadata and using existing metadata to train methods to extract new metadata.

Future Direction

Google’s open ecosystem of datasets looks promising in encouraging and building a reliable community of providers and publishers. Since the field of data science is interdisciplinary, a statistician or a journalist need not feel left out on accessing data from other domains.

The quality of metadata still remains a challenge that the team aims to improve gradually by linking from existing metadata to resources such as academic publications.

Know more about the Search engine here.

PS: The story was written using a keyboard.
Share
Picture of Ram Sagar

Ram Sagar

I have a master's degree in Robotics and I write about machine learning advancements.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Courses & Careers

Become a Certified Generative AI Engineer

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India