MITB Banner

Google Unveils Gecko, a Versatile Text Embedding Model Distilled from Large Language Models

Gecko is trained on an LLM-generated synthetic dataset FRet that contains LLM-ranked positives and negatives.

Share

Listen to this story

Google has announced Gecko, a compact and versatile text embedding model powered by the vast world knowledge of large language models (LLMs). 

Text embedding models represent natural language as dense vectors, positioning semantically similar text near each other within the embedding space. Or in simple terms, text embedding models are like translators for computers. They take text and convert it into numbers in a way the computer can understand.

The numerical representations, also known as embeddings, capture semantic information about the words or sentences in the text. By allowing computers to process natural language, these embeddings are used to carry out a wide range of downstream tasks including document retrieval, sentence similarity, classification, and clustering. 

Instead of building separate embedding models for each downstream task, there has been a push for creating a single model that can support many tasks. However, such general-purpose text embedding models require large amounts of training data to comprehensively cover desired domains and skills. This is where LLMs can be leveraged, as done by Google in this research. 

“LLMs contain vast knowledge across various domains and are known to be exceptional few-shot learners”. Google’s approach leverages insights from knowledge distillation to create Gecko, a two-step LLM-powered embedding model. 

“Our two-step distillation process begins with generating diverse, synthetic paired data using an LLM. Next, we further refine the data quality by retrieving a set of candidate passages for each query, and relabeling the positive and hard negative passages using the same LLM.” 

So basically, starting with a large corpus of unlabeled passages, the team used a few-shot prompted LLM to generate a relevant task and query for each passage. They then embedded the concatenated task and query using a pretrained embedding model to obtain nearest neighbor passages, used an LLM to rerank the passages, and obtained positive and negative passages based on the LLM scores. This approach helped Gecko achieve strong retrieval performance. 

The research showed that training Gecko on an LLM-generated synthetic dataset, FRet, containing LLM-ranked positives and negatives alone can lead to significantly improvement, setting a strong baseline as a zero-shot embedding model on the Massive Text Embedding Benchmark (MTEB).

“By combining this LLM-generated and LLM-ranked data with human-annotated data, our model, Gecko-1B with 768-dimensional embeddings, achieves the best performance on the popular MTEB benchmark among the models with compatible embedding dimensions and model sizes. It achieves an average score of 66.31, competing with 7x larger models and 5x higher dimensional embeddings”, the research mentioned. 

Share
Picture of Sukriti Gupta

Sukriti Gupta

Having done her undergrad in engineering and masters in journalism, Sukriti likes combining her technical know-how and storytelling to simplify seemingly complicated tech topics in a way everyone can understand
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Subscribe to Our Newsletter

The Belamy, our weekly Newsletter is a rage. Just enter your email below.