Listen to this story
|
Google has announced Gecko, a compact and versatile text embedding model powered by the vast world knowledge of large language models (LLMs).
Text embedding models represent natural language as dense vectors, positioning semantically similar text near each other within the embedding space. Or in simple terms, text embedding models are like translators for computers. They take text and convert it into numbers in a way the computer can understand.
The numerical representations, also known as embeddings, capture semantic information about the words or sentences in the text. By allowing computers to process natural language, these embeddings are used to carry out a wide range of downstream tasks including document retrieval, sentence similarity, classification, and clustering.
Instead of building separate embedding models for each downstream task, there has been a push for creating a single model that can support many tasks. However, such general-purpose text embedding models require large amounts of training data to comprehensively cover desired domains and skills. This is where LLMs can be leveraged, as done by Google in this research.
“LLMs contain vast knowledge across various domains and are known to be exceptional few-shot learners”. Google’s approach leverages insights from knowledge distillation to create Gecko, a two-step LLM-powered embedding model.
“Our two-step distillation process begins with generating diverse, synthetic paired data using an LLM. Next, we further refine the data quality by retrieving a set of candidate passages for each query, and relabeling the positive and hard negative passages using the same LLM.”
So basically, starting with a large corpus of unlabeled passages, the team used a few-shot prompted LLM to generate a relevant task and query for each passage. They then embedded the concatenated task and query using a pretrained embedding model to obtain nearest neighbor passages, used an LLM to rerank the passages, and obtained positive and negative passages based on the LLM scores. This approach helped Gecko achieve strong retrieval performance.
The research showed that training Gecko on an LLM-generated synthetic dataset, FRet, containing LLM-ranked positives and negatives alone can lead to significantly improvement, setting a strong baseline as a zero-shot embedding model on the Massive Text Embedding Benchmark (MTEB).
“By combining this LLM-generated and LLM-ranked data with human-annotated data, our model, Gecko-1B with 768-dimensional embeddings, achieves the best performance on the popular MTEB benchmark among the models with compatible embedding dimensions and model sizes. It achieves an average score of 66.31, competing with 7x larger models and 5x higher dimensional embeddings”, the research mentioned.