Listen to this story
|
The amount of available labelled data is a barrier to producing a high-performing model in many ML applications. Developments in the past two years have shown the challenge of overcoming data limitations by using LLMs (Large Language Models), such as OpenAI GPT-3 to achieve good results. However, while these improve the missing labeled data situation, they introduce a new problem of the access and cost of LLMs.
To counter this, a group of researchers have discovered a new approach called SetFit to create highly accurate text-classification models with limited labeled data. Intel Labs, UKP Lab, and Hugging Face led the joint research that outperforms GPT-3 in 7 out of 11 tasks – while being 1600x smaller.
Source: Phil Schmid
According to the blog, SetFit has several unique features compared to other few-shot learning methods. One feature is using no prompts or verbalisers, as current techniques for few-shot fine-tuning require handcrafted prompts. SetFit altogether dispenses with prompts by generating embeddings directly from text examples. Moreover, it doesn’t require large-scale models like GPT-3 to achieve high accuracy. It also consists of multilingual support which can be used with Sentence Transformer on the hub.
Source: Phil Schmid
The team has generated a high-performing text-classification model with 8 samples per class or only 32 labeled samples using the new approach. “This is huge! SetFit will help so many companies to get started with text-classification and transformers, without the need to label a lot of data and compute power. Compared to LLM training, the SetFit classifier takes less than 1 hour on a small GPU (NVIDIA T4) to train or less than $1 so to speak,” read the blog.