MITB Banner

Apple Introduces Cost-Effective Language Models for Limited Domain Use

The paper outlines four key variables: pre training budget, specialisation budget, inference budget, and in-domain training set size.

Share

Why Apple will Build the Best Chatbot

In a recent paper titled “Specialized Language Models with Cheap Inference from Limited Domain Data,” Apple addresses the challenges of applying large language models to tasks with constraints on both inference budgets and in-domain training sets. 

The paper outlines four key variables: pre training budget, specialisation budget, inference budget, and in-domain training set size.

Large language models have proven to be versatile tools but face difficulties in scenarios where both large inference budgets and substantial in-domain training sets are lacking. The research formalises these constraints and explores various approaches from the machine learning literature to address the challenges posed by limited resources.

The study reveals that, when constrained by inference cost, alternatives to the conventional practice of training very large vanilla transformer models are more effective. 

Specifically, hyper-networks and mixtures of experts exhibit superior perplexity for large pre-training budgets. On the other hand, small models trained on importance sampled datasets emerge as attractive options for scenarios with large specialisation budgets.

Apple’s latest research in language models offers specific recommendations based on the size of the specialization budget. The findings suggest that for scenarios with a large specialization budget, opting for small models pretrained with importance sampling is the most effective approach. This involves pretraining over a generic corpus that has been resampled using importance sampling techniques.

On the other hand, when dealing with a smaller specialisation budget, the research advises investing in the generic pretraining of hyper-networks and mixtures of experts. These asymmetric models exhibit a substantial parameter count during pretraining but can be instantiated as smaller models for specialization purposes.

The research concludes that distillation, a commonly used technique, does not prove to be competitive across the various cost trade-offs considered in the study.

Share
Picture of Siddharth Jindal

Siddharth Jindal

Siddharth is a media graduate who loves to explore tech through journalism and putting forward ideas worth pondering about in the era of artificial intelligence.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India