In a recent paper titled “Specialized Language Models with Cheap Inference from Limited Domain Data,” Apple addresses the challenges of applying large language models to tasks with constraints on both inference budgets and in-domain training sets.
The paper outlines four key variables: pre training budget, specialisation budget, inference budget, and in-domain training set size.
Large language models have proven to be versatile tools but face difficulties in scenarios where both large inference budgets and substantial in-domain training sets are lacking. The research formalises these constraints and explores various approaches from the machine learning literature to address the challenges posed by limited resources.
The study reveals that, when constrained by inference cost, alternatives to the conventional practice of training very large vanilla transformer models are more effective.
Specifically, hyper-networks and mixtures of experts exhibit superior perplexity for large pre-training budgets. On the other hand, small models trained on importance sampled datasets emerge as attractive options for scenarios with large specialisation budgets.
Apple’s latest research in language models offers specific recommendations based on the size of the specialization budget. The findings suggest that for scenarios with a large specialization budget, opting for small models pretrained with importance sampling is the most effective approach. This involves pretraining over a generic corpus that has been resampled using importance sampling techniques.
On the other hand, when dealing with a smaller specialisation budget, the research advises investing in the generic pretraining of hyper-networks and mixtures of experts. These asymmetric models exhibit a substantial parameter count during pretraining but can be instantiated as smaller models for specialization purposes.
The research concludes that distillation, a commonly used technique, does not prove to be competitive across the various cost trade-offs considered in the study.