Microsoft’s Phi-3 Outperforms Meta’s Llama 3 and Fits Perfectly on an iPhone

Microsoft shows who is the boss of tiny open source models.

Share

Illustration by Raghavendra Rao

Published on April 23, 2024

by Mohit Pandey

Listen to this story

“One of the things that makes Phi-2 better than Meta’s Llama 2 7B and other models is that its 2.7 billion parameter size is very well suited for fitting on a phone,” said Harkirat Behl, one of the creators of the model, who has now created Phi-3, the latest open source model by Microsoft.

Phi-3-Mini is a 3.8 billion parameter language model trained on an extensive dataset of 3.3 trillion tokens. Despite its compact size, the Phi-3-Mini boasts performance levels that not just exceed the recent ones such as Mixtral 8x7B and GPT-3.5, but even surpass the recently launched Meta’s Llama 3 8B on MMLU benchmarks.

Despite these high capabilities, Phi-3-Mini can run locally on a cell phone. Its small size allows it to be quantised to 4 bits, occupying approximately 1.8GB of memory. Microsoft tested the quantised model by deploying Phi-3-Mini on an iPhone 14 with an A16 Bionic chip, running natively on the device and fully offline, achieving more than 12 tokens per second.

phi-3 is here, and it's … good :-).

I made a quick short demo to give you a feel of what phi-3-mini (3.8B) can do. Stay tuned for the open weights release and more announcements tomorrow morning!

(And ofc this wouldn't be complete without the usual table of benchmarks!) pic.twitter.com/AWA7Km59rp
— Sebastien Bubeck (@SebastienBubeck) April 23, 2024

Along with this, Microsoft has also introduced Phi-3-Small and Phi-3-Medium models, both significantly more capable than Phi-3-Mini. The Phi-3-Small 7 billion parameter model achieves an MMLU score of 75.3 outperforms Meta’s recently launched Llama 3 8B Instruct with a score of 66.

With a Grain of Salt

“To best benefit the open source community, Phi-3-Mini is built upon a similar block structure as Llama-2,” reads the technical report by Microsoft. But currently, the model is limited to English, which is not ideal for other languages and for Indic AI developers.

The innovation behind Phi-3-Mini lies in its training dataset, an expanded version of the one used for its predecessor, Phi-2. This dataset comprises heavily filtered web and synthetic data. The model has also been optimised for robustness, safety, and chat format.

Given that small open-source models are performing so well, it wouldn’t be surprising if soon there is a model outperforming OpenAI’s GPT-4. Interestingly, Meta is also training a model with around 400 billion parameters which would possibly be able to outperform the closed models once it is launched.

“BUT – as with all (tiny) models, benchmarks tell us less than vibes,” said Matt Shumer on X. In a discussion, people highlight the issue with the benchmarks of the model. “According to what I’ve read, Phi-2 was much worse than its benchmark numbers suggested. This model follows the same training strategy,” read a comment.

Since the model is built by Microsoft and uses synthetic data for training, it is possibly using GPT-4 output for training. “I don’t think it’s impossible for a small model to be very good. I see their ‘synthetic data’ as essentially a way of distilling GPT-4 into smaller models,” said the same user.

Furthermore, the model is also trained with only 4.8 trillion tokens, which is significantly less than 15 trillion tokens that Llama 3 was trained on. Regardless, the model can run on a phone, which the Llama series of models are still a little far away from given the size.

Moreover, Phi models aren’t specifically tuned for chat or instruct, which makes them perform slightly worse when compared to Llama models when incorporating in real world scenarios.

On the other hand, Behl had told AIM that scaling laws are not necessarily true. “You don’t need a specific size or number of parameters for a model to get good at coding,” said Behl, saying that you do not need large models to instil intelligence. “All you need is a small amount of high quality data, aka textbook quality data.”

This is what is continued with Phi-3.

What Dent will it Make?

Since the model is built for on-device and edge use cases, it is ideal for the ongoing shift towards AI devices. Moreover, Apple is also experimenting with AI on edge, and Phi-3 might give Microsoft an edge (pun intended) over Apple.

Moreover, since such small models are outperforming larger models, this might also possibly make an impact on OpenAI’s release of GPT-5, as enterprises are also increasingly adopting open source models. Who knows, the company might decide to open source one of its upcoming models, though that seems highly unlikely for now.

Microsoft has also kept in mind the need to make LLMs that are up to date with current information, thus have made Phi-3 ideal for RAG use cases as well.

Microsoft believes that training models on synthetic data reduces the size of the model, and also brings in a lot of capabilities within them, which is different from how GPT-3 was trained. “Textbooks are written by experts in the field, unlike the internet where anybody can write and post, which is how GPT-3 is trained,” said Behl.

Access all our open Survey & Awards Nomination forms in one place