Listen to this story
|
Large language models, up until now, have been in a legal grey area being trained on ChatGPT output. Databricks seems to have figured out a way around this with Dolly 2.0, the predecessor of the large language model with ChatGPT-like human interactivity that the company released just two weeks ago. The differentiating factor between other ‘open source’ models and Dolly 2.0 is that it is available for commercial purposes without the need to pay for API access or share data with third parties unlike the rest.
According to the company’s official statement, Dolly 2.0 is the world’s first open-source LLM that follows instructions and is fine-tuned on a transparent and openly available dataset. The LLM based on the EleutherAI pythia model family, boasts an impressive 12 billion parameters and has been fine-tuned exclusively on an open-source corpus databricks-dolly-15k.
Databricks’ employees generated this dataset, and its licensing terms allow it to be used, modified, and extended for any purpose, including academic or commercial applications. There has been a wave of LLM releases that are considered open-source by many definitions but are bound by industrial licences and. The trailblazer was Meta’s LLaMA, followed by Stanford’s Alpaca, Koala, and Vicuna.
The Stanford project’s data of 52k questions and answers was trained on the ChatGPT’s outputs. But as per OpenAI’s terms of use, you can’t use output from services that compete with OpenAI. Databricks seems to have figured out how to get around this with Dolly 2.0.
According to Ali Ghodsi, the CEO of Databricks, the model, Dolly 2.0, is set to create a “snowball” effect in the AI community. He believes that this will inspire others to contribute and collaborate on developing alternative models. The limit on commercial use was a big obstacle to overcome, he explained.