“The main difference I see with GPT-3, is that unlike other AI applications that are narrow, its performance is general and human-like”Sahar Mor
OpenAI’s GPT-3 is one of the largest language models with 175 billion parameters. It can perform a range of tasks, including text summarisation, search, coding, and essay writing. GPT-3 has been the most discussed topic among the ML community in the second half of 2020, and for right reasons. The language model drew more attention after the launch of OpenAI API. Developers across the world launched their own startups based on GPT-3. Few even managed to raise funds. Sahar Mor was one of the lucky ones to get access to this API.
Sahar doesn’t have a university degree. Instead, he cut his teeth at the elite Israeli Intelligence Unit-8200. He has over a decade of experience in engineering and product management, both focused on products with AI at their core. He has also co-founded Homie, a leading platform in Israel aimed at people searching for long-term rental apartments, using language processing to index Facebook posts according to their content.
Analytics India Magazine got in touch with Sahar to know more about his GPT-3 adventures, implications of large language models and more.
AIM: Tell us about your AI journey
Sahar: I relocated to Berlin in 2016 to join as a founding Product Manager to Zeitgold, a B2B AI accounting software that has raised >$60 million to date. At Zeitgold, I led the building of Zeitgold’s AI products and scaled its internal human-in-the-loop platform to support hundreds of business customers, automating ~75% of human operations. After Zeitgold, I have joined as a founding PM/engineer to Levity.ai, a No-Code AutoML platform providing models for image, document, and text tasks.
Last summer, I was one of the first engineers within the AI community to get access to OpenAI’s GPT-3 model. I have used this technology to build AirPaper, an automated document extraction API. Launched last September with OpenAI’s CTO retweeting about it, AirPaper’s waiting list has grown to >100, including startups, accounting firms, and insurance companies. In the last few years, I have been active in the AI community by writing and giving talks about the latest advancements in the field and exploring the different ways to transform the recent breakthroughs in AI research into production-ready AI products.
AIM: Tell us about your GPT-3 based application and how you got access to OpenAI’s API?
Sahar: AirPaper is a robust document intelligence API. Send any document, either a PDF or an image, and get structured data.
For getting access, I have emailed the OpenAI’s CTO with a short background about myself and the app I had in mind. OpenAI’s process for approving apps has led me to write about its scalability and shortcomings – along with potential ways to mitigate them. To get API access, one needs to apply via this form. The current waiting times can be forever, with developers that applied in late June and are still waiting for a response.
Once you’ve built an app that is ready for production, you’ll be required to fill another form. It might take up to 7 business days for OpenAI’s team to review a request. After your app has been approved, you’re good to go.
Every GPT-3 powered app starts within OpenAI’s Playground as you quickly iterate and validate if your problem can be solved with GPT-3. This tinkering is key in developing the needed intuition for crafting successful prompts. During this process, I realised there is an opportunity for OpenAI to automate and optimize this part, which they did several months later with their instruct-model series.
Once I had the right prompt template in mind, I integrated it into my code. This meant preprocessing every document, turning its OCR into a GPT-3 digestible prompt, and querying the API. After further testing and parameters optimization (e.g. reducing temperature), I’ve deployed the app.
AIM: What are the usual challenges you face while training large language models?
Sahar: Lack of data that is relevant for the task at hand. As an example, there are many great open-source datasets for reviews and tweets, but none for document processing. That’s the reason there are still so many commercial companies building document intelligence APIs such as Google Document AI, AWS Textract, Instabase, etc.
I’m building DocumNet, which is an ImageNet equivalent but for documents. I believe that, given enough data, document understanding can be commoditized in the same way computer vision has during the last decade.
Inference and training costs are a challenge. Training and hosting your own language model can be quite costly. You can avoid these costs by using language models inference APIs such as OpenAI and HuggingFace, but those come with an extra premium.
That said, I see these two costs dropping significantly during the next three years thanks to more efficient methods for training language models, a decreasing need for abundant data, and cloud providers reducing their prices.
AIM: OpenAI has released DALL.E and CLIP recently. Do you think fusion models (vision + language) are the future of AI research?
Sahar: Definitely. These days most AI applications in production are vertical and by now it’s a common understanding that narrow AI is not merely equivalent to human intelligence, even when conducting one specific task. For example, the SOTA deep learning model for early-stage detection of cancer (vision) is limited in its performance when it’s not combined with patient’s charts (text) from her electronic health records.
I’ve seen this issue with many AI companies I’ve consulted over the years and also during my time at Zeitgold. For example, when a human operations agent was extracting an invoice amount, he was implicitly taking into account the handwritten correction next to the original amount, therefore extracting the handwritten one. He was only able to draw this conclusion as he was taking both the image and textual input into consideration.
The main reason multimodal systems aren’t common in AI research is due to their shortcoming of picking up on biases in datasets. This can be solved with more data, which is becoming increasingly more available. Multimodal applications are not only relevant in the context of vision + language. During the last years, Facebook released several papers outlining novel approaches for automatic speech recognition (ASR) combining both audio and text.
AIM: Should GPT-3 be regulated in the future?
Sahar: Yes, but it’s tricky. The question of “should GPT-3 be regulated?” is a broader one involving other applications of AI. The main difference I see with GPT-3, is that unlike other AI applications that are narrow, its performance is general and human-like. This is the same concern we have with technologies such as deep fakes. Nevertheless, the fact OpenAI is regulating itself shows they acknowledge the harmful potential of its technology. And if that’s the case, can we trust a commercial company to self-regulate in the absence of an educated regulator? What happens once such a company faces a trade-off between ethics and revenues?
The bottom line answer is yes, yet the main challenge is to understand in which ways and if at all regulation is an effective tool in ensuring safe AI adoption (I personally believe industry and research will beat the regulator by setting its own standards and recommendations. Which is again- dangerous).
Human intelligence works in a multi-modal manner, where we utilize all of our senses when making decisions such as “what is in this picture?” or “is this a toxic comment?”. Furthermore, to make these decisions, we incorporate other elements such as our past experiences, which are de-facto the equivalent of transfer learning in ML.
Not incorporating the two is confining whatever ML model you’re building to its (missing) data, and if the saying “your model is only as good as the data it was trained on” is a popular one, then how about “your model is only as good as the completeness of the data it was trained on?”
Recommended Readings By Sahar Mor:
- Facebook AI’s Multitask & Multimodal Unified Transformer: A Step Toward General-Purpose Intelligent Agents
- Generative Pretraining from Pixels
- TAP: Text-Aware Pre-training for Text-VQA and Text-Caption
- 12-in-1: Multi-Task Vision and Language Representation Learning
- Multi-skilled AI