Data is Gold, Twitter the Goldmine to Train AI Models

Every tweet posted on the platform, becomes the property of the social media giant and can be used by others who have access to its API

Published on May 23, 2023

by Lokesh Choudhary

Listen to this story

Meta-owned Instagram has ambitious plans in the pipeline to enter the microblogging space and challenge Elon Musk’s Twitter app. According to a Bloomberg report, Instagram is currently developing a Twitter-like microblogging application set to make its debut before the end of June. Internally referred to as the P92 or Barcelona, this new platform aims to combine the best features of Instagram and Twitter.

The development of Instagram comes at a time when Jack Dorsey is already working on his decentralised social network Bluesky, which is kind of similar to Twitter, but open source. A similar example of the platform can be Mastodon, another decentralised Twitter-based social network.

As per recent reports, Instagram’s forthcoming app will serve as a text-based platform for engaging in conversations. Users will have the ability to communicate directly with their audience and peers.

The app will offer various creative tools for users to craft their messages, including the option to incorporate links, photos, and videos.

Legal Disputes

Instagram’s foray into creating an alternative to Twitter coincides with an ongoing legal dispute between Twitter and Microsoft. Twitter has filed a lawsuit against Microsoft, accusing the company of unauthorised utilisation of Twitter data for training purposes, which it deems as ‘illegal’.

The conflict emerged when Microsoft refused to pay for access to Twitter’s API, which had recently introduced new payment tiers. Previously, developers could freely utilise the Twitter API, but to optimise their earnings, Twitter’s CEO, Musk, announced the end of this cost-free accessibility.

The confluence of events surrounding the lawsuit, Twitter’s decision to monetize its API, and Meta’s introduction of a Twitter-like application hint at a larger context. Twitter has traditionally been a unique platform, fostering text-heavy content and enabling individuals to express their opinions freely, leading to a more authentic and human experience compared to other platforms where insincerity is prevalent.

This distinctive nature of Twitter provides invaluable data for researchers aiming to enhance the human-like responses of language models like GPT. While the specific motives behind Elon Musk’s lawsuit against Microsoft remain unknown, the possibility of Microsoft utilising Twitter data to train OpenAI’s GPT models cannot be dismissed.

In the past as well, Microsoft had ventured into training bots using Twitter’s data. One notable example is Tay, a Twitter bot introduced in 2016, positioned as an experiment in “conversational understanding” by the company.

Microsoft stated that Tay would become smarter the more users interacted with it, adapting to engaging people through casual and playful conversation. Unfortunately, this endeavour turned sour for the software giant.

Users began inundating the bot with misogynistic, racist, and Donald Trump-inspired remarks. As a result, Tay, essentially an internet-connected robot parrot, started echoing these sentiments back to users.

Training Data in Conflict with Users

Twitter’s privacy policy, which most users tend to ignore, clearly states that by publicly posting content, the users are directing the platform to disclose that information as broadly as possible, including through its APIs, and directing those accessing the information through its APIs to do the same.

This essentially means that every tweet posted on the platform becomes the property of the social media giant and can be used by others who have access to its API.

However, since Twitter has put its API behind the paywall, many companies are coming up with their own Twitter-like platforms, claiming to provide users with a richer experience. Users may or may not have wanted that in the first place.

They use social media platforms to express their own opinions, rather than providing a cache of datasets for AI companies to make LLM out of it.

European Union Draft AI Bill

While it is currently permissible for platforms to utilise users’ opinions and posts as data for their models as long as they include it in their privacy policy, it is crucial to ensure that users are sufficiently informed about this.

In the past, websites commonly employed cookies to enhance the browsing experience for visitors, yet they did not typically present consent pop-ups as we often encounter nowadays. Similarly, legislation such as the EU draft AI Bill advocates for companies to disclose the datasets on which their models are trained, which can prove invaluable insights in the aforementioned scenarios.

To address this matter effectively, it would be beneficial to introduce a notification, such as a pop-up message, on social media platforms explicitly informing users that their opinions are being utilised to train AI models.

PS: The story was written using a keyboard.

Access all our open Survey & Awards Nomination forms in one place

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

The Impact of Lok Sabha Election on India’s AI Progress

Vidyashree Srinivas

The BJP aims to safeguard citizen safety and privacy, leaning towards regulation, while the Congress views AI advancements as an opportunity to create jobs.