University of Vermont (UVM) researchers recently unveiled a new tool called the Storywrangler to visualise the use of billions of words, hashtags and emoji posted on Twitter. Check out the code on GitHub.
In a research paper, “Storywrangler: A massive exploratorium for sociolinguistic, cultural, socioeconomic, and political timelines using Twitter,” researchers from the University of Vermont, in collaboration with Charles River Analytics, and MassMutual Data Science, detailed the working of a tool that curated over 150 billion tweets containing 1 trillion 1-grams from 2008 to 2021.
The researchers have highlighted the tool’s potential showcasing use cases apropos social amplification, the sociotechnical dynamics of famous individuals, box office success, and social unrest.
How does it work?
The team has broken down tweets into 1-, 2-, and 3 – grams across 100+ languages, generating frequencies for words, hashtags, numerals, handles, symbols, and emojis for each day. A 1-gram or unigram is a one-word sequence. Similarly, a 2-gram or bigram is a two-word sequence of words, a 3-gram or trigram is a three-word sequence, so and so forth.
For example, in the below visuals from the tool’s online viewer, three global events from 2020 are highlighted: the death of Iranian general “Qasem Soleimani;” the beginning of the “Covid-19” pandemic; and the “Black Lives Matter” protests following the murder of “George Floyd”.
(Source: arXiv)
“We make the dataset available through an interactive time series viewer and as downloadable time series and daily distributions,” said the UVM researchers.
Though Storywrangler leverages Twitter data, their method of tracking dynamic changes in ‘n-grams‘ can be extended to any evolving corpus.
Thayer Alshaabi, a researcher from UVM, said, “It is like a telescope to look — in real-time — at all this data that people share on ‘social media.’ We hope people will use it themselves, in the same way, you might look up at the stars and ask your own questions.'”
Why Twitter?
Powered by ‘UVM’s supercomputer’ at the Vermont Advanced Computing Core, Storywrangler provides a powerful lens for viewing and analysing the rise and fall of words, ideas, and tweets each day. “It is important because it shows major discourses as they are happening,” said Jane L. Adams, “It is quantifying collective attention.”
“Though Twitter does not represent the whole of humanity, it is used by a very large and diverse group of people, which means that it encodes popularity and spreading,” noted the researchers.
Interestingly, the researchers showed the tool could be used to predict political and financial turmoil. The team examined the percent change in the words ‘rebellion’ and ‘crackdown’ in various regions of the world and found the rise and fall of these terms were significantly associated with a change in a well-established index of geopolitical risk for those locations.
Professor at the UVM’s computer science department, Christopher M. Danforth, said the Storywrangler offers a data-driven way to index what regular people are talking about in everyday conversations, not just what authors or reporters have chosen.
Storywrangler aims to enable research in computational social science, data journalism, natural language processing, and the digital humanities.
UVM’s Danforth said a hashtag is being invented every second. “We did not know to look for that yesterday, but it will show up in the data and become part of the story.”
Wrapping up
With support from the ‘National Science Foundation,’ the UVM team is currently using Twitter to demonstrate how chatter on distributed social media can act as a kind of global sensor system — of what happened, how people reacted, and what’s next.
In theory, other social media streams, including Reddit, 4chan and Weibo, can also be used to feed Storywrangler.