Amazon Releases Dataset To Detect Counterfactual Phrases For Products

Amazon has recently released a new dataset publicly to help train machine learning models to recognize counterfactual statements.

Product retrieval systems, like the one in the Amazon Store, often use the text of product reviews to improve the results of queries. But such systems can be misled by counterfactual statements, which describe events that did not or cannot take place.

Counterfactual statements in reviews are rare, but they can lead to frustrating experiences for customers — as when, for instance, a search for “red shirt” pulls up a product whose reviews make clear that it is not available in red. 

To help ease such complications, Amazon has recently released a new dataset publicly to help train machine learning models to recognize counterfactual statements.

At the time this project was started, there were no large-scale datasets that covered counterfactual statements in product reviews in multiple languages. Amazon decided to annotate sentences selected from product reviews for three languages: English, German, and Japanese. Sentences that express counterfactuals are rare in natural-language texts — only 1-2% of sentences, according to one study. Therefore, simply annotating a randomly selected set of sentences would yield a highly imbalanced dataset with a sparse training signal.

Counterfactual statements can be broken into two parts: a statement about the event (if it were available in red), also referred to as the antecedent, and the consequence of the event (I would have bought this shirt), referred to as the consequent.

Image Source: Amazon

To identify counterfactual statements, Amazon specified certain relationships between antecedent and consequent in the presence of certain clue words. With the help of professional linguists for all the languages under consideration, they compiled a set of such specifications for conjunctive normal sentences, conjunctive converse sentences, modal propositional sentences, sentences with clue words like “wished”, “hoped”, and the like.

However, not all sentences that contain counterfactual clues express counterfactuals. So, professional linguists also reviewed the selected sentences to determine whether they truly expressed counterfactuals.

Selecting sentences based on precompiled clue word lists could, however, bias the data. Hence, they also selected sentences that do not contain clue words but are highly similar to sentences that do. As a measure of similarity, Amazon used the proximity of sentence embeddings — vector representations of the sentences — computed by a pretrained BERT model.

Counterfactual detection can be modelled as a binary classification task: given a sentence, classify it as positive if it expresses a counterfactual statement and negative otherwise. The research team experimented with different methods for representing sentences, such as bag-of-words representations, static word-embedding-based representations, and contextualized word-embedding-based representations. Different classification algorithms were also evaluated, ranging from logistic regression and support vector machines to multilayer perceptrons. We found that a cross-lingual language model (XLM) based on the RoBERTa model and fine-tuned on the counterfactually annotated sentences performed best overall.

To study the relationship between the dataset and existing datasets, it was trained on a counterfactual detection model and evaluated on the public dataset for a counterfactual-detection competition, which contains counterfactual statements from news articles. Models trained on Amazon’s dataset performed poorly on the competition dataset, indicating that the counterfactual statements in product reviews — the focus of our dataset — are significantly different from those in news articles.

As a simple baseline, Amazon first trained a model on English training data and then applied it to German and Japanese test data, translated into English via a machine translation system. However, this simple baseline resulted in poor performance, indicating that counterfactuals are highly language-specific, so more principled approaches will be needed for their cross-lingual transfer.

The team is still investigating filtration by other types of linguistic constructions, besides counterfactuals, and expanding the detection models to other languages.

More Great AIM Stories

Victor Dey
Victor is an aspiring Data Scientist & is a Master of Science in Data Science & Big Data Analytics. He is a Researcher, a Data Science Influencer and also an Ex-University Football Player. A keen learner of new developments in Data Science and Artificial Intelligence, he is committed to growing the Data Science community.

More Stories


8th April | In-person Conference | Hotel Radisson Blue, Bangalore

Organized by Analytics India Magazine

View Event >>

30th Apr | Virtual conference

Organized by Analytics India Magazine

View Event >>

Yugesh Verma
All you need to know about Graph Embeddings

Embeddings can be the subgroups of a group, similarly, in graph theory embedding of a graph can be considered as a representation of a graph on a surface, where points of that surface are made up of vertices and arcs are made up of edges

3 Ways to Join our Community

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Telegram Channel

Discover special offers, top stories, upcoming events, and more.

Subscribe to our newsletter

Get the latest updates from AIM