At MLDS 2020, Vinodhini Ranganathan, Data Scientist at Cisco highlighted the challenges of fake reviews along with an analytics framework for identifying and removing them on web platforms.
“In God we trust, all others bring data,” W. Edwards Deming said a long while ago. But in this age, the sort of optimism is challenging, especially with the online e-com ecosystem. The hunt for authentic data is where lies the true value. But, this could be challenging as the platforms are filled with fake reviews.
“Fake reviews are becoming an increasing problem. Online platforms like Amazon, TripAdvisor, Zomato or any other review platform where you have a lot of volume of reviews coming, and that’s where we deal with this problem of fake reviews,” said Vinodhini Ranganathan of Cisco.
The impact of reviews is vast. This can be elaborated by the fact that 68% of millennials always go through reviews before they make a purchase.
Sign up for your weekly dose of what's up in emerging technology.
According to Vinodhini, consumer reviews are trusted significantly more than any other reviews, and that’s what makes this area worthy of more research. According to her, negative reviews create a bust, positive reviews boost sales.
“There’s a lot of sentiment that goes around when there is a negative review. Let’s say the Maggi controversy when reports came out saying it had high levels of lead in it, leading to a huge dip in sales,” Vinodhini Ranganathan, Data Scientist at Cisco said.
A single negative review can cost you, 30 customers. After reading three negative reviews, 59% of consumers will not buy. Four or more negative reviews about your company or product might take away 70% of the potential customers. Which is why it’s important to identify which reviews may be authentic or fake, particularly businesses that rely a lot on online reviews such as restaurants, hotels, medical/healthcare, clothing store, grocery, etc.
It has also been found that it’s not just the negative reviews which create an impact. The impact of a positive review is also there in the sense that for every star a business gets, there will be approximately a 5-9% increase in business revenue. Consumers are likely to spend 31% more on a business with excellent reviews. 72 percent of consumers say that positive reviews make them trust a local business more.
There are many ways in which fake reviews exist on web platforms. These could be sock puppeting, where a single person can post several reviews, crowd turfing where online sellers hire people to post reviews; and review brushing where orders are created via identity theft to alternate delivery addresses, which then lead to spoofing fake reviews.
While there are potential ways to use analytics and ML systems to identify fake reviews, according to Vinodhini, human learning is also key here. “We as individuals have a lot of challenges as it is not like any other task where we can machine learning. Usually, you have a forecasting problem, and you can identify it because the human learning component is needed to a great extent. But, in the case of identifying fake reviews, it is humans who are going to annotate these reviews as fake or not fake,” based on the different features which have been identified within data sets of reviews.”
Feature Extraction Is The Challenge
Vinodhini said that feature extraction is one of the major challenges when it comes to fake review detection as it entails finally annotating and labelling reviews by doing all of the feature engineering processes. Then, there is the automation part to it also. “While we have so many algorithms for that, the initial challenge is to annotate datasets we have as part of the feature engineering extraction process,” she said.
There are different types of features here such reviewer centric features, review centric features, and network-centric features. Reviewer centric features include a number of reviews, number of helpful votes, the time interval between reviews, percentage of positive and negative reviews, the ratio of verified purchase, verified stay flag, rating deviation, review length, etc.
Review centric includes looking for near-duplicate reviews posted from different IDs, or the same reviewer posting different reviews for different products, but the content may be same, or the spamming of the same reviews across different intervals. “Using NLP, experts can find the percentage of nouns, pronouns or adjectives as part of speech tagging. If you have a lot of pronouns or adjectives in your reviews, then it is more likely to be a fake review. Then there are things like lexical validity, lexical diversity, content diversity, syntactical diversity, usage of pictures/links, emotiveness, sentiment score, product information matching etc,” told Vinodhini Ranganathan of Cisco.
Moving on, there are network-centric features including IP address, GPS information, timestamp, traffic patterns, the sender IP neighbourhood density, device information. “Whenever there is a spam network, they are usually closely-knit during a given period of time coming from the same IP neighbourhoods,” told Vinodhini on this.
What Is The Proposed Framework For Fake Review Detection?
Sentiment analysis models should include the component of fake/non-fake classification where possible classes include fake positive, fake negative, genuine positive, genuine negative. There is a lot of experimentation and research that can go when you combine this with a traditional sentiment analysis model. Vinodhini proposes a framework on this–
“If you have unlabelled data, we are not going to clean the data how we do for sentiment analysis where we go remove numbers, punctuations, we should be looking at the data as is. The best and least normalisation we can do is converting everything into lower case and correct spell errors because every information in the review is very important. This will be data preprocessing. Then, it is followed by feature engineering, where all the list of features we spoke about and annotate the reviews datasets. After we have the labelled data, we are going to put it through classification algorithms for automation. This is similar to any sort of classification approach we follow. But, there will be a lot of skewed imbalanced data, so we can use a lot of sampling techniques to make the classes balanced, and then we can go and evaluate the model,” stated Vinodhini.
According to Vinodhini Ranganathan of Cisco, fake or non-fake should be the first task in the pipeline for review text analysis whenever a sentiment analysis model is created.
“We are talking about explainable AI, industry 4.0, we should also be talking about responsible AI. All of us are directly and indirectly responsible for the content that is getting posted online. As ML practitioners we also have to come up with ways to identify all of these because there is not much process, we have BERT, we have neural networks, and we should be also looking at this problem research and find out models which identify fake reviews,” said Vinodhini Ranganathan, Data Scientist at Cisco.