Benford’s Law: A Cloak-and-Dagger tool for Data Scientists

Benford’s law, often known as Newcomb-Benford law, is an observation about the frequency distribution of leading digits of unconstrained numeric data in the real world. 

The intuition behind the law dates to the 1880s when an American Scientist, Simon Newcomb started to discover a pattern among the log tables. He noticed that people usually have a lot of markings on numbers starting with small digits like 1 and 2. He didn’t research much about his observation, so after 50 years, Benford continued  his research on this phenomenon and found out many interesting things by applying it to populations, length of rivers etc. Let’s see how to apply Benford’s law and what are the possible applications of it.

Figure 1: Probability to follow Benford’s Law

The above given formula will give you the appropriate likeliness about the occurrence of digits to comply with Benford’s Law. The probabilities are as follows:

dP(d)
130.1%
217.6%
312.5%
49.7%
57.9%
66.7%
75.8%
85.1%
94.6%

If the selected set of first digits do not follow the above probability distribution, then either the dataset is too small, or someone has tried to manipulate it. Even 1% of manipulation in the real data will flag some fraudulent activities as it will violate the probability distribution given above. This law can be applied to anything and everything that is a result of an unconstrained process. Not all set of numbers can be used with Benford’s Law. For example, telephone numbers. 

The law can be the first step of filtration of any unconstrained dataset to make sure that the data has not tampered. Below are some of the applications of the Benford’s Law.

  • Financial Data:

The financial world relies a lot on the Benford’s law, to identify frauds. It could be applied to loan data, stock prices, tax returns etc. Most of the datasets will follow the probability distribution and if not, either someone has manipulated the data or maybe the dataset is too small.

  • Election Data:

You could take the number of votes for a party from different cities and try to compare it to the probability distribution. This could be a good check to understand if the party has tried to buy votes or pressurised people to vote for them.

  • Image Forensics:

In times, where tutorials to create a Deep Fake are openly available on the internet, it becomes difficult to rely on evidence when it comes to proving the crime. Benford’s law acts as an amateur filtration step to authenticate the image. For e.g., try taking an image with your phone and apply the Benford’s law to the pixel intensities, you’ll notice the same probability distribution as the law. But if you add a filter to the image and save it, it will violate the law as it’s no longer an original image. The same process can be done to spot fake videos. This filter makes it difficult for amateur defaulters to fool the law.

  • Twitter Bot Identification:

A researcher in the US, wanted to understand the use of the Benford’s Law so she started looking at the number of friends you have on your Twitter account and also the number of friends your friends have on their account. For e.g., scrapping out the number of friends you have on your account and the number of friends that your friends have on their account. Having done this she found out that most of the people were following the Benford’s law but there were also accounts that didn’t follow the distribution. After having a closer look at those accounts, she understood that those were bots. Carrying on her research she exposed an entire network of bots on Twitter. These bots could be used to manipulate elections and send fraud messages to people. 

Conclusion: 

The law is so simple yet very powerful and beholds the ability to spot frauds within seconds. There are a lot of applications of it and researchers are actively looking for the possible applications, but the question is, “Why everything follows the Benford’s law?”. I’ll leave this to you, to explore. 

More Great AIM Stories

Rithwik Chhugani
I am a final year Data Science student with good experience in working with startups across India and Australia in the Machine Learning and AI space. I am always in search of tasks that challenge me to broaden my vision and enhance the level of experience. Looking for a full-time position after my graduation in April 2021. Hit me up if you have an opportunity for me.

More Stories

OUR UPCOMING EVENTS

8th April | In-person Conference | Hotel Radisson Blue, Bangalore

Organized by Analytics India Magazine

View Event >>

30th Apr | Virtual conference

Organized by Analytics India Magazine

View Event >>

MORE FROM AIM
Yugesh Verma
All you need to know about Graph Embeddings

Embeddings can be the subgroups of a group, similarly, in graph theory embedding of a graph can be considered as a representation of a graph on a surface, where points of that surface are made up of vertices and arcs are made up of edges

Yugesh Verma
A beginner’s guide to Spatio-Temporal graph neural networks

Spatio-temporal graphs are made of static structures and time-varying features, and such information in a graph requires a neural network that can deal with time-varying features of the graph. Neural networks which are developed to deal with time-varying features of the graph can be considered as Spatio-temporal graph neural networks. 

Yugesh Verma
A guide to explainable named entity recognition

Named entity recognition (NER) is difficult to understand how the process of NER worked in the background or how the process is behaving with the data, it needs more explainability. we can make it more explainable.

Yugesh Verma
10 real-life applications of Genetic Optimization

Genetic algorithms have a variety of applications, and one of the basic applications of genetic algorithms can be the optimization of problems and solutions. We use optimization for finding the best solution to any problem. Optimization using genetic algorithms can be considered genetic optimization

Yugesh Verma
How to Visualize Backpropagation in Neural Networks?

The backpropagation algorithm computes the gradient of the loss function with respect to the weights. these algorithms are complex and visualizing backpropagation algorithms can help us in understanding its procedure in neural network.

3 Ways to Join our Community

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Telegram Channel

Discover special offers, top stories, upcoming events, and more.

Subscribe to our newsletter

Get the latest updates from AIM