With organisations transitioning from a traditional business model into adopting emerging technologies, popular professional social networking platform, LinkedIn has also adopted machine learning technologies to help professionals in a more sophisticated way.
Business and employment-oriented platform — LinkedIn hosts nearly 660+ million members in over 200 countries and territories. According to a source, it has 303 million active users per month, and on an average, two people create an account on this platform every second. Also, the platform witnesses 172,800 new users every day and around 62 million unique users every year.
With these amounts of data generation, the developers of this platform have been striving hard to make its state-of-the-art machine learning model a robust one, so that it provides more accurate decisions or results for its users. The ML researcher team at LinkedIn built a domain-specific language (DSL) and a Jupyter notebook to integrate the selected features and for tuning the parameters. In one of our articles, we discussed how LinkedIn’s recommendation system is generating the perfect job match for its users.
Utilising machine learning techniques to detect and remove inappropriate contents has been trending among social media giants like Facebook, LinkedIn, Twitter, among others. To automatically detect and remediate behaviours that violate the terms of service of the company and maintain a safe and trusted community, the professional platform has already been using techniques like automated fake account detection system, abuse detection systems and other such. According to reports, between January and June in 2019, the platform took action against 21.6 million fake accounts.
The platform has been continually working to find and remove profiles which contains inappropriate contents and recently, the ML team at LinkedIn details a machine learning approach which handles inappropriate content.
Initial ML Approach
The initial approach used to identify and establish a set of words and phrases known as blocklist. When an account contained any of these inappropriate words or phrases, it was marked fraudulent and removed from LinkedIn. However, this approach includes a few drawbacks, such as:
- Scalability because it is a fundamentally manual process and needs to be taken care of while evaluating words or phrases.
- Words with both appropriate and inappropriate contexts.
- Tracking performance on a phrase-by-phrase basis takes a significant amount of time as well as engineering efforts.
The New Approach
To mitigate such challenges and improve the overall performance, the ML team at LinkedIn decided to change the machine learning approach. The new approach is a machine learning model which is a text classifier trained on public member profile content. To train this classifier, the researchers built a training set consisting of accounts labelled as either “inappropriate” or “appropriate.” The “inappropriate” labels consist of accounts that have been removed from the platform due to inappropriate content.popular professional social networking platform, LinkedIn has also adopted machine learning technologies to help professionals in a more sophisticated way.
For the model, a deep learning architecture, Convolutional Neural Network (CNN) has also been leveraged. The reason to use this technique was because the CNNs perform particularly well on images and texts classification tasks. These are useful for data which has spatial properties meaning that there is information contained in the fact that two feature values are adjacent to each other.
While building the model, the team faced difficulties in assembling a training set which contains enough information. The labels for training data have been extracted using accounts that have been previously restricted for several reasons. When new models are trained using these labels, there is an inherent bias towards re-learning the patterns of existing systems.
This issue has been resolved by identifying several problematic words that were producing high levels of false positives and sampled appropriate accounts from the member base containing such words. The accounts are then manually labelled and added to the training set.