Organisations are adopting emerging machine learning technologies at breakneck speed, reducing human labour and dealing with data efficiently. LinkedIn is the largest professional and employment-oriented service platform, and for years, has been leveraging AI/ML to optimise different processings on the forum. As of March 2021, LinkedIn boasts over 740 million users in more than 200 countries and territories across the globe.
LinkedIn’s Anti-Abuse AI Team works to create, deploy and maintain AI models to detect and prevent abuse on the platform. Platforms like LinkedIn are prone to abuses like the creation of fake accounts, member profile scraping, automated spam, and account takeovers.
Challenges
The team had to overcome three challenges:
- With attackers quickly adapting and evolving against the anti-abuse defence, there is a need to update LinkedIn’s adversarial behaviour tools constantly.
- This also transcends several heterogeneous parts of the website that need to be protected from attackers.
- Keep in mind the need to maximise signals since standard features do not entirely leverage the available signal in member activity patterns.
The team has created a DL model operating directly on raw sequences of member activity to overcome these challenges. In addition, the model leverages the available signal hidden in the data to prevent adversarial attacks.
Logged in Accounts
The model was used to detect logged-in accounts scraping member profile data. Scraping is not destructive all the time search engines are authorised to scrape to collect and index information throughout the internet. Still, when done without permission, it is a nefarious practice.
Unauthorised scrapers automate logged-in LinkedIn accounts; meaning, scraping information that is viewable when logged into a member account. The model looks for signals of bot-like activity and classifies sequences of user behaviour as automated. The team also leverages outlier detection to detect non-human activity.
Activity Sequence Modelling Technique
Activity sequence modelling is a standardised dataset encapsulating the sequence of member requests on LinkedIn. These are overarchingly member activity patterns – “As a member visits LinkedIn, the member’s web browser makes many requests to LinkedIn’s servers; every request includes a path identifying the part of the site the member’s browser intends to access,” as explained by LinkedIn’s blog post. The sequence can be thought of as a “sentence” describing the member’s LinkedIn activity.
Source: LinkedIn
An illustration of LinkedIn’s arrangement of member requests in a sequence including information about the type of request, the order of requests, and the timing between requests.
Standardise request paths translate specific request paths into a standardised token indicating the type of request. For instance, a profile view is illustrated as linkedin.com/in/jamesverbus/.
The integer array automated process maps the standardised request paths to integers based on the frequency of that request path to help understand how common that specific type of request is for a given user. These requests are colour coded into the activity sequence, depending on the homogeneity, making it easier for the human eye to identify abusive activities.
Source: LinkedIn
Comparison of 200 requests made by a non abusive member and an abusive member. The colours represent the recurrent nature of a specific request.
NLP techniques help to classify the sequences by replacing member requests and user actions as tokens to create the sequence and further classify them as abusive or not abusive. After processing the request path sequence data, the team leverages a supervised long short-term memory (LSTM) model to produce abuse scores.
These are based on the sequence of the time difference between consecutive requests. LinkedIn’s policies state – “If we receive an abnormally high number of page requests or detect patterns that indicate the use of an automated tool, we may suspend or restrict that account.”
The last step before behaviour correction is arranging the training labels based on the type of abuse to be detected. An unsupervised outlier-detection based on LinkedIn’s isolation forest library generates the tags used to train the model.
Isolation Forest Library
The library is an unsupervised outlier detection tool because outliers are “few and different” and thus are easier to isolate in leaf nodes and require fewer random splits. Thus, they can be used to randomly generate binary tree structures to non-parametrically capture the multi-dimensional feature distribution of the training dataset. This results in a shorter expected path length from the root node to the leaf node for outlier data points. As a result, isolation forests are a top-performing unsupervised outlier detection algorithm.
Source: LinkedIn
Example of an isolation tree
The activity sequence modelling technology helps to tackle anti-abuse issues by detecting abusive behaviour, preventing adversarial attackers, and providing a modelling approach that is generalisable and scalable to various attack surfaces.