MITB Banner

How Microsoft Is Using Machine Learning To Secure Its Software Development Cycle

Share

Tech giant Microsoft recently built a machine learning classification system which aims to secure the software development lifecycle. The machine learning system helps in classifying bugs as security or non-security and critical or non-critical. This provides a level of accuracy, akin to that provided by security experts.

The software developers at Microsoft address several issues and vulnerabilities. More than 45,000 developers generate nearly 30,000 bugs per month, which gets stored across 100+ AzureDevOps and GitHub repositories. The tech giant is looking to mitigate these vulnerabilities. 

Since 2001, the tech giant has collected 13 million work items and bugs. According to sources, Microsoft spends an estimated $150,000 per issue as a whole to mitigate bugs and vulnerabilities. 

However, according to the developers, since there are more than 45,000 developers already working to address the problem, applying more human resources to better label and prioritise the bugs is not possible. 

To build the machine learning model, the tech giant used 13 Million work items and bugs to train the model which they had collected for two decades. They stated, “We used that data to develop a process and machine learning model that correctly distinguishes between security and non-security bugs 99% of the time, and accurately identifies the critical, high priority security bugs 97% of the time.”

Behind The Classification System

As large volumes of semi-curated data are adequate for machine learning tasks, the data science and security teams at the tech giant came together to build the supervised machine learning system. 

In supervised learning, a machine learning model learns how to classify data from pre-labelled data. The ML developers at the tech giant fed the model with a large number of bugs, which are organised into labelled security, and others that are not labelled security. 

To make the machine learning classification system perform like a security expert, the training data was initially approved by the security experts before it was fed to the machine learning model.  

To build a machine learning model that yields maximum accuracy, the developers followed an approach in action, which includes five processes:

  1. Data collection: For data collection, the developer identified all the data types and sources and evaluated its quality
  2. Data curation and approval: In this approach, the security expert reviewed the data and confirmed that the labels are correct
  3. Modelling and evaluation: In this approach, a data modelling technique is selected, the model is trained, and the performance is evaluated
  4. Evaluation of model in production: In this approach, the security experts evaluated the model in production by monitoring the average number of bugs and manually reviewing a random sampling of bugs
  5. Automated re-training: The developers then conducted automated re-training to make sure that the bug modelling system keeps the right pace with the ever-evolving products at Microsoft

How It Works

The ML developers used statistical sampling to provide security experts with a manageable amount of data to review. To classify bugs accurately, they used a two-step machine learning model operation. 

The first step for the machine learning model is to learn how to classify security and non-security bugs. In the second step, the machine learning model applied severity labels such as critical, important, and low-impact to the security bugs.

Wrapping Up

Applying this machine learning classification system, the developers can now accurately classify which work items are security bugs 99% of the time. The model also shows 97% accuracy rate when it comes to labelling critical and non-critical security bugs. 

The developers stated, “This level of accuracy gives us the confidence that we are catching more security vulnerabilities before they are exploited.” They added, “In the coming months, we will open-source our methodology to GitHub.” 

Share
Picture of Ambika Choudhury

Ambika Choudhury

A Technical Journalist who loves writing about Machine Learning and Artificial Intelligence. A lover of music, writing and learning something out of the box.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Subscribe to Our Newsletter

The Belamy, our weekly Newsletter is a rage. Just enter your email below.