The term Machine Learning has become quite popular and familiar amongst the programming community these days. The marvelous spectrum of developments it presents us with seems like a magical concept with a huge potential for the future. But the essential requirement to perform such implementations is the presence of a good Dataset. A good and clean Dataset will always enable us to perform and create Deep and Machine Learning Algorithms and generate higher accuracies from the models. During the cleaning and processing of Datasets, several processes are performed to do so, one of the many important steps is Outlier Removal. Outliers can be defined as extreme values that deviate from other observations on the present data; they can indicate variability in measurement, an experimental error or a novelty. It can also be stated that an outlier is an observation that diverges from the overall pattern on a data sample.

Outliers can be of two different kinds: univariate and multivariate. Univariate outliers are found when looking at a distribution of values within a single feature space. Multivariate outliers are found in the n-dimensional space of n-features. An outlier, in general, can be called the odd man out in a series of data. Outliers can be unusually and sometimes extremely different from most of the data points in our sample dataset. It could either be a very large observation or a very small observation.

Outliers can at times create very highly biased results while calculating the stats of the data due to its extreme nature, thereby affecting further statistical/ML models. There is no standardized and rigid mathematical method for determining an outlier. It varies depending on the set or data population present, so the determination and detection become fully subjective. Through continuous sampling in a given data field, the characteristics of an outlier may be established to make the detection easier. There are several model-based methods for detecting outliers, assuming that the data is all taken from a normal distribution and will identify observations or points, which are deemed to be certainly unlikely based on the mean or standard deviation, as outliers. Many data analysts get tempted to delete outliers during data processing. This decision sometimes might be the wrong choice.

Likely performed in conventional analytical models, similarly in machine learning models, one needs to resist the urge to simply delete when coming across such anomalies to improve your model’s accuracy. So rather than giving it a spontaneous reaction, one must tread with caution while handling the outliers present in the dataset. Outliers sometimes can also be helpful indicators. In some data analytics applications like credit card fraud detection, outlier analysis becomes important as the exception rather than the rule that may interest the analyst. In supervised models, outliers can also deceive the training process, resulting in prolonged training times, or lead to less precise models. The outliers easily influence machine Learning models, like linear & logistic regression in the training data. Some models even exist that hike the weights of misclassified points for every repetition of the training.

## What is DORO?

Many machine learning tasks require certain models to perform well under distributional shifts, where the training and the testing data distributions are different. One type of distributional shift that has recently aroused great research interest is the subpopulation shift, where the testing distribution is specific, also known as the subpopulation of the training distribution. One directly identified cause of this occurrence has been the sensitivity of DRO, also known as distributionally robust optimization, to outliers in the datasets.

To resolve this issue, a framework of DORO or Distributional and Outlier Robust Optimization has been introduced as an approach. This approach’s core is a refined risk function that prevents the DRO from overfitting potential outliers. In addition, DORO improves the performance and stability of DRO with experiments on large modern datasets, thereby positively addressing the problem directly.

Image Source: http://proceedings.mlr.press/v139/zhai21a/zhai21a.pdf

DORO is a robust outlier refinement of DRO that takes inspiration from its robust statistics. The refined risk function, which prevents DRO from overfitting to potential outliers, intuitively, the new risk function adaptively filters out a small fraction of data with high risk during training, which is potentially caused by outliers. A machine learning task with a subpopulation shift requires a model that performs well on the data distribution of each subpopulation. Conducting large-scale experiments empirically show that DORO improves the performance and stability of DRO. The effect of hyperparameters on DRO and DORO can also be observed and analyzed.

The basic idea is to construct an uncertainty set U containing all possible tests, and minimize the expected risk over the worst distribution in the set, i.e. upper bound of the worst-case risk.

## Getting Started with the Codes

In this article, we will be performing a demo experiment of DORO to address the subpopulation shift problem, where the data domain contains several possibly overlapping domains. The goal will be to maximize the model’s minimum performance over all domains. Particularly we will focus on the domain-oblivious setting, where the group labels or the domains each sample belongs to are unknown during training. We will also demonstrate that the DORO method enhances the robustness of DRO to outliers.

We will be focusing on demonstrating the following during our implementation:

- DRO methods have poor and unstable performances on the original dataset
- The performance of DRO becomes better and more stable if the outliers are removed.
- DORO improves the performance of DRO on the original dataset.

The following implementation is inspired by a demo implementation implemented by the creators of DORO, whose official site can be accessed using the link here.

##### Importing the Dependencies

The first step will be to import all the required dependencies. To do so, the following lines of code can be run,

import os import argparse import math import pandas as pd import numpy as np from sklearn.model_selection import train_test_split import scipy.io as sio import scipy.optimize as sopt import matplotlib.pyplot as plt from copy import deepcopy import torch import torch.nn from torch import Tensor from torch import optim from torch.nn.modules.module import Module from torch.utils.data import DataLoader from torch.utils.data.dataset import Dataset import torch.nn.functional as F import torch.backends.cudnn as cudnn

Here we use SciPy.optimize, which provides functions for minimizing and maximizing objective functions, possibly subject to constraints. It includes solvers for nonlinear problems with support for both local and global optimization algorithms. It can also be used for linear programming, constrained and nonlinear least-squares, root finding, and curve fitting.

##### Setting the Model parameters.

Next, we will be setting the training parameters for our dataset to be loaded when it is run through various epochs.

#setting the model epoch parameters def run_epoch(alg, model: Module, loader: DataLoader, optimizer: optim.Optimizer, criterion, device: str, alpha=0.0, eps=0.0, update=True): train_loss = 0 n = 0 model.train() for _, (inputs, targets) in enumerate(loader): inputs, targets = inputs.to(device), targets.to(device) batch_size = len(inputs) n += batch_size outputs = model(inputs) loss_vector = criterion(outputs, targets) loss = alg(loss_vector, alpha, eps) if update: optimizer.zero_grad() loss.backward() optimizer.step() train_loss += loss.item() * batch_size return train_loss / n #setting the loss vector def erm(loss_vector, alpha, eps): loss = loss_vector.mean() return loss

##### Setting the Algorithms

Let us now set our initial DRO and DORO algorithms, through which we will be testing our model for the subpopulation shift problem.

#introducing the DRO Algorithm def cvar(loss_vector, alpha, eps): batch_size = len(loss_vector) n = int(alpha * batch_size) rk = torch.argsort(loss_vector, descending=True) loss = loss_vector[rk[:n]].mean() return loss def chisq(loss_vector, alpha, eps): max_l = 10. C = math.sqrt(1 + (1 / alpha - 1) ** 2) foo = lambda eta: C * math.sqrt((F.relu(loss_vector - eta) ** 2).mean().item()) + eta opt_eta = sopt.brent(foo, brack=(0, max_l)) loss = C * torch.sqrt((F.relu(loss_vector - opt_eta) ** 2).mean()) + opt_eta return loss #introducing the DORO Algorithm def cvar_doro(loss_vector, alpha, eps): gamma = eps + alpha * (1 - eps) batch_size = len(loss_vector) n1 = int(gamma * batch_size) n2 = int(eps * batch_size) rk = torch.argsort(loss_vector, descending=True) loss = loss_vector[rk[n2:n1]].sum() / alpha / (batch_size - n2) return loss def chisq_doro(loss_vector, alpha, eps): max_l = 10. batch_size = len(loss_vector) C = math.sqrt(1 + (1 / alpha - 1) ** 2) n = int(eps * batch_size) rk = torch.argsort(loss_vector, descending=True) l0 = loss_vector[rk[n:]] foo = lambda eta: C * math.sqrt((F.relu(l0 - eta) ** 2).mean().item()) + eta opt_eta = sopt.brent(foo, brack=(0, max_l)) loss = C * torch.sqrt((F.relu(l0 - opt_eta) ** 2).mean()) + opt_eta

##### Preparing our Dataset

We will now be setting our dataset. The dataset used to test our model is known as the COMPAS dataset, a recidivism prediction dataset with 5049 training samples. We will select race and sex as the protected features and define four domains on this dataset: White, Others, Male and Female. A two-layer feed-forward neural network with ReLU activations will be used as the classification model.

#importing dataset !wget https://raw.githubusercontent.com/RuntianZ/cloud/master/compas/compas-scores-two-years.csv #Building the Two-layer feed-forward ReLU neural network def build_model(input_dim: int) -> Module: model = torch.nn.Sequential(torch.nn.Linear(input_dim, 10, bias=True), torch.nn.ReLU(inplace=True), torch.nn.Linear(10, 1, bias=True)) return model def preprocess_compas(df: pd.DataFrame): #preprocessing the dataset columns = ['juv_fel_count', 'juv_misd_count', 'juv_other_count', 'priors_count', 'age', 'c_charge_degree', 'sex', 'race', 'is_recid'] target_variable = 'is_recid' df = df[['id'] + columns].drop_duplicates() df = df[columns] race_dict = {'African-American': 1, 'Caucasian': 0} df['race'] = df.apply(lambda x: race_dict[x['race']] if x['race'] in race_dict.keys() else 2, axis=1).astype( 'category') sex_map = {'Female': 0, 'Male': 1} df['sex'] = df['sex'].map(sex_map) c_charge_degree_map = {'F': 0, 'M': 1} df['c_charge_degree'] = df['c_charge_degree'].map(c_charge_degree_map) X = df.drop([target_variable], axis=1) y = df[target_variable] return X, y class MyDataset(Dataset): def __init__(self, X, y): super(MyDataset, self).__init__() self.X = X self.y = y def __getitem__(self, item): return self.X[item], self.y[item] def __len__(self): return len(self.X) class MyLoss(object): def __init__(self, reduction='mean'): self.reduction = reduction def __call__(self, outputs: Tensor, targets: Tensor) -> Tensor: outputs = outputs.view(-1) loss = -targets * F.logsigmoid(outputs) - (1 - targets) * F.logsigmoid(-outputs) if self.reduction == 'mean': loss = loss.mean() elif self.reduction == 'sum': loss = loss.sum() return loss # Loading the dataset into dataframe df = pd.read_csv('compas-scores-two-years.csv') X, y = preprocess_compas(df) input_dim = len(X.columns) X, y = X.to_numpy().astype('float32'), y.to_numpy() X[:, 4] /= 10 X[X[:, 7] > 0, 7] = 1 # Race: White (0) and Others (1) domain_fn = [ lambda x: x[:, 7] == 0, # White lambda x: x[:, 7] == 1, # Others lambda x: x[:, 6] == 0, # Female lambda x: x[:, 6] == 1, # Male ] # Split the dataset: train-test = 70-30 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, shuffle=True) trainset = MyDataset(X_train, y_train) testset = MyDataset(X_test, y_test) trainloader = DataLoader(trainset, batch_size=128, shuffle=True) testtrainloader = DataLoader(trainset, batch_size=1024, shuffle=False) testloader = DataLoader(testset, batch_size=1024, shuffle=False)

##### Setting Test and Train Parameters

Further setting the test parameters, to test the average and group accuracy of the model and the training parameters as well,

#setting the Test Parameters def test(model: Module, loader: DataLoader, criterion, device: str, domain_fn, trim_num=None): model.eval() total_correct = 0 total_loss = 0 total_num = 0 num_domains = len(domain_fn) group_correct = np.zeros((num_domains,), dtype=np.int) group_loss = np.zeros((num_domains,), dtype=np.float) group_num = np.zeros((num_domains,), dtype=np.int) l_rec = [] with torch.no_grad(): for _, (inputs, targets) in enumerate(loader): inputs, targets = inputs.to(device), targets.to(device) outputs = model(inputs).view(-1) c = ((outputs > 0) & (targets == 1)) | ((outputs < 0) & (targets == 0)) correct = c.sum().item() l = criterion(outputs, targets).view(-1) if trim_num is not None: l_rec.append(l.detach().cpu().numpy()) loss = l.sum().item() total_correct += correct total_loss += loss total_num += len(inputs) for i in range(num_domains): g = domain_fn[i](inputs) group_correct[i] += c[g].sum().item() group_loss[i] += l[g].sum().item() group_num[i] += g.sum().item() if trim_num is not None: l_vec = np.concatenate(l_rec) l = np.argsort(l_vec)[:-trim_num] return l return total_correct / total_num, total_loss / total_num, \ group_correct / group_num, group_loss / group_num #setting training parameters def train(alg, trainloader, testloader, epochs=40, alpha=0.0, eps=0.0, seed=None): # Fix seed if seed is not None: torch.manual_seed(seed) np.random.seed(seed) torch.set_deterministic(True) model = build_model(input_dim) criterion = MyLoss(reduction='none') optimizer = optim.ASGD(model.parameters(), lr=0.01) train_loss = [] test_loss = [] avg_acc = [] avg_loss = [] group_acc = [] group_loss = [] for epoch in range(epochs): train_l = run_epoch(alg, model, trainloader, optimizer, criterion, 'cpu', alpha, eps) test_l = run_epoch(alg, model, testloader, optimizer, criterion, 'cpu', alpha, eps, update=False) a, b, c, d = test(model, testloader, criterion, 'cpu', domain_fn) train_loss.append(train_l) test_loss.append(test_l) avg_acc.append(a) avg_loss.append(b) group_acc.append(c) group_loss.append(d) results = { 'train_loss': np.array(train_loss), 'test_loss': np.array(test_loss), 'avg_acc': np.array(avg_acc), 'avg_loss': np.array(avg_loss), 'group_acc': np.array(group_acc), 'group_loss': np.array(group_loss), } return model, results

First, we run ERM and DRO on the original dataset.

#performing calculations _, erm_results = train(erm, trainloader, testloader, seed=2021) _, cvar_results = train(cvar, trainloader, testloader, alpha=0.5, seed=2021) _, chisq_results = train(chisq, trainloader, testloader, alpha=0.5, seed=2021)

##### Plotting The Visualization

Let’s plot the average and worst-case accuracies of the models produced by the algorithms.

#plotting results def plot_result(results, pos, title, xlabel, ylabel): plt.rcParams.update({'font.size': 24}) plt.figure(figsize=(8, 6), dpi=80) plt.plot(results[0][0], marker='.', markersize=10, label=results[0][1], linewidth=3.0) plt.plot(results[1][0], marker='*', markersize=10, label=results[1][1], linewidth=3.0) plt.plot(results[2][0], marker='^', markersize=10, label=results[2][1], linewidth=3.0) plt.title(title) plt.xlabel(xlabel) plt.ylabel(ylabel) plt.legend() plt.gcf().subplots_adjust(left=0.2, bottom=0.2) plt.show() # Figure a plot_result([(erm_results['avg_acc'], 'ERM'), (cvar_results['avg_acc'], 'CVaR'), (chisq_results['avg_acc'], r'$\chi^2$-DRO')], (-0.02, -0.02), 'Average Accuracy (Original)', 'Epochs', 'Accuracy')

# Figure b plot_result([(erm_results['group_acc'].min(1), 'ERM'), (cvar_results['group_acc'].min(1), 'CVaR'), (chisq_results['group_acc'].min(1), r'$\chi^2$-DRO')], (-0.02, -0.02), 'Worst-case Accuracy (Original)', 'Epochs', 'Accuracy')

The plots clearly show that DRO has lower performance than ERM and is very unstable on the original dataset.

To demonstrate that the issue is due to the sensitivity of DRO to outliers, we do the following: we will first remove the outliers from the original dataset, then run the algorithms on the clean dataset, and see whether the performance would be improved.

# Remove outliers trainset_clean = deepcopy(trainset) torch.manual_seed(38) np.random.seed(38) for t in range(5): trainloader_clean = DataLoader(trainset_clean, batch_size=128, shuffle=True) testtrainloader_clean = DataLoader(trainset_clean, batch_size=1024, shuffle=False) model, _ = train(erm, trainloader_clean, testloader) criterion = MyLoss(reduction='none') r = test(model, testtrainloader_clean, criterion, 'cpu', domain_fn, 200) X_train = X_train[r] y_train = y_train[r] trainset_clean = MyDataset(X_train, y_train) # Train on the clean dataset trainloader_clean = DataLoader(trainset_clean, batch_size=128, shuffle=True) testtrainloader_clean = DataLoader(trainset_clean, batch_size=1024, shuffle=False) _, erm_clean_results = train(erm, trainloader_clean, testloader, seed=2021) _, cvar_clean_results = train(cvar, trainloader_clean, testloader, alpha=0.5, seed=2021) _, chisq_clean_results = train(chisq, trainloader_clean, testloader, alpha=0.5,seed=2021)

Plotting the results now:

plot_result([(erm_clean_results['avg_acc'], 'ERM'), (cvar_clean_results['avg_acc'], 'CVaR'), (chisq_clean_results['avg_acc'], r'$\chi^2$-DRO')], (0.52, -0.02), 'Average Accuracy (Outliers removed)', 'Epochs', 'Accuracy')

##### Introducing DORO

Finally, let’s introduce the DORO algorithm to enhance the robustness of DRO to outliers and observe the difference in results.

#Applying DORO _, cvar_doro_results = train(cvar_doro, trainloader, testloader, alpha=0.5, eps=0.2, seed=2021) _, chisq_doro_results = train(chisq_doro, trainloader, testloader, alpha=0.5, eps=0.2, seed=2021) #plotting the graph plot_result([(erm_results['avg_acc'], 'ERM'), (cvar_doro_results['avg_acc'], 'CVaR-DORO'), (chisq_doro_results['avg_acc'], r'$\chi^2$-DORO')], (0.38, -0.02), 'Average Accuracy (Original)', 'Epochs', 'Accuracy')

# Plotting worst case Accuracy plot_result([(erm_results['group_acc'].min(1), 'ERM'), (cvar_doro_results['group_acc'].min(1), 'CVaR-DORO'), (chisq_doro_results['group_acc'].min(1), r'$\chi^2$-DORO')], (0.38, -0.02), 'Worst-case Accuracy (Original)', 'Epochs', 'Accuracy')

We can see that DORO improves the performance and stability of DRO on the original dataset!

## End Notes

In this article, we have understood the importance of outlier removal and how it affects the performance of the dataset. We also tried to understand the DORO algorithm, which can be used to tackle the subpopulation shift problem during model training. The Following implementation above can be found as a Colab notebook, using the link here.

Happy Learning!