Now Reading
What Is WILDS DataSet By Stanford – A Complete Guide

What Is WILDS DataSet By Stanford – A Complete Guide

Jayita Bhattacharyya

WILDS is a benchmark of in-the-wild distribution shifts spanning a variety of datasets and applications, consisting of wildlife monitoring, tumour identification, poverty mapping and some others. Until now, seven datasets have been incorporated, and more is to be done. Wilds builds on top of recently collected data by experts. It provides evaluation metrics along with train/test splits that represent real-world distribution shifts. These datasets show distribution shifts in training and testing data on different cameras, time periods, countries, demographics, molecular scaffolds, etc., which causes significant performance drop in baseline models. It is maintained by many researchers at Stanford, and some others from Berkley, Cornell, Caltech Universities and Microsoft Research team.

FMoW – Building and land classification across different regions and years

Machine learning techniques can enable global-scale monitoring of sustainability, specifically in data-poor regions using satellite imagery and other remotely sensed data to meet economic challenges. In just areas gathering data is expensive. Wilds tries to bridge this data gap that can thereby improve research and decision-making to undertake policies and humanitarian efforts such as population density, tracking deforestation, crop yield prediction, poverty mapping and addressing other such issues. As human activity and environmental processes can often cause changes to the natural environment, thus ML models must be trained robustly to distribution shifts over time. 

Dataset design: The input x in this dataset is a satellite image, and the target label y is one of 62 categories of land use. The domain d is measured on time and geographical regions. Wilds aims at solving a domain generalization in terms of time and subpopulation performance in terms of region.

PovertyMap – Poverty mapping across different countries 

Proper predictions of poverty measures are necessary for directing policy decisions in developing and poverty-stricken countries. Actually, poverty ground truth measurements are lacking for most developing countries, since it’s a difficult task to gather information in this regard. In some countries, it may have never been conducted a survey or had gaps of over a decade between surveys. The lack of labels generation in certain countries creates a natural scenario for the need for model generalization in case of unseen countries. Shift across countries based on model performance is considered the effect of rural vs. urban subpopulations. Improving performance within rural subpopulation will help developing countries like Africa.

Dataset design: The input x is a satellite image, and the output label y is a real-valued asset wealth index. The domain d is measured on countries. Wilds aims to solve both a domain generalization problem in terms of country borders and improve subpopulation performance in terms of urban and rural areas.

iWildCam – Species classification across different camera traps

In the 2020 Living Planet Report claimed that animal populations have declined by 68% on average since 1970. In the present climate diversity, the proper mapping between climate change and wildlife biodiversity loss has become a serious issue to be concerned about. For monitoring wildlife, one of the primary methods that have been adopted is placing heat or motion-activated cameras into the wild. These cameras can track data much faster than anyone can process it, as a result of this ecologists have taken up computer vision solutions. Static cameras like this are capable of capturing signals that are correlated in space and time. The correlation causes overfitting and thereby poor generalization as compared to new sensor deployments, degrading scalability factor of computer vision solutions.

Dataset Design: This is a case of multi-class species classification. The input x is a photo captured by the camera, the output label y is divided into 186 different classes of animal species, and the domain d is a measurement that identifies which camera trap took the photo.

Camelyon17 – Tumor identification across different hospitals


Medical applications models are generally trained on a small set of data acquired from some hospitals due to patients’ privacy issues but get deployed to other hospitals as well. Model accuracy can degrade in the data collection and processing variations through different hospitals not included in the training set. This variation can be caused by many sources for example studying tissue slides under a microscope, differences can arouse in slide staining or the patient population or image acquisition. Wilds studies this distribution shift by building a patch-based variant of the Camelyon17 dataset.

Dataset design: This is a binary classification task. The input x is a histopathological image, the output label y is a binary indicator that contains any tumour tissue or not, and the domain d is a measurement that identifies the hospital.

OGB-MolPCBA –  Molecular property prediction across different scaffolds

Drug discovery is a time-consuming procedure by now we must have realised. The entire process takes many years, during which many experiments are conducted to find a potent molecule. For computer-aided search, accurate and generalizable molecular property predictor is useful over a large collection of small molecules in order to detect those structures which have the most probability to bind to a drug target. Accurate computer vision aided solutions can largely reduce redundant experiments, hence help in accelerating the drug discovery process. The biggest challenge here remains is that molecular properties prediction over a variety of molecules screened from the large chemical database. It is thus crucial for models to generalize out of the dataset molecules that are structurally different from training ones.

Dataset Design: This is a multi-task classification problem. The input x is a graphical representation of a molecule, the target label y is a binary vector of length 128 types of biological activity, and the domain d is the scaffold group that the molecule is a part of. 

Amazon – Sentiment classification across different users

As discussed above for medical data likewise for text data also models are similarly trained on collected data and deployed as an all-purpose model across a wide range of users. Hence these models can show performance disparities. These drawbacks of performance gaps in applications have urged for the need for good performance across a wide range of users. Additionally, the indicative unfairness of models, their failure to learn the actual task, thereby leading to biasness. Wilds makes use of inter-individual performance disparities for the sentiment classification task on the Amazon-wilds dataset where the goal is to train models with high performance in terms of reviewers.

Dataset design: This is a multi-class sentiment classification task. The input x is the text for review, the target label y is the star corresponding to the rating from 1 to 5, and the domain d is used as the indicator of the user who wrote the review.

CivilComments – Toxicity classification across demographic identities

Automatic review management of user-generated text such as detecting if a comment is negative is an important task for moderating the huge amount of volume of text being written daily on the Internet. Earlier works have shown documented biases in automatic moderation tools, for example, comment classifiers have shown the particular mention of certain demographic groups. Wilds has made a modified version of the CivilComments dataset, a large collection of comments on online posts/articles etc. taken from the Civil Comments platform and annotated for negativity and demographic mentions by multiple crowd workers.

Dataset design: This a binary classification task of predicting whether or not a comment is negative. The input x is a comment comprising one or more sentences, and the target label y is whether it is marked negative or not. The domain annotation d is a multi-dimensional binary vector denoting whether the comment mentions each of the eight demographic entities LGBTQ, male, female, Christian, Muslim, other religions, Black, and White.


WILDS has an open-source Python package which provides a standardized interface for all datasets. 


pip install wilds


git clone

Additional dependencies

pip install torch-scatter -f${TORCH}+${CUDA}.html

pip install torch-sparse -f${TORCH}+${CUDA}.html

pip install torch-cluster -f${TORCH}+${CUDA}.html

pip install torch-spline-conv -f${TORCH}+${CUDA}.html

pip install torch-geometric

pip install transformers

Default models

python --dataset civilcomments --algorithm groupDRO --root_dir data --download

Data loading

 from wilds.datasets.iwildcam_dataset import IWildCamDataset
 from wilds.common.data_loaders import get_train_loader
 import torchvision.transforms as transforms 

# Loading full dataset, and downloading 

dataset = IWildCamDataset(download=True)

# Getting the training set

train_data = dataset.get_subset('train', transform=transforms.Compose([transforms.Resize((224,224)),                                                           transforms.ToTensor()]))

# Preparing the data loader

train_loader = get_train_loader('standard', train_data, batch_size=16)

Training code snippet

 def train(algorithm, datasets, general_logger, config, epoch_offset, best_val_metric):
     for epoch in range(epoch_offset, config.n_epochs):
         general_logger.write('\nEpoch [%d]:\n' % epoch) 

  # First run training

See Also
7 Free Online Resources To Learn NVIDIA NeMo

        run_epoch(algorithm, datasets['train'], general_logger, epoch, config, train=True)

        # Then run val

         val_results = run_epoch(algorithm, datasets['val'], general_logger, epoch, config, train=False)
         curr_val_metric = val_results[config.val_metric]
         general_logger.write(f'Validation {config.val_metric}: {curr_val_metric:.3f}\n') 

        # Then run everything else

         if config.evaluate_all_splits:
             additional_splits = [split for split in datasets.keys() if split not in ['train','val']]
             additional_splits = config.eval_splits
         for split in additional_splits:
             run_epoch(algorithm, datasets[split], general_logger, epoch, config, train=False)
         if best_val_metric is None:
             is_best = True
             if config.val_metric_decreasing:
                 is_best = curr_val_metric < best_val_metric
                 is_best = curr_val_metric > best_val_metric
         if is_best:
             best_val_metric = curr_val_metric
         if config.save_step is not None and (epoch + 1) % config.save_step == 0:
             save(algorithm, epoch, best_val_metric, os.path.join(config.log_dir, '%d_model.pth' % epoch))
         if config.save_last:
             save(algorithm, epoch, best_val_metric, os.path.join(config.log_dir, 'last_model.pth'))
         if config.save_best and is_best:
             save(algorithm, epoch, best_val_metric, os.path.join(config.log_dir, 'best_model.pth'))
             general_logger.write(f'Best model saved at epoch {epoch}\n')
 def run_epoch(algorithm, dataset, general_logger, epoch, config, train):
     if dataset['verbose']:
     if train:

    # Not preallocating memory is slower

    # but makes it easier to handle different types of data loaders

    # (which might not return exactly the same number of examples per epoch)

     epoch_y_true = []
     epoch_y_pred = []
     epoch_metadata = []
     # Using enumerate(iterator) can sometimes leak memory in some environment so instead manually incrementing batch_idx
     batch_idx = 0
     iterator = tqdm(dataset['loader']) if config.progress_bar else dataset['loader']
     for batch in iterator:
         if train:
             batch_results = algorithm.update(batch)
             batch_results = algorithm.evaluate(batch) 

        # These tensors are already detached, but we need to clone them again

        # Otherwise they don’t get garbage collected properly in some versions

        # The subsequent detach is just for safety

        # (they should already be detached in batch_results)

         if train and (batch_idx+1) % config.log_every==0:
             log_results(algorithm, dataset, general_logger, epoch, batch_idx)
          batch_idx += 1
     results, results_str = dataset['dataset'].eval(,,
     if config.scheduler_metric_split==dataset['split']:
             log_access=(not train)) 

    # log after updating the scheduler in case it needs to access the internal logs

     log_results(algorithm, dataset, general_logger, epoch, batch_idx)
     results['epoch'] = epoch
     if dataset['verbose']:
         general_logger.write('Epoch eval:\n')
     return results 


from wilds.common.data_loaders import get_eval_loader

# Getting the test set

test_data = dataset.get_subset('test', transform=transforms.Compose([transforms.Resize((224,224)),     transforms.ToTensor()]))

# Preparing the data loader

test_loader = get_eval_loader('standard', test_data, batch_size=16)

# Getting predictions for the full test set

 for x, y_true, metadata in test_loader:
    y_pred = model(x) 

 # Evaluation

dataset.eval(all_y_pred, all_y_true, all_metadata)

End Notes

WILDS tries to put a generalized dataset covering diverse data across visuals and text. It is under constant development, in future we can expect to see more benchmarked datasets to produce high quality trained data models that can address complex problems.

What Do You Think?

Subscribe to our Newsletter

Get the latest updates and relevant offers by sharing your email.
Join Our Telegram Group. Be part of an engaging online community. Join Here.

Copyright Analytics India Magazine Pvt Ltd

Scroll To Top