Most Popular Datasets For Neural Textual Entailment With Implementation In PyTorch And Tensorflow

Textual entailment is a technique in natural language processing that endeavors to perceive whether one sentence can be inferred from another sentence. A pair of sentences are categorized into one of three categories: positive or negative or neutral.
textual

Textual entailment is a technique in natural language processing that endeavors to perceive whether one sentence can be inferred from another sentence. A pair of sentences are categorized into one of three categories: positive or negative or neutral. The positive category happens when the main sentence is used to demonstrate that a subsequent sentence is valid. Negative entailment or contradiction occurs when the primary sentence can be utilized to invalidate the subsequent sentence. Finally, if the two sentences have no relationship, they are considered to have a neutral entailment. 

Textual entailment is valuable in some of the applications. For example, it is used in question-answering systems to verify an answer from stored data. It may also be used to remove sentences that don’t have new information.

The article will give a detailed explanation of the various popular datasets that are used in Textual entailment using TensorFlow and Pytorch.

THE BELAMY

Sign up for your weekly dose of what's up in emerging technology.

SNLI

SNLI contains 570000 human-produced sentence sets in the English language that are physically marked for a balanced classification. It was developed in 2015 by the researchers: Samuel R.Bowman, Gabor Angeli and Christoper Potts of Stanford University. The data is collected by using the Amazon Mechanical Turk. It consists of three categories: entailment, contradiction and neutral. For Example:

Source

Loading the dataset usingTensorFlow

Load the libraries required for this project.

import csv
import os
import tensorflow.compat.v2 as tf
import tensorflow_datasets.public_api as tfds
url = 'https://nlp.stanford.edu/projects/snli/snli_1.0.zip'
class Snli(tfds.core.GeneratorBasedBuilder):
  VERSION = tfds.core.Version('1.1.0')
  def _info(self):
    return tfds.core.DatasetInfo(
        builder=self,
        features=tfds.features.FeaturesDict({
            'premise':
                tfds.features.Text(),
            'hypothesis':
                tfds.features.Text(),
            'label':
                tfds.features.ClassLabel(
                    names=['entailment', 'neutral', 'contradiction']),
        }),
        supervised_keys=None,
        homepage='https://nlp.stanford.edu/projects/snli/',
    )
  def _split_generators(self, dl_manager):
    dl_directory = dl_manager.download_and_extract(url)
    data_directory = os.path.join(dl_directory, 'snli_1.0')
    return [
        tfds.core.SplitGenerator(
            name=tfds.Split.TRAIN,
            gen_kwargs={
                'filepath': os.path.join(data_directory, 'snli_1.0_train.txt')
            }),
    ]

Loading the dataset using Pytorch

Pass the link to the function. The next step is to split the datasets into train,test and validation sets.

from torchtext import data
class ShiftReduceField(data.Field):
    def __init__(self):

        super(ShiftReduceField, self).__init__(preprocessing=lambda parse: [
            'reduce' if t == ')' else 'shift' for t in parse if t != '('])

        self.build_vocab([['reduce'], ['shift']])

class ParsedTextField(data.Field):
    def __init__(self, eos_token='<pad>', lower=False, reverse=False):
        if reverse:
            super(ParsedTextField, self).__init__(
                eos_token=eos_token, lower=lower,
                preprocessing=lambda parse: [t for t in parse if t not in ('(', ')')],
                postprocessing=lambda parse, _: [list(reversed(p)) for p in parse],
                include_lengths=True)
        else:
            super(ParsedTextField, self).__init__(
                eos_token=eos_token, lower=lower,
                preprocessing=lambda parse: [t for t in parse if t not in ('(', ')')],
                include_lengths=True)
            
class NLIDataset(data.TabularDataset):

    urls = []
    directoryname = ''
    namenew = 'nli'

    @staticmethod
    def sort_key(ex):
        return data.interleave_keys(
            len(ex.premise), len(ex.hypothesis))

    @classmethod
    def splits(cls, text_field, label_field, parse_field=None,
               extra_fields={}, root='.data', train='train.jsonl',
               validation='val.jsonl', test='test.jsonl'):
        path = cls.download(root)

        if parse_field is None:
            fields = {'sentence1': ('premise', text_field),
                      'sentence2': ('hypothesis', text_field),
                      'gold_label': ('label', label_field)}
        else:
            fields = {'sentence1_binary_parse': [('premise', text_field),
                                                 ('premise_transitions', parse_field)],
                      'sentence2_binary_parse': [('hypothesis', text_field),
                                                 ('hypothesis_transitions', parse_field)],
                      'gold_label': ('label', label_field)}

        for key in extra_fields:
            if key not in fields.keys():
                fields[key] = extra_fields[key]

        return super(NLIDataset, cls).splits(
            path, root, train, validation, test,
            format='json', fields=fields,
            filter_pred=lambda ex: ex.label != '-')

    @classmethod
    def iters(cls, batch_size=32, device=0, root='.data',
              vectors=None, trees=False, **kwargs):
        if trees:
            TEXT = ParsedTextField()
            TRANSITIONS = ShiftReduceField()
        else:
            TEXT = data.Field(tokenize='spacy')
            TRANSITIONS = None
        LABEL = data.Field(sequential=False)

        train, val, test = cls.splits(
            TEXT, LABEL, TRANSITIONS, root=root, **kwargs)

        TEXT.build_vocab(train, vectors=vectors)
        LABEL.build_vocab(train)

        return data.BucketIterator.splits(
            (train, val, test), batch_size=batch_size, device=device)
class SNLI(NLIDataset):
    urls = ['http://nlp.stanford.edu/projects/snli/snli_1.0.zip']
    directoryname = 'snli_1.0'
    name1 = 'snli'
    @classmethod
    def splits(cls, text_field, label_field, parse_field=None, root='.data',
               train='snli_1.0_train.jsonl', validation='snli_1.0_dev.jsonl',
               test='snli_1.0_test.jsonl'):
        return super(SNLI, cls).splits(text_field, label_field, parse_field=parse_field,
                                       root=root, train=train, validation=validation,
                                       test=test)

Parameters Specification

text_field: It will be used for premise and hypothesis data.

Label_field: It will be used for label data.

parse_field: It will be used for shift-reduce parser transitions.

root: Path to the dataset’s zip archive directory.

train: Training Set. Default: ‘train.jsonl’.

test:Testing Set

validation:Validation Set

State of the Art

The current state of the art on SNLI dataset is CA-MTL. The model gave an accuracy of 92.1%.

Multi-NLI

Multi-Genre Natural Language Inference (MultiNLI) corpus contains 433000 sentence sets explained with literary entailment data. It was developed by the researchers: Adina Williams, Nikita Nangia and Samuel R. Bowman1 . The corpus is demonstrated on the SNLI corpus, however, varies in that it covers a scope of kinds of spoken and composed content, and supports a particular cross-class speculation assessment. Below is the example of corpus:

Source

Loading the dataset using TensorFlow

import os
import tensorflow.compat.v2 as tf
import tensorflow_datasets.public_api as tfds
class MultiNLI(tfds.core.GeneratorBasedBuilder)
  VERSION = tfds.core.Version("1.1.0")
  def _info(self):
    return tfds.core.DatasetInfo(
        builder=self,
        features=tfds.features.FeaturesDict({
            "premise":
                tfds.features.Text(),
            "hypothesis":
                tfds.features.Text(),
            "label":
                tfds.features.ClassLabel(
                    names=["entailment", "neutral", "contradiction"]),
        }),
        supervised_keys=None,
        homepage="https://www.nyu.edu/projects/bowman/multinli/",
    )
  def _split_generators(self, dl_manager):
    downloaded_dir = dl_manager.download_and_extract(
        "https://cims.nyu.edu/~sbowman/multinli/multinli_1.0.zip")
    multinli_path = os.path.join(downloaded_dir, "multinli_1.0")
    train_path = os.path.join(multinli_path, "multinli_1.0_train.txt")
    matched_validation_path = os.path.join(multinli_path,
                                           "multinli_1.0_dev_matched.txt")
    mismatched_validation_path = os.path.join(
        multinli_path, "multinli_1.0_dev_mismatched.txt")
    return [
        tfds.core.SplitGenerator(
            name=tfds.Split.TRAIN,
            gen_kwargs={"filepath": train_path}),
    ]

Loading the dataset using Pytorch

class MultiNLI(NLIDataset):
    urls = ['http://www.nyu.edu/projects/bowman/multinli/multinli_1.0.zip']
    directoryname = 'multinli_1.0'
    name2 = 'multinli'
    @classmethod
    def splits(cls, text_field, label_field, parse_field=None, genre_field=None,
               root='.data',
               train='multinli_1.0_train.jsonl',
               validation='multinli_1.0_dev_matched.jsonl',
               test='multinli_1.0_dev_mismatched.jsonl'):
        extra_fields = {}
        if genre_field is not None:
            extra_fields["genre"] = ("genre", genre_field)
        return super(MultiNLI, cls).splits(text_field, label_field,
                                           parse_field=parse_field,
                                           extra_fields=extra_fields,
                                           root=root, train=train,
                                           validation=validation, test=test)

State of the Art

The current state of the art on SNLI dataset is T5-11B. The model gave an accuracy of 92%.

XNLI

The Cross-lingual Natural Language Inference (XNLI) corpus contains 5,000 test and 2,500 training sets for the Multi-NLI corpus. It was developed by the researchers: Adina Williams and Samuel R. BowmanThe sets are explained with printed entailment and converted into 14 dialects: French, Spanish, German, Greek, Bulgarian, Russian, Turkish, Arabic, Vietnamese, Thai, Chinese, Hindi, Swahili and Urdu.

For example:

Source

Loading the dataset using TensorFlow

import collections
import csv
import os
import six
import tensorflow.compat.v2 as tf
import tensorflow_datasets.public_api as tfds
url = 'https://cims.nyu.edu/~sbowman/xnli/XNLI-1.0.zip'
languages = ('ar', 'bg', 'de', 'el', 'en', 'es', 'fr', 'hi', 'ru', 'sw', 'th',
              'tr', 'ur', 'vi', 'zh')
class Xnli(tfds.core.GeneratorBasedBuilder):
  VERSION = tfds.core.Version('1.1.0')
  def _info(self):
    return tfds.core.DatasetInfo(
        builder=self,
        features=tfds.features.FeaturesDict({
            'premise':
                tfds.features.Translation(
                    languages=languages,),
            'hypothesis':
                tfds.features.TranslationVariableLanguages(
                    languages=languages,),
            'label':
                tfds.features.ClassLabel(
                    names=['entailment', 'neutral', 'contradiction']),
        }),
        supervised_keys=None,
        homepage='https://www.nyu.edu/projects/bowman/xnli/',
    )
  def _split_generators(self, dl_manager):
    dl_directory = dl_manager.download_and_extract(url)
    data_directory = os.path.join(dl_directory, 'XNLI-1.0')
    return [
        tfds.core.SplitGenerator(
            name=tfds.Split.TEST,
            gen_kwargs={'filepath': os.path.join(data_directory, 'xnli.test.tsv')}),
    ]

Loading the dataset using Pytorch

class XNLI(NLIDataset):
    urls = ['http://www.nyu.edu/projects/bowman/xnli/XNLI-1.0.zip']
    directoryname = 'XNLI-1.0'
    name3 = 'xnli'
    @classmethod
    def splits(cls, text_field, label_field, genre_field=None, language_field=None,
               root='.data',
               validation='xnli.dev.jsonl',
               test='xnli.test.jsonl'):
        extra_fields = {}
        if genre_field is not None:
            extra_fields["genre"] = ("genre", genre_field)
        if language_field is not None:
            extra_fields["language"] = ("language", language_field)
        return super(XNLI, cls).splits(text_field, label_field,
                                       extra_fields=extra_fields,
                                       root=root, train=None,
                                       validation=validation, test=test)
    @classmethod
    def iters(cls, *args, **kwargs):
        raise NotImplementedError('XNLI dataset does not support iters')

State of the Art

The current state of the art on SNLI dataset is RoBERTa-wwm-ext-large. The model gave an accuracy of 81.2%.

Conclusion

In this article, we have discussed some of the most popular datasets that are used in Textual Entailment. Further, we implemented these text corpus using Pytorch and TensorFlow.Textual Entailment are incredible vehicles for thinking, and essentially all inquiries regarding weightiness in language can be decreased to can be reduced to questions of entailment and contradiction in context. This recommends that Text Entailment is an ideal proving ground for hypotheses of semantic portrayal.

More Great AIM Stories

Ankit Das
A data analyst with expertise in statistical analysis, data visualization ready to serve the industry using various analytical platforms. I look forward to having in-depth knowledge of machine learning and data science. Outside work, you can find me as a fun-loving person with hobbies such as sports and music.

Our Upcoming Events

Conference, in-person (Bangalore)
Machine Learning Developers Summit (MLDS) 2023
19-20th Jan, 2023

Conference, in-person (Bangalore)
Rising 2023 | Women in Tech Conference
16-17th Mar, 2023

Conference, in-person (Bangalore)
Data Engineering Summit (DES) 2023
27-28th Apr, 2023

Conference, in-person (Bangalore)
MachineCon 2023
23rd Jun, 2023

Conference, in-person (Bangalore)
Cypher 2023
20-22nd Sep, 2023

3 Ways to Join our Community

Whatsapp group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our newsletter

Get the latest updates from AIM