Textual entailment is a technique in natural language processing that endeavors to perceive whether one sentence can be inferred from another sentence. A pair of sentences are categorized into one of three categories: positive or negative or neutral. The positive category happens when the main sentence is used to demonstrate that a subsequent sentence is valid. Negative entailment or contradiction occurs when the primary sentence can be utilized to invalidate the subsequent sentence. Finally, if the two sentences have no relationship, they are considered to have a neutral entailment.
Textual entailment is valuable in some of the applications. For example, it is used in question-answering systems to verify an answer from stored data. It may also be used to remove sentences that don’t have new information.
The article will give a detailed explanation of the various popular datasets that are used in Textual entailment using TensorFlow and Pytorch.
Subscribe to our Newsletter
Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.
SNLI
SNLI contains 570000 human-produced sentence sets in the English language that are physically marked for a balanced classification. It was developed in 2015 by the researchers: Samuel R.Bowman, Gabor Angeli and Christoper Potts of Stanford University. The data is collected by using the Amazon Mechanical Turk. It consists of three categories: entailment, contradiction and neutral. For Example:
Loading the dataset usingTensorFlow
Load the libraries required for this project.
import csv import os import tensorflow.compat.v2 as tf import tensorflow_datasets.public_api as tfds url = 'https://nlp.stanford.edu/projects/snli/snli_1.0.zip' class Snli(tfds.core.GeneratorBasedBuilder): VERSION = tfds.core.Version('1.1.0') def _info(self): return tfds.core.DatasetInfo( builder=self, features=tfds.features.FeaturesDict({ 'premise': tfds.features.Text(), 'hypothesis': tfds.features.Text(), 'label': tfds.features.ClassLabel( names=['entailment', 'neutral', 'contradiction']), }), supervised_keys=None, homepage='https://nlp.stanford.edu/projects/snli/', ) def _split_generators(self, dl_manager): dl_directory = dl_manager.download_and_extract(url) data_directory = os.path.join(dl_directory, 'snli_1.0') return [ tfds.core.SplitGenerator( name=tfds.Split.TRAIN, gen_kwargs={ 'filepath': os.path.join(data_directory, 'snli_1.0_train.txt') }), ]
Loading the dataset using Pytorch
Pass the link to the function. The next step is to split the datasets into train,test and validation sets.
from torchtext import data class ShiftReduceField(data.Field): def __init__(self): super(ShiftReduceField, self).__init__(preprocessing=lambda parse: [ 'reduce' if t == ')' else 'shift' for t in parse if t != '(']) self.build_vocab([['reduce'], ['shift']]) class ParsedTextField(data.Field): def __init__(self, eos_token='<pad>', lower=False, reverse=False): if reverse: super(ParsedTextField, self).__init__( eos_token=eos_token, lower=lower, preprocessing=lambda parse: [t for t in parse if t not in ('(', ')')], postprocessing=lambda parse, _: [list(reversed(p)) for p in parse], include_lengths=True) else: super(ParsedTextField, self).__init__( eos_token=eos_token, lower=lower, preprocessing=lambda parse: [t for t in parse if t not in ('(', ')')], include_lengths=True) class NLIDataset(data.TabularDataset): urls = [] directoryname = '' namenew = 'nli' @staticmethod def sort_key(ex): return data.interleave_keys( len(ex.premise), len(ex.hypothesis)) @classmethod def splits(cls, text_field, label_field, parse_field=None, extra_fields={}, root='.data', train='train.jsonl', validation='val.jsonl', test='test.jsonl'): path = cls.download(root) if parse_field is None: fields = {'sentence1': ('premise', text_field), 'sentence2': ('hypothesis', text_field), 'gold_label': ('label', label_field)} else: fields = {'sentence1_binary_parse': [('premise', text_field), ('premise_transitions', parse_field)], 'sentence2_binary_parse': [('hypothesis', text_field), ('hypothesis_transitions', parse_field)], 'gold_label': ('label', label_field)} for key in extra_fields: if key not in fields.keys(): fields[key] = extra_fields[key] return super(NLIDataset, cls).splits( path, root, train, validation, test, format='json', fields=fields, filter_pred=lambda ex: ex.label != '-') @classmethod def iters(cls, batch_size=32, device=0, root='.data', vectors=None, trees=False, **kwargs): if trees: TEXT = ParsedTextField() TRANSITIONS = ShiftReduceField() else: TEXT = data.Field(tokenize='spacy') TRANSITIONS = None LABEL = data.Field(sequential=False) train, val, test = cls.splits( TEXT, LABEL, TRANSITIONS, root=root, **kwargs) TEXT.build_vocab(train, vectors=vectors) LABEL.build_vocab(train) return data.BucketIterator.splits( (train, val, test), batch_size=batch_size, device=device)
class SNLI(NLIDataset): urls = ['http://nlp.stanford.edu/projects/snli/snli_1.0.zip'] directoryname = 'snli_1.0' name1 = 'snli' @classmethod def splits(cls, text_field, label_field, parse_field=None, root='.data', train='snli_1.0_train.jsonl', validation='snli_1.0_dev.jsonl', test='snli_1.0_test.jsonl'): return super(SNLI, cls).splits(text_field, label_field, parse_field=parse_field, root=root, train=train, validation=validation, test=test)
Parameters Specification
text_field: It will be used for premise and hypothesis data.
Label_field: It will be used for label data.
parse_field: It will be used for shift-reduce parser transitions.
root: Path to the dataset’s zip archive directory.
train: Training Set. Default: ‘train.jsonl’.
test:Testing Set
validation:Validation Set
State of the Art
The current state of the art on SNLI dataset is CA-MTL. The model gave an accuracy of 92.1%.
Multi-NLI
Multi-Genre Natural Language Inference (MultiNLI) corpus contains 433000 sentence sets explained with literary entailment data. It was developed by the researchers: Adina Williams, Nikita Nangia and Samuel R. Bowman1 . The corpus is demonstrated on the SNLI corpus, however, varies in that it covers a scope of kinds of spoken and composed content, and supports a particular cross-class speculation assessment. Below is the example of corpus:
Loading the dataset using TensorFlow
import os import tensorflow.compat.v2 as tf import tensorflow_datasets.public_api as tfds class MultiNLI(tfds.core.GeneratorBasedBuilder) VERSION = tfds.core.Version("1.1.0") def _info(self): return tfds.core.DatasetInfo( builder=self, features=tfds.features.FeaturesDict({ "premise": tfds.features.Text(), "hypothesis": tfds.features.Text(), "label": tfds.features.ClassLabel( names=["entailment", "neutral", "contradiction"]), }), supervised_keys=None, homepage="https://www.nyu.edu/projects/bowman/multinli/", ) def _split_generators(self, dl_manager): downloaded_dir = dl_manager.download_and_extract( "https://cims.nyu.edu/~sbowman/multinli/multinli_1.0.zip") multinli_path = os.path.join(downloaded_dir, "multinli_1.0") train_path = os.path.join(multinli_path, "multinli_1.0_train.txt") matched_validation_path = os.path.join(multinli_path, "multinli_1.0_dev_matched.txt") mismatched_validation_path = os.path.join( multinli_path, "multinli_1.0_dev_mismatched.txt") return [ tfds.core.SplitGenerator( name=tfds.Split.TRAIN, gen_kwargs={"filepath": train_path}), ]
Loading the dataset using Pytorch
class MultiNLI(NLIDataset): urls = ['http://www.nyu.edu/projects/bowman/multinli/multinli_1.0.zip'] directoryname = 'multinli_1.0' name2 = 'multinli' @classmethod def splits(cls, text_field, label_field, parse_field=None, genre_field=None, root='.data', train='multinli_1.0_train.jsonl', validation='multinli_1.0_dev_matched.jsonl', test='multinli_1.0_dev_mismatched.jsonl'): extra_fields = {} if genre_field is not None: extra_fields["genre"] = ("genre", genre_field) return super(MultiNLI, cls).splits(text_field, label_field, parse_field=parse_field, extra_fields=extra_fields, root=root, train=train, validation=validation, test=test)
State of the Art
The current state of the art on SNLI dataset is T5-11B. The model gave an accuracy of 92%.
XNLI
The Cross-lingual Natural Language Inference (XNLI) corpus contains 5,000 test and 2,500 training sets for the Multi-NLI corpus. It was developed by the researchers: Adina Williams and Samuel R. BowmanThe sets are explained with printed entailment and converted into 14 dialects: French, Spanish, German, Greek, Bulgarian, Russian, Turkish, Arabic, Vietnamese, Thai, Chinese, Hindi, Swahili and Urdu.
For example:
Loading the dataset using TensorFlow
import collections import csv import os import six import tensorflow.compat.v2 as tf import tensorflow_datasets.public_api as tfds url = 'https://cims.nyu.edu/~sbowman/xnli/XNLI-1.0.zip' languages = ('ar', 'bg', 'de', 'el', 'en', 'es', 'fr', 'hi', 'ru', 'sw', 'th', 'tr', 'ur', 'vi', 'zh') class Xnli(tfds.core.GeneratorBasedBuilder): VERSION = tfds.core.Version('1.1.0') def _info(self): return tfds.core.DatasetInfo( builder=self, features=tfds.features.FeaturesDict({ 'premise': tfds.features.Translation( languages=languages,), 'hypothesis': tfds.features.TranslationVariableLanguages( languages=languages,), 'label': tfds.features.ClassLabel( names=['entailment', 'neutral', 'contradiction']), }), supervised_keys=None, homepage='https://www.nyu.edu/projects/bowman/xnli/', ) def _split_generators(self, dl_manager): dl_directory = dl_manager.download_and_extract(url) data_directory = os.path.join(dl_directory, 'XNLI-1.0') return [ tfds.core.SplitGenerator( name=tfds.Split.TEST, gen_kwargs={'filepath': os.path.join(data_directory, 'xnli.test.tsv')}), ]
Loading the dataset using Pytorch
class XNLI(NLIDataset): urls = ['http://www.nyu.edu/projects/bowman/xnli/XNLI-1.0.zip'] directoryname = 'XNLI-1.0' name3 = 'xnli' @classmethod def splits(cls, text_field, label_field, genre_field=None, language_field=None, root='.data', validation='xnli.dev.jsonl', test='xnli.test.jsonl'): extra_fields = {} if genre_field is not None: extra_fields["genre"] = ("genre", genre_field) if language_field is not None: extra_fields["language"] = ("language", language_field) return super(XNLI, cls).splits(text_field, label_field, extra_fields=extra_fields, root=root, train=None, validation=validation, test=test) @classmethod def iters(cls, *args, **kwargs): raise NotImplementedError('XNLI dataset does not support iters')
State of the Art
The current state of the art on SNLI dataset is RoBERTa-wwm-ext-large. The model gave an accuracy of 81.2%.
Conclusion
In this article, we have discussed some of the most popular datasets that are used in Textual Entailment. Further, we implemented these text corpus using Pytorch and TensorFlow.Textual Entailment are incredible vehicles for thinking, and essentially all inquiries regarding weightiness in language can be decreased to can be reduced to questions of entailment and contradiction in context. This recommends that Text Entailment is an ideal proving ground for hypotheses of semantic portrayal.