Now Reading
Datasets for Language Modelling in NLP using TensorFlow and PyTorch

Datasets for Language Modelling in NLP using TensorFlow and PyTorch

Ankit Das
language

In recent times, Language Modelling has gained momentum in the field of Natural Language Processing. So, it is essential for us to think of new models and strategies for quicker and better preparation of language models. Nonetheless, because of the complexity of language, we have to deal with some of the problems in the dataset. With an increase in the size of the dataset, there is an increase in the normal number of times a word shows up in that dataset. Models performing admirably on little datasets probably won’t perform well on bigger ones.

Here, we will discuss some of the most popular datasets for word-level language modeling. Further, we will implement these datasets with the help of TensorFlow and Pytorch Library.

Dataset Statistics

In comparison to the Penn Treebank dataset, the WikiText datasets are larger. WikiText-2 aims to be of a similar size to the Penn Treebank while WikiText-103 contains all articles extracted from Wikipedia.

WikiText-103

The WikiText-103 dataset, created by Salesforce, contains more than ten crore tokens retrieved from the arrangement of checked Good and Featured articles on Wikipedia. This dataset comprises 28,475 great and highlighted articles from Wikipedia. It has a drawn-out reliance with 103 million tokens. It contains a vocabulary size of 267,735 after replacing all the token that appears not more than two times with <unk>token. This makes it restrictive to explore different avenues regarding word-level LMs on this dataset. For an implanting size of 400, the embedding layer consists of 267K x 400 ≈ 106Mparameters.



Loading the WikiText-103 dataset using Tensorflow

Download the dataset from the link given below.

!wget --quiet https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-raw-v1.zip
!unzip wikitext-103-raw-v1.zip

Pass the path to the function. The TensorFlow library will help in reading the data and storing it in the proper format.

import tensorflow as tf
def wiki103(path):
  data2 = tf.data.TextLineDataset(path)
  def content_filter(source):
    return tf.logical_not(tf.strings.regex_full_match(
        source, 
        '([[:space:]][=])+.+([[:space:]][=])+[[:space:]]*'))
  data2 = data2.filter(content_filter)
  data2 = data2.map(lambda x: tf.strings.split(x, ' . '))
  data2 = data2.unbatch()
  return data2
train = wiki103('/content/wikitext-103-raw/wiki.train.raw')

Loading the WikiText-103 dataset using Pytorch

from torchtext import data
import io
 
class LanguageModelingDataset(data.Dataset):
    
 
  def __init__(self, path, text_field, newline_eos=True,
                 encoding='utf-8', **kwargs):
        fields = [('text', text_field)]
        text = []
        with io.open(path, encoding=encoding) as f:
            for line in f:
                text += text_field.preprocess(line)
                if newline_eos:
                    text.append(u'<eos>')
 
        examples = [data.Example.fromlist([text], fields)]
        super(LanguageModelingDataset, self).__init__(
            examples, fields, **kwargs)
 
class WikiText103(LanguageModelingDataset):
    urls = ['https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-v1.zip']
    name2 = 'wikitext-103'
    directoryname = 'wikitext-103'
    def splits(cls, text_field, root='.data', train='wiki.train.tokens',
               validation='wiki.valid.tokens', test='wiki.test.tokens',
               **kwargs):
        return super(WikiText103, cls).splits(
            root=root, train=train, validation=validation, test=test,
            text_field=text_field, **kwargs)
    def iters(cls, batch_size=32, bptt_len=35, device=0, root='.data',
              vectors=None, **kwargs):
        text2 = data.Field()
        train, val, test = cls.splits(text2, root=root, **kwargs)
        text2.build_vocab(train, vectors=vectors)
        return data.BPTTIterator.splits(
            (train, val, test), batch_size=batch_size, bptt_len=bptt_len,
            device=device)
  • root- directory where the dataset’s zip archive will be stored.
  • batch size- number of training data points passed in one iteration.
  • Bptt_len-length of sequence for backpropagation.
  • device-Use – 1 for CPU and None for the presently dynamic GPU gadget.
  • text_field – field that will be used for text data points.
  • train – training set
  • validation – approval set
  • test -testing test

Testing and Validation Perplexity

State of the Art on WikiText-103

The present state of the art on WikiText-103 dataset is Megatron-LM. The model gave a test-perplexity of 10.81%. The model performs best with lower perplexity.

WikiText-2

WikiText-2 is a 2M token variant of WikiText-103 with a jargon size of 33,278. This dataset is a little form of the WikiText-103 dataset. This little dataset is appropriate for testing your language model.

See Also


Stay Connected

Get the latest updates and relevant offers by sharing your email.

Loading the WikiText-2 dataset using Tensorflow

!wget --quiet https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-2-raw-v1.zip
!unzip wikitext-2-raw-v1.zip
def wiki1(path):
  data = tf.data.TextLineDataset(path)
  def content_filter(source):
    return tf.logical_not(tf.strings.regex_full_match(
        source, 
        '([[:space:]][=])+.+([[:space:]][=])+[[:space:]]*'))
  data = data.filter(content_filter)
  data = data.map(lambda x: tf.strings.split(x, ' . '))
  data = data.unbatch()
  return data
train= wiki1('/content/wikitext-2-raw/wiki.train.raw')

Loading the WikiText-2 dataset using Pytorch

class WikiText2(LanguageModelingDataset):
    urls = ['https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-2-v1.zip']
    name1 = 'wikitext-2'
    directoryname = 'wikitext-2'
    def splits(cls, text_field, root='.data', train='wiki.train.tokens',
               validation='wiki.valid.tokens', test='wiki.test.tokens',
               **kwargs):
        return super(WikiText2, cls).splits(
            root=root, train=train, validation=validation, test=test,
            text_field=text_field, **kwargs)
    def iters(cls, batch_size=32, bptt_len=35, device=0, root='.data',
              vectors=None, **kwargs):
        text1 = data.Field()
        train, val, test = cls.splits(text1, root=root, **kwargs)
        text1.build_vocab(train, vectors=vectors)
        return data.BPTTIterator.splits(
            (train, val, test), batch_size=batch_size, bptt_len=bptt_len,
            device=device)

Testing and Validation Perplexity

State of the Art on WikiText-2

The present state of the art on WikiText-2 dataset is GPT-2 . The model gave a test-perplexity of 18.34%.

PennTreeBank

Penn Treebank dataset contains the Penn Treebank bit of the Wall Street Diary corpus, developed by Mikolov. It comprises 929k tokens for the train, 73k for approval, and 82k for the test. The words in the dataset are lower-cased, numbers substituted with N, and most punctuations eliminated. <unk> token replaced the Out-of-vocabulary (OOV) words. The jargon is the most continuous 10k words.It contains sentences rather than passages, so its setting is restricted.

Loading the PennTreeBank dataset using Tensorflow

import tensorflow as tf
def pentree(path):
  data3 = tf.data.TextLineDataset(path)
  # Drop article headers.
  def content_filter(source):
    return tf.logical_not(tf.strings.regex_full_match(
        source, 
        '([[:space:]][=])+.+([[:space:]][=])+[[:space:]]*'))
  data3 = data3.filter(content_filter)
  data3 = data3.map(lambda x: tf.strings.split(x, ' . '))
  data3 = data3.unbatch()
  return data3
train = pentree('https://raw.githubusercontent.com/wojzaremba/lstm/master/data/ptb.train.txt')

Loading the Penn TreeBank dataset using Pytorch

class PennTreebank(LanguageModelingDataset):
    urls = ['https://raw.githubusercontent.com/wojzaremba/lstm/master/data/ptb.train.txt',
            'https://raw.githubusercontent.com/wojzaremba/lstm/master/data/ptb.valid.txt',
            'https://raw.githubusercontent.com/wojzaremba/lstm/master/data/ptb.test.txt']
    name3 = 'penn-treebank'
    directoryname = 'penn-treebank'
    def splits(cls, text_field, root='.data', train='ptb.train.txt',
               validation='ptb.valid.txt', test='ptb.test.txt',
              **kwargs):
        return super(PennTreebank, cls).splits(
            root=root, train=train, validation=validation, test=test,
            text_field=text_field, **kwargs)
    def iters(cls, batch_size=32, bptt_len=35, device=0, root='.data',
              vectors=None, **kwargs):
        text3 = data.Field()
        train, val, test = cls.splits(text3, root=root, **kwargs)
        text3.build_vocab(train, vectors=vectors)
        return data.BPTTIterator.splits(
            (train, val, test), batch_size=batch_size, bptt_len=bptt_len,
            device=device)

Testing perplexity of Penn TreeBank

State of the Art on Penn TreeBank

The present state of the art on PennTreeBank dataset is GPT-3. The model gave a test-perplexity of 20.5%.

Conclusion

In this article, we have covered most of the popular datasets for word-level language modelling. Penn Treebank is the smallest and WikiText-103 is the largest among these three. As the size of Penn TreeBank is less, it is easier and faster to train the model on this. So, it is advisable to check in detail the performance of models on different sizes of the dataset.

What Do You Think?

If you loved this story, do join our Telegram Community.


Also, you can write for us and be one of the 500+ experts who have contributed stories at AIM. Share your nominations here.
What's Your Reaction?
Excited
8
Happy
4
In Love
0
Not Sure
1
Silly
0

Copyright Analytics India Magazine Pvt Ltd

Scroll To Top