Advertisement

Datasets for Language Modelling in NLP using TensorFlow and PyTorch

In recent times, Language Modelling has gained momentum in the field of Natural Language Processing. So, it is essential for us to think of new models and strategies for quicker and better preparation of language models. Nonetheless, because of the complexity of language, we have to deal with some of the problems in the dataset. With an increase in the size of the dataset, there is an increase in the normal number of times a word shows up in that dataset.
language

In recent times, Language Modelling has gained momentum in the field of Natural Language Processing. So, it is essential for us to think of new models and strategies for quicker and better preparation of language models. Nonetheless, because of the complexity of language, we have to deal with some of the problems in the dataset. With an increase in the size of the dataset, there is an increase in the normal number of times a word shows up in that dataset. Models performing admirably on little datasets probably won’t perform well on bigger ones.

Here, we will discuss some of the most popular datasets for word-level language modeling. Further, we will implement these datasets with the help of TensorFlow and Pytorch Library.

Dataset Statistics

In comparison to the Penn Treebank dataset, the WikiText datasets are larger. WikiText-2 aims to be of a similar size to the Penn Treebank while WikiText-103 contains all articles extracted from Wikipedia.

WikiText-103

The WikiText-103 dataset, created by Salesforce, contains more than ten crore tokens retrieved from the arrangement of checked Good and Featured articles on Wikipedia. This dataset comprises 28,475 great and highlighted articles from Wikipedia. It has a drawn-out reliance with 103 million tokens. It contains a vocabulary size of 267,735 after replacing all the token that appears not more than two times with <unk>token. This makes it restrictive to explore different avenues regarding word-level LMs on this dataset. For an implanting size of 400, the embedding layer consists of 267K x 400 ≈ 106Mparameters.

Loading the WikiText-103 dataset using Tensorflow

Download the dataset from the link given below.

!wget --quiet https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-raw-v1.zip
!unzip wikitext-103-raw-v1.zip

Pass the path to the function. The TensorFlow library will help in reading the data and storing it in the proper format.

import tensorflow as tf
def wiki103(path):
  data2 = tf.data.TextLineDataset(path)
  def content_filter(source):
    return tf.logical_not(tf.strings.regex_full_match(
        source, 
        '([[:space:]][=])+.+([[:space:]][=])+[[:space:]]*'))
  data2 = data2.filter(content_filter)
  data2 = data2.map(lambda x: tf.strings.split(x, ' . '))
  data2 = data2.unbatch()
  return data2
train = wiki103('/content/wikitext-103-raw/wiki.train.raw')

Loading the WikiText-103 dataset using Pytorch

from torchtext import data
import io
 
class LanguageModelingDataset(data.Dataset):
    
 
  def __init__(self, path, text_field, newline_eos=True,
                 encoding='utf-8', **kwargs):
        fields = [('text', text_field)]
        text = []
        with io.open(path, encoding=encoding) as f:
            for line in f:
                text += text_field.preprocess(line)
                if newline_eos:
                    text.append(u'<eos>')
 
        examples = [data.Example.fromlist([text], fields)]
        super(LanguageModelingDataset, self).__init__(
            examples, fields, **kwargs)
 
class WikiText103(LanguageModelingDataset):
    urls = ['https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-v1.zip']
    name2 = 'wikitext-103'
    directoryname = 'wikitext-103'
    def splits(cls, text_field, root='.data', train='wiki.train.tokens',
               validation='wiki.valid.tokens', test='wiki.test.tokens',
               **kwargs):
        return super(WikiText103, cls).splits(
            root=root, train=train, validation=validation, test=test,
            text_field=text_field, **kwargs)
    def iters(cls, batch_size=32, bptt_len=35, device=0, root='.data',
              vectors=None, **kwargs):
        text2 = data.Field()
        train, val, test = cls.splits(text2, root=root, **kwargs)
        text2.build_vocab(train, vectors=vectors)
        return data.BPTTIterator.splits(
            (train, val, test), batch_size=batch_size, bptt_len=bptt_len,
            device=device)
  • root- directory where the dataset’s zip archive will be stored.
  • batch size- number of training data points passed in one iteration.
  • Bptt_len-length of sequence for backpropagation.
  • device-Use – 1 for CPU and None for the presently dynamic GPU gadget.
  • text_field – field that will be used for text data points.
  • train – training set
  • validation – approval set
  • test -testing test

Testing and Validation Perplexity

State of the Art on WikiText-103

The present state of the art on WikiText-103 dataset is Megatron-LM. The model gave a test-perplexity of 10.81%. The model performs best with lower perplexity.

WikiText-2

WikiText-2 is a 2M token variant of WikiText-103 with a jargon size of 33,278. This dataset is a little form of the WikiText-103 dataset. This little dataset is appropriate for testing your language model.

Loading the WikiText-2 dataset using Tensorflow

!wget --quiet https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-2-raw-v1.zip
!unzip wikitext-2-raw-v1.zip
def wiki1(path):
  data = tf.data.TextLineDataset(path)
  def content_filter(source):
    return tf.logical_not(tf.strings.regex_full_match(
        source, 
        '([[:space:]][=])+.+([[:space:]][=])+[[:space:]]*'))
  data = data.filter(content_filter)
  data = data.map(lambda x: tf.strings.split(x, ' . '))
  data = data.unbatch()
  return data
train= wiki1('/content/wikitext-2-raw/wiki.train.raw')

Loading the WikiText-2 dataset using Pytorch

class WikiText2(LanguageModelingDataset):
    urls = ['https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-2-v1.zip']
    name1 = 'wikitext-2'
    directoryname = 'wikitext-2'
    def splits(cls, text_field, root='.data', train='wiki.train.tokens',
               validation='wiki.valid.tokens', test='wiki.test.tokens',
               **kwargs):
        return super(WikiText2, cls).splits(
            root=root, train=train, validation=validation, test=test,
            text_field=text_field, **kwargs)
    def iters(cls, batch_size=32, bptt_len=35, device=0, root='.data',
              vectors=None, **kwargs):
        text1 = data.Field()
        train, val, test = cls.splits(text1, root=root, **kwargs)
        text1.build_vocab(train, vectors=vectors)
        return data.BPTTIterator.splits(
            (train, val, test), batch_size=batch_size, bptt_len=bptt_len,
            device=device)

Testing and Validation Perplexity

State of the Art on WikiText-2

The present state of the art on WikiText-2 dataset is GPT-2 . The model gave a test-perplexity of 18.34%.

PennTreeBank

Penn Treebank dataset contains the Penn Treebank bit of the Wall Street Diary corpus, developed by Mikolov. It comprises 929k tokens for the train, 73k for approval, and 82k for the test. The words in the dataset are lower-cased, numbers substituted with N, and most punctuations eliminated. <unk> token replaced the Out-of-vocabulary (OOV) words. The jargon is the most continuous 10k words.It contains sentences rather than passages, so its setting is restricted.

Loading the PennTreeBank dataset using Tensorflow

import tensorflow as tf
def pentree(path):
  data3 = tf.data.TextLineDataset(path)
  # Drop article headers.
  def content_filter(source):
    return tf.logical_not(tf.strings.regex_full_match(
        source, 
        '([[:space:]][=])+.+([[:space:]][=])+[[:space:]]*'))
  data3 = data3.filter(content_filter)
  data3 = data3.map(lambda x: tf.strings.split(x, ' . '))
  data3 = data3.unbatch()
  return data3
train = pentree('https://raw.githubusercontent.com/wojzaremba/lstm/master/data/ptb.train.txt')

Loading the Penn TreeBank dataset using Pytorch

class PennTreebank(LanguageModelingDataset):
    urls = ['https://raw.githubusercontent.com/wojzaremba/lstm/master/data/ptb.train.txt',
            'https://raw.githubusercontent.com/wojzaremba/lstm/master/data/ptb.valid.txt',
            'https://raw.githubusercontent.com/wojzaremba/lstm/master/data/ptb.test.txt']
    name3 = 'penn-treebank'
    directoryname = 'penn-treebank'
    def splits(cls, text_field, root='.data', train='ptb.train.txt',
               validation='ptb.valid.txt', test='ptb.test.txt',
              **kwargs):
        return super(PennTreebank, cls).splits(
            root=root, train=train, validation=validation, test=test,
            text_field=text_field, **kwargs)
    def iters(cls, batch_size=32, bptt_len=35, device=0, root='.data',
              vectors=None, **kwargs):
        text3 = data.Field()
        train, val, test = cls.splits(text3, root=root, **kwargs)
        text3.build_vocab(train, vectors=vectors)
        return data.BPTTIterator.splits(
            (train, val, test), batch_size=batch_size, bptt_len=bptt_len,
            device=device)

Testing perplexity of Penn TreeBank

State of the Art on Penn TreeBank

The present state of the art on PennTreeBank dataset is GPT-3. The model gave a test-perplexity of 20.5%.

Conclusion

In this article, we have covered most of the popular datasets for word-level language modelling. Penn Treebank is the smallest and WikiText-103 is the largest among these three. As the size of Penn TreeBank is less, it is easier and faster to train the model on this. So, it is advisable to check in detail the performance of models on different sizes of the dataset.

Download our Mobile App

Ankit Das
A data analyst with expertise in statistical analysis, data visualization ready to serve the industry using various analytical platforms. I look forward to having in-depth knowledge of machine learning and data science. Outside work, you can find me as a fun-loving person with hobbies such as sports and music.

Subscribe to our newsletter

Join our editors every weekday evening as they steer you through the most significant news of the day.
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

Our Upcoming Events

15th June | Bangalore

Future Ready | Lead the AI Era Summit

15th June | Online

Building LLM powered applications using LangChain

17th June | Online

Mastering LangChain: A Hands-on Workshop for Building Generative AI Applications

20th June | Bangalore

Women in Data Science (WiDS) by Intuit India

Jun 23, 2023 | Bangalore

MachineCon 2023 India

26th June | Online

Accelerating inference for every workload with TensorRT

MachineCon 2023 USA

Jul 21, 2023 | New York

Cypher 2023

Oct 11-13, 2023 | Bangalore

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox
MOST POPULAR