MITB Banner

Most Popular Datasets For Neural Sequence Tagging with the Implementation in TensorFlow and PyTorch

In Artificial Intelligence, Sequence Tagging is a sort of pattern recognition task that includes the algorithmic task of a categorical tag to every individual from a grouping of observed values. It consists of various sequence labeling tasks: Part-of-speech (POS) tagging, Named Entity Recognition (NER), and Chunking.

Share

sequence

In Artificial Intelligence, Sequence Tagging is a sort of pattern recognition task that includes the algorithmic task of a categorical tag to every individual from a grouping of observed values. It consists of various sequence labeling tasks: Part-of-speech (POS) tagging, Named Entity Recognition (NER), and Chunking.

Sequence

POS-labeling gives a grammatical feature name to each word in a sentence; Named Entity Recognition requires recognizing named elements, similar to individual or association names; chunking targets distinguishing syntactic constituents inside a sentence, similar to the noun or verb phrase. 

For Example:

example

Here, we will cover the details of datasets used in Sequence Tagging. Further, we will execute these datasets using Tensorflow and Pytorch library. 

CoNLL 2000

CoNLL 2000 was introduced in 2000 by the researchers: Tjong Kim Sang and Buchholz, 2000. It was a shared task for text chunking. This information comprises similar parcels of the Wall Street Journal corpus (WSJ) as the generally utilized information for noun phrase chunking: 211727 tokens used for training data and 47377 tokens utilized in test data. After the CoNLL 2000 dataset researchers introduced three more ColNLL datasets.

State of the Art on CoNLL 2000

The current state of the art on WMT14 dataset is SWEM-CRF. The model gave an F1 score of 90.34.

CoNLL 2002

The CoNLL 2002 dataset was utilized for the shared tasks on language-free NER. The information speaks to news wire covering two dialects: Spanish and Dutch. It was developed by the researcher: Tjong Kim Sang.

State of the Art on CoNLL 2002

The current state of the art on WMT14 dataset is ACE + document-context. The model gave a F1 score of 95.5.

CoNLL 2003

CoNLL 2003 was developed by Tjong Kim Sang and De Meulder. It is similar to CoNLL 2002. The dataset contains English and German Languages.

State of the Art on CoNLL 2003

The current state of the art on CoNLL 2003 dataset is LUKE. The model gave an F1 score of 94.3.

CoNLL 2012

The CoNLL 2012 dataset was made for a mutual task on multilingual unlimited coreference goals. It is bigger than the previous CoNLL NER based dataset.

State of the Art on CoNLL 2012

The current state of the art on CoNLL 2003 dataset is CorefQA + SpanBERT-large. The model gave an F1 score of 83.1.

Loading the dataset using TorchText

from torchtext import data
import random
class SequenceTaggingDataset(data.Dataset):
    @staticmethod
    def sort_key(example):
        for attr in dir(example):
            if not callable(getattr(example, attr)) and \
                    not attr.startswith("__"):
                return len(getattr(example, attr))
        return 0
    def __init__(self, path, fields, encoding="utf-8", separator="\t", **kwargs):
        examples1 = []
        newcolumns = []
        with open(path, encoding=encoding) as input_file:
            for line in input_file:
                line = line.strip()
                if line == "":
                    if newcolumns:
                        examples1.append(data.Example.fromlist(newcolumns, fields))
                    newcolumns = []
                else:
                    for i, column in enumerate(line.split(separator)):
                        if len(newcolumns) < i + 1:
                            newcolumns.append([])
                        newcolumns[i].append(column)
            if newcolumns:
                examples1.append(data.Example.fromlist(newcolumns, fields))
        super(SequenceTaggingDataset, self).__init__(examples1, fields,
                                                     **kwargs)
class CoNLL2000Chunking(SequenceTaggingDataset):
    urls = ['https://www.clips.uantwerpen.be/conll2000/chunking/train.txt.gz',
            'https://www.clips.uantwerpen.be/conll2000/chunking/test.txt.gz']
    directoryname = ''
    name2 = 'conll2000'
    @classmethod
    def splits(cls, fields, root=".data", train="train.txt",
               test="test.txt", validation_frac=0.1, **kwargs):
        train, test = super(CoNLL2000Chunking, cls).splits(
            fields=fields, root=root, train=train,
            test=test, separator=' ', **kwargs)
        return train, val, test

Loading the dataset using Tensorflow

import tensorflow as tf
def colnll(path):
  data = tf.data.TextLineDataset(path)
  def content_filter(source):
    return tf.logical_not(tf.strings.regex_full_match(
        source, 
        '([[:space:]][=])+.+([[:space:]][=])+[[:space:]]*'))
  data = data.filter(content_filter)
  data = data.map(lambda x: tf.strings.split(x, ' . '))
  data = data.unbatch()
  return data
train= colnll('https://www.clips.uantwerpen.be/conll2000/chunking/train.txt.gz’')

UDPOS

UPDOS is a parsed text corpus that clarifies syntactic or semantic sentence structure. The datasets follow the original format from the Universal Dependencies English Treebank. Universal Dependency is an open network exertion with more than 300 contributors delivering more than 150 treebanks in 90 dialects. Universal Dependencies treebank was developed by Marat M.

Loading the dataset using TorchText

class UDPOS(SequenceTaggingDataset):
    urls = ['https://bitbucket.org/sivareddyg/public/downloads/en-ud-v2.zip']
    directoryname = 'en-ud-v2'
    name1 = 'udpos'
    @classmethod
    def splits(cls, fields, root=".data", train="en-ud-tag.v2.train.txt",
               validation="en-ud-tag.v2.dev.txt",
               test="en-ud-tag.v2.test.txt", **kwargs)
        return super(UDPOS, cls).splits(
            fields=fields, root=root, train=train, validation=validation,
            test=test, **kwargs)

Loading the dataset using Tensorflow

import tensorflow as tf
def UDPOS(path):
  data = tf.data.TextLineDataset(path)
  def content_filter(source):
    return tf.logical_not(tf.strings.regex_full_match(
        source, 
        '([[:space:]][=])+.+([[:space:]][=])+[[:space:]]*'))
  data = data.filter(content_filter)
  data = data.map(lambda x: tf.strings.split(x, ' . '))
  data = data.unbatch()
  return data
train= UDPOS(''https://bitbucket.org/sivareddyg/public/downloads/en-ud-v2.zip')

State of the Art on UDPOS

The current state of the art on Universal Dependencies dataset is UDPipe 2.0 + mBERT + FLAIR. The model gave a LAS score of 84.60..

Tiger Corpus

Tiger Corpus is a broad collection of German paper messages. It was developed in 2002 by the researcher: Brandt. The dataset has a few distinct kinds of annotation. The researchers used grammatical feature comments for setting up a German POS labelling task. It has 40,472 of the initially requested sentences for training, the following 5,000 for validation, and the remaining 5,000 for testing.

Loading the dataset using Torchtext

class TigerCorpus(SequenceTaggingDataset):
url = ['https://www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/tiger/Brants-CLAUS98.ps.gz']
    directoryname = ''
    name3 = 'tigercorpus'
    @classmethod
    def splits(cls, fields, root=".data", train="train.txt",
               test="test.txt", validation_frac=0.1, **kwargs):
        train, test = super(TigerCorpus, cls).splits(
            fields=fields, root=root, train=train,
            test=test, separator=' ', **kwargs)
        return train, val, test

Loading the dataset using Tensorflow

import tensorflow as tf
def tigcor(path):
  data = tf.data.TextLineDataset(path)
  def content_filter(source):
    return tf.logical_not(tf.strings.regex_full_match(
        source, 
        '([[:space:]][=])+.+([[:space:]][=])+[[:space:]]*'))
  data = data.filter(content_filter)
  data = data.map(lambda x: tf.strings.split(x, ' . '))
  data = data.unbatch()
  return data
train= tigcor('https://www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/tiger/Brants-CLAUS98.ps.gz’')

Conclusion

In this article, we have discussed some of the datasets that are used as Sequence Tagging. Further, we implemented these data corpus using different Python libraries. As sequence tagging is a crucial part of NLP we can research the performance of models on each of these datasets.

Share
Picture of Ankit Das

Ankit Das

A data analyst with expertise in statistical analysis, data visualization ready to serve the industry using various analytical platforms. I look forward to having in-depth knowledge of machine learning and data science. Outside work, you can find me as a fun-loving person with hobbies such as sports and music.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Courses & Careers

Become a Certified Generative AI Engineer

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Subscribe to Our Newsletter

The Belamy, our weekly Newsletter is a rage. Just enter your email below.