Most Popular Datasets For Neural Sequence Tagging with the Implementation in TensorFlow and PyTorch

In Artificial Intelligence, Sequence Tagging is a sort of pattern recognition task that includes the algorithmic task of a categorical tag to every individual from a grouping of observed values. It consists of various sequence labeling tasks: Part-of-speech (POS) tagging, Named Entity Recognition (NER), and Chunking.
sequence

In Artificial Intelligence, Sequence Tagging is a sort of pattern recognition task that includes the algorithmic task of a categorical tag to every individual from a grouping of observed values. It consists of various sequence labeling tasks: Part-of-speech (POS) tagging, Named Entity Recognition (NER), and Chunking.

Sequence

POS-labeling gives a grammatical feature name to each word in a sentence; Named Entity Recognition requires recognizing named elements, similar to individual or association names; chunking targets distinguishing syntactic constituents inside a sentence, similar to the noun or verb phrase. 

For Example:

Subscribe to our Newsletter

Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.
example

Here, we will cover the details of datasets used in Sequence Tagging. Further, we will execute these datasets using Tensorflow and Pytorch library. 




CoNLL 2000

CoNLL 2000 was introduced in 2000 by the researchers: Tjong Kim Sang and Buchholz, 2000. It was a shared task for text chunking. This information comprises similar parcels of the Wall Street Journal corpus (WSJ) as the generally utilized information for noun phrase chunking: 211727 tokens used for training data and 47377 tokens utilized in test data. After the CoNLL 2000 dataset researchers introduced three more ColNLL datasets.

State of the Art on CoNLL 2000

The current state of the art on WMT14 dataset is SWEM-CRF. The model gave an F1 score of 90.34.

CoNLL 2002

The CoNLL 2002 dataset was utilized for the shared tasks on language-free NER. The information speaks to news wire covering two dialects: Spanish and Dutch. It was developed by the researcher: Tjong Kim Sang.

State of the Art on CoNLL 2002

The current state of the art on WMT14 dataset is ACE + document-context. The model gave a F1 score of 95.5.

CoNLL 2003

CoNLL 2003 was developed by Tjong Kim Sang and De Meulder. It is similar to CoNLL 2002. The dataset contains English and German Languages.

State of the Art on CoNLL 2003

The current state of the art on CoNLL 2003 dataset is LUKE. The model gave an F1 score of 94.3.

CoNLL 2012

The CoNLL 2012 dataset was made for a mutual task on multilingual unlimited coreference goals. It is bigger than the previous CoNLL NER based dataset.

State of the Art on CoNLL 2012

The current state of the art on CoNLL 2003 dataset is CorefQA + SpanBERT-large. The model gave an F1 score of 83.1.

Loading the dataset using TorchText

from torchtext import data
import random
class SequenceTaggingDataset(data.Dataset):
    @staticmethod
    def sort_key(example):
        for attr in dir(example):
            if not callable(getattr(example, attr)) and \
                    not attr.startswith("__"):
                return len(getattr(example, attr))
        return 0
    def __init__(self, path, fields, encoding="utf-8", separator="\t", **kwargs):
        examples1 = []
        newcolumns = []
        with open(path, encoding=encoding) as input_file:
            for line in input_file:
                line = line.strip()
                if line == "":
                    if newcolumns:
                        examples1.append(data.Example.fromlist(newcolumns, fields))
                    newcolumns = []
                else:
                    for i, column in enumerate(line.split(separator)):
                        if len(newcolumns) < i + 1:
                            newcolumns.append([])
                        newcolumns[i].append(column)
            if newcolumns:
                examples1.append(data.Example.fromlist(newcolumns, fields))
        super(SequenceTaggingDataset, self).__init__(examples1, fields,
                                                     **kwargs)
class CoNLL2000Chunking(SequenceTaggingDataset):
    urls = ['https://www.clips.uantwerpen.be/conll2000/chunking/train.txt.gz',
            'https://www.clips.uantwerpen.be/conll2000/chunking/test.txt.gz']
    directoryname = ''
    name2 = 'conll2000'
    @classmethod
    def splits(cls, fields, root=".data", train="train.txt",
               test="test.txt", validation_frac=0.1, **kwargs):
        train, test = super(CoNLL2000Chunking, cls).splits(
            fields=fields, root=root, train=train,
            test=test, separator=' ', **kwargs)
        return train, val, test

Loading the dataset using Tensorflow

import tensorflow as tf
def colnll(path):
  data = tf.data.TextLineDataset(path)
  def content_filter(source):
    return tf.logical_not(tf.strings.regex_full_match(
        source, 
        '([[:space:]][=])+.+([[:space:]][=])+[[:space:]]*'))
  data = data.filter(content_filter)
  data = data.map(lambda x: tf.strings.split(x, ' . '))
  data = data.unbatch()
  return data
train= colnll('https://www.clips.uantwerpen.be/conll2000/chunking/train.txt.gz’')

UDPOS

UPDOS is a parsed text corpus that clarifies syntactic or semantic sentence structure. The datasets follow the original format from the Universal Dependencies English Treebank. Universal Dependency is an open network exertion with more than 300 contributors delivering more than 150 treebanks in 90 dialects. Universal Dependencies treebank was developed by Marat M.

Loading the dataset using TorchText

class UDPOS(SequenceTaggingDataset):
    urls = ['https://bitbucket.org/sivareddyg/public/downloads/en-ud-v2.zip']
    directoryname = 'en-ud-v2'
    name1 = 'udpos'
    @classmethod
    def splits(cls, fields, root=".data", train="en-ud-tag.v2.train.txt",
               validation="en-ud-tag.v2.dev.txt",
               test="en-ud-tag.v2.test.txt", **kwargs)
        return super(UDPOS, cls).splits(
            fields=fields, root=root, train=train, validation=validation,
            test=test, **kwargs)

Loading the dataset using Tensorflow

import tensorflow as tf
def UDPOS(path):
  data = tf.data.TextLineDataset(path)
  def content_filter(source):
    return tf.logical_not(tf.strings.regex_full_match(
        source, 
        '([[:space:]][=])+.+([[:space:]][=])+[[:space:]]*'))
  data = data.filter(content_filter)
  data = data.map(lambda x: tf.strings.split(x, ' . '))
  data = data.unbatch()
  return data
train= UDPOS(''https://bitbucket.org/sivareddyg/public/downloads/en-ud-v2.zip')

State of the Art on UDPOS

The current state of the art on Universal Dependencies dataset is UDPipe 2.0 + mBERT + FLAIR. The model gave a LAS score of 84.60..

Tiger Corpus

Tiger Corpus is a broad collection of German paper messages. It was developed in 2002 by the researcher: Brandt. The dataset has a few distinct kinds of annotation. The researchers used grammatical feature comments for setting up a German POS labelling task. It has 40,472 of the initially requested sentences for training, the following 5,000 for validation, and the remaining 5,000 for testing.

Loading the dataset using Torchtext

class TigerCorpus(SequenceTaggingDataset):
url = ['https://www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/tiger/Brants-CLAUS98.ps.gz']
    directoryname = ''
    name3 = 'tigercorpus'
    @classmethod
    def splits(cls, fields, root=".data", train="train.txt",
               test="test.txt", validation_frac=0.1, **kwargs):
        train, test = super(TigerCorpus, cls).splits(
            fields=fields, root=root, train=train,
            test=test, separator=' ', **kwargs)
        return train, val, test

Loading the dataset using Tensorflow

import tensorflow as tf
def tigcor(path):
  data = tf.data.TextLineDataset(path)
  def content_filter(source):
    return tf.logical_not(tf.strings.regex_full_match(
        source, 
        '([[:space:]][=])+.+([[:space:]][=])+[[:space:]]*'))
  data = data.filter(content_filter)
  data = data.map(lambda x: tf.strings.split(x, ' . '))
  data = data.unbatch()
  return data
train= tigcor('https://www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/tiger/Brants-CLAUS98.ps.gz’')

Conclusion

In this article, we have discussed some of the datasets that are used as Sequence Tagging. Further, we implemented these data corpus using different Python libraries. As sequence tagging is a crucial part of NLP we can research the performance of models on each of these datasets.

Ankit Das
A data analyst with expertise in statistical analysis, data visualization ready to serve the industry using various analytical platforms. I look forward to having in-depth knowledge of machine learning and data science. Outside work, you can find me as a fun-loving person with hobbies such as sports and music.

Download our Mobile App

MachineHack

AI Hackathons, Coding & Learning

Host Hackathons & Recruit Great Data Talent!

AIM Research

Pioneering advanced AI market research

Request Customised Insights & Surveys for the AI Industry

The Gold Standard for Recognizing Excellence in Data Science and Tech Workplaces

With Best Firm Certification, you can effortlessly delve into the minds of your employees, unveil invaluable perspectives, and gain distinguished acclaim for fostering an exceptional company culture.

AIM Leaders Council

World’s Biggest Community Exclusively For Senior Executives In Data Science And Analytics.

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox
MOST POPULAR

Subscribe to Our Newsletter

The Belamy, our weekly Newsletter is a rage. Just enter your email below.