In Artificial Intelligence, Sequence Tagging is a sort of pattern recognition task that includes the algorithmic task of a categorical tag to every individual from a grouping of observed values. It consists of various sequence labeling tasks: Part-of-speech (POS) tagging, Named Entity Recognition (NER), and Chunking.
POS-labeling gives a grammatical feature name to each word in a sentence; Named Entity Recognition requires recognizing named elements, similar to individual or association names; chunking targets distinguishing syntactic constituents inside a sentence, similar to the noun or verb phrase.
For Example:
Subscribe to our Newsletter
Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.
Here, we will cover the details of datasets used in Sequence Tagging. Further, we will execute these datasets using Tensorflow and Pytorch library.

CoNLL 2000
CoNLL 2000 was introduced in 2000 by the researchers: Tjong Kim Sang and Buchholz, 2000. It was a shared task for text chunking. This information comprises similar parcels of the Wall Street Journal corpus (WSJ) as the generally utilized information for noun phrase chunking: 211727 tokens used for training data and 47377 tokens utilized in test data. After the CoNLL 2000 dataset researchers introduced three more ColNLL datasets.
State of the Art on CoNLL 2000
The current state of the art on WMT14 dataset is SWEM-CRF. The model gave an F1 score of 90.34.
CoNLL 2002
The CoNLL 2002 dataset was utilized for the shared tasks on language-free NER. The information speaks to news wire covering two dialects: Spanish and Dutch. It was developed by the researcher: Tjong Kim Sang.
State of the Art on CoNLL 2002
The current state of the art on WMT14 dataset is ACE + document-context. The model gave a F1 score of 95.5.
CoNLL 2003
CoNLL 2003 was developed by Tjong Kim Sang and De Meulder. It is similar to CoNLL 2002. The dataset contains English and German Languages.
State of the Art on CoNLL 2003
The current state of the art on CoNLL 2003 dataset is LUKE. The model gave an F1 score of 94.3.
CoNLL 2012
The CoNLL 2012 dataset was made for a mutual task on multilingual unlimited coreference goals. It is bigger than the previous CoNLL NER based dataset.
State of the Art on CoNLL 2012
The current state of the art on CoNLL 2003 dataset is CorefQA + SpanBERT-large. The model gave an F1 score of 83.1.
Loading the dataset using TorchText
from torchtext import data import random class SequenceTaggingDataset(data.Dataset): @staticmethod def sort_key(example): for attr in dir(example): if not callable(getattr(example, attr)) and \ not attr.startswith("__"): return len(getattr(example, attr)) return 0 def __init__(self, path, fields, encoding="utf-8", separator="\t", **kwargs): examples1 = [] newcolumns = [] with open(path, encoding=encoding) as input_file: for line in input_file: line = line.strip() if line == "": if newcolumns: examples1.append(data.Example.fromlist(newcolumns, fields)) newcolumns = [] else: for i, column in enumerate(line.split(separator)): if len(newcolumns) < i + 1: newcolumns.append([]) newcolumns[i].append(column) if newcolumns: examples1.append(data.Example.fromlist(newcolumns, fields)) super(SequenceTaggingDataset, self).__init__(examples1, fields, **kwargs)
class CoNLL2000Chunking(SequenceTaggingDataset): urls = ['https://www.clips.uantwerpen.be/conll2000/chunking/train.txt.gz', 'https://www.clips.uantwerpen.be/conll2000/chunking/test.txt.gz'] directoryname = '' name2 = 'conll2000' @classmethod def splits(cls, fields, root=".data", train="train.txt", test="test.txt", validation_frac=0.1, **kwargs): train, test = super(CoNLL2000Chunking, cls).splits( fields=fields, root=root, train=train, test=test, separator=' ', **kwargs) return train, val, test
Loading the dataset using Tensorflow
import tensorflow as tf def colnll(path): data = tf.data.TextLineDataset(path) def content_filter(source): return tf.logical_not(tf.strings.regex_full_match( source, '([[:space:]][=])+.+([[:space:]][=])+[[:space:]]*')) data = data.filter(content_filter) data = data.map(lambda x: tf.strings.split(x, ' . ')) data = data.unbatch() return data train= colnll('https://www.clips.uantwerpen.be/conll2000/chunking/train.txt.gz’')
UDPOS
UPDOS is a parsed text corpus that clarifies syntactic or semantic sentence structure. The datasets follow the original format from the Universal Dependencies English Treebank. Universal Dependency is an open network exertion with more than 300 contributors delivering more than 150 treebanks in 90 dialects. Universal Dependencies treebank was developed by Marat M.
Loading the dataset using TorchText
class UDPOS(SequenceTaggingDataset): urls = ['https://bitbucket.org/sivareddyg/public/downloads/en-ud-v2.zip'] directoryname = 'en-ud-v2' name1 = 'udpos' @classmethod def splits(cls, fields, root=".data", train="en-ud-tag.v2.train.txt", validation="en-ud-tag.v2.dev.txt", test="en-ud-tag.v2.test.txt", **kwargs) return super(UDPOS, cls).splits( fields=fields, root=root, train=train, validation=validation, test=test, **kwargs)
Loading the dataset using Tensorflow
import tensorflow as tf def UDPOS(path): data = tf.data.TextLineDataset(path) def content_filter(source): return tf.logical_not(tf.strings.regex_full_match( source, '([[:space:]][=])+.+([[:space:]][=])+[[:space:]]*')) data = data.filter(content_filter) data = data.map(lambda x: tf.strings.split(x, ' . ')) data = data.unbatch() return data train= UDPOS(''https://bitbucket.org/sivareddyg/public/downloads/en-ud-v2.zip')
State of the Art on UDPOS
The current state of the art on Universal Dependencies dataset is UDPipe 2.0 + mBERT + FLAIR. The model gave a LAS score of 84.60..
Tiger Corpus
Tiger Corpus is a broad collection of German paper messages. It was developed in 2002 by the researcher: Brandt. The dataset has a few distinct kinds of annotation. The researchers used grammatical feature comments for setting up a German POS labelling task. It has 40,472 of the initially requested sentences for training, the following 5,000 for validation, and the remaining 5,000 for testing.
Loading the dataset using Torchtext
class TigerCorpus(SequenceTaggingDataset): url = ['https://www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/tiger/Brants-CLAUS98.ps.gz'] directoryname = '' name3 = 'tigercorpus' @classmethod def splits(cls, fields, root=".data", train="train.txt", test="test.txt", validation_frac=0.1, **kwargs): train, test = super(TigerCorpus, cls).splits( fields=fields, root=root, train=train, test=test, separator=' ', **kwargs) return train, val, test
Loading the dataset using Tensorflow
import tensorflow as tf def tigcor(path): data = tf.data.TextLineDataset(path) def content_filter(source): return tf.logical_not(tf.strings.regex_full_match( source, '([[:space:]][=])+.+([[:space:]][=])+[[:space:]]*')) data = data.filter(content_filter) data = data.map(lambda x: tf.strings.split(x, ' . ')) data = data.unbatch() return data train= tigcor('https://www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/tiger/Brants-CLAUS98.ps.gz’')
Conclusion
In this article, we have discussed some of the datasets that are used as Sequence Tagging. Further, we implemented these data corpus using different Python libraries. As sequence tagging is a crucial part of NLP we can research the performance of models on each of these datasets.