Explained: CUAD, The Dataset For Legal NLP

NLP is still largely unexplored when it comes to complicated language such as legal contracts. Recently, the researchers at Berkeley and Nueva School, have taken a stab at legal NLP with their latest work.

The researchers have released CUAD or Contract Understanding Atticus Dataset, a legal contract dataset with expert annotations from lawyers. With a corpus of more than 13,000 labels in 510 commercial legal contracts, CUAD is exploring new pastures in legal NLP. The dataset has been manually labelled under the supervision of experienced attorneys. They worked on contracts in different file formats such as PDF, txt, CSV and Excel files with varying legal clauses. The extensive dataset is estimated to cost over $2 million. 

Interestingly, the beta version of CUAD was released in October 2020 as AOK v.1 with 200 contracts.


Sign up for your weekly dose of what's up in emerging technology.

The Research

The lack of large labelled datasets has been the kryptonite for AI in many domains.

“While large pretrained Transformers have recently surpassed humans on tasks such as SQuAD 2.0 and SuperGLUE, many real-world document analysis tasks still do not make use of machine learning whatsoever,” the researchers stated. Whether these large models can be repurposed for highly specialised domains remains the million-dollar question. 

Download our Mobile App

Researchers believe the answer lies in large specialised datasets. But the rub is, large datasets may require thousands of annotations and are cost-intensive. For specialised domains, datasets tend to be even more expensive.

CUAD addresses these challenges, serving as a benchmark for the broader NLP community. CUAD was built with the help of expert law student annotators who received 70-100 hours of contract review training before labelling data. The students attended training sessions to learn how to label each of the 41 categories, including video instructions and live workshops with experienced lawyers, detailed instructions, and quizzes. 

They have also studied over 100 pages of rules and annotation standards created for CUAD. Three additional annotators further verified each annotation to ensure the quality and consistency of labels. 

CUAD used the HuggingFace Transformers library and was tested with Python 3.8, PyTorch 1.7, and Transformers 4.3/4.4. The researchers also tested CUAD v1 against nine sophisticated pretrained language models. 

Why Is It Important?

Studies suggest law firms spent approximately 50% of their time reviewing contracts. It is also a costly affair as it requires specialised training to understand and interpret contracts. Contract review can be an expensive affair for not only lawyers but customers too. 

Moreover, some companies and individuals often sign contracts without even reading them, resulting in predatory behaviour that harms consumers. 

Automating contract review by training on extensive and high-quality data can be a game-changer for the law community. CUAD can not only reduce cost but train NLP models with excellent efficiency to help overcome these challenges. 

Key Highlights 

Researchers experimented with several state-of-the-art Transformer models on CUAD. The result showed overall improvement in precision (upto 80%), recall value, and performance.

Researchers are expecting CUAD to help lawyers in the following use cases: 

  • In the disclosure schedule, AI tools can identify document name, agreement date and parties coupled with simple code. It can save hours of attorney time and enable speedy delivery of high-quality work. 
  • It can help determine which contracts are for the divested business and automate those to accurately identify parties, name of signing entities and divested projects. 
  • It can help deal with uncommon clauses that are rare in legal contracts. CUAD v1 contains a large number of these rare clauses that can be used to supplement proprietary training datasets.

The researchers said they are hoping the next release will double the size of CUAD v.1 and focus on data for clauses with lower performance scores. 

CUAD v1 can be downloaded here

Support independent technology journalism

Get exclusive, premium content, ads-free experience & more

Rs. 299/month

Subscribe now for a 7-day free trial

More Great AIM Stories

Srishti Deoras
Srishti currently works as Associate Editor at Analytics India Magazine. When not covering the analytics news, editing and writing articles, she could be found reading or capturing thoughts into pictures.

AIM Upcoming Events

Early Bird Passes expire on 3rd Feb

Conference, in-person (Bangalore)
Rising 2023 | Women in Tech Conference
16-17th Mar, 2023

Conference, in-person (Bangalore)
Data Engineering Summit (DES) 2023
27-28th Apr, 2023

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox

All you need to know about Graph Embeddings

Embeddings can be the subgroups of a group, similarly, in graph theory embedding of a graph can be considered as a representation of a graph on a surface, where points of that surface are made up of vertices and arcs are made up of edges