Explained: CUAD, The Dataset For Legal NLP


NLP is still largely unexplored when it comes to complicated language such as legal contracts. Recently, the researchers at Berkeley and Nueva School, have taken a stab at legal NLP with their latest work.

The researchers have released CUAD or Contract Understanding Atticus Dataset, a legal contract dataset with expert annotations from lawyers. With a corpus of more than 13,000 labels in 510 commercial legal contracts, CUAD is exploring new pastures in legal NLP. The dataset has been manually labelled under the supervision of experienced attorneys. They worked on contracts in different file formats such as PDF, txt, CSV and Excel files with varying legal clauses. The extensive dataset is estimated to cost over $2 million. 


Sign up for your weekly dose of what's up in emerging technology.

Interestingly, the beta version of CUAD was released in October 2020 as AOK v.1 with 200 contracts.

The Research

The lack of large labelled datasets has been the kryptonite for AI in many domains.

“While large pretrained Transformers have recently surpassed humans on tasks such as SQuAD 2.0 and SuperGLUE, many real-world document analysis tasks still do not make use of machine learning whatsoever,” the researchers stated. Whether these large models can be repurposed for highly specialised domains remains the million-dollar question. 

Researchers believe the answer lies in large specialised datasets. But the rub is, large datasets may require thousands of annotations and are cost-intensive. For specialised domains, datasets tend to be even more expensive.

CUAD addresses these challenges, serving as a benchmark for the broader NLP community. CUAD was built with the help of expert law student annotators who received 70-100 hours of contract review training before labelling data. The students attended training sessions to learn how to label each of the 41 categories, including video instructions and live workshops with experienced lawyers, detailed instructions, and quizzes. 

They have also studied over 100 pages of rules and annotation standards created for CUAD. Three additional annotators further verified each annotation to ensure the quality and consistency of labels. 

CUAD used the HuggingFace Transformers library and was tested with Python 3.8, PyTorch 1.7, and Transformers 4.3/4.4. The researchers also tested CUAD v1 against nine sophisticated pretrained language models. 

Why Is It Important?

Studies suggest law firms spent approximately 50% of their time reviewing contracts. It is also a costly affair as it requires specialised training to understand and interpret contracts. Contract review can be an expensive affair for not only lawyers but customers too. 

Moreover, some companies and individuals often sign contracts without even reading them, resulting in predatory behaviour that harms consumers. 

Automating contract review by training on extensive and high-quality data can be a game-changer for the law community. CUAD can not only reduce cost but train NLP models with excellent efficiency to help overcome these challenges. 

Key Highlights 

Researchers experimented with several state-of-the-art Transformer models on CUAD. The result showed overall improvement in precision (upto 80%), recall value, and performance.

Researchers are expecting CUAD to help lawyers in the following use cases: 

  • In the disclosure schedule, AI tools can identify document name, agreement date and parties coupled with simple code. It can save hours of attorney time and enable speedy delivery of high-quality work. 
  • It can help determine which contracts are for the divested business and automate those to accurately identify parties, name of signing entities and divested projects. 
  • It can help deal with uncommon clauses that are rare in legal contracts. CUAD v1 contains a large number of these rare clauses that can be used to supplement proprietary training datasets.

The researchers said they are hoping the next release will double the size of CUAD v.1 and focus on data for clauses with lower performance scores. 

CUAD v1 can be downloaded here

More Great AIM Stories

Srishti Deoras
Srishti currently works as Associate Editor at Analytics India Magazine. When not covering the analytics news, editing and writing articles, she could be found reading or capturing thoughts into pictures.

Our Upcoming Events

Conference, in-person (Bangalore)
MachineCon 2022
24th Jun

Conference, Virtual
Deep Learning DevCon 2022
30th Jul

Conference, in-person (Bangalore)
Cypher 2022
21-23rd Sep

3 Ways to Join our Community

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Telegram Channel

Discover special offers, top stories, upcoming events, and more.

Subscribe to our newsletter

Get the latest updates from AIM
Sreejani Bhattacharyya
Why is edtech falling first?

With the lockdown being imposed due to the COVID-19 pandemic and schools being shut down, the edtech startups witnessed some of their best times during 2020 and 2021.