NLP is still largely unexplored when it comes to complicated language such as legal contracts. Recently, the researchers at Berkeley and Nueva School, have taken a stab at legal NLP with their latest work.
The researchers have released CUAD or Contract Understanding Atticus Dataset, a legal contract dataset with expert annotations from lawyers. With a corpus of more than 13,000 labels in 510 commercial legal contracts, CUAD is exploring new pastures in legal NLP. The dataset has been manually labelled under the supervision of experienced attorneys. They worked on contracts in different file formats such as PDF, txt, CSV and Excel files with varying legal clauses. The extensive dataset is estimated to cost over $2 million.
Interestingly, the beta version of CUAD was released in October 2020 as AOK v.1 with 200 contracts.
The lack of large labelled datasets has been the kryptonite for AI in many domains.
“While large pretrained Transformers have recently surpassed humans on tasks such as SQuAD 2.0 and SuperGLUE, many real-world document analysis tasks still do not make use of machine learning whatsoever,” the researchers stated. Whether these large models can be repurposed for highly specialised domains remains the million-dollar question.
Researchers believe the answer lies in large specialised datasets. But the rub is, large datasets may require thousands of annotations and are cost-intensive. For specialised domains, datasets tend to be even more expensive.
CUAD addresses these challenges, serving as a benchmark for the broader NLP community. CUAD was built with the help of expert law student annotators who received 70-100 hours of contract review training before labelling data. The students attended training sessions to learn how to label each of the 41 categories, including video instructions and live workshops with experienced lawyers, detailed instructions, and quizzes.
They have also studied over 100 pages of rules and annotation standards created for CUAD. Three additional annotators further verified each annotation to ensure the quality and consistency of labels.
CUAD used the HuggingFace Transformers library and was tested with Python 3.8, PyTorch 1.7, and Transformers 4.3/4.4. The researchers also tested CUAD v1 against nine sophisticated pretrained language models.
Why Is It Important?
Studies suggest law firms spent approximately 50% of their time reviewing contracts. It is also a costly affair as it requires specialised training to understand and interpret contracts. Contract review can be an expensive affair for not only lawyers but customers too.
Moreover, some companies and individuals often sign contracts without even reading them, resulting in predatory behaviour that harms consumers.
Automating contract review by training on extensive and high-quality data can be a game-changer for the law community. CUAD can not only reduce cost but train NLP models with excellent efficiency to help overcome these challenges.
Researchers experimented with several state-of-the-art Transformer models on CUAD. The result showed overall improvement in precision (upto 80%), recall value, and performance.
Researchers are expecting CUAD to help lawyers in the following use cases:
- In the disclosure schedule, AI tools can identify document name, agreement date and parties coupled with simple code. It can save hours of attorney time and enable speedy delivery of high-quality work.
- It can help determine which contracts are for the divested business and automate those to accurately identify parties, name of signing entities and divested projects.
- It can help deal with uncommon clauses that are rare in legal contracts. CUAD v1 contains a large number of these rare clauses that can be used to supplement proprietary training datasets.
The researchers said they are hoping the next release will double the size of CUAD v.1 and focus on data for clauses with lower performance scores.
CUAD v1 can be downloaded here.