MITB Banner

A New Language Model Trained on Dark Web Emerges

DarkBERT finds its use cases in researching dark web, alongside identifying cyber security threats like Ransomware and leak site detection and others.

Share

dark web
Listen to this story

Researchers from South Korea recently released DarkBERT, a dark web domain-specific language model based on the RoBERTa architecture. 

This new model is said to show promising applicability in future research in the dark web domain and in the cyber security industry. It also outperformed existing language models with evaluations on dark web domain tasks and datasets. 

But, how did they do it? To allow DarkBERT to adapt well to the language used in the Dark Web, the researchers have pre-trained the model on a large-scale dark web corpus collected by crawling the Tor network. In addition to this, they also polished the pre-training corpus through data filtering and deduplication, alongside data pre-processing to address the potential ethical concerns in Dark Web texts related to sensitive information. 

Showcasing DarkBERT pretraining process and the various use case scenarios for evaluation. (Source: arXiv


The same group of researchers last year worked on ‘Shedding New Light on the Langauge of the Dark Web,’ where they introduced CoDA, a dark web text corpus collected from various onion services divided into topical categories. Another notable study includes – ‘The Language of Legal and Illegal Activity on the Darknet,’ done by Israeli researchers, where they identified several distinguishing factors between legal and illegal texts, taking a variety of approaches. This includes predictive (text classification), and application-based (named entity Wikification), along with an approach based on raw statistics. 

All of this research work and more has inspired the researchers to develop DarkBERT. 

What next? 

In the coming months, the researchers said that they plan to improve the performance of dark web domain-specific pre-trained language models using more latest architectures and crawl additional data to allow the construction of multilingual language models. 

Share
Picture of Mohit Pandey

Mohit Pandey

Mohit dives deep into the AI world to bring out information in simple, explainable, and sometimes funny words. He also holds a keen interest in photography, filmmaking, and the gaming industry.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Subscribe to Our Newsletter

The Belamy, our weekly Newsletter is a rage. Just enter your email below.