DefinePK

DefinePK hosts the largest index of Pakistani journals, research articles, news headlines, and videos. It also offers chapter-level book search.

Mining the Shadows: A Hybrid NLP Framework for Dark Web Cybercrime Investigation


Article Information

Title: Mining the Shadows: A Hybrid NLP Framework for Dark Web Cybercrime Investigation

Authors: Bilal Khan, Ans Riaz, Kausar Parveen

Journal: International Journal for Electronic Crime Investigation

HEC Recognition History
Category From To
Y 2024-10-01 2025-12-31

Publisher: Lahore Garrison University, Lahore

Country: Pakistan

Year: 2025

Volume: 9

Issue: 1

Language: en

DOI: 10.54692/ijeci.2025.0901/246

Keywords: BERTNatural Language ProcessingCybercrimeDark WebRoBERTadigital forensics teamsmalicious softwarenamed entity recognitionIRB-aligned

Categories

Abstract

The Dark Web is one of the central hubs of cyber-crime, where such actors discuss campaigns, trade illegal materials, and sell malware. The traditional audit of such environments is non-scalable and inefficient, limited by sheer scale, linguistic diversity and intentional content obfuscation. This article proposes a hybrid Natural Language Processing (NLP) system that can be used to investigate cybercrime automatically on the Dark Web forums. The system was developed to build on the earlier research and transformer-based models like BERT and RoBERTa have been employed with the typical preprocessing steps. Custom components deal with named-entity recognition (NER), topic modeling, sentiment and intent classification and extraction of threat-keywords. Author-tracking across aliases can be achieved with the help of lexical and behavioral features based on stylometric profiling. Experimental analyses show high precision of identifying entities, clustering cybercriminal dialogue and intent categorization, which exceeds baseline models by precision and recall measure. Additional distinction of the system is achieved by the inclusion of a rule-aware ethical scraping protocol as well as an IRB-friendly data-processing layer. Using the conversion of raw and noisy forum text to structured threat intelligence, the framework enables scalable, real-time operation to surveillance the landscape of cybercriminal ecosystems and to provide actionable intelligence to cybersecurity researchers, digital forensics experts, commercial law-enforcement agencies, and any downstream consumers of threat data.


Paper summary is not available for this article yet.

Loading PDF...

Loading Statistics...