Underground forums contain discussions and advertisements of various topics, including general chatter, hacking tutorials, and sales of items on marketplaces. While off-the-shelf natural language processing (NLP) techniques may be applied in this domain, they are often trained on standard corpora such as news articles and Wikipedia.
It isn’t clear how well these models perform with the noisy text data found on underground forums, which contains evolving domain-specific lexicon, misspellings, slang, jargon, and acronyms. I explored this problem with colleagues from the Cambridge Cybercrime Centre and the Computer Laboratory, in developing a tool for detecting bursty trending topics using a Bayesian approach of log-odds. The approach uses a prior distribution to detect change in the vocabulary used in forums, for filtering out consistently used jargon and slang. The paper has been accepted to the 2020 Workshop on Noisy User-Generated Text (ACL) and the preprint is available online.
Other more commonly used approaches of identifying known and emerging trends range from simple keyword detection using a dictionary of known terms, to statistical methods of topic modelling including TF-IDF and Latent Dirichlet Allocation (LDA). In addition, the NLP landscape has been changing over the last decade [1], with a shift to deep learning using neural models, such as word2vec and BERT.
In this Three Paper Thursday, we look at how past papers have used different NLP approaches to analyse posts in underground forums, from statistical techniques to word embeddings, for identifying and define new terms, generating relevant warnings even when the jargon is unknown, and identifying similar threads despite relevant keywords not being known.
[1] Gregory Goth. 2016. Deep or shallow, NLP is breaking out. Commun. ACM 59, 3 (March 2016), 13–16. DOI:https://doi.org/10.1145/2874915
Continue reading Three Paper Thursday: Applying natural language processing to underground forums