Linguists team up with computer scientists to spot trends on cybercrime forums

4 years ago 164
BOOK THIS SPACE FOR AD
ARTICLE AD

Cambridge University boffins apply natural language processing to sort out the slang on HackForums

Computer scientists and linguists from Cambridge University have combined to apply natural language processing (NLP) techniques to pick out trends in discussions on underground cybercrime forums.

These underground forums and chatrooms typically feature a great deal of general discussion, as well as attempts to sell illicit software and other items, or offer hacking tutorials. Posts are often full of domain-specific lexicon, misspellings, slang, jargon, and acronyms.

Standard NLP approaches are tuned for more organized, edited and collated content such as news articles and Wikipedia entries. Conventional approaches fall apart once faced with talk of ‘fullz’, ‘warez’, ‘rats’, ‘sploits’, and other terms that pepper English-language cybercrime forums.

A team of researchers led by Jack Hughes of the University of Cambridge’s Computer Laboratory and linguist Seth Aycock, also of the University of Cambridge, were however able to develop a technique to pick out trends from years of posts to an English-language underground hacking forum – specifically the popular HackForums site.

Noisy data

The statistical approach developed by the team was based on a technique called ‘weighted log-odds ratio’ was said to achieve better results than ‘term-frequency inverse-document-frequency’ (TF-IDF), another NLP-based method.

The researchers tested their technique by looking at HackForums posts referencing the spread of the WannaCry ransomware in 2017, and a second set of posts contained in a subforum called ‘Monetizing Techniques’.

Read more of the latest cybercrime news


The Bayesian-based statistical analysis approach taken by the researchers and the NLP techniques they applied is informed by earlier research into making sense of “noisy text data”.

“Detecting trending topics on noisy social media data is not a new problem for information retrieval and NLP,” the University of Cambridge team explains.

“However, we believe our application of an existing statistical method onto a longitudinal dataset provides a novel lightweight approach to detecting trending terms, which returns terms of more relevance than TF-IDF, and remains computationally less expensive than topic modelling such as LDA.”

Cybercrime canary

Applying the technique may have practical applications in “identifying what may be of interest to security researchers” more quickly and efficiently, according to the researchers.

However, the team acknowledge that since many cybercrime posts take place on Russian language forums, more work is needed to see if the technique lends itself to wider application.

“Many cyber-crime forums are not English-speaking, which can add complexity into analysis,” the team acknowledged.

A paper (PDF) on the research has been accepted at the 2020 Workshop on Noisy User-Generated Text.

The Daily Swig approached the researchers with additional questions. No word back as yet but we’ll update this story as and when more information comes to hand.

READ MORE FIN11 uncovered: Hacking group promoted to financial cybercrime elite

Read Entire Article