Linguists team up with computer scientists to spot trends on cybercrime forums

4 years ago 164

BOOK THIS SPACE FOR AD

ARTICLE AD

Cambridge University boffins apply natural language processing to sort out the slang on HackForums

PortSwigger Ltd / vladwel /Shutterstock

Computer scientists and linguists from Cambridge University have combined to apply natural language processing (NLP) techniques to pick out trends in discussions on underground cybercrime forums.

These underground forums and chatrooms typically feature a great deal of general discussion, as well as attempts to sell illicit software and other items, or offer hacking tutorials. Posts are often full of domain-specific lexicon, misspellings, slang, jargon, and acronyms.

Standard NLP approaches are tuned for more organized, edited and collated content such as news articles and Wikipedia entries. Conventional approaches fall apart once faced with talk of ‘fullz’, ‘warez’, ‘rats’, ‘sploits’, and other terms that pepper English-language cybercrime forums.

A team of researchers led by Jack Hughes of the University of Cambridge’s Computer Laboratory and linguist Seth Aycock, also of the University of Cambridge, were however able to develop a technique to pick out trends from years of posts to an English-language underground hacking forum – specifically the popular HackForums site.

Noisy data

The statistical approach developed by the team was based on a technique called ‘weighted log-odds ratio’ was said to achieve better results than ‘term-frequency inverse-document-frequency’ (TF-IDF), another NLP-based method.

The researchers tested their technique by looking at HackForums posts referencing the spread of the WannaCry ransomware in 2017, and a second set of posts contained in a subforum called ‘Monetizing Techniques’.

Cybercrime canary

Applying the technique may have practical applications in “identifying what may be of interest to security researchers” more quickly and efficiently, according to the researchers.

However, the team acknowledge that since many cybercrime posts take place on Russian language forums, more work is needed to see if the technique lends itself to wider application.

“Many cyber-crime forums are not English-speaking, which can add complexity into analysis,” the team acknowledged.

A paper (PDF) on the research has been accepted at the 2020 Workshop on Noisy User-Generated Text.

The Daily Swig approached the researchers with additional questions. No word back as yet but we’ll update this story as and when more information comes to hand.

Read Entire Article