Large Language Models (LLMs) have gained massive popularity over the past few months, especially since the emergence of AI chatbots like ChatGPT. These AI-powered models can generate new content, such as text, images, audio, and more by studying an existing database and learning patterns to generate new and unique content. While these tools have been used to generate content using generative AI, researchers have now developed the first-of-its-kind LLM to assess and combat cybersecurity threats. Interestingly, this model has only been trained on the information present on the dark web.
DarkBERT is an encoder model that adopts the RoBERTa architecture, relying on transformers. Instead of being trained on the web, researchers trained this LLM on a vast dataset of dark web pages, assimilating information from places such as hacker forums, scamming websites, and other criminal internet sources. In a paper called ‘DarkBERT: A language model for the dark side of the Internet' published on arxiv.org that is yet to be peer-reviewed , its creators say that DarKBERT can revolutionize the fight against cybercrime by finding and analyzing the elusive domains of the Internet, which remain hidden from search engines.
While the dark web is usually concealed and inaccessible to the general public, researchers used the Tor network to access and collect data from its pages. The data then underwent several processes such as deduplication, category balancing, and pre-processing to create a refined database of the dark web, which was then finally fed to RoBERTa, which led to the creation of DarKBERT over a period of 15 days.
Since it is trained on the dataset of dark web pages, DarKBERT has the potential for a wide range of cybersecurity
Read more on tech.hindustantimes.com