This blog is jointly authored by Apurv Singh Gautam, Sr. Threat Research Analyst at Cyble, and Sean O’Connor, co-author of the SANS FOR589TM: Cybercrime IntelligenceTM course.
In the rapidly evolving cybercrime landscape, staying ahead of malicious actors requires a proactive approach to gathering and analyzing data. One of the most powerful tools in the arsenal of cybercrime intelligence analysts is web scraping.
What is Web Scraping?
Web scraping is the process of automatically extracting data from websites and web pages, enabling organizations to collect vast amounts of information quickly and efficiently.
Web scraping plays a crucial role in cybercrime intelligence by enabling analysts to keep a close watch on dark web forums, marketplaces, and other online chat platforms where cybercriminals gather and exchange information. By systematically collecting data from these sources, analysts can unearth valuable insights into emerging threats, vulnerabilities, and malicious actors' tactics, techniques, and procedures (TTPs). These insights can guide decision-making, bolster incident response capabilities, and fortify overall cybersecurity.
In this blog post, we will explore the various aspects of scraping operations on the cybercrime underground. We will delve into the different scraping methods, discuss the key use cases in cybercrime investigations, and examine the challenges of anti-scraping mechanisms and techniques to bypass them. Importantly, we will provide insights into the strategic decision-making process behind scraping operations and highlight the importance of data storage and analysis using tools like Elasticsearch and Kibana.
By the end of this post, readers will have a comprehensive understanding of how scraping operations can be leveraged to enhance cybercrime intelligence efforts. This knowledge will empower you, as a cybersecurity intelligence analyst, with the best practices and considerations for implementing effective scraping workflows (see Figure 1). So, let's dive in and uncover the power of scraping in the fight against cybercrime.
Scraping Toolsets
Scraping operations involve leveraging different software tools, libraries, and frameworks to extract data programmatically, enabling efficient and scalable information gathering (see Figure 2).
- Python Libraries and Frameworks: The most popular tools for these scraping tasks include Python libraries and frameworks. BeautifulSoup is a Python library that simplifies parsing HTML and XML documents with an intuitive interface for navigating and searching parsed data, making it a user-friendly choice for web scraping. Another key library, Requests, facilitates HTTP requests and is often used with BeautifulSoup to fetch web pages and extract data, providing a clean API for handling requests, cookies, and authentication. Scrapy, a more robust Python framework, offers a comprehensive ecosystem for building and managing scrapers, with features like request handling, data extraction, multithreading, and pipeline management. Another application-specific library is Telethon, which is used to scrape Telegram messages.
- JavaScript Scraping: For scraping JavaScript-based sources, Puppeteer is a powerful Node.js library that enables programmatic control of headless Chrome or Chromium browsers. It is ideal for scraping dynamic web pages and handling complex scenarios like user interactions and JavaScript rendering. Browser automation tools such as Selenium and Playwright are also widely used. Selenium supports multiple programming languages and simulates user interactions, making it effective for scraping JavaScript-heavy websites. Playwright, developed by Microsoft, provides a unified API for automating interactions across different browsers, offering a modern alternative to Selenium.
- Proxies: Proxies play a crucial role in scraping operations by enhancing privacy and bypassing restrictions. Tools like Privoxy, a non-caching web proxy with advanced filtering capabilities, support traffic routing through networks like Tor. Another proxy tool, Proxychains, forces applications to route TCP connections through proxies like TOR, SOCKS4/5, or HTTP(S) proxies. Similarly, Tor is a widely used network for routing traffic to enhance anonymity.
Cybercrime Intelligence Use Cases for Scraping
Scraping operations play a crucial role in gathering valuable intelligence for combating cybercrime. By leveraging automated scraping techniques, cybercrime investigators and analysts can collect and analyze vast amounts of data from cybercrime sources. Some of the key use cases of scraping operations on the cybercrime underground include monitoring cybercrime forums and marketplaces, detecting data leaks and breaches, tracking and profiling threat actors, and investigating cybercrime networks and infrastructure (see Figure 3).
- Monitoring Cybercrime Forums and Marketplaces: The cybercrime underground is a hotbed for cybercriminal activities, with numerous forums, marketplaces, and chat platforms for sharing technical knowledge, selling stolen data, and coordinating attacks. Scraping these forums and marketplaces provides valuable insights into the latest trends, tools, and techniques cybercriminals use. By monitoring conversations and tracking the sale of illicit goods, analysts can stay ahead of emerging threats and proactively defend against potential attacks.
- Detecting Data Leaks and Breaches: The scraped data from these forums and marketplaces can be used to search for specific keywords or patterns, such as company names, product names, or sensitive data, to identify potential exposures quickly. Early detection of data leaks allows organizations to take swift action, notify affected parties, and mitigate the breach's impact.
- Tracking and Profiling Threat Actors: Cybercriminals often leave digital footprints across multiple online platforms, including forums, social media, and code repositories. By scraping these sources, analysts can gather information about specific threat actors, their aliases, and their activities. This data can be used to build comprehensive profiles of cybercriminals, understand their motivations, and track their movements across the web. Profiling threat actors helps in attribution efforts and enables targeted investigations and takedowns.
- Investigating Cybercrime Networks and Infrastructure: Cybercriminals often rely on a complex network of infrastructure, including command and control (C2) servers, proxy servers, and bulletproof hosting providers. Analyzing indicators of compromise (IOCs) from scraped data and enriching them with domain registration records, IP address ranges, malware analysis, and server configurations can help uncover these infrastructure components and map out the cybercrime infrastructure. Analysts can also identify key players, disrupt operations, and gather evidence for legal proceedings.
By leveraging automated scraping techniques, cybercrime intelligence analysts can gather valuable data, identify trends and patterns, and make informed decisions to prevent, detect, and respond to cyber threats.
Anti-Scraping Mechanisms
Website owners and administrators implement anti-scraping mechanisms to protect their content and prevent unauthorized data collection. These mechanisms pose significant challenges for cybercrime intelligence professionals who rely on scraping to gather valuable data. Some of the common anti-scraping techniques employed by websites include:
- Access Control and Authentication: CAPTCHAs and human verification challenges, such as distorted text or Image recognition tasks, are among the most widely used anti-scraping measures. These challenges, designed to differentiate human users from automated bots, significantly impede data collection efforts.
- User Agent Detection and Blocking: Websites can analyze the user agent string and headers the client sends to identify and block requests originating from scraping scripts. The website may deny access or serve alternate content if the user agent string does not indicate a legitimate human-like browser request.
- IP Address Tracking and Blocking: Websites can monitor the rate and pattern of requests from specific IP addresses. If an IP address is detected making an unusually high number of requests within a short time frame, the website may flag it as a potential scraper and block or throttle its access.
- DDoS Protection and Rate Limiting: Distributed denial of service (DDoS) protection measures, such as introducing delays in content delivery, can effectively limit the rate at which scrapers can access and extract data. These delays are designed to distinguish between legitimate human users and automated scripts. Website administrators utilize services like Cloudflare and Imperva to provide DDoS protection and rate limits.
- Dynamic Content Rendering and JavaScript Obstacles: Some websites heavily rely on JavaScript to render content dynamically, making it challenging for traditional scraping tools to extract data. The desired information may not be present in the initial HTML response and requires executing JavaScript code to load and display the content.
- Cookie Change and Account Blocking: Website administrators may monitor and block user accounts that exhibit suspicious scraping behavior. Moreover, periodically changing cookies is another strategy website administrators employ to deter scraping. Scrapers that rely on static cookies may encounter difficulties when the cookies are frequently updated.
- Fingerprinting and Browser Profiling: Advanced anti-scraping systems employ fingerprinting techniques to identify and block automated scraping tools. These techniques analyze various client attributes, such as browser version, installed plugins, screen resolution, and system fonts, to create a unique fingerprint. These fingerprinting mechanisms can detect whether a request comes from a genuine browser.
Anti-Scraping Countermeasures
To successfully navigate the challenges posed by anti-scraping mechanisms, cybercrime intelligence analysts must employ a range of countermeasures and best practices. These techniques aim to mitigate the risk of detection and ensure the continuity of scraping operations. Some of the strategies to bypass anti-scraping measures and conduct effective scraping for cybercrime intelligence purposes include:
- Handling CAPTCHAs and Solving Challenges: Dealing with CAPTCHAs is one of the most significant hurdles in scraping operations. While some CAPTCHAs can be solved using automated optical character recognition (OCR) techniques, more sophisticated challenges often require human intervention. CAPTCHA-solving/bypass services, such as Anti-Captcha or 2Captcha, can be integrated into scraping workflows to outsource CAPTCHA-solving to human workers, allowing scrapers to proceed without manual intervention.
- Rotating User Agents and Headers: One of the most basic countermeasures is to rotate user agent strings and headers to mimic legitimate browser requests. By using a pool of diverse user agent strings and regularly rotating them between requests, scrapers can avoid detection based on a single, consistent user agent. Customizing headers such as "Referer" and "Accept-Language" can also help make requests appear more human-like.
- Utilizing Proxy Servers and IP Rotation: Employing proxy servers and implementing IP rotation strategies can help circumvent IP-based blocking and rate limiting. By routing requests through a network of proxy servers, scrapers can distribute their traffic across multiple IP addresses, making it harder for websites to detect and block individual scraping sessions. Proxy and VPN services offer various proxies, such as residential, data center, and mobile, each with advantages and use cases.
- Mimicking Human Behavior and Introducing Random Delays: To avoid triggering rate-limiting mechanisms and appearing as a bot, scrapers should introduce random delays between requests and mimic human browsing behavior. This can involve adding random pauses, varying the time intervals between requests, and simulating human-like actions such as scrolling, clicking, and mouse movements. Tools like Playwright and Selenium can be used to mimic human behavior and introduce delays.
- Handling Dynamic Content and JavaScript Rendering: For websites that heavily rely on JavaScript to render content dynamically, scraping requires the use of headless browsers or tools like Puppeteer or Selenium. These tools can simulate user interactions, execute JavaScript code, and retrieve fully rendered HTML content. By leveraging these technologies, scrapers can extract data from websites that would otherwise be challenging to scrape using traditional methods.
- Continuous Changing of Cookies and Rotating Accounts: Websites use cookies to track users' sessions and behavior. A scraper can avoid being recognized as a single continuous user by continuously changing cookies. This can be done manually by clearing cookies at regular intervals or automatically by scripting the scraper to change or reset cookies periodically. This technique helps evade tracking mechanisms that rely on cookie data to identify and block scrapers. Similarly, if the website requires user accounts for access, rotating between different accounts can prevent any single account from being flagged for excessive use or unusual activity. This practice helps distribute the load and masks the automated nature of the interactions.
- Bypassing Fingerprinting and Browser Profiling: Analysts need to be aware of the browser fingerprinting methods and take steps to mimic a genuine browser environment. This can involve using headless browsers with configurations resembling popular browser setups and periodically rotating browser profiles to avoid detection. Services like Am I Unique or Cover Your Tracks can be used to check browser fingerprinting.
Anti-scraping mechanisms are constantly evolving, and websites may implement new measures to deter scrapers. As cybercrime intelligence professionals, your adaptability is key. You must continuously monitor your scraping operations, detect any changes or disruptions, and adapt your techniques accordingly. Staying up-to-date with the latest scraping technologies, tools, and best practices is essential to maintain effective scraping efforts (see Table 1).
Anti-Scraping Mechanism | Intent | Countermeasure | Impact to Scraping Operations |
CAPTCHAs and human verification challenges | Differentiate human users from automated bots | Utilize CAPTCHA-solving services or incorporate human intervention | Increases the complexity and time required for scraping |
User agent detection and blocking | Identify and block requests from known scraping tools or libraries | Rotate user agent strings and customize headers to mimic legitimate browser requests | Requires additional effort to maintain a pool of diverse user agent strings |
IP address tracking and rate limiting | Detect and block requests from IP addresses making excessive requests | Employ proxy servers and implement IP rotation strategies to distribute requests across multiple IP addresses | Increases the cost and complexity of scraping infrastructure |
Dynamic content rendering and JavaScript obstacles | Prevent scraping of content loaded dynamically through JavaScript | Use headless browsers or tools like Puppeteer or Selenium to render and extract dynamically loaded content | Increases the complexity and computational resources required for scraping |
Fingerprinting and browser profiling | Identify and block automated tools based on unique browser characteristics | Use headless browsers with configurations that closely resemble genuine browser environments and rotate browser profiles | Requires continuous monitoring and adaptation to avoid detection |
Browser automation detection | Identify and block requests from automated browser tools like Selenium or Puppeteer | Minimize the use of automation-specific code and utilize techniques like WebDriver spoofing or undetected Chrome variants | Requires staying updated with the latest detection methods and countermeasures |
CAPTCHA and challenge-response evolution | Prevent automated solving of CAPTCHAs using advanced challenge-response mechanisms | Continuously monitor and integrate the latest CAPTCHA-solving techniques and services | Increases the complexity and cost of handling CAPTCHAs in scraping operations |
Table 1: Anti-Scraping Mechanisms and Countermeasures (Source)
Scraping Decision Making
Effective scraping operations require careful planning and strategic decision-making. Several factors must be considered to ensure that scraping efforts are targeted, efficient, and aligned with the overall objectives of the investigation. The key aspects of decision-making when planning and executing scraping tasks for cybercrime intelligence purposes include:
- Identifying Relevant Data Sources: Analysts must carefully select the websites, forums, marketplaces, and other online platforms likely to contain relevant information. Factors to consider include the data source's reputation and credibility, the volume and quality of the available data, and the information's relevance to the specific case at hand.
- Evaluating Technical Feasibility and Resource Requirements: A scraping operation's technical feasibility and resource requirements must be carefully evaluated. This involves assessing the complexity of the target websites, the presence of anti-scraping mechanisms, the need for specialized tools or expertise, and the decision between self-built or paid tools. Analysts should consider factors such as the scalability of the scraping infrastructure, the storage and processing capacity for the collected data, and the availability of skilled personnel to design and execute the scraping tasks.
- Designing Resilient and Adaptable Scraping Workflows: Cybercrime intelligence professionals should design scraping workflows resilient to changes in website structures, anti-scraping measures, and data availability. This involves implementing robust error handling, monitoring mechanisms, and failover strategies to ensure the continuity of scraping operations.
Storage and Analysis
Efficient storage, processing, and analysis of the data collected through scraping operations are crucial for deriving actionable insights in cybercrime investigations. Any database can store the data, but the Elastic or ELK stack provides a robust and scalable solution for managing and exploiting the scraped data.
Elasticsearch: Elasticsearch is a distributed, open-source search and analytics engine that forms the core of the ELK stack. It provides a scalable and efficient platform for storing, searching, and analyzing large volumes of structured and unstructured data.
Logstash: Logstash is a data processing pipeline that integrates with Elasticsearch to ingest, transform, and load data from various sources.
Kibana: Kibana is a powerful data visualization and exploration tool that complements Elasticsearch and Logstash in the ELK stack. It provides an intuitive web interface for querying, visualizing, and dashboarding the data stored in Elasticsearch.
Cybercrime intelligence analysts need to integrate their scraping operations with the ELK stack to leverage the Elastic stack for scraped data storage and analysis. This involves configuring the scraping tools or scripts to output the collected data in a format compatible with Logstash's input plugins, such as JSON or CSV. Once Logstash ingests the scraped data, it can be processed, enriched, and transformed using Logstash's pipeline configuration. The transformed data is then indexed in Elasticsearch, which becomes available for searching, querying, and visualization through Kibana. Kibana's Discover and Visualization tools (see Figures 4 and 5) can analyze and visualize the data using tables and charts.
The powerful combination of Elastic stack, with its search and analytics capabilities, enrichment features, and intuitive visualization and exploration tools, enables efficient and effective insights extraction from collected data.
Case Study
CHAOTIC SPIDER, also known as "Desorden" and previously operating as "ChaosCC," is a financially motivated cybercriminal entity active since September 2021. The group specializes in data theft and extortion, focusing its efforts on enterprises in Southeast Asia, particularly in Thailand since 2022. Desorden employs SQL injection attacks to compromise web-facing servers, exfiltrating data without resorting to encryption or destruction tactics commonly seen in double-extortion schemes. Their primary activities include targeting prominent organizations such as Acer, Ranhill Utilities, and various Thai firms, with stolen data sold on cybercriminal forums like RaidForums and BreachForums (see Figure 6). The group's last known activity was recorded in October 2023, emphasizing their preference for secrecy through secure communication channels such as Tox and private messaging. Desorden’s operational focus highlights the urgent need for regional businesses to implement robust cybersecurity defenses, including multi-factor authentication, enhanced monitoring, and employee security awareness training, to mitigate potential threats.
Profiling Desorden's activities offers critical insights into their methods and objectives. Manual profiling through cybercrime forums provides a detailed analysis of their data sales and targeted industries but is time-intensive and subject to challenges such as post deletions and forum volatility (see Figure 7).
Automated profiling, on the other hand, leverages tools like Elasticsearch and Kibana to index and query forum data efficiently, offering enhanced visibility and historical insights into the group's operations across multiple forums. For example, Kibana allows analysts to quickly identify posts authored by Desorden, uncovering a range of high-profile victims across Southeast Asia (see Figure 8).
While automated systems provide scalability and operational security, they benefit from being complemented by human intelligence (HUMINT) to add context and nuance. This blended approach ensures a comprehensive understanding of threat actor behaviors, enabling organizations to stay ahead of evolving cyber threats posed by entities like CHAOTIC SPIDER.
The case study demonstrates the practical application of scraping operations in investigating data leaks on the cybercrime underground. The case study focuses on identifying and analyzing leaked credentials and sensitive information related to a specific organization.
Utilizing Large Language Models (LLMs) for Scraping Operations
The advancements in LLMs have opened up new possibilities for enhancing cybercrime intelligence, including scraping operations and data analysis. Some applications of LLMs in the context of scraping and analyzing data for cybercrime investigations include:
- Automated Scraping Script Generation: Script generation and improvement are widely used features of LLMs. By providing the LLM with a description of the target website or data source and the desired data fields to extract, LLMs can generate a customized scraping script in a specific programming language. This can save significant time and effort in the development process, especially when dealing with multiple data sources or frequently changing website structures. LLMs are also used to fine-tune an already-built script for scaling purposes, making the process more efficient and productive.
- Text Summarization and Insight Generation: LLMs can generate summaries and extract key insights from large volumes of scraped text data. The model can be trained to identify and highlight the most relevant information from scraped forum discussions, chat logs, or marketplace listings by fine-tuning an LLM on domain-specific cybercrime-related content. For example, an LLM can generate concise summaries of lengthy threads discussing new attack techniques, summarize key points from posts related to a specific cybercrime campaign, or identify emerging trends and patterns across multiple data sources, keeping analysts informed and up-to-date with timely intelligence.
- Multilingual Analysis and Translation: Cybercrime activities often span across different countries and languages. LLMs with multilingual capabilities can be leveraged to analyze scraped data in various languages and provide automated translations. For example, an LLM can be used to detect the language of scraped text automatically, translate the content into a common language (e.g., English) for analysis, or identify key entities and relationships across multiple languages. This can significantly expand the scope and effectiveness of cybercrime investigations by incorporating data from diverse linguistic backgrounds.
It's important to note that while LLMs offer promising applications in scraping operations, they also have limitations and potential biases. LLMs are trained on vast amounts of data and may sometimes generate irrelevant or incorrect information. Therefore, analysts play a crucial role in carefully reviewing and validating the outputs generated by LLMs to ensure accuracy and relevance, making them an integral part of the process.
Final Thoughts on Scraping the Underground
Web scraping is a powerful and essential capability for cybercrime intelligence professionals, offering unmatched efficiency in gathering and analyzing data from the cybercrime underground. By leveraging tools like Python libraries, browser automation frameworks, and proxies, analysts can monitor forums, marketplaces, and communication platforms where cybercriminals operate. However, the work doesn’t stop there—understanding and countering anti-scraping mechanisms, integrating robust analysis tools like the ELK stack, and strategically managing operations are critical to success.
With the right tools and techniques, web scraping allows analysts to detect emerging threats, track malicious actors, and provide actionable insights to bolster organizational defenses.
Interested in mastering web scraping techniques and integrating them into your threat intelligence operations? The SANS FOR589: Cybercrime Intelligence course dives deep into these strategies, offering hands-on training and insights into navigating the ever-evolving cybercrime landscape. Register today or request a live demo to see how the FOR589 course can transform your approach to cyber intelligence.
Let’s take your scraping operation skills to the next level!