Is Your Data Being Scraped? A Real Case Study
Disclaimer: All material in this post has been used with permission. Certain details omitted for security and client confidentiality.
About the Client
Our client is a small business that hosts a web application. This web application allows users to access some publicly available data without authentication, much like Amazon (eg. products and their prices). The client typically receives hundreds of thousands of page views per day and sometimes handles over one million page views per day.
The Initial Lead
We were first tipped off that there might be a issue because we saw our client’s public data available on other websites that they didn’t have a business relationship with. We though that these other sites could be scraping data from the client’s site without permission. If this was happening, it would be causing our client to lose traffic and revenue.
Verify the Scraping
To further understand what was happening, we set up comprehensive web request logging using SumoLogic, a tool chosen after evaluating several providers based on pricing and features. By analyzing these logs, we identified a high volume of requests from unauthorized sources, including various servers and scripts not associated with legitimate users.
Implementing a Solution
Our initial step was to deploy AWS Web Application Firewall (WAF) to block requests originating from other app servers. We enabled a managed AWS WAF rule preconfigured by AWS to automatically block illegitimate traffic. While this was effective to some extent, it inadvertently affected legitimate users on VPNs and didn’t catch all the bots.
We then explored various bot management platforms such as Cloudflare Bot Management, PerimeterX, and WhiteOps (now called Human). However, all these solutions were beyond the client’s budget, which was capped at $10,000/yr. Consequently, we had to develop an in-house solution.
Blocking Unauthorized Bots
To separate legitimate bot traffic from malicious bots, we implemented two main strategies:
- Reverse DNS Lookup: For requests claiming to be from well-known bots like Googlebot, we performed reverse DNS lookups to verify their authenticity. This method effectively blocked many illegitimate requests from Azure servers pretending to be Bingbot.
- Cookie Tracking: For requests claiming to be from legitimate browsers, we assigned cookies to users and monitored their behavior. We then flagged users that had significantly different behavior than the average user. For instance, a user would be flagged for further analysis if they accessed the site from 3 different countries in one day.
User Experience Considerations
To avoid disrupting legitimate users, we implemented a secondary verification process. Users flagged as potential bots had to complete a JavaScript challenge. Successfully passing this challenge confirmed their human status and restored their access.
I’d love to discuss all the technical details of our automated bot blocking system, but, unfortunately, I have to keep it brief. Many aspects of the system are confidential so a malicious actor cannot gain inside knowledge of how the system works.
Results
Our multi-layered approach significantly reduced bot traffic and prevented third-party websites from accessing the client’s data. One interesting detail is that we found that 60% percent of traffic that initially appeared to be BingBot (BingBot User Agent) was not the legitimate BingBot. We were able to block this illegitimate traffic while not affecting authentic BingBot requests.
Ongoing Success
The solution has been in place for years and continues to protect the client’s valuable information. While we used additional methods to enhance security, these primary strategies effectively achieved our goal within the client’s budget.
Conclusion
If your business is struggling with web scraping but cannot afford high-priced bot management solutions, consider reaching out to us. We can tailor a cost-effective strategy to safeguard your data and maintain your competitive edge.