Sitemap
Photo by Trent Pickering on Unsplash

How We Downloaded, Parsed, and Analyzed 14,000+ PDFs with AI

--

How we combined lightweight infrastructure, parallel downloads, and targeted parsing to efficiently extract complex tables

By Ryan Hughes and Dilan Bhat

How We Downloaded 14,000 PDFs and Extracted Complex Tables

A recent project involved downloading a large number of research PDFs — over 14,000 in total — and then extracting specific tables from each document. While it was a sizable undertaking, it was quite straightforward for our team, thanks to our prior experience with bulk file handling and data extraction. Below is a breakdown of how we approached both the file collection and table parsing tasks.

Part 1: Collecting 14,000 PDFs

Automated vs. Manual Downloading

Automated Downloads

For larger repositories or websites hosting hundreds of PDFs, we used custom scripts (built in Node.js, TypeScript, etc.) to locate and download the files in parallel. By relying on basic HTTP requests, we could retrieve multiple files at once without putting undue strain on either our own system or the source sites.

Manual Downloads

Many source sites contained just a handful of PDFs or were very complex, making an automated approach more time-consuming than necessary. Instead, we hired operations staff located overseas to manually download these smaller sets. This approach was cost-effective and quick for low-volume sources.

Right-Sized Infrastructure

Lightweight Setup

We kept track of each PDF using a single PostgreSQL instance (roughly $50/month) that stored the status of every file. This simple setup — coupled with an Amazon S3 bucket to store the actual PDFs — kept things both inexpensive and easy to manage. We didn’t need any advanced infrastructure or a fleet of servers; a single, modest database plus cloud storage was enough to handle all 14,000 documents.

Scalability in Mind

If a future project involves hundreds of thousands of PDFs, we can easily scale by adding more compute resources and a queue (eg. AWS SQS) to orchestrate downloads.

Part 2: Extracting Complex Tables

Collecting thousands of PDFs was just the beginning. Each paper contained one or more tables we needed to find and parse into a consistent data format and upload, as structured data, to our Postgres database.

Locating and Identifying Tables

  • Page-Level Search: We scanned through each PDF to find pages that might contain tables, looking for specific numeric patterns or keywords.
  • Handling Images with OCR: Some tables appeared only as scanned images, so we used Optical Character Recognition to extract text.
  • Unpredictable Formatting: Every publisher, and sometimes every paper, had a different way of structuring tables — some had missing rows, others had extra rows, and many lacked standard column separators.

“A single word in the table title could mean completely different parsing rules.”

Parsing and Standardizing

Once we identified the relevant tables (sometimes in supplementary sections rather than the main text), we applied custom logic to transform them into a uniform structure:

  1. Extract Raw Text: From both native PDF text and OCR outputs.
  2. Column & Row Detection: We used domain-specific insights, regex patterns, and numeric indicators to isolate columns and rows.
  3. Data Cleaning: We filtered out irrelevant rows, accounted for missing data, and fixed inconsistencies. Some tables turned out to be incomplete or irrelevant, which we had to skip.

Aggregating and Visualizing

After processing, we uploaded all extracted data to our Postgres database. This allowed us to:

  • Filter and Format: Narrow the data to exactly what was needed.
  • Visualize with Grafana: Create dashboards to show averages, minima/maxima, and other key statistics. (We considered a few alternate business insights tools as well, but chose Grafana for this project because it was cost-effective, powerful and easy enough to use.)
  • Drill Down to Source: Provide links back to original PDFs (stored in a secure bucket) for anyone who needed to verify or explore the raw data.

Key Take-Aways

Overseas Talent Can be Very Cost Effective

It was more cost-effective to hire overseas talent to manually download the PDFs than it was to have an engineer write a script to download the PDFs. This may seem like an odd approach to engineers who default to building an automatic solution, but it worked great and helped save the client time and money.

Domain Knowledge Matters

Understanding the research context (including commonly used metrics and table formats) helped us correct mistakes early without repeatedly consulting the client.

Err on the Side of Over-Collecting

Missing important data is often worse than collecting too much. We pulled as many potentially relevant tables as possible, then applied robust filtering later.

Optimize Early

Setting up parallel downloads and creating flexible parsing rules at the start saved time in the long run, making updates or reruns much easier.

Final Thoughts

Downloading 14,000 PDFs is no small task, but the bigger challenge often lies in extracting and standardizing crucial information buried inside. By combining automated and manual techniques for file collection, then applying tailored parsing and OCR solutions, we successfully transformed a jumble of PDF tables into a structured dataset suitable for in-depth analysis and visualization.

For anyone facing a similarly complex data-gathering project, the key takeaways are clear: invest in the right balance of tools and manual oversight, ensure you have the capacity to handle multiple formats and inconsistencies, and be prepared to refine your processes as new obstacles arise. This approach will save significant time and help you derive the insights you need.

--

--

Fan Pier Labs
Fan Pier Labs

Written by Fan Pier Labs

Helping startups with Web Development, AWS Infrastructure, and Machine Learning since 2019.

No responses yet