When “Just Unzip This” Turned Into a Full-Scale Engineering Project
How a half-terabyte of encrypted, nested archives turned into a 27TB, 44-million-file engineering challenge
By Ryan Hughes @ Fan Pier Labs
What started as a simple request to help a client extract some files quickly turned into an unexpectedly complex data uncompression project — but nothing we can’t handle.
The Call: “Can You Swing By?”
A client asked us to drop by their office to help one of their team members uncompress some data from an external hard drive. It sounded straightforward — closer to IT support than software development. No big deal. We packed our laptops and headed over.
When we arrived, we plugged the external HDD into their office desktop… and it immediately threw an error: the drive was corrupted. No access. Great.
Layer One: Encrypted Filesystem
Turns out the drive was encrypted with VeraCrypt, which we didn’t realize initially. We installed the software, entered the password, and finally mounted the disk. What we found wasn’t one big archive, but a unorganized pile of nested, multi-format archive files — .zip, .rar, .tar.gz, .7z, and combinations inside combinations. It was archive inception.
We still thought we could knock it out with a quick script.
Layer Two: Exponential Growth
There was about 500GB of compressed data — but because the archives were nested and not flat, we had no way of accurately estimating the final uncompressed size. A quick test unzipping 1/25th of the data showed a 40x expansion. That put the total size at 20TB+. Way too large for the machine we were on.
The client bought two 24TB external HDDs. We upgraded their desktop. Now it was software time.
Layer Three: Building a Robust Decompression Engine
We wrote a recursive decompression script in Python. It used a DFS-style queue to traverse folders and unpack archives layer by layer. We intentionally avoided using external state (no database), relying instead on the file system: when a file was unzipped, we replaced the original archive with a folder or a renamed artifact, so the process could resume where it left off.
It worked. Until it didn’t.
Layer Four: Windows Pain
We hit major problems:
- Files with non-ASCII names broke some tools.
- Path lengths exceeded Windows’ 256-character limit.
- Some archives were corrupt or password-protected, crashing the whole run.
- Uncompression on their Windows desktop computer was just too slow
We also tried to speed things up by multi-processing across multiple CPU cores on the Windows machine. But the desktop was limited — both in raw CPU performance and number of cores. Even with our optimizations, the uncompression remained slow. It was also painful to debug. Just listing all the files on the external HDDs could take over an hour. Development cycles were brutal — run a job, wait 2 days, find a bug, restart. To top it off, the only way to check status was to physically go to their office in Boston.
We needed a faster, more flexible solution.
Layer Five: Moving to AWS
We spun up an S3 bucket and uploaded the archive files. Then we migrated our decompression system to use Amazon EC2 and Amazon SQS. This let us:
- Parallelize the work across multiple compute nodes.
- Use bigger and faster machines than the in-office Windows desktop — more cores, more RAM, faster disks.
- Track decompression jobs with SQS.
- Move corrupted/failed jobs to a dead letter queue.
- Avoid path length issues (S3 supports 1024-character paths).
- Boost throughput: Local HDDs on the Windows machine were giving us 30–40 Mbps read/write speeds. On AWS, we were seeing over 900 Mbps between EC2 and S3 — more than 20× faster.
Now we could process the entire dataset in less than 12 hours instead of 7 days — and without leaving our desks.
Layer Six: What Comes Next
Now that the data lives in Amazon S3 — and the uncompression is complete — we’ve got options. A lot of them.
With roughly 27 terabytes of output and over 44 million files, the challenge shifts from decompression to accessibility and analysis. Here’s what we could build next:
- Vectorize the data for RAG pipelines: We could extract the text from the files, build vectors, and build a retrieval-augmented generation (RAG) pipeline. This would allow the client to “talk to their data” using natural language — powered by OpenAI or open-source LLMs.
- Semantic search with Elasticsearch or OpenSearch: If the goal is less generative and more traditional search, we can index the content or metadata and enable fast, accurate search across millions of files.
- Filename search: Even if content indexing is too heavy, we could still index just the file names into Elasticsearch for lightweight search and filtering.
- Batch processing and tagging: With all data in S3, we can build tagging jobs (e.g. “mark all PDFs over 50MB,” or “label files with sensitive keywords”) using Lambda, Step Functions, or containerized workers.
- Audit trail and analytics: Want to know how many files came from a specific nested archive? Or what file formats are most common? We can scan and summarize the full dataset efficiently now that it’s flat and accessible.
The key takeaway: once the data is in AWS, it’s not just uncompressed — it’s unlocked. Whatever the next layer looks like, we can build it.
What We Learned
This was a great example of a project that looked trivial on the surface but turned into a full-fledged engineering problem:
- Filesystem constraints matter.
- Recursion + parallelism + state tracking = necessary complexity.
- AWS makes the hard parts (scaling, durability, visibility) easier.
- The right architecture saves weeks of dev time.
We helped the client fully uncompress and recover their data. And if they ever want to search or index it, their entire dataset is already in S3 — ready for scalable processing.
Need help with high-volume data pipelines or infrastructure challenges?
We’ve seen — and untangled — it all. Let’s talk.
