Sitemap
Photo by Michael Boskovski on Unsplash

Scaling Real-Time Gaming to Thousands of Concurrent Players

Scaling a Real-Time Gaming Platform: How We Helped Our Client Support Thousands of Concurrent Players

6 min readJan 16, 2025

--

By Ryan Hughes

When we first engaged with this startup — a real-time online game platform — they were already in trouble. Their platform allowed players to compete in a board-game-like environment (think multiplayer chess, but with more players and added complexity). However, as soon as they got a surge of users trying to play simultaneously, the backend would buckle under the load, often crashing entirely.

In this post, we’ll walk you through the core challenges our client faced, why their existing setup was failing, and the steps we took to help them achieve both immediate scalability gains and a sustainable, long-term infrastructure plan.

The Challenge

Single-Threaded Backend

The entire game logic lived in a container running on a single AWS ECS Fargate Task. The application was built using single-threaded Python/Django. It wasn’t trivial to simply use more threads because the game state was stored in python memory — if we used more threads, we’d have to introduce a mechanism to coordinate state updates between threads. Simply transitioning the application to use more powerful (and more expensive) EC2 instances wouldn’t solve the problem because the larger servers generally don’t include more powerful cores — they just include significantly more cores.

All Game State in Memory

The application stored each game’s state (board positions, moves, etc.) in the same memory space as the Python process. This meant that if that single instance crashed, every active game was lost. Considering that players often had real money on the line, this was obviously unacceptable.

Overloaded WebSocket Connections

Communication between clients and the backend happened entirely via WebSockets. Each player established a persistent connection to the single Python backend. As user counts grew, so did the number of active WebSocket connections — saturating the EC2 instance, which had to juggle the game logic and the management of thousands of open connections.

Monolith

All of the logic for all the subsystems, including authentication and profile management, were all running on the same codebase. This means that when the server went down due to being overloaded the entire application went down. This includes the ability to log in. This also removes the ability for us to give proper error messaging to users explaining the situation causing a complete catastrophic collapse. Other than the game logic which used WebSockets, all other functions of the app followed proper stateless REST principles.

Short-Term vs. Long-Term Needs

The client needed two things:

  • Immediate scaling relief to prevent the server from crashing and, ideally, handle 2x, 3x, or 4x as many concurrent users.
  • A long-term plan that could potentially handle 10x or 20x the current user load and still offer a reliable, resilient experience.

Potential Changes

After diagnosing the performance bottlenecks, we identified five key interventions that could help us address the scalability and reliability issues.

Break up Monolith

We discussed spitting up the monolith application into multiple different servers. Some of the functionallity, outside the core game logic, followed the stateless REST principles, meaning that we could deploy replica servers that handle the functions other than the core game logic. These other pieces of functionallity were stateless as well, which allowed us to easily scale horizontally by adding more cores. Additionally, if the core game logic crashed due to being overloaded, the other functions of the app would still operate successfully.

Install Datadog

We needed a real-time monitoring and analytics platform to pinpoint where the Python process was spending most of its CPU time. Datadog would give us the visibility and actionable insights necessary to tackle the biggest bottlenecks first.

Use Redis Channels

We considered implementing Redis Channels (or a similar approach) to enable multiple Python processes or threads to run on the same EC2 instance. Redis would keep track of which process was handling which user and coordinate the broadcast of messages. However, while this might alleviate some load by leveraging multiple cores, it wouldn’t fundamentally help us scale beyond a single EC2 instance.

Offload WebSocket Handling to a Gateway

We discussed setting up a Gateway (either API Gateway or a custom built one with Node.js) to handle all the WebSocket connections. This would relieve the single Django process of having to maintain thousands of persistent connections, allowing us to scale that layer independently of the core game logic.

Migrate off Django

We also discussed moving from Django to a more speed-optimized framework, such as FastAPI or Flask. FastAPI in particular stood out because it’s known for high performance, modern features, and strong support for asynchronous requests.

Our Decision: Balancing Short-Term Wins and Future Scalability

We wanted to ensure that any changes we made would solve the immediate load issues and also position us for significant growth down the line. After discussing each solution, we landed on the following plan:

Deployed Replicas

The application was containerized and followed a standard DevOps build process on GitHub Actions and use Terraform to manage its infrastructure. This means to create replicas we just needed to build using terraform a secondary ECS cluster with tasks running the same container. We then modified the OSI level 7 load balancer fronting the application to send all the game related traffic to the game ECS cluster, and all other requests to the replica cluster, which we called the administrative cluster. We differentiated between the two types of requests via the path. We then deployed many identical tasks into the administrative cluster to add a bit more redundancy and to be able to handle more traffic. This reduced the scope of the problem to just the game logic. As the administrative and game clusters both used the same version of the app we also did not need to make the deployment process any more complicated. Both clusters referenced the same version number as an argument to the deploy process so nothing about the team’s development process had to change to accommodate the additional cluster.

Install Datadog

This was an easy win, giving us immediate visibility into system bottlenecks and helping us make data-driven decisions. Specifically, we installed Datadog’s Application Performance Management service and Datadog’s Profiling system. APM allowed us to trace requests from the first point they entered the backend to the database to see which pieces were being triggered for each request. DataDdog Profiling allowed us to see exactly where the CPU was spending its time as well.

Migrate Off Django

Although it involved refactoring work, moving to a more performance-oriented framework provided a strong foundation for both near-term optimizations and future growth. We considered both Flask or FastAPI, and decided to go with FastAPI because it is heavily optimized for performance.

Offload WebSocket Handling to a Gateway

This took longer, but offloading connections promised major short-term and long-term performance boosts.

Exclude Redis Channels

We decided not to use Redis Channels because it wouldn’t address our overarching goal of scaling beyond a single server — we’d have to unwind all this work to scale to multiple servers.

After our work, our client was able to scale to handle significantly more users and, thanks to Datadog, had the insights they needed to see what was happening in their infrastructure.

Lessons

Divide and Conquer

By figuring out a method of segmenting away the game logic problem from the rest of the functions of the application we were able to quickly focus on just the problem with the game logic. This allows us to deliver a quick win to the customer to build confidence while we build a more sophisticated solution.

Monitoring is Key

Tools like Datadog can quickly highlight critical bottlenecks and guide your optimization efforts. Without monitoring tools in place, engineers can spend days poking around to determine where the issues lie.

Externalize Critical State

Storing important data in a single instance’s memory inevitably creates a single point of failure and creates a significant burden when scaling.

Don’t Underestimate WebSocket Overhead

Maintaining thousands of persistent connections is no small feat. Consider dedicated gateways or services that specialize in this. Specifically, API Gateway can handle up to hundreds of thousands to millions of connections which is far more than a single server could handle.

Plan for Both Today and Tomorrow

Balancing quick fixes with forward-looking architecture ensures you can handle immediate surges while setting yourself up for long-term success.

Conclusion

By deploying replicas, installing Datadog, setting up AWS API Gateway for WebSocket management, and migrating off Django, we helped this gaming startup stabilize their platform in the short term while preparing for robust growth in the long term.

If your team is hitting similar scalability hurdles — whether it’s a single-threaded bottleneck or a need to handle thousands of concurrent WebSocket connections — reach out and let us know. We can work with you to identify the most urgent issues, build a strategy that tackles them while positioning you for the future. With the right tools and a well-thought-out plan, it’s entirely possible to transform a fragile, single-instance setup into a fault-tolerant, highly scalable platform.

--

--

Fan Pier Labs
Fan Pier Labs

Written by Fan Pier Labs

Helping startups with Web Development, AWS Infrastructure, and Machine Learning since 2019.

No responses yet