Scaling Real-Time Gaming to Thousands of Concurrent Players
Scaling a Real-Time Gaming Platform: How We Helped Our Client Support Thousands of Concurrent Players
By Ryan Hughes
When we first engaged with this startup — a real-time online game platform — they were already in trouble. Their platform allowed players to compete in a board-game-like environment (think multiplayer chess, but with more players and added complexity). However, as soon as they got a surge of users trying to play simultaneously, the backend would buckle under the load, often crashing entirely.
In this post, we’ll walk you through the core challenges our client faced, why their existing setup was failing, and the steps we took to help them achieve both immediate scalability gains and a sustainable, long-term infrastructure plan.
The Challenge
Single-Threaded Backend
The entire game logic lived in a container running on a single AWS ECS Fargate Task. The application was built using single-threaded Python/Django. It wasn’t trivial to simply use more threads because the game state was stored in python memory — if we used more threads, we’d have to introduce a mechanism to coordinate state updates between threads. Simply transitioning the application to use more powerful (and more expensive) EC2 instances wouldn’t solve the problem because the larger servers generally don’t include more powerful cores — they just include significantly more cores.
All Game State in Memory
The application stored each game’s state (board positions, moves, etc.) in the same memory space as the Python process. This meant that if that single instance crashed, every active game was lost. Considering that players often had real money on the line, this was obviously unacceptable.
Overloaded WebSocket Connections
Communication between clients and the backend happened entirely via WebSockets. Each player established a persistent connection to the single Python backend. As user counts grew, so did the number of active WebSocket connections — saturating the EC2 instance, which had to juggle the game logic and the management of thousands of open connections.
Monolith
All of the logic for all the subsystems, including authentication and profile management, were all running on the same codebase. This means that when the server went down due to being overloaded the entire application went down. This includes the ability to log in. This also removes the ability for us to give proper error messaging to users explaining the situation causing a complete catastrophic collapse. Other than the game logic which used WebSockets, all other functions of the app followed proper stateless REST principles.
Short-Term vs. Long-Term Needs
The client needed two things:
- Immediate scaling relief to prevent the server from crashing and, ideally, handle 2x, 3x, or 4x as many concurrent users.
- A long-term plan that could potentially handle 10x or 20x the current user load and still offer a reliable, resilient experience.
Potential Changes
After diagnosing the performance bottlenecks, we identified five key interventions that could help us address the scalability and reliability issues.
Break up Monolith
We discussed spitting up the monolith application into multiple different servers. Some of the functionallity, outside the core game logic, followed the stateless REST principles, meaning that we could deploy replica servers that handle the functions other than the core game logic. These other pieces of functionallity were stateless as well, which allowed us to easily scale horizontally by adding more cores. Additionally, if the core game logic crashed due to being overloaded, the other functions of the app would still operate successfully.
Install Datadog
We needed a real-time monitoring and analytics platform to pinpoint where the Python process was spending most of its CPU time. Datadog would give us the visibility and actionable insights necessary to tackle the biggest bottlenecks first.
Use Redis Channels
We considered implementing Redis Channels (or a similar approach) to enable multiple Python processes or threads to run on the same EC2 instance. Redis would keep track of which process was handling which user and coordinate the broadcast of messages. However, while this might alleviate some load by leveraging multiple cores, it wouldn’t fundamentally help us scale beyond a single EC2 instance.
Offload WebSocket Handling to a Gateway
We discussed setting up a Gateway (either API Gateway or a custom built one with Node.js) to handle all the WebSocket connections. This would relieve the single Django process of having to maintain thousands of persistent connections, allowing us to scale that layer independently of the core game logic.
Migrate off Django
We also discussed moving from Django to a more speed-optimized framework, such as FastAPI or Flask. FastAPI in particular stood out because it’s known for high performance, modern features, and strong support for asynchronous requests.
Our Decision: Balancing Short-Term Wins and Future Scalability
We wanted to ensure that any changes we made would solve the immediate load issues and also position us for significant growth down the line. After discussing each solution, we landed on the following plan:
Deployed Replicas
The application was containerized and followed a standard DevOps build process on GitHub Actions and use Terraform to manage its infrastructure. This means to create replicas we just needed to build using terraform a secondary ECS cluster with tasks running the same container. We then modified the OSI level 7 load balancer fronting the application to send all the game related traffic to the game ECS cluster, and all other requests to the replica cluster, which we called the administrative cluster. We differentiated between the two types of requests via the path. We then deployed many identical tasks into the administrative cluster to add a bit more redundancy and to be able to handle more traffic. This reduced the scope of the problem to just the game logic. As the administrative and game clusters both used the same version of the app we also did not need to make the deployment process any more complicated. Both clusters referenced the same version number as an argument to the deploy process so nothing about the team’s development process had to change to accommodate the additional cluster.
Install Datadog
This was an easy win, giving us immediate visibility into system bottlenecks and helping us make data-driven decisions. Specifically, we installed Datadog’s Application Performance Management service and Datadog’s Profiling system. APM allowed us to trace requests from the first point they entered the backend to the database to see which pieces were being triggered for each request. DataDdog Profiling allowed us to see exactly where the CPU was spending its time as well.
Migrate Off Django
Although it involved refactoring work, moving to a more performance-oriented framework provided a strong foundation for both near-term optimizations and future growth. We considered both Flask or FastAPI, and decided to go with FastAPI because it is heavily optimized for performance.
Offload WebSocket Handling to a Gateway
This took longer, but offloading connections promised major short-term and long-term performance boosts.
Exclude Redis Channels
We decided not to use Redis Channels because it wouldn’t address our overarching goal of scaling beyond a single server — we’d have to unwind all this work to scale to multiple servers.
After our work, our client was able to scale to handle significantly more users and, thanks to Datadog, had the insights they needed to see what was happening in their infrastructure.
Lessons
Divide and Conquer
By figuring out a method of segmenting away the game logic problem from the rest of the functions of the application we were able to quickly focus on just the problem with the game logic. This allows us to deliver a quick win to the customer to build confidence while we build a more sophisticated solution.
Monitoring is Key
Tools like Datadog can quickly highlight critical bottlenecks and guide your optimization efforts. Without monitoring tools in place, engineers can spend days poking around to determine where the issues lie.
Externalize Critical State
Storing important data in a single instance’s memory inevitably creates a single point of failure and creates a significant burden when scaling.
Don’t Underestimate WebSocket Overhead
Maintaining thousands of persistent connections is no small feat. Consider dedicated gateways or services that specialize in this. Specifically, API Gateway can handle up to hundreds of thousands to millions of connections which is far more than a single server could handle.
Plan for Both Today and Tomorrow
Balancing quick fixes with forward-looking architecture ensures you can handle immediate surges while setting yourself up for long-term success.
Conclusion
By deploying replicas, installing Datadog, setting up AWS API Gateway for WebSocket management, and migrating off Django, we helped this gaming startup stabilize their platform in the short term while preparing for robust growth in the long term.
If your team is hitting similar scalability hurdles — whether it’s a single-threaded bottleneck or a need to handle thousands of concurrent WebSocket connections — reach out and let us know. We can work with you to identify the most urgent issues, build a strategy that tackles them while positioning you for the future. With the right tools and a well-thought-out plan, it’s entirely possible to transform a fragile, single-instance setup into a fault-tolerant, highly scalable platform.