After dealing with a TLS certificate expiration, Epic Games decides to make their experience a teaching moment for others — we’ll cover some of the key takeaways they shared and how you can prevent it from happening to your business
This server is unavailable.
These four words deliver feelings of dread and aggro to gamers as effectively as a punch to the gut. It means no battlegrounds, raids, or hours of exciting weeknight gameplay with friends. Or, worse, you might have to spend your free time with family instead — and what teenager wants that? Gross.
Seriously, though, widespread service disruptions can have a huge impact. The global online gaming market is a huge industry. As of 2020, it was worth $167 billion and is anticipated to reach $287.1 billion by 2026, according to recent data from ResearchAndMarkets.com. And online gaming service outages don’t just affect kids and teens. Data from LimeLight’s State of Online Gaming 2020 report shows that many gamers fall within the 26-35 age category (30.2%), followed by 36-45 year olds (28.3%) and gamers who are 60+ (26.8%).
Some of the worst types of downtime for businesses are those that are entirely avoidable… you know, like SSL/TLS certificate expirations.
Unfortunately for Epic Games (EG) fans — players of games like FortNite, HouseParty and Rocket League — they discovered what happens when a company allows even just one of their SSL/TLS certificates to expire. But unlike many companies in their position, Epic Games didn’t try to hide or downplay their mistake. Instead, they decided to be a boss and openly talked about the April 6 incident in an online article. Their goal? To help other companies learn from their mistakes.
Kudos, Epic Games. We respect that. And in honor of your uncommon transparency, we’re going to go over the highlights of your report and go over what companies can do differently to avoid ending up in the same position.
Let’s hash it out.
An Epic Play-by-Play: Breaking Down What Occurred
Certificate expirations suck no matter how you look at it. For businesses, they make a bad impression and leave you non-compliant. For users, you’re lose access to the services or products you paid for. Digital certificates are your organization’s digital identity as well as a way to secure your services, websites, and data from unauthorized access. And when even “only” one certificate expires, it creates a slew of problems that no organization wants to deal with.
In Epic Games’ situation, one of the internal TLS certificates they were using to encrypt their backend services for internal management tools and cross-service API calls expired. Of course, it’s important to note that it just takes one certificate to create a big mess. But in this case, thankfully, EG quickly narrowed down the issue to an expired certificate and got people from across their various teams to work together to resolve the issue.
But just how did everything go down? Epic Games was kind enough to provide a detailed timeline of events as they occurred on Tuesday, April 6 in their article:
Rather than go over every specific detail of this timeline of this incident in depth, we’re going to give you the highlights.
- They discovered that an internal wildcard SSL/TLS certificate expired. This certificate, which touched many internal backend services across their IT ecosystem, led to widespread service outages for users and employees alike. This immediately led EG’s IT team to go into incident management mode to deal with the issue.
- 25 minutes later, they started the certificate reissuance process. Thankfully, it didn’t take long for them to discover an expired certificate was the culprit behind the service outages. They quickly started the certificate reissuance process, which allowed them to start the recovery of select services. But the situation doesn’t end there…
- Their internal teams discover other issues with connected services over the next few hours. A series of events and issues led them to identify other things that were amiss within their IT ecosystem that affected their launcher client and online store. Some of these issues included missing assets and invalid content. Luckily, EG says they were able to attain full recovery of all their affected services and systems by 5:35 p.m. UTC
Epic Games reports that the whole situation lasted a little more than 5.5. hours from start to finish. But it seems like the online gaming giant took the hit to the chin like a champ and responded quickly to resolve the issues. They also decided to use it as an opportunity to spread the word about the importance of implementing effective certificate management. (We’ll speak more to that momentarily…)
Area of Effect: Who and What Were Impacted By the Certificate Expiration
Epic Games is a company with a large and growing customer base. Their Epic Games Store 2020 Year in Review report shares that their EGS community has 31.3 million daily active users (DAUs), which is a 192% increase over the previous year. They also report having more than 160 million Epic Games Store PC users who spent more than $700 million in 2020. So, you can see that we’re not talking about a small market here.
Because Epic Games used the affected wildcard certificate across hundreds of different production services, it means that the impact of its expiration was widespread across their ecosystem. This affected both their customers who were trying to use their products and their employees who were attempting to resolve and manage the downtime-related issues.
The biggest impacts were felt by their identity and authentication systems. As you can imagine, this resulted in:
- User login and purchase failures across multiple products and systems. This means anyone trying to log in during the hours of the outage couldn’t do so. They also couldn’t purchase items in the Epic Games Launcher client.
- Live service and gameplay disconnections and website failures. For users already in the middle of gaming, this boot from live gameplay resulted in extra frustrations because they couldn’t reconnect. EG’s product and marketing websites also were experiencing a lot of 403 errors due to an unrelated container update that had been made the day before.
- EG employees’ hands being temporarily tied due to internal tooling issues. The people who get it the worst in downtime situations are the customer service employees.
There Were Some Unexpected Positives That Came Out of the Situation…
An issue that started with an expired internal certificate quickly morphed into something much bigger. It served as an opportunity for EG to identify other unrelated issues that existed within their systems that they otherwise may have not discovered until cybercriminals exploited them.
One example is the “unexpected behaviors” that they discovered in the Epic Games Launcher client that resulted in unusual call patterns. It turns out, clients were using linear retry logic rather than a truncated exponential backoff. The first results in repeated connection retries without end; the latter aims to prevent excessive connection attempts that increase traffic loads.
As a result, every time a user’s client sent a failed connection request, it would continuously send additional requests until it would receive a successful response. This glitch caused millions of launcher clients globally to send repeated requests continuously, which overloaded their systems. The result? “We were effectively DDoSed by our own clients.” This incident also enabled EG to discover issues in their web application firewall (WAF) ruleset. Fortunately, they were able to reduce the traffic and are now aware of their need for a standard process to deal with similar issues in the future.
A second unrelated issue they discovered affected the traffic on their Epic Games Store website. Instances were trying to fetch an asset ID that didn’t seem to exist, resulting in a bunch of 403 errors. After discovering the cause of the issue, they quickly fixed it and restored valid traffic.
The good news is that this certificate expiration set of a chain of events that forced Epic Games to take a hard look at their internal processes and tools. For example, they may not have realized the issue with their retry logic without their system first becoming overloaded with client traffic. This allowed them to see where they went wrong and implement changes, as well as share their insights to help others avoid following in their footsteps. So, while certificate mismanagement isn’t good, at least there was a relatively happy ending in this particular situation.
This brings us to our next point: how can you help your own company avoid dealing with the ramifications of an expired website security certificate?
No one wants their business or services to experience an outage due to certificate mismanagement. This is why it’s integral for businesses — particularly those with hundreds or thousands of X.509 digital certificates — to have clear visibility of everything that touches their networks and IT systems. And this is where effective certificate management best practices and tools come into play.
A good certificate manager is one that enables you to discover all of the digital certificates that exist within your IT ecosystem. This means you’ll know where every certificate is across all endpoints and which systems each certificate is tied to or secures.
Sure, you can manually track your certificates using spreadsheets and calendar reminders, but this gets hairy at scale. KeyFactor and the Ponemon Institute report that organizations use an average of 88,750 keys and digital certificates. And if that number isn’t enough to surprise you, then consider that 74% of their 603 IT security and infosec survey participants think their organizations don’t actually know how many certificates or keys they have, let alone when they expire.
That’s not only embarrassing — it’s downright terrifying. And considering that SSL/TLS certificates have a one-year certificate validity period now, it means that certificates expire more quickly and require more stringent management.
Without effective certificate management, you may wind up having expired or revoked certificates on your network that you don’t know about. And each one is a vulnerability that cybercriminals can exploit. (Remember the Equifax data breach from a few years ago? Yeah, that was because of an expired digital certificate.) And this is when you go from having “just” a temporary service outage to potentially a full-blown data breach situation.
Manage Digital Certificates like a Boss
14 Certificate Management Best Practices to keep your organization running, secure and fully-compliant.
Having the Right Tools Isn’t Enough — You Need to How to Use Them Effectively
You can be properly geared but still not get the Certificate Management Boss achievement. That’s because although having the right certificate manager is great, it’s just as important — if not more so — that you know how to use that tool effectively. This is true both from a general cybersecurity standpoint as well as a risk mitigation perspective.
It’s kind of like intimately knowing your character’s specs and attack/healing rotations in games. While wearing one set of armor and using a specific healing or attack rotation may be great for keeping your group alive in dungeons, it doesn’t mean that those same tools are effective when playing a tank or healer in raids. This is why you need to have not only the right gear (a certificate manager) but also must know the right…
FTW: Gaming Company Uses Certificate Expiration to Deliver Teachable Moment