99.99% Uptime Goal for 2024

During late Q3 perhaps early Q4, My company introduced us to their new uptime goal of “Five Nines” which written with numbers looks like 99.999%. Overall this implies that the company would only tolerate 5.26 minutes of downtime all year. Considering we have over 300 locations across the country to manage, It’s what I would call a stretch goal. As it stands, I am not even sure we have a way to measure that metric. Although it is refreshing to have something to work towards as a team.

Considering that I love doing what I do for work, I’d like to practice my Engineering-craft and implement an uptime/availability goal here on my own web server. Since I am only a one-man operation with limited availability, I do not think Five Nines is feasible. Instead I will opt for 99.99% instead.

Availability is generally calculated based on how long a service was unavailable over some period. Assuming no planned downtime, Table 1-1 indicates how much downtime is permitted to reach a given availability level.

Percent	Year	Quarter	Month	Week	Day	Hour
99.95	4.38 Hours	1.08 Hours	21.6 Minutes	5.04 Minutes	43.2 Seconds	1.8 Seconds
99.99	52.6 minutes	12.96 minutes	4.32 minutes	60.5 seconds	8.64 seconds	0.36 seconds
99.999	5.26 minutes	1.3 minutes	25.9 seconds	6.05 seconds	0.87 seconds	0.04 seconds

Architecture and Design

On December 25th 2023 - I rebuilt this web server to meet one of my favorite princples, keep it simple. Moving from WordPress to serving plain HTML5/CSS3 greatly reduces the complexity of the system. This removes the backend database requirement and also reduces resource overhead. I had learned that Wordpress (being PHP-based) runs a build process for each page requested by a visitor. This means that the server is doing more CPU-intensive work to serve the same content. This also implies that a large spike in normal traffic would have higher potential to DoS the backend server.

To further reducing the likelyhood of a DoS event, I’ll need a Content Delivery Network. During my not so long ago WordPress days, I would have used QUIC.Cloud. They have a plugin that has fantastic integration with WordPress and the backend OpenLightSpeed webserver. However I’ve learned that QUIC.Cloud struggles to effectivly cache plain HTML. I could go with Cloudflare and their free tier, but it does not have the SLO/SLA that I am looking to achieve. I’m also not willing to spend $30/month for their CDN. Instead I have opt-ed to use Bunny.net. They have about 40 more PoPs (Points of Presents) than QUIC.Cloud which should reduce latency to my website in some parts of the world. The big selling factor was their documentation and ability to integrate painlessly with basic HTML5.

Another important note, is the location of my webserver. While it is generally okay to host a website from home; I will not. Instead this webserver lives in the Akamai Data Center (formally Linode) in Fremont, California. This removes my needs to worry about redundant cooling, electrcity, and networks. It also allows me to scale my server both vertically and horizonally as my needs change. Inside this Fremont Data Center, I am also performing rolling backups. This way even if the server is broken beyond repair for whatever reason; I can rollback and restore to a known good state in about 10 minutes.

Observability

Now that we are comfortable with our hardware and operational software stack, we need to properly monitor these underlying services. Let’s think about this… I’ll need to have decent visibility into….

Resource utilization.
Error rates.
SSL certificate status.
Webserver latency.
CDN latency.

NewRelic will be my primary monitoring solution. Using their locally installed agent, I can monitor resource utilization, error rates, and backend latency. NewRelic will also have the ability to alert me by email and a personal PagerDuty account in the event of a full-scale outage. The Linode Cloud Manager will act as my seconday method for alerting against unusually high resource utilization for an extended period of time.

UptimeKuma will be used to monitor SSL certificate status, webserver latency, and CDN latency. It will also alert me to outages via my personal PagerDuty account. This UptimeKuma instance is operated within the Oracle Cloud Infrastructure in a region seperate from myself and the webserver. By leveraging this third geographical point of monitoring, I can collect and analyze data to find latency and areas of improvement that otherwise would have been hidden to myself.

Conclusion

2024 is here and this website/project finally for the first time in its existence has a practical goal. Hope to see you later in the year!

Site Reliability Engineering: How Google runs Production Systems – O’Reilly Media 2016

2024 11
2023 13
2022 4
2021 7

99.99% Uptime Goal for 2024

Architecture and Design

Observability

Conclusion

2024

2023

2022

2021