In this lab you will learn about:
Oh and one more thing, thanks for coming to SLOConf. We are so grateful that you are taking the time to connect with the community! Please feel free to share your Lab experience with others! Finally, if you enjoyed the contents of this lab, be sure to drop a line to our friends at Isos Technology.
Let's get started!
If you're new to SLO and how it relates to SRE, please watch this short introductory video on why SLO is important and why it's exciting.
In this longer video you'll learn the method at a logical level, behind the math and computer science that drives SLOs. You'll also learn why Nobl9 built the Nobl9 platform. We will also discuss error budgets and how to use them.
How do we tie reliability objectives to customer satisfaction? How would it change how we work if we stopped focusing on how reliable we can make individual systems? These systems combined, make up our product and we need to get smart about how all of our systems affect customer happiness.
Reliability is a function of customer happiness. But SLAs measure total dissatisfaction at a contractual level. The A stands for Agreement, as in a legal contract with money and significant negative strategic (reputation) and tactical (wasted work) effects. What happens when we also use reliability objectives tied to customer happiness? We call these Service Level Objectives (or SLOs).
Using an SLO and an Error Budget is relatively easy, conceptually. But the math involved and getting everyone on the same page can be challenging.
The Nobl9 platform was created to help teams do this harder work. Nobl9 ingests the SLIs that you choose from your various data sources, allows you to define your SLOs in a wizard or via SLOs as code and then calculates your error budget for you. The SLOs are displayed in a clean organized view which makes it easy for stakeholders with both limited and high technical aptitude to easily be able to see the health and trends of their SLOs.
Below are some of the many benefits of using error budgets, (we won't have time to cover all of these in the lab today):
We've prepared the lab environment with example services and SLOs to get you started.
This lab is a shared environment. Please do not edit the example SLOs or those of other users. Please limit your changes to the SLOs in the services that you will create.
https://app.nobl9.com
Your lab account credentials should have been sent to you when you completed the form above, don't forget to check your spam folder if you can't find it
If you can't find it we can slack you in the SLOconf slack workgroup. Ask for help in the #sloconf-labs-nobl9 channel!
We suggest you open a new browser window so you can easily glance back and forth between this lab and the app.
Upon successfully logging in you should be greeted by the SLO grid view which will have SLOs laid out by their respective services. Feel free to look through them, clicking on the SLO name or Reliability Burn Down chart will open up a details page.
These SLOs in this lab are solely for the purposes of this lab, and should not be taken as demonstrating poor service by the providers. Also the time windows have been set as 1 hour rolling windows, in the real world these are usually 28 or 30 days, and have larger error budgets.
You can use the time picker to expand the chart time windows.
By clicking on the SLO name on the reliability burn down chart you can see the SLO Details page. Here you can find details about the SLO, see the SLI metrics and how it compares to the threshold, an expanded Reliability Burn Down chart, and an Error Budget Burn rate chart.
Your screen should look like this:
(This oath will make sense after you do the next bit, it's pretty fun.)
A one hour rolling window with a target of 99% of the pages completely loading all resources in less than 6,000ms
Step One: Select a Service
Go to step two by clicking on the Step 2 Heading (this is how you navigate all the steps, and you can jump back and forth if you missed something or need to go back and change something).
Step Two: Select a Data Source and Metric
sloconf-lab
data source, a two panel element will appearSELECT average(duration) FROM SyntheticCheck WHERE monitorName = 'yahoo' TIMESERIES
In this case it's New Relic data, but it could be from a growing list of data sources
Step Three: Defining Time Window
Step Three: Define Error Budget Calculation and Objectives
Step Four: Name you SLO and assign the Alert Policy
#sloconf-alerts
You can find yours by looking for your SLO name.Click Apply in the lower right to save you SLO.
Using a similar step by step process you saw when you created your first SLO, you can see how we create an Alert. Let's go ahead and look at the Interface.
Your configuration screen should look like this.
The first step, Define Alert Condition, is really the main configuration and it's pretty light in terms of the amount of data it asks for. It's kind of surprising, isn't it? It's very important that you notice this simplicity because this is another place where the efficient flexibility of Nobl9 really shines.
An alert policy can be tied to multiple SLOs. These alerts are proactive webhooks to make your team(s) aware or trigger automation before the error budget is exhausted. You can leverage pre-configured webhook destinations or just send the webhook to anywhere that accepts them.
From an SLO based SRE perspective you really only need alerts when your error budget is
With these alerts, you now are empowered with the ability to be alerted when customers are actually being impacted. What if your incident response team was only alerted on trends and spikes directly related to customer happiness? This can reduce pager fatigue and also be used as reference to determine if a regular monitoring alert is actually impacting users.
Nobl9 dynamically calculates the Error Budget as it receives the data. If you are using a rolling time window, as is often the case, when reliability rises, the Error Budget is replenished. On a calendar time window you don't get it back until a new period starts. Your customers can be at varying levels including ecstatic beyond their wildest dreams, pretty satisfied, and very dissatisfied.
The Error Budget is how we can more easily see the problems that are affecting customer satisfaction. Without an SLO we don't have a target level of reliability. And without a measurable level of unreliability we can't make changes with knowledge of where we stand with our level of service. We can find a balance between the floor dissatisfaction, and the ceiling of 100% reliability. If we know how to find that balance, we can also balance how many new features and how many site reliability improvements we can make in the near term and the long term.
We can improve the product by adding new features and improve reliability at the same time.
We completed our earlier task by creating an SLO and sending an alert to Slack. Alerting is great for helping you be proactive and keeping everyone in the loop but what about the past? That's where reports come into play as they can help you look at long periods of time enabling you to make more informed decisions down the road.
Since we keep the SLO data for over a year these reports can be pulled for extended periods of time. We also give you the ability to export these reports to an S# bucket, GCS or Snowflake with more one the way.
Usually teams like to see the math play out. Agreeing on your first SLO's and Alert policy configuration starts you on the journey of SLO Based SRE. The real, and very effective work is beginning a more pragmatic dialog between all of the stakeholders.
Often teams start working on really attacking the systems that were causing very slow burn or very spiky reliability issues, because they can now see what parts of those systems are contributing to customer satisfaction.
The first few weeks can be pretty giddy, but it's also the beginning of improving your SLO and Alerting configurations.
Then you really get the whole picture. Model the objectives to map to a customer satisfaction tiered scale. When customers are just barely satisfied, you can be working on planned work to get them to the next level up. What's always fun is when we get the model accurate enough that you realize your SLOs are providing better incident response signals then you've ever had before.
We've got some ways you can modify the SLO you created earlier, in the next module.
Go ahead and play around with the interface.
If you want to tweak what you did in the lab exercises, we have some suggestions in the next module.
This is the end of the guided lab!
Congratulations!
These are some other ideas to get you thinking about how you can play with the SLO configurations connected to our sample SLI monitoring data sources.
For the Objective you created in this Lab here are some alternate configurations to play with:
Try adding more objectives, with a different target, percentage, and experience name.
Notice that modifying an Objectives value will reset its history.
Below are some additional queries that you can try in new SLOs
SELECT average(duration) FROM SyntheticCheck WHERE monitorName = 'yahoo' AND locationLabel='San Francisco, CA, USA' TIMESERIES
SELECT average(duration) FROM SyntheticCheck WHERE monitorName = 'yahoo' AND locationLabel='Washington, DC, USA' TIMESERIES
Our configuration will only send alerts to the sloconf alerts channel.
If you create a new SLO (and set it to use your new alert) it won't have an error budget calculation for a while. So the alert won't fire right away. But check back in and you should see it firing.
SRE - Service Reliability Engineering, the practice of designing and maintaining a system from a reliability perspective. Usually measured in terms of services availability, (service responsiveness, latency, throughput et. al.), to ensure compliance with reliability targets.
SERVICE LEVEL OBJECTIVE (SLO) - An SLO is used to prioritize and decide how much effort to invest in activities that improve customer happiness. They are not only used for SRE but they are very important to SRE.
CUSTOMER HAPPINESS A tiered measure of how happy or unhappy a customer is with the reliability of a product (not the systems that deliver the product).
ERROR BUDGET Specifically, and under what conditions do we all agree: How much Customer Happiness we are willing to sacrifice so we can continuously improve our entire product.