What does SLA Mean?

Service level agreements (SLAs) describe how an operation performs within agreed limits. To say that one is operating within the agreed service level means that the average and the average variation from that average (i.e., the variance) are within agreed ranges.

Some common uses of SLA are:

  • Mean-time-to-failure (e.g., expected life time of a disk drive)
  • Average time to close service tickets (e.g., help desk)
  • Transaction response time (i.e., performance monitoring)
  • Up-time (e.g., the time that an application is available)

How do you know when your system is operating within the terms of an SLA?

Let’s take an example.  Suppose that the SLA for a transaction is 1.5 seconds on average with one standard deviation.  The standard deviation is a measurement of how far, on average, a set of observed data is from the average of that data.  The standard deviation also gives the probability of an observed value given the standard deviation and the average.  The probabilities of 1,2,3, and 4 standard deviations are:

1=68.26%

2=95.45%

3=99.73%

4=99.99%

This means, for example, that 99.73% of observed values are expected to be within three standard deviations of the average.

If our service level is one standard deviation, for example, that means we would expect 68.26% of transactions to be within one standard deviation of the average.  Since we are talking about performance, we are not concerned with transactions that exceed the minimal threshold.  We only want to make sure that they do not fall below this threshold in terms of standard deviation.

Let’s suppose our average transaction response time is 1.5 seconds and our agreed standard deviation is 1.  Then we are saying that, on average, our transaction response time will not exceed 1.5 (average) + 1 (standard deviation)=2.5 seconds.

Suppose now our performance has slowed to the point that we have these observed values:

Average response time=3.267

Standard deviation=2.715

What is our SLA compliance?  This is the same as asking what percentage of transactions fall outside the SLA.

To compare two sets of data, they have to be on the same scale.  In statistics, the average and standard deviation are shown as a bell curve.  In order to see how we are operating with regards to the SLA we have to draw these two curves using the same normalized values.  These are called z values.

Take a look at the bell curve below.

SLA-Graphic-690x288

 

This is a plot of the performance we see now in terms of normalized values z.  Now our performance is average 3.267 and standard deviation 2.715.  Our SLA says that we will have an average of1.5 and standard deviation of 1.  Our agreement is that our transaction response time will not exceed x=1.5+1=2.5 seconds, in terms of standard deviation and average.  This is z=-0.65 when you normalize it, meaning calculate z=(1.5-3.267)/2.715=-0.65.    Now we are operating at z=(3.267+2.715)/2.715=2.2

So what percentage of our performance measurements are between z=-0.65 and z=2.2.  Consulting a standard normal table that tells us the cumulative probability that an event is between 0 and z=-0.65 is 24.2%.  The probability that z is between 0 and 2 is 47.7%.  The probability that z lies between these two values then is 24.2%+47.7%=71.9%. 71.9% of observations are outside our SLA in terms of average and standard deviation.

Your contract with your client might assess a penalty based upon the failure to meet the SLA.  In this case 71.9% of our transactions are operating outside agreed thresholds.  Depending on how long this situation persists, the surcharge could be large or small.