Performance Monitoring Basic Statistics and Google Sheets

In order to have a better understanding about performance monitoring, one should have some understanding of statistics. (This is increasingly true as that is the technology that drives analytics.) The most basic statistics are the average, standard deviation, and the variance. Here we explain those in the simplest possible manner (so those who are not math buffs will not be intimidated).

Google Sheets versus Microsoft Excel

Indulge me for a moment and let me give you information about the relative merits of Excel versus Google Sheet. You might even find something funny here. I am something of an expert on this topic: last year I ghost wrote an entire book on Microsoft Excel advanced function.

I have Microsoft Office and use Microsoft Excel and Word practically every day. But I also use Google Docs. The Google Docs spreadsheet application is called Google Sheets. It is basically the same as Microsoft Excel, but there are some slight differences in formula syntax, and of course it is web-based.

Here I am going to explain some basic statistical concepts using Sheets. Most of the time people write about such things, they use Excel. I feel one should use both.

Last year I bought a copy of Microsoft Office for my new computer. But Office does not include Microsoft Access anymore. I believe the reason for this is people have learned how to use Excel Pivot tables, which they would easier to understand than Access. But someone like you, who works in IT, would find Access a better tool for many types of analysis, including performance monitoring. I am going to illustrate some ways to do that in upcoming posts. Access lets you create SQL queries and write Visual Basic functions to analyze large sets of data. It is a much better tool that Excel or Sheets for that.

Anyway, the sort-of amusing part of this is I signed up for Office 365 in order to get Access. Now I pay about $8 per month for a subscription to what it seems like I already had since I could already use the Microsoft SkyDrive and edit documents online without the Office 365 subscription. In other words, in 3 years I would have spent $288 for Microsoft Access. I should have just bought the software and saved money.

Plus there are some problems. I am seeing that there are sync problems with the Microsoft SkyDrive, which they now call OneDrive. This happens when I use VPN. It also happens when I use Excel on the Office 365 site (in other words the web version).

I have never had even one sync problem using Google Drive and Microsoft Office. But with Microsoft OneDrive I have come very close to losing documents entirely. It seems you can recover manually even what Word cannot recover if you spend enough time looking into your appdata folder on the C:-drive for temp files whose name does not necessarily end in “.doc”. One should not have to spend so much time looking for lost documents. I imagine other people who are not computer-savvy have lost work.

Now, onto statics for performance monitoring.

Average
The means and average are the same thing: this is the sum of the data points divided by the number of data points. Its the simplest statistics of all and one which practically everyone knows, since yout if they follow baseball.

For example, if you have the data {1,2,3,4,5} then the average is:

(1+2+3+4+5)/5 = 3

In Google Docs (Google Sheets) you can write this formula exactly like this:

=average(1,2,3,4,5)

Or its more common to refer to cell references as in:

=average(a1:a5)

Intuitively, you can thing of the average as being the number “in the middle,” which in this case it is. As in “he is of average height.” (Except the thing in the middle is correctly called the “median.” See below.)

In this simple example above, you can see that we have an odd number of data elements and the difference between adjacent elements is 1. So obviously the number in the middle is going to be the average, in this case 3. Keep that in mind when we discuss variance below. I made this example that way to make the next idea easier to understand.

Variance
Take a breath: the “variance” is the average of the squared differences from the mean. Wow. What does that mean?

If the variance in a data set is large then the that means numbers are widely dispersed, sort of. Read on.

Recall our simple example from above with this data set: {1,2,3,4,5}

The mean (average) is 3. So the differences from the mean for each number is:

3-1=2
3-2=1
3-3=0
3-4=-1
3-5=-2

If we just sum all those up, they total 0, since these differences all cancel each other other. So that tells us nothing. So we square each number to that the we are always dealing with positive numbers. So we have:

3-1=2^2=4
3-2=1^2=1
3-3=0^2=0
3-4=-1^2=1
3-5=-2^2=4

The average of that is (4+1+0+1+4)/4=2, which is the variance.

But those numbers are the squares of the distance from the mean. So they do not give a good indication of how widely dispersed our data is. So we take the square root sqrt(2)=1.414 to unsquare the numbers and put them back to what they were. The resulting number is called the standard deviation.

Here is where you need to be careful when you are calculating these values in Google Docs, because the formulas that you think would be correct, are not the correct ones to use.

In Google Sheets, the variance is:

=VARP(1,2,3,4,5)=2

and not

=VAR(1,2,3,4,5)=2.5

And the standard deviation is:

=STDEVP(1,2,3,4,5)=1.414

and not

=STDEV(1,2,3,4,5)=1.58

Note the letter “P” in the formulas, meaning “population.” The population means all the data points. Google defines STDDEV and VAR as working over a “sample” and not a “population.” “Sample” means all of the data points except 1 of them. That is the way that statisticians deal with those numbers but it is not applicable here. (I really cannot explain why. It has something to do with using statistics in the traditional sense, meaning making some estimate based on a sample of data. But here we have all of the data, albeit it is a sample of some set of data, as it performance metrics over time. Confusing, I know.)

Median

The median is the point at which 1/2 of the data points are higher than that value and 1/2 of the data points are less than that value. It just means the number in the middle. So for {1,2,3,4,5} the median is 3, which is the number in the middle. The median and average would not normally be the same, but the example we are using is was contrived to illustrate these ideas.

What does the media mean? If the average is, say, 100, and the median is 4, then you know that the number above the median are much higher that 4. In that case the variance and standard deviation would be large. This is the whole point here: to take the mean, median, and standard deviation and be able to make some statement about what you are looking at.

For example, {1,2,4,193,300} is an set of numbers with mean 4 and average 100. As you can image the standard deviation is large: 124.

Confidence

The idea of the standard deviation leads to all kind of related concepts like the cumulative distribution meaning the probability that a particular data point is below or equal to that data point.

One of the simplest interpretations of the standard deviation is the confidence level.

Here I am going to take this example from APM Digest http://apmdigest.com/how-to-have-confidence-in-a-small-sample

and give you another way to interpret that. So read both articles to have the widest possible understanding.

Suppose we have the following 5 response times. I have made them all almost the same exact value so that you can more easily understand what is “confidence.”

response times
4.9
5
5
5
5

4.98 average
5 median
0.04 std deviation
2 variance
0.01206563945 CONFIDENCE(alpha, standard_deviation, pop_size)

In this example the average (mean) mean is 4.98 and the standard deviation is very small 0.04. Since all the measurements are about the same; that small standard deviation helps make the point that the standard deviation and the average say a lot about your data set.

Here lets add one more statistic: confidence. We can say with 95% confidence and all of our performance times are going to be within 0.012 +- of 4.98 because:

=CONFIDENCE(5%, standard_deviation, number of data points)

where 5% is 100%-95%.

The careful reader will notice that we did not need to use the average in this calculation. That is because it is used in the calculation of the standard deviation.

Further Reading

Here are some other blog posts on this topic of analytics and statistics:

http://www.correlsense.com/blog/digging-deeper-analytics/

http://www.correlsense.com/blog/application-analytics/