Using various measures of deviation from a tendency, calculate average scales of data
Identify the applicability of each deviation measure to answer business questions.
Formulate and explanation an interpretation of deviation measure results to answer business questions.
Deviation, distance, variation, scale, redisual, and even error are very similar concepts. They form the backbone of statistics because they measure how far away actual data is from our model, our supposition, our hypothesis, our belief of where we would like to think the data is. And of course there are lot ways to compute distance and deviation.
We just computed several locations, trends, averages, percentiles, even quartiles with a box! Each of these are our peculiar mental image of where we think the data ought to be, at least most of the time. The rest of the time data is, where data happens to be when we stumble on it. Getting to an understanding of scale and distance and deviation we had to form a belief, calculate an average, build a trend. Now we move on to a more complete story of any particular data stream.
Rev up the deuce
This is where we will start, again. This time we are going to mix it up a bit. We are back at our car auction in New Jersey. We have heard enough about price and now we want to know why price might vary. One explanation from our far-reaching understanding of economic choice is that we might be buying car quality. One measure of quality is how many miles the previoius owner drove the car. We read odometers to get that data. We now have two sets (vectors) of data. Each observation is a pair of two observations for each car: price and miles.
i
price
miles
1
12500
43521
2
13350
31002
3
14600
18868
4
15750
12339
5
17500
9997
We can set up three ways of thinking about our car data:
Price (\(Y_i\)) variations on their own: \(Y_i=\bar{Y}+e_i\)
Miles (\(X_i\)) variations on their own: \(X_i=\bar{X}+e_i\)
Price dependency on miles: \(Y_i=b_0+b_1X_i+e_i\)
To put all of these models together on one graph we can plot each of the 5 points on a scatter plot like so:
Let’s look at the Y-axis, price first and draw a horizontal line to depict the arithmetic average of price (\(Y_i\)). Then draw error bars from each data point to this line all in the Y-direction.
This so looks like the error bar plot we generated before where now miles seems to act like the observations index. Try to draw a similar plot of the X-axis error bars around the arithmetic mean of miles.
We just had to rotate our thnking by 90 degrees clockwise, that’s all. Hover over the virtical average lilne to see the mean of miles as X_bar.
Now let’s layer one graph on the other. This will get a little dense, but will not so badly illustrate the interaction of price and miles.
We not have price and miles in the cross-hairs and begin to see not just the individual variations of price and miles about their respective means, but can begin to visualize the co-variation of price and miles.
To illustrate model 3, price dependency on miles: \(Y_i=b_0+b_1X_i+e_i\), let’s draw the best line we can through the scatter, one that will minimize squared deviations of price about our odometer-inspired model of price. Let’s leave that calculation till later. Willingly we will suspend any disbelief about that calculation. Let’s instead believe the numbers to be true. Let’s also keep the cross-hairs. Here we go.
hover over the dark blue error bars that connect data pairs of miles and price to a new average, a downward-sloping straight (linear that is) line. If we were to be told that the average slope is -0.1314, what would we think the average Y-intercept is?
Our new average price, \(\bar{Y}\), is an estimate of the average of \(Y\) conditional on miles, \(X\) and its average, still \(\bar{X}\), is
\[
\bar{Y} = b_0 + b_1 \bar{X}
\] We know that \(\bar{Y}=14740\), \(\bar{X}=23145.4\), and have just been informed that the slope parameter \(b_1=-0.1314\). So we plug (that is substitute) these numbers into the formula to get this.
\[
14740=b_0-0.1314(23145.4)
\] We now solve for the Y-intercept \(b_0\) to find this result.
\[
b_0=14740+0.1314\bar{X}=17782.4396
\] Phew!, but not so bad, just one equation in one unknown.
We now have jumped into hyper-space (at least 2-space) to expand our consciousness from univariate to bivariate relationships.
Our next stop is to use these and other versions of deviations to get a handle on the scale, range, variation, deviation, and yes, error in the data, at least with respect to our beliefs about the data. We just expanded our beliefs from univariate to multivariate. When we get to multivariate we will look at how variations relate to one another, in our case bivariate variations in price relating to variations in miles and vice-versa.
Blinded by the light
What’s a sample? (again)
Let’s recall what a sample is. It is a random draw from a larger group of data called a population. The word random derives from the Frankish (Germanic language from a while back) word rant much like our word rant and eventually meaning (by about 1880) simply indiscriminate.
Our sample of auction prices is whatever we could get our hands on at the time of the sampling, thus a sort of random draw from the population of all auctioned cars. This begins to allow us to look at the error bars we generated as random deviations from the mean.
Let’s also recall that this drawing of the random sample is the first analytical step after identifying a business question and a population from which to sample.
So we have a sample of two univariate data series. We have also thought that is is reasonable to relate the two series together. Let’s now find some measures of scale, deviation, and variation for each of the univariate series.
1. Standard deviation
If \(x\) is a sample (subset) indexed by \(i = 1 \dots N\) with \(N\) elements from the population, then we already know that
\[
\bar{X} = \frac{\Sigma_{i=1}^N X_i}{N}
\]
Our model of deviations from the mean produced error terms \(e_i=X_i-\bar{X}\), which we assigned to the miles univariate data series. If we were to add up these deviations like this
\[
\Sigma_{i=1}^{5}(X_i-\bar{X})
\]
what would we get?
ZERO.
Right, this aggregation of deviations will always give us zero, by definition.
Again, like we did to find the best average, let’s calculate the sum of squared errors, the \(SSE\). Because we sampled the data, the average \(SSE\) will have to be divided by \(N-1=5-1=4\), called, for the moment, the number of degrees of freedom. This will produce an unbiased measure, another topic for another time!
\[
s^2 = \frac{\Sigma_{i=1}^N (x_i - \bar{x})^2}{N-1}
\] The squared measure \(s^2\) is officially called the variance. It’s square root
\[
s = \sqrt{s^2}
\]
is the standard deviation. THe notion of standard is that of an average. We already know what a deviation is.
Let’s build a table of four columns to calculate this measure of scale, deviation, error, and variation.
i
miles
deviation
deviation squared
1
43521
20376
415165075
2
31002
7857
61726164
3
18868
-4277
18296151
4
12339
-10806
116778281
5
9997
-13148
172880423
Some really big numbers emerge. That is typical and we often try to scale these down when performing computations, again a topic for later.
Let’s sum up the deviations again (just to prove they add up to zero) and the squared deviations (\(SSE\))
\[
var(miles)=s^2=\frac{784846093.2}{5 - 1}=196211523.3
\] and then
Interquartile Range, IQR: \(Q_3\) net of \(Q_1\) (\(P_{75} - P_{25}\)) gives us a robust view like that in Tukey’s (1977) box plot. What is the IQR of miles and price?
We should have gotten
\[
IQR_{miles}=31002-12339=18663
\] and
\[
IQR_{price}=15750-13350=2400
\]
Wider, broader scale than the standard deviations? This measure is robust to highly skewed distributions with thick tails.
Mean Absolute Deviation: MAD is robust to outliers.
\[
MAD = \frac{\Sigma_{i=1}^N |X_i - \mu|}{N}
\] Let’s compute this statistic for price and miles.
This time we should have gotten for \(X_i= miles_i\):
Correlation measures the degree of relationship between two variables. The measure ranges from a low of -1 to a high of +1. The -1 is a perfectly correlated inverse relationship between two variables. The +1 correlation measure a perfectly possitive relationship between two variables. A 0 indicates no relationship seems to exist. This is not cause and effect, just two variables happening to bump together or not in the street one day in the Bronx. We otherwise call this an antecedent-consequent relationship.
Three steps to a correlation.
Calculate the covariance between two variables, \(X_i\) and \(Y_i\).
\[
cov(X, Y) = s_{xy} = \frac{\Sigma_{i=1}^N(X_i-\bar{X})(Y_i-\bar{Y})}{N-1}
\] The numerative sums up the pairwaise ups and downs of how \(X\) varies with \(Y\). This number may net out to positive, negative, or just very close to zero. We lose at least one degree of freedom because we have to use the arithmetic mean to calculate deviations, just like we did for the sample standard deviation.
Calculate the standard deviations for each of the two variables, \(X_i\) and \(Y_i\). First, the variance, here illustrated for \(X\), miles in our example.
\[
var(X) = s_{x}^2 = \frac{\Sigma_{i=1}^N(X_i-\bar{X})^2}{N-1} = \frac{784846093.2}{5} = 196211523.3
\]\[
s_X = \sqrt{var(X)} = 14007.5524
\] and for \(Y\) as price we have
\[
s_Y = 1975.2848
\]
Calculate the ratio of covariance to the product of the standard deviations. This step transform the unwieldly, and hard to interpret covariance from the \(-\infty...+\infty\) to the \(-1...+1\) range. For samples we call the correlation \(r_{XY}\).
We will show later (yes, with calculus – free of charge) that the slope parameter \(b_1\) is just the ratio of the covariance of miles (\(X\)) and price (\(Y\)) to the variance of miles (\(X\).
Negatively sloped, reflecting the negative correlation: when variations in miles are positive, variations in prices are negative, on average in this sample.
If we find a car at the auction with an odometer reading of 15010 miles. What price would we expect based on our model?
Sure. We found that if we knew \(b_1=-0.1314\), then we could calculate at the cross-hairs of average \(X\) and \(Y\) that \(b_0=17782.4396\). We substitute this into our model to get
Would we be willing to pay that amount? We might want to look at other factors such as the age and condition of the car, among many other features of our auction in New Jersey.
The tails have it!
To round out our discussion about the shape of data, we ask two more questions:
What direction do the tails of distribution tend to go, to the right or the left?
How thick are the tails?
The first question gets at the asymmetry in lots of data. An example of this is this data on losses from trading common stock in solar ppwer companies. We gather several days of stock prices, then comppute returns as percentage changes in the daily prices. A loss is whenever the returns are negative. Suppose that the latest price per share is USD 32.13 and we have 100,000 shares.
Let’s eyeball answers to the questions.
What direction does the distribution tend to?
This data looks like the skewness is to the right. There is a preponderance of observations in the body to the left of the 100000 mark.
The distirbution therefore looks like it is right- or positively-skewed.
How thick tailed is the distribution?
The losses seem to be somewhat frequent in the tails. It looks like it might be tick tailed. In financial terms this means that the volatility (standard deviation, IQR) is itself volatile.
The answer to these questions can also come from two more aggregations, both based on deviations and variation of data points from thwir means. Skewness needs a direction so a positive metric will mean deviations can be found in the positive tail on average, while a negative metric will indicate a net average long tail in the opposite diredtion. Here is a metric that does this:
Just like correlation is dimensionless by scaling the covariance with the product of standard deviations, so the skewness measure looks at the cubed deviations per cubed standard deviation.
What direction is the price data?
Using the formula we construct a table like so.
i
Y=price
(Y-Y_bar)
(Y-Y_bar)^3
1
12500
-2240
-11239424000
2
13350
-1390
-2685619000
3
14600
-140
-2744000
4
15750
1010
1030301000
5
17500
2760
21024576000
We see lots of negative cubed deviations. Let’s add them up to get 8127090000. Divide this by $N-1=$5 and we get 2031772500. Dividing by the cube of the standard deviation we see our result.
We are sure to check out the units of measurement here. The skew is positive. Price variations on average are above their mean. What about miles?
By the by, the solar loss ditribution does indeed compute a positive skewness. But we could see that with our own eyes as well.
So now how thick is the tail? Here we can use a variant of the skew just by squaring the variance term. Kurtosis describes the condition of the thickness of the tail. Just as skewness tells us whether the deviations are on average above or below the mean, so kurtosis tells us how volatile the standard deviation is.
Try this formula out for kurtosis on the price variable.
We see lots of negative cubed deviations. Let’s add them up to get 8127090000. Divide this by $N-1=$5 and we get 2031772500. Dividing by the cube of the standard deviation we see our result.
THe kurtosis is very small, indeed a very thin tale. This indicates a very stable volatility in prices. We will see later with teh normal distribution that the normal kurtosis is 3.00. Compared to the normal distribution, the price distribution is a very thin tailed distribution. An implication might be that it is a rare occurrence to see a high price in this sample. We must remember thata there are only 5 data points in the first place!
The solar loss distribution has a kurtosis that is only slightly greater than 3.
We are sure to check out the units of measurement here. The skew is positive. Price variations on average are above their mean. What about miles?
How can we use all of this?
At least five come to mind. All of these build on a foundation of where (location, tendency) a distribution tends to land.
Empirical Rule: if we thnk the distribution is symmetrical (“normal”) then the proportion of observations will be
distance
proportion
\(\mu \pm 1 \sigma\)
68.0
\(\mu \pm 2 \sigma\)
95.0
\(\mu \pm 3 \sigma\)
99.7
Chebychev: at least 75% of all values are within ±2σ of the mean regardless of the shape of a distribution lie within \(\pm 2 \sigma\) of the mean; for \(\mu \pm k \sigma\), \(k\) standard deviations and the proportion of observations is
\[
1 - \frac{1}{k^2}
\]
Outlier analysis: using IQR as a quantile inspired deviation we can build fences: Tukey(1977) once recommended
We add 1.5 times IQR to Q3 and see if any data exceeded that fenceand
We subtract 1.5 times IQR from Q1 and see if any data fell short of that fence
Coefficient of Variation: the ratio of the standard deviation to the mean expressed in percentage and is denoted CV
\[
CV = \frac{\sigma}{\mu} \times 100
\]
Comparisons using the z-score: converts means into standard deviation units; the number of standard deviations a value from the distribution is above or below the mean of the distribution
\[
z = \frac{x - \mu}{\sigma}
\]
Higher moments. The third and fourth moments are present in the skewnewss (cubed) and kurtosis (fourth power). Very nice math with a meaningful punch. Skewness will help us understand if deviations are above (positive skewness) or below (negative skewness) the average. Kurtosis will tell us if the standard deviation varies more or less than the normal distribution average of 3.00 (to be dug into later, of course).
Problems, always problems
Shown below are the top nine leading retailers in the United States in a recent year according to Statista.com. Compute range, IQR, MAD, standard deviation, CV, and invoke the empirical rule and Chebychev’s Theorem, along with the z-score for each. Treat this as a sample (then what’s the population?).
\[
MAD = \frac{\Sigma_{i=1}^N |x_i - \mu|}{N}
\]\[
\mu = \frac{\Sigma_{i=1}^N x_i}{N}=\frac{917.99}{9}
\]\[
MAD = \frac{485.3044}{9}
\] - Standard Deviation: first the square of the standard deviation is the variance
\[
\sigma^2 = \frac{\Sigma_{i=1}^N (x_i - \mu)^2}{N}
\] next, the standard deviation is
\[
\sigma = \sqrt{\sigma^2}
\] A slightly easier way to compute \(\sigma^2\) is
Empirical Rule: if we think the distribution is symmetrical (“normal”) then the proportion of observations will be
and
Chebychev: at least 75% of all values are within ±2σ of the mean regardless of the shape of a distribution lie within \(\pm 2 \sigma\) of the mean; for \(\mu \pm k \sigma\), \(k\) standard deviations and the proportion of observations is
\[
1 - \frac{1}{k^2}
\]
distance
proportion
k
distance.1
chebychev
\(\mu \pm 1 \sigma\)
68.0
1
\(101 \pm 1*92\)
\(\mu \pm 2 \sigma\)
95.0
2
\(101 \pm 2*92\)
0.75
\(\mu \pm 3 \sigma\)
99.7
3
\(101 \pm 3*92\)
0.89
Coefficient of Variation: the ratio of the standard deviation to the mean expressed in percentage and is denoted CV
z-score: converts means into standard deviation units; the number of standard deviations a value from the distribution is above or below the mean of the distribution. For the first data point
\[
s^2 = \frac{\Sigma_{i=1}^N (x_i - \bar{x})^2}{N-1} - \frac{67529.0961}{9 -1} = 8441.137
\] and then
\[
s = \sqrt{s^2} = 91.8757
\]
Shown below are the per diem business travel expenses in 11 international cities selected from a study conducted for Business Travel News’ 2015 Corporate Travel Index, which is an annual study showing the daily cost of business travel in cities across the Globe. The per diem rates include hotel, car, and food expenses. Use this list to calculate the z scores for Lagos, Riyadh, and Bangkok. Treat the list as a sample.
city
expense
London
576
Mexico City
240
Tokyo
484
Bangalore
199
Bangkok
234
Riyadh
483
Lagos
506
Cape Town
230
Zurich
508
Paris
483
Guatemala City
213
Answer
Range: the distance between the max and min
\[
Range = |max(x) - min(x)| = 576- 199 = 377
\]
IQR: \(Q_3\) net of \(Q_1\) (\(P_{75} - P_{25}\))
\[
IQR = Q_3 - Q_2 = 495 - 232 = 263
\]
Mean Absolute Deviation: MAD is robust
\[
MAD = \frac{\Sigma_{i=1}^N |x_i - \mu|}{N}
\]\[
\mu = \frac{\Sigma_{i=1}^N x_i}{N}=\frac{4156}{11} = 377.8182
\]\[
MAD = \frac{1546.1818}{11} = 140.562
\] - Standard Deviation: first the square of the standard deviation is the variance
\[
\sigma^2 = \frac{\Sigma_{i=1}^N (x_i - \mu)^2}{N}
\] next, the standard deviation is
\[
\sigma = \sqrt{\sigma^2}
\] A slightly easier way to compute \(\sigma^2\) is
Empirical Rule: if we think the distribution is symmetrical (“normal”) then the proportion of observations will be
and
Chebychev: at least 75% of all values are within ±2σ of the mean regardless of the shape of a distribution lie within \(\pm 2 \sigma\) of the mean; for \(\mu \pm k \sigma\), \(k\) standard deviations and the proportion of observations is
\[
1 - \frac{1}{k^2}
\]
distance
proportion
k
distance.1
chebychev
\(\mu \pm 1 \sigma\)
0.680
1
\(378 \pm 1*151\)
\(\mu \pm 2 \sigma\)
0.950
2
\(378 \pm 2*151\)
0.75
\(\mu \pm 3 \sigma\)
0.997
3
\(378 \pm 3*151\)
0.89
Coefficient of Variation: the ratio of the standard deviation to the mean expressed in percentage and is denoted CV
z-score: converts means into standard deviation units; the number of standard deviations a value from the distribution is above or below the mean of the distribution. For the first data point