How much does data vary?

Learning outcomes

Using various measures of deviation from a tendency, calculate average scales of data
Identify the applicability of each deviation measure to answer business questions.
Formulate and explanation an interpretation of deviation measure results to answer business questions.

Deviation, distance, variation, scale, redisual, and even error are very similar concepts. They form the backbone of statistics because they measure how far away actual data is from our model, our supposition, our hypothesis, our belief of where we would like to think the data is. And of course there are lot ways to compute distance and deviation.

We just computed several locations, trends, averages, percentiles, even quartiles with a box! Each of these are our peculiar mental image of where we think the data ought to be, at least most of the time. The rest of the time data is, where data happens to be when we stumble on it. Getting to an understanding of scale and distance and deviation we had to form a belief, calculate an average, build a trend. Now we move on to a more complete story of any particular data stream.

Rev up the deuce

This is where we will start, again. This time we are going to mix it up a bit. We are back at our car auction in New Jersey. We have heard enough about price and now we want to know why price might vary. One explanation from our far-reaching understanding of economic choice is that we might be buying car quality. One measure of quality is how many miles the previoius owner drove the car. We read odometers to get that data. We now have two sets (vectors) of data. Each observation is a pair of two observations for each car: price and miles.

i	price	miles
1	12500	43521
2	13350	31002
3	14600	18868
4	15750	12339
5	17500	9997

We can set up three ways of thinking about our car data:

Price (\(Y_i\)) variations on their own: \(Y_i=\bar{Y}+e_i\)
Miles (\(X_i\)) variations on their own: \(X_i=\bar{X}+e_i\)
Price dependency on miles: \(Y_i=b_0+b_1X_i+e_i\)

To put all of these models together on one graph we can plot each of the 5 points on a scatter plot like so:

Let’s look at the Y-axis, price first and draw a horizontal line to depict the arithmetic average of price (\(Y_i\)). Then draw error bars from each data point to this line all in the Y-direction.

This so looks like the error bar plot we generated before where now miles seems to act like the observations index. Try to draw a similar plot of the X-axis error bars around the arithmetic mean of miles.

Now let’s layer one graph on the other. This will get a little dense, but will not so badly illustrate the interaction of price and miles.

We not have price and miles in the cross-hairs and begin to see not just the individual variations of price and miles about their respective means, but can begin to visualize the co-variation of price and miles.

To illustrate model 3, price dependency on miles: \(Y_i=b_0+b_1X_i+e_i\), let’s draw the best line we can through the scatter, one that will minimize squared deviations of price about our odometer-inspired model of price. Let’s leave that calculation till later. Willingly we will suspend any disbelief about that calculation. Let’s instead believe the numbers to be true. Let’s also keep the cross-hairs. Here we go.

hover over the dark blue error bars that connect data pairs of miles and price to a new average, a downward-sloping straight (linear that is) line. If we were to be told that the average slope is -0.1314, what would we think the average Y-intercept is?

We now have jumped into hyper-space (at least 2-space) to expand our consciousness from univariate to bivariate relationships.

Our next stop is to use these and other versions of deviations to get a handle on the scale, range, variation, deviation, and yes, error in the data, at least with respect to our beliefs about the data. We just expanded our beliefs from univariate to multivariate. When we get to multivariate we will look at how variations relate to one another, in our case bivariate variations in price relating to variations in miles and vice-versa.

Blinded by the light

What’s a sample? (again)

Let’s recall what a sample is. It is a random draw from a larger group of data called a population. The word random derives from the Frankish (Germanic language from a while back) word rant much like our word rant and eventually meaning (by about 1880) simply indiscriminate.

Our sample of auction prices is whatever we could get our hands on at the time of the sampling, thus a sort of random draw from the population of all auctioned cars. This begins to allow us to look at the error bars we generated as random deviations from the mean.

Let’s also recall that this drawing of the random sample is the first analytical step after identifying a business question and a population from which to sample.

So we have a sample of two univariate data series. We have also thought that is is reasonable to relate the two series together. Let’s now find some measures of scale, deviation, and variation for each of the univariate series.

1. Standard deviation

If \(x\) is a sample (subset) indexed by \(i = 1 \dots N\) with \(N\) elements from the population, then we already know that

\[ \bar{X} = \frac{\Sigma_{i=1}^N X_i}{N} \]

Our model of deviations from the mean produced error terms \(e_i=X_i-\bar{X}\), which we assigned to the miles univariate data series. If we were to add up these deviations like this

\[ \Sigma_{i=1}^{5}(X_i-\bar{X}) \]

what would we get?

Right, this aggregation of deviations will always give us zero, by definition.

Again, like we did to find the best average, let’s calculate the sum of squared errors, the \(SSE\). Because we sampled the data, the average \(SSE\) will have to be divided by \(N-1=5-1=4\), called, for the moment, the number of degrees of freedom. This will produce an unbiased measure, another topic for another time!

\[ s^2 = \frac{\Sigma_{i=1}^N (x_i - \bar{x})^2}{N-1} \] The squared measure \(s^2\) is officially called the variance. It’s square root

\[ s = \sqrt{s^2} \]

is the standard deviation. THe notion of standard is that of an average. We already know what a deviation is.

Let’s build a table of four columns to calculate this measure of scale, deviation, error, and variation.

i	miles	deviation	deviation squared
1	43521	20376	415165075
2	31002	7857	61726164
3	18868	-4277	18296151
4	12339	-10806	116778281
5	9997	-13148	172880423

What about the standard deviation of price?

2. Robust measures

Range: the distance between the max and min

\[ Range = max(x) - min(x) \]

Interquartile Range, IQR: \(Q_3\) net of \(Q_1\) (\(P_{75} - P_{25}\)) gives us a robust view like that in Tukey’s (1977) box plot. What is the IQR of miles and price?

Mean Absolute Deviation: MAD is robust to outliers.

\[ MAD = \frac{\Sigma_{i=1}^N |X_i - \mu|}{N} \] Let’s compute this statistic for price and miles.

3. Correlation

Correlation measures the degree of relationship between two variables. The measure ranges from a low of -1 to a high of +1. The -1 is a perfectly correlated inverse relationship between two variables. The +1 correlation measure a perfectly possitive relationship between two variables. A 0 indicates no relationship seems to exist. This is not cause and effect, just two variables happening to bump together or not in the street one day in the Bronx. We otherwise call this an antecedent-consequent relationship.

Three steps to a correlation.

Calculate the covariance between two variables, \(X_i\) and \(Y_i\).

\[ cov(X, Y) = s_{xy} = \frac{\Sigma_{i=1}^N(X_i-\bar{X})(Y_i-\bar{Y})}{N-1} \] The numerative sums up the pairwaise ups and downs of how \(X\) varies with \(Y\). This number may net out to positive, negative, or just very close to zero. We lose at least one degree of freedom because we have to use the arithmetic mean to calculate deviations, just like we did for the sample standard deviation.

Calculate the standard deviations for each of the two variables, \(X_i\) and \(Y_i\). First, the variance, here illustrated for \(X\), miles in our example.

\[ var(X) = s_{x}^2 = \frac{\Sigma_{i=1}^N(X_i-\bar{X})^2}{N-1} = \frac{784846093.2}{5} = 196211523.3 \] \[ s_X = \sqrt{var(X)} = 14007.5524 \] and for \(Y\) as price we have

\[ s_Y = 1975.2848 \]

Calculate the ratio of covariance to the product of the standard deviations. This step transform the unwieldly, and hard to interpret covariance from the \(-\infty...+\infty\) to the \(-1...+1\) range. For samples we call the correlation \(r_{XY}\).

\[ r_{XY} = \frac{cov(X,Y)}{s_x s_y} = \frac{-25791807.5}{(14007.5524)(1975.2848) } = -0.9322 \]

Let’s build a table and calculate away.

i	X=miles	Y=price	(X-X_bar)	(Y-Y_bar)	(X-X_bar)(Y-Y_bar)
1	43521	12500	20376	-2240	-45641344
2	31002	13350	7857	-1390	-10920674
3	18868	14600	-4277	-140	598836
4	12339	15750	-10806	1010	-10914464
5	9997	17500	-13148	2760	-36289584

What about the slope?

We will show later (yes, with calculus – free of charge) that the slope parameter \(b_1\) is just the ratio of the covariance of miles (\(X\)) and price (\(Y\)) to the variance of miles (\(X\).

Should we try it?

If we find a car at the auction with an odometer reading of 15010 miles. What price would we expect based on our model?

The tails have it!

To round out our discussion about the shape of data, we ask two more questions:

What direction do the tails of distribution tend to go, to the right or the left?
How thick are the tails?

The first question gets at the asymmetry in lots of data. An example of this is this data on losses from trading common stock in solar ppwer companies. We gather several days of stock prices, then comppute returns as percentage changes in the daily prices. A loss is whenever the returns are negative. Suppose that the latest price per share is USD 32.13 and we have 100,000 shares.

Let’s eyeball answers to the questions.

The answer to these questions can also come from two more aggregations, both based on deviations and variation of data points from thwir means. Skewness needs a direction so a positive metric will mean deviations can be found in the positive tail on average, while a negative metric will indicate a net average long tail in the opposite diredtion. Here is a metric that does this:

\[ skew = \frac{\frac{\Sigma_{i=1}^N (X_i - m)^3}{N-1}}{s^3} \]

Just like correlation is dimensionless by scaling the covariance with the product of standard deviations, so the skewness measure looks at the cubed deviations per cubed standard deviation.

What direction is the price data?

i	Y=price	(Y-Y_bar)	(Y-Y_bar)^3
1	12500	-2240	-11239424000
2	13350	-1390	-2685619000
3	14600	-140	-2744000
4	15750	1010	1030301000
5	17500	2760	21024576000

So now how thick is the tail? Here we can use a variant of the skew just by squaring the variance term. Kurtosis describes the condition of the thickness of the tail. Just as skewness tells us whether the deviations are on average above or below the mean, so kurtosis tells us how volatile the standard deviation is.

Try this formula out for kurtosis on the price variable.

\[ kurtosis = \frac{\frac{\Sigma_{i=1}^N (X_i - m)^4}{N-1}}{s^4} \]

How thick or thin are the tails?

i	Y=price	(Y-Y_bar)	(Y-Y_bar)^4
1	12500	-2240	25176309760000
2	13350	-1390	3733010410000
3	14600	-140	384160000
4	15750	1010	1040604010000
5	17500	2760	58027829760000

How can we use all of this?

At least five come to mind. All of these build on a foundation of where (location, tendency) a distribution tends to land.

Empirical Rule: if we thnk the distribution is symmetrical (“normal”) then the proportion of observations will be

distance	proportion
\(\mu \pm 1 \sigma\)	68.0
\(\mu \pm 2 \sigma\)	95.0
\(\mu \pm 3 \sigma\)	99.7

Chebychev: at least 75% of all values are within ±2σ of the mean regardless of the shape of a distribution lie within \(\pm 2 \sigma\) of the mean; for \(\mu \pm k \sigma\), \(k\) standard deviations and the proportion of observations is

\[ 1 - \frac{1}{k^2} \]

Outlier analysis: using IQR as a quantile inspired deviation we can build fences: Tukey(1977) once recommended

We add 1.5 times IQR to Q3 and see if any data exceeded that fence and
We subtract 1.5 times IQR from Q1 and see if any data fell short of that fence

Coefficient of Variation: the ratio of the standard deviation to the mean expressed in percentage and is denoted CV

\[ CV = \frac{\sigma}{\mu} \times 100 \]

Comparisons using the z-score: converts means into standard deviation units; the number of standard deviations a value from the distribution is above or below the mean of the distribution

\[ z = \frac{x - \mu}{\sigma} \]

Higher moments. The third and fourth moments are present in the skewnewss (cubed) and kurtosis (fourth power). Very nice math with a meaningful punch. Skewness will help us understand if deviations are above (positive skewness) or below (negative skewness) the average. Kurtosis will tell us if the standard deviation varies more or less than the normal distribution average of 3.00 (to be dug into later, of course).

Problems, always problems

Shown below are the top nine leading retailers in the United States in a recent year according to Statista.com. Compute range, IQR, MAD, standard deviation, CV, and invoke the empirical rule and Chebychev’s Theorem, along with the z-score for each. Treat this as a sample (then what’s the population?).

company	revenue
Walmart	343.62
The Kroger Co.	103.03
Costco	79.69
The Home Depot	74.20
Walgreen	72.67
Target	72.62
CVS Caremark	67.97
Lowe’s Companies	54.81
Amazon. Com	49.38

Answers (so many)

Range: the distance between the max and min

\[ Range = |max(x) - min(x)| = 343.62- 49.38 = 294.24 \]

IQR: \(Q_3\) net of \(Q_1\) (\(P_{75} - P_{25}\))

\[ IQR = Q_3 - Q_2 = 79.69 - 67.97 = 11.72 \]

Mean Absolute Deviation: MAD is robust

\[ MAD = \frac{\Sigma_{i=1}^N |x_i - \mu|}{N} \] \[ \mu = \frac{\Sigma_{i=1}^N x_i}{N}=\frac{917.99}{9} \] \[ MAD = \frac{485.3044}{9} \] - Standard Deviation: first the square of the standard deviation is the variance

\[ \sigma^2 = \frac{\Sigma_{i=1}^N (x_i - \mu)^2}{N} \] next, the standard deviation is

\[ \sigma = \sqrt{\sigma^2} \] A slightly easier way to compute \(\sigma^2\) is

\[ \sigma^2 = \frac{\Sigma_{i=1}^N x_i^2 - N(\bar{x})^2}{N} \] \[ \sigma^2 = \frac{161163.0561 - 9(101.9989)^2}{9} = 8441.137 \] \[ \sigma = 91.8757 \]

Empirical Rule: if we think the distribution is symmetrical (“normal”) then the proportion of observations will be

and

Chebychev: at least 75% of all values are within ±2σ of the mean regardless of the shape of a distribution lie within \(\pm 2 \sigma\) of the mean; for \(\mu \pm k \sigma\), \(k\) standard deviations and the proportion of observations is

\[ 1 - \frac{1}{k^2} \]

distance	proportion	k	distance.1	chebychev
\(\mu \pm 1 \sigma\)	68.0	1	\(101 \pm 1*92\)
\(\mu \pm 2 \sigma\)	95.0	2	\(101 \pm 2*92\)	0.75
\(\mu \pm 3 \sigma\)	99.7	3	\(101 \pm 3*92\)	0.89

Coefficient of Variation: the ratio of the standard deviation to the mean expressed in percentage and is denoted CV

\[ CV = \frac{\sigma}{\mu} \times 100 = \frac{91.8757}{101.9989} \times 100 = 90.0752 \]

z-score: converts means into standard deviation units; the number of standard deviations a value from the distribution is above or below the mean of the distribution. For the first data point

\[ z = \frac{x - \mu}{\sigma} = \frac{343.62 - 101.9989}{91.8757}=\frac{241.6211}{91.8757} \]

This observation is 2.6299 standard deviations from the mean 101.9989.

Sample versus population

If \(x\) is a sample (subset) indexed by \(i = 1 \dots N\) with \(N\) elements from the population, then (the same formula!)

\[ \bar{x} = \frac{\Sigma_{i=1}^N x_i}{N} = \frac{917.99}{9} = 101.9989 \]

and

\[ s^2 = \frac{\Sigma_{i=1}^N (x_i - \bar{x})^2}{N-1} - \frac{67529.0961}{9 -1} = 8441.137 \] and then

\[ s = \sqrt{s^2} = 91.8757 \]

Shown below are the per diem business travel expenses in 11 international cities selected from a study conducted for Business Travel News’ 2015 Corporate Travel Index, which is an annual study showing the daily cost of business travel in cities across the Globe. The per diem rates include hotel, car, and food expenses. Use this list to calculate the z scores for Lagos, Riyadh, and Bangkok. Treat the list as a sample.

city	expense
London	576
Mexico City	240
Tokyo	484
Bangalore	199
Bangkok	234
Riyadh	483
Lagos	506
Cape Town	230
Zurich	508
Paris	483
Guatemala City	213

Answer

Range: the distance between the max and min

\[ Range = |max(x) - min(x)| = 576- 199 = 377 \]

IQR: \(Q_3\) net of \(Q_1\) (\(P_{75} - P_{25}\))

\[ IQR = Q_3 - Q_2 = 495 - 232 = 263 \]

Mean Absolute Deviation: MAD is robust

\[ MAD = \frac{\Sigma_{i=1}^N |x_i - \mu|}{N} \] \[ \mu = \frac{\Sigma_{i=1}^N x_i}{N}=\frac{4156}{11} = 377.8182 \] \[ MAD = \frac{1546.1818}{11} = 140.562 \] - Standard Deviation: first the square of the standard deviation is the variance

\[ \sigma^2 = \frac{\Sigma_{i=1}^N (x_i - \mu)^2}{N} \] next, the standard deviation is

\[ \sigma = \sqrt{\sigma^2} \] A slightly easier way to compute \(\sigma^2\) is

\[ \sigma^2 = \frac{\Sigma_{i=1}^N x_i^2 - N(\bar{x})^2}{N} \] \[ \sigma^2 = \frac{1796936 - 11(377.8182)^2}{11} = 22672.3636 \] \[ \sigma = 150.5734 \]

Empirical Rule: if we think the distribution is symmetrical (“normal”) then the proportion of observations will be

and

Chebychev: at least 75% of all values are within ±2σ of the mean regardless of the shape of a distribution lie within \(\pm 2 \sigma\) of the mean; for \(\mu \pm k \sigma\), \(k\) standard deviations and the proportion of observations is

\[ 1 - \frac{1}{k^2} \]

distance	proportion	k	distance.1	chebychev
\(\mu \pm 1 \sigma\)	0.680	1	\(378 \pm 1*151\)
\(\mu \pm 2 \sigma\)	0.950	2	\(378 \pm 2*151\)	0.75
\(\mu \pm 3 \sigma\)	0.997	3	\(378 \pm 3*151\)	0.89

Coefficient of Variation: the ratio of the standard deviation to the mean expressed in percentage and is denoted CV

\[ CV = \frac{\sigma}{\mu} \times 100 = \frac{150.5734}{377.8182} \times 100 = 39.8534 \]

z-score: converts means into standard deviation units; the number of standard deviations a value from the distribution is above or below the mean of the distribution. For the first data point

\[ z = \frac{x - \mu}{\sigma} = \frac{576 - 377.8182}{150.5734}=\frac{198.1818}{150.5734} \]

This observation is 1.3162 standard deviations from the mean 377.8182.

Sample versus population

If \(x\) is a sample (subset) indexed by \(i = 1 \dots N\) with \(N\) elements from the population, then (the same formula!)

\[ \bar{x} = \frac{\Sigma_{i=1}^N x_i}{N} = \frac{4156}{11} = 377.8182 \]

and

\[ s^2 = \frac{\Sigma_{i=1}^N (x_i - \bar{x})^2}{N-1} - \frac{226723.6364}{11 -1} = 22672.3636 \] and then

\[ s = \sqrt{s^2} = 150.5734 \]

What have we learned?

So much! More to come – stay tuned.