What data tends to: building a position

Learning outcomes

In this unit you will learn to:

Build a model for which you can derive a simple, but optimized, estimator of a measure of central tendency
Using a frequency distribution approach calculate additional measures of aggregate position of data
Compare, contrast various measures of tendency

By tendency we mean how data elements might aggregate, accumulate, even congregate around or near a particular data point. Elementary examples of this measure are the arithmetic mean and the median. More sophisticated measures of position include quantiles, with quartiles as a special case, and the frequency-weighted average of grouped data.

Position measures help us gain insight into trends, beliefs, and upper and lower limits of decision drivers. Bwecause they are aggregates, they necessarily abstract from the individual data points themselves. The measures do ehlp us understand teh systematic movement of a stream of data. But they also indicate how far they are away from any given piece of data. This distance is something we will exploit now and, and in the next installment even more so.

The best we have

All statistics is born in two simple ideas:

There either is or is not a sytematic pattern, an aggregation, a trend in the data
Individual data do not sytematically deviate from this pattern or trend.

Let’s consider this sample of 5 observations of car prices at a recent auction in New Jersey:

price
12500
13350
14600
15750
17500

Suppose that as you cross the GW Bridge from the Bonx into New Jersey, you hear an advertisement on the radio proclaiming that, at the very auction you are going to, the average car price is $14,000.

Let’s let $X_i$ be the series of $i-1\dots5$ prices and $a=14000$ be the advertiser’s estimate of the averaga, trend, or belief in what the car price is. Here is table of how prices deviate from the advertiser’s announcement.

price	deviation
12500	-1500
13350	-650
14600	600
15750	1750
17500	3500

Of course there may be many such beliefs about the average price and thus many different possible deviations. Our job is to find the average that is best in a very particular sense. Here is our model.

\[ Y_i = m + e_i \]

We think (really we suppose or hypothesize) that each and every price, $Y_i$ is composed of a systematic constant price $m$ plus an unsystematic error or deviation from the average, called here $e_i$. This might be because we do not know enough about auctions, used cars, whatever, to think there might be a systematic factor that might influence car price $Y_i$.

Each of the hanging error bars from the blue points to the red line represent deviations of price from a supposed average $a=$ $14,000. Is this the best average we can come up with? Let’s systematically think through this.

There are three usual suspects for getting at a best average in this situation. We can try to find the average $m$ that minimizes

The sum of deviations or errors: $\Sigma_{i=1}^5(Y_i-m)$
The sum of squared deviations or errors (SSE): $\Sigma_{i=1}^5(Y_i - m)^2$
The sum of absolute deviations: $|\Sigma_{i=1}^5(Y_i - m)|$

There are many criteria we could choose. Which one(s) might work best?

Square those errors

This plot depicts the sum of squared deviations for a grid of potential values of what the data points deviate from, $m$. Use of such a criterion allows us a clear and in this case unique calculation of the best linear estimator for the mean.

Hover over the graph and brush over the area around the red dot to zoom in. What do we see?

A bit of calculus confirms the brute force choice of the arithmetic mean that minimizes the sum of squared deivations about the mean.

First, the sum of squared errors (deviations) of the $X_i$ data points about a mean of $m$ is

\[ SSE = \Sigma_{i=1}^5 (Y_i - m)^2 \]

Second, we derive the first derivative of $SSE$ with reapect to $m$, holding all else (e.g., sums of $X_i$) and set the derivative equal to zero for the first order condition for an optimum.

\[ \frac{d\,\,SSE}{dm} = -2\left(\Sigma_{i=1}^5 (Y_i - m)\right) = 0 \] Here we used the chain and power rules of differentiation.

Third, we solve for $m$ to find

\[ m = \frac{\Sigma_{i=1}^5 Y_i}{N}=14740 \]

Close enough for us? This is none other than the arithmetic mean. We will perform a very similar procedure to get the sample means of the y-intercept $b_0$ and slope $b_1$ of the relationship

\[ Y_i = b_0 + b_1 X_i + e_i \]

where $x_i$ data points try tp explain movements in the $Y_i$ data points.

Absolutely!

Even more interesting is the idea we can find a middling measure that minimizes the sum of absolute deviations of data around this metric (too many $m$!).

\[ SAD = \Sigma_{i=1}^5 |Y_i - m| \]

Yes, it is SAD, the sum of abssolute deviations. This is our foray into rank-oder statistics, quite a bit different in nature than the arithmetic mean of $SSE$ fame. We get to basic counting when we try to mind the $m$ that minimizes SAD. To illustrate this suppose our data is all positive (ratio data in fact). If $m=5$ then the function

\[ f(Y;m) = |Y-m| \] has this appearance, the so-called check function.

Intuitively, half the graph seems to be the left of $m=5$, the other have is to the right. Let’s look at the first derivative of the check function with respect to changes in $m$, just like we did with each term in $SSE$. Notice that the (eyeballed) rise over run, i..e., slope, before $m=5$ is -1, and after it is +1. At $m=5$ there is no slope that’s even meaningful.

We have two cases to consider. First $Y$ can be less than or equal to $m$ so that $Y-m \leq 0$. In this case

\[ \frac{d\,\,(Y-m)}{dY} = -1 \]

This corresponds exactly to negatively sloped line rolling into our supposed $m=5$ in the plot.

Second, $Y$ can be greater than or equal to $m$ so that $Y-m \geq 0$. In this case

\[ \frac{d\,\,(Y-m)}{dY} = +1 \]

also correpsonding to the positively sloped portion of the graph.

Another graph is in order to imagine this derivative.

It’s all or nothing for the derivative, a classic step function. We use this fact in the following (near) finale in our search for $m$. Back to $SAD$.

We are looking for the $m$ that minimizes $SAD$:

\[ SAD = \Sigma_{i=1}^N |Y_i - m| = |Y_1-m| + \ldots + |Y_N-m| \]

If we take the derivative of $SAD$ with respect to $Y$ data points, we get $N$ minus 1s and $N$ plus ones in our sum because each and every $|Y_i-m|$ could either be greater than or equal to $m$ or less than or equal to $m$, we just just don’t know which, so we need to consider both cases at once. We also don’t know off hand how many data points are to the left or the right of the value of $m$ that minimizes $SAD$!

Let’s play a little roulette and let $L$ be the number of (unknown) points to the left of $m$ and $R$ points to the right. Then $SAD$ looks like it is split into two terms, just like the two intervals leading up to and away from the red dot at the bottom of the check function.

\[ SAD = \Sigma_{i=1}^R |Y_i - m| + \Sigma_{i=1}^R |Y_i - m| = (|Y_1-m| + \ldots + |Y_L-m|) + (|Y_1-m| + \ldots + |Y_R-m|) \]

\[ \frac{d\,\,SAD}{dY} = \Sigma_{i=1}^L (-1) + \Sigma_{i=1}^R (+1) = (-1)L+ (+1)R \]

When we set this result to zero for the first order condition for an optimum we get a possibly strange, but appropriate result. The tradeoff between left and right must offset one another exactly.

\[ (-1)L + (+1)R = 0 \] \[ L = R \] Whatever number of points are to the left must also be to the right of $m$. If $L$ points also include $m$, then $L/N\geq1/2$ as well as for the $R$ points if they include $m$ so that $R/N\geq1/2$.

We have arrived at what a median is.

Now we come up with a precise statement of the middle of a data series, the notorious median. We let $P()$ be the proportion of data points at and above (if $Y \geq M$) or at and below ($Y \leq m$).

THe median, $m$, is the first time a data point in a data series reaches both

$P(Y \leq m) \geq 1/2$ (from minimum data point) and
$P(Y \geq m) \geq 1/2$ (from the maximum data point)

That definition will work for us whether each data point is equally likely ($1/N$) or from grouped data with symmetric or skewed relative frequency distributions.

Two cases arise:

Even number of data points. So if $N=10$, the only way that can happen is if there are 5.5 points to the left of $m$ and 5.5 points to the right. Yes, the value $m$ is halfway between data point number 5 and data point number 6.
Odd number of data points. If $N=9$ data points, the only way that can happen is if there are 5 points to the left of $m$, including $m$, and 5 points to the right, also including $m$. Yes, the value $m$ is data point number 5.

Thus the complexities of order statistics obliterate calm composure.

So what about our odd number of price data points?

i	price	proportion	min-to-max	max-to-min
1	12500	0.2	0.2	1.0
2	13350	0.2	0.4	0.8
3	14600	0.2	0.6	0.6
4	15750	0.2	0.8	0.4
5	17500	0.2	1.0	0.2

The movement up from the minimum and down from the maximum price agreed on one data point. That will always happen for data sets with odd numbers of data points. What if there is an even number of data points? Add a price of d$18000 and let’s see.

i	price	proportion	min-to-max	max-to-min
1	12500	0.17	0.17	1.00
2	13350	0.17	0.33	0.83
3	14600	0.17	0.50	0.67
4	15750	0.17	0.67	0.50
5	17500	0.17	0.83	0.33
6	18000	0.17	1.00	0.17

Quantiles anyone?

We can use the same approach to finding those data points that correspond to the 25th quantile, otherwise known as the first quartile $Q1$. Instead of 1/2 to the left or right of the median data point, we go 1/4 to the left (including the Q1 we search for), which means to the right there is $1-1/4=3/4$ of the data to the right (and including $Q$) of the quartile.

Does this work? (it better!)

i	price	proportion	min-to-max	max-to-min
1	12500	0.17	0.17	1.00
2	13350	0.17	0.33	0.83
3	14600	0.17	0.50	0.67
4	15750	0.17	0.67	0.50
5	17500	0.17	0.83	0.33
6	18000	0.17	1.00	0.17

Now let’s try something really daunting. Suppose that the 6 prices occur with relative frequencies (in ascending order) of 0.1, 0.4, 0.2, 0.1, 0.1, 0.1, What happens now? Will our approach still work? Let’s this time find the 0.40 quantile so that 40% of the data is at or below this point and 60% of the data is at or above this point.

i	price	proportion	min-to-max
1	12500	0.1	0.1
2	13350	0.4	0.5
3	14600	0.2	0.7
4	15750	0.1	0.8
5	17500	0.1	0.9
6	18000	0.1	1.0

We have entered the world of {singular statistics}(http://watanabe-www.math.dis.titech.ac.jp/users/swatanab/e-manga.html), where we are otherwise sucked into a vortex. An exxample of that vortex is the check function we just used. It is a statistical singularity like the physical black holes in the cosmos. It ends up that a lot of the statistics practically done (right and well) uses the thinking and techniques behind our determination of the singular median.

Procedures

Let’s summarize some basic procedures for calculating position and tendency (central or otherwise):

Mean: arithmetic mean and weighted mean (or average as some like to call it)
Median: another quartile? (of course)
Mode: good for nominal data
Quantile: percentile, and a special subset of quantile, the quartile, and the median too

Mean

If $Y_i$ is all of the data possible in the universe (population) indexed by $i = 1 \dots N$ with $N$ elements, then the arithmetic mean is the well-known (and just derived through calculus):

\[ \mu = \frac{\Sigma_{i=1}^N Y_i}{N} \]

If $Y_i$ is a sample (subset) indexed by $i = 1 \dots N$ with $N$ elements from the population, then (the same formula!)

\[ \bar{Y} = m = \frac{\Sigma_{i=1}^N Y_i}{N} \]

We use the $\bar{}$ over the $Y$ to indicate a sample mean.

The arithmetic mean assumes that all the observations $Y_i$ are equally important. Why?

Let $f_i$ be the frequency (count) of each observation (could be grouped into bins as well). Then the weighted mean (or average) is

\[ m = \Sigma_{i=1}^N\left(\frac{f_i}{N}\right)Y_i \]

Here $f_i/N$ is the relative frequency of observation $Y_i$.

Aren’t the arithmetic mean and weighted mean really equivalent?

Median

The middle of the data. We can use the Percentile method below with $P = 50$.

Mode

The most frequently occurring value in the data.

Percentile and quantile

How to compute?

Organize the numbers into an ascending-order array.
Calculate the percentile location $i$

\[ i = \frac{P}{100}N \] where

$P$ = the percentile of interest

$i$ = percentile location

$N$ = number of elements in the data set

Determine the location by either (a) or (b).

If $i$ is a whole number, the $P$th percentile is the average of the value at the $i$th location and the value at the $(i + 1)$st location.
If $i$ is not a whole number, the $P$th percentile value is located at the whole number part of $i + 1$.

For example, suppose you want to determine the 80th percentile of 1240 numbers.

$P$ is 80 and $N$ is 1240.

Order the numbers from lowest to highest.
Calculate the location of the 80th percentile.

\[ i = \frac{80}{100}(1240) = 992 \]

Because $i = 992$ is a whole number, follow the directions in step 3a. The 80th percentile is the average of the 992nd number and the 993rd number.

Always problems

Determine the arithmetic mean, median, and the mode for the following numbers.

2,4,8,4,6,2,7,8,4,3,8,9,4,3,5

Answer

The arithmetic mean is $66/15=4.4$.

Both median and mode happen to be 4:

Arrange in ascending order:

2, 2, 3, 3, 4, 4, 4, 4, 5, 6, 7, 8, 8, 8, 9

There are 15 terms.
Since there are an odd number of terms, the median is the middle number. Using the percentile formula, the median is located at the $(N + 1)/2$ = 8th term

The 8th term is 4

3, For the mode

2, 2, 3, 3, 4, 4, 4, 4, 5, 6, 7, 8, 8, 8, 9

The mode = 4, the most frequently occurring value

The 2018 list of the 15 largest banks in the world by assets is in this survey.

Compute the median and the mean assets from this group.
Which of these two measures do you think is most appropriate for summarizing these data and why?
What is the value of Q2 and Q3?
Determine the 63rd percentile for the data.
Determine the 29th percentile for the data.
Build an error bar graph around the 75th quantile (also the third quartile).

bank	total.assets
Industrial & Commercial Bank of China (ICBC)	3452
China Construction Bank Corp.	2819
Agricultural Bank of China	2716
HSBC Holdings	2670
JPMorgan Chase & Co.	2600
Bank of China	2584
BNP Paribas	2527
Mitsubishi UFJ Financial Group	2337
Credit Agricole Group	2144
Barclays PLC	2114
Bank of America	2105
Deutsche Bank	2078
Citigroup, Inc.	1843
Japan Post Bank	1736
Wells Fargo	1687

Visualizing with Tukey’s box

In 1977 John Tukey introduced the box-and-whisker plot or if you want to practice your French the boite a moustacke. Like a bullet graph, the box plot visualizes several aspects of data using a box. Here we imagine a vertical rectangle:

The 75th percentile is on the top of the box
The 25th percentile is on the bottom of the box
The 505h percentile is in the middle of the box somewhere
Outliers, inclusing the maximum and minimun data points are on lines that extend from the top and bottom of box and are called whiskers (c’est a dire, moustache)

Draw the box and whiskers plot for the bank asset data.

What have we gotten to so far?

We seem to have hit all of the learning outcomes:

We did build a (very naive) model of a variable, car auction price, and recognized a systematic pattern (average) and an unsystematic residue (error or deviation from the average) in the data with the model (average).
We did use a frequency distribution approach throughout. Where? We interpreted the arithmetic average as a weighted average where the weights (frequencies or counts) are all equal. We also built out percentiles, which is just the ogive curve.
Did we compare and contrast? A bit. But we need to do more. The error bar chart is in the right direction. It indicates thresholds for clustering data at the very least. That will jump out in the next installment: deviations. The box plot is another devise to view the relative position statistics. It will pop up again the next installment as well.