The shape of data

Learning outcomes

In this unit we will learn to:

Arrange numeric data into groups (also known as bins, classes, intervals)
Count the number of data points in each group: the frequency
Transform the absolute frequency (count) into a relative frequency and cumulative relative frequency
Plot on the vertical axis the relative and cumulative relative frequencies of the horizontal axis midpoints of each bin
Use the plots to describe the shape of the data and the implications of shape for decisions

For the plots we will use a special bar graph called a histogram. We could also connect the midpoints with a line to produce a polygon graph. These graphs will help us answer two questions:

What is the range of impact of the variable?
How often do values and ranges of values of the variable occur?

How well did we do?

Imagine we run a distribution center for a major appliance manufacturer. Key indicators of our operational performance include the number of on-time in-full orders delivered, time from receipt of order to delivery, returns, and overall service level. Time, cost, and quality are the hallmarks of a well run supply chain.

Here is a sample of 20 separate on-time in-full orders for the past month. Each observation is the number of items in the order.

	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19	20
orders	92	94	82	83	89	82	85	88	96	90	87	79	76	90	81	84	95	91	99	84

The data is a mix of various values with an index in the top row.

Our procedure straitforward:

Arrange the data from lowest to highest values

	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19	20
sorted orders	76	79	81	82	82	83	84	84	85	87	88	89	90	90	91	92	94	95	96	99

The data has also been re-indexed as shown in the top row of the table.

Choose the number of bins (groups, intervals). Let’s choose 5 bins. Usually bins are an odd, not an even, number. They typically range from 3 to 9 in practice.
Calculate the bin width. We let \(w\) be the width, \(n\) the number of bins and \(x\) the sorted orders in the formula below.

\[ w = \frac{max(x)-min(x)}{n} = \frac{99 - 76}{5} = 4.6 \] Already we are binning to describe our data. We have sorted it from lowest value \(min(x)\) to the highest value \(max(x)\). We are beginning to group the data into \(n=5\).

In a vertical table arrange the bins from lowest interval to highest. Initially use 5 columns as in the table below. The intervals will each have a beginning and ending value such that groups of orders will lie in non-overlapping intervals of bin-width.

\[ begin \leq orders < end \] We do not want to ever double count the number of orders in a bin, thus the \(<\) relation for the end. The first \(begin\) value is the minimum order \(76\). The bin-width is \(4.6\). The ending value of the first interval is then

\[ end = begin + width = 76 + 4.6 = 80.6 \]

Thus our first interval looks like this

\[ 76 \leq orders < 80.6 \]

The interval or class midpoint is the arithemetic average of the interval from beginning to end. So for the first group

\[ midpoint = \frac{begin+end}{2} = \frac{76 + 80.6}{2} = 78.3 \]

Because we have sorted our data, it is a simple exercise to count the number of orders in this first bin just by examining the sorted series.

How many are there?

The relationships in the last interval are very important to remember. From the very low to the high intervals the relationships are

\[ begin \leq orders < end \]

But to obery the rubric that we must use all of our data (and also remember not to every double count), the last, the very high interval has these relationships.

\[ begin \leq orders \leq max(orders) \]

We must always remember to use the \(\leq\) relationship to include the \(max(x)\) of our data series.

Let’s build out the table with 5 columns: category, begin, end, midpoint, frequency. Try this on paper.

category	begin	end	midpoint	frequency
very low	76.0	80.6	78.3	2
low	80.6	85.2	82.9	7
medium	85.2	89.8	87.5	3
high	89.8	94.4	92.1	5
very high	94.4	99.0	96.7	3

A few more steps and we will have a table of derived metrics to help answer our questions about the distribution center. In another column we calculate for each interval the relative frequency as the percentage of the bin’s frequency count of the total count (20) of the sample. In the first interval are 2 very low observations. The relative frequency (%) is thus 2/20 or 10% of the sample. We continue with the rest of the bins. Next we calculate the cumulative sum of contributions of the sample to all classes up to and including the latest class. If the relative frequency of the second class is 7/20 or 35%, then the cumulative relative frequency across both the first (10%) and the second (35%) intervals is 45%. This means that45% of the sample contributes to very low and low levels of order sizes.

Let’s finish building the table with two more columns. After that we can plot our handiwork.

What do we notice about the cumulative relative frequency result in the last bin?
How much of the data is high or very high in orders?

category	begin	end	midpoint	frequency	relative	cumulative
very low	76.0	80.6	78.3	2	10	10
low	80.6	85.2	82.9	7	35	45
medium	85.2	89.8	87.5	3	15	60
high	89.8	94.4	92.1	5	25	85
very high	94.4	99.0	96.7	3	15	100

Let’s plot our table. Draw a box, the left vertical side of which is the relative frequency in percentage (label this axis), the right vertical side of which is cumulative relative frequency (again label this secondary axis), with the bottom horizontal side the midpoints of the 5 bins (and again label using the midpoints and the categories for visual clarity). Plot relative frequency versus midpoints using a bar chart and cumulative relative frequency versus midpoint using a line plot.

What do you get?

Practice, practice, practice …

If we do enough of these exercises together, we will (probably) understand what it is to describe data with empirical distributions and answer these two questions:

How much or many?
How often?

Suppose you help operate the project management office of a housing authority in the Bronx. Here is a sample of the total number of housing units under the category of new construction for 2019. Here is a listing of the variables in the data base from which a sample from the Bronx is drawn.

Can you identify the four data types here?

## Observations: 4,103
## Variables: 41
## $ `Project ID`                         <dbl> 61875, 61875, 61875, 6187...
## $ `Project Name`                       <chr> "1199 HOUSING CORP.PLP.FY...
## $ `Project Start Date`                 <chr> "6/28/2019", "6/28/2019",...
## $ `Project Completion Date`            <chr> NA, NA, NA, NA, NA, "6/28...
## $ `Building ID`                        <dbl> 413, 804652, 804825, 8048...
## $ Number                               <chr> "2070", "420", "2090", "2...
## $ Street                               <chr> "1 AVENUE", "EAST 111 STR...
## $ Borough                              <chr> "Manhattan", "Manhattan",...
## $ Postcode                             <dbl> 10029, 10029, 10029, 1002...
## $ BBL                                  <dbl> 1017010001, 1017010001, 1...
## $ BIN                                  <dbl> 1083953, 1083956, 1083954...
## $ `Community Board`                    <chr> "MN-11", "MN-11", "MN-11"...
## $ `Council District`                   <dbl> 8, 8, 8, 8, 24, 17, 17, 9...
## $ `Census Tract`                       <dbl> 162, 162, 162, 162, 1267,...
## $ `NTA - Neighborhood Tabulation Area` <chr> "MN33", "MN33", "MN33", "...
## $ Latitude                             <dbl> 40.79037, 40.79228, 40.79...
## $ Longitude                            <dbl> -73.93951, -73.93653, -73...
## $ `Latitude (Internal)`                <dbl> 40.79088, 40.79088, 40.79...
## $ `Longitude (Internal)`               <dbl> -73.93768, -73.93768, -73...
## $ `Building Completion Date`           <chr> NA, NA, NA, NA, NA, "6/28...
## $ `Reporting Construction Type`        <chr> "Preservation", "Preserva...
## $ `Extended Affordability Only`        <chr> "No", "No", "No", "No", "...
## $ `Prevailing Wage Status`             <chr> "Non Prevailing Wage", "N...
## $ `Extremely Low Income Units`         <dbl> 74, 66, 68, 69, 75, 0, 0,...
## $ `Very Low Income Units`              <dbl> 352, 310, 327, 326, 0, 3,...
## $ `Low Income Units`                   <dbl> 0, 0, 0, 0, 124, 12, 5, 2...
## $ `Moderate Income Units`              <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ `Middle Income Units`                <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ `Other Income Units`                 <dbl> 0, 0, 1, 1, 1, 0, 1, 0, 0...
## $ `Studio Units`                       <dbl> 39, 39, 39, 38, 90, 0, 0,...
## $ `1-BR Units`                         <dbl> 155, 155, 155, 156, 49, 5...
## $ `2-BR Units`                         <dbl> 150, 130, 135, 135, 54, 1...
## $ `3-BR Units`                         <dbl> 64, 38, 48, 48, 7, 0, 0, ...
## $ `4-BR Units`                         <dbl> 18, 14, 19, 19, 0, 0, 0, ...
## $ `5-BR Units`                         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ `6-BR+ Units`                        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ `Unknown-BR Units`                   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ `Counted Rental Units`               <dbl> 0, 0, 0, 0, 200, 15, 16, ...
## $ `Counted Homeownership Units`        <dbl> 426, 376, 396, 396, 0, 0,...
## $ `All Counted Units`                  <dbl> 426, 376, 396, 396, 200, ...
## $ `Total Units`                        <dbl> 426, 376, 396, 396, 200, ...

	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19	20
sorted total units	1	1	1	1	1	1	1	13	41	59	62	75	86	102	118	122	199	249	250	281

Build a frequency table with 5 bins and categories from very low to very high. Include the beginning, ending and midpoint of intervals, the frequency, relative frequency, and cumulative relative frequency of each class interval.

category	begin	end	midpoint	frequency	relative	cumulative
very low	1	57	29	9	45	45
low	57	113	85	5	25	70
medium	113	169	141	2	10	80
high	169	225	197	1	5	85
very high	225	281	253	3	15	100

Further visualize the table with a frequency histogram and cumulative frequency line plot.

Two dimensions?

Here is an example of a two dimensional frequency distribution.

Try this

Apply the data distribution approach to …

TO BE CONTINUED!