In this unit we will learn to:
Arrange numeric data into groups (also known as bins, classes, intervals)
Count the number of data points in each group: the frequency
Transform the absolute frequency (count) into a relative frequency and cumulative relative frequency
Plot on the vertical axis the relative and cumulative relative frequencies of the horizontal axis midpoints of each bin
Use the plots to describe the shape of the data and the implications of shape for decisions
For the plots we will use a special bar graph called a histogram. We could also connect the midpoints with a line to produce a polygon graph. These graphs will help us answer two questions:
What is the range of impact of the variable?
How often do values and ranges of values of the variable occur?
Imagine we run a distribution center for a major appliance manufacturer. Key indicators of our operational performance include the number of on-time in-full orders delivered, time from receipt of order to delivery, returns, and overall service level. Time, cost, and quality are the hallmarks of a well run supply chain.
Here is a sample of 20 separate on-time in-full orders for the past month. Each observation is the number of items in the order.
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
orders | 92 | 94 | 82 | 83 | 89 | 82 | 85 | 88 | 96 | 90 | 87 | 79 | 76 | 90 | 81 | 84 | 95 | 91 | 99 | 84 |
The data is a mix of various values with an index in the top row.
Our procedure straitforward:
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
sorted orders | 76 | 79 | 81 | 82 | 82 | 83 | 84 | 84 | 85 | 87 | 88 | 89 | 90 | 90 | 91 | 92 | 94 | 95 | 96 | 99 |
The data has also been re-indexed as shown in the top row of the table.
Choose the number of bins (groups, intervals). Let’s choose 5 bins. Usually bins are an odd, not an even, number. They typically range from 3 to 9 in practice.
Calculate the bin width. We let \(w\) be the width, \(n\) the number of bins and \(x\) the sorted orders in the formula below.
\[ w = \frac{max(x)-min(x)}{n} = \frac{99 - 76}{5} = 4.6 \] Already we are binning to describe our data. We have sorted it from lowest value \(min(x)\) to the highest value \(max(x)\). We are beginning to group the data into \(n=5\).
\[ begin \leq orders < end \] We do not want to ever double count the number of orders in a bin, thus the \(<\) relation for the end. The first \(begin\) value is the minimum order \(76\). The bin-width is \(4.6\). The ending value of the first interval is then
\[ end = begin + width = 76 + 4.6 = 80.6 \]
Thus our first interval looks like this
\[ 76 \leq orders < 80.6 \]
The interval or class midpoint is the arithemetic average of the interval from beginning to end. So for the first group
\[ midpoint = \frac{begin+end}{2} = \frac{76 + 80.6}{2} = 78.3 \]
Because we have sorted our data, it is a simple exercise to count the number of orders in this first bin just by examining the sorted series.
How many are there?
The relationships in the last interval are very important to remember. From the very low to the high intervals the relationships are
\[ begin \leq orders < end \]
But to obery the rubric that we must use all of our data (and also remember not to every double count), the last, the very high interval has these relationships.
\[ begin \leq orders \leq max(orders) \]
We must always remember to use the \(\leq\) relationship to include the \(max(x)\) of our data series.
Let’s build out the table with 5 columns: category, begin, end, midpoint, frequency. Try this on paper.
Let’s finish building the table with two more columns. After that we can plot our handiwork.
What do we notice about the cumulative relative frequency result in the last bin?
How much of the data is high or very high in orders?
What do you get?
If we do enough of these exercises together, we will (probably) understand what it is to describe data with empirical distributions and answer these two questions:
How much or many?
How often?
Suppose you help operate the project management office of a housing authority in the Bronx. Here is a sample of the total number of housing units under the category of new construction for 2019. Here is a listing of the variables in the data base from which a sample from the Bronx is drawn.
Can you identify the four data types here?
## Observations: 4,103
## Variables: 41
## $ `Project ID` <dbl> 61875, 61875, 61875, 6187...
## $ `Project Name` <chr> "1199 HOUSING CORP.PLP.FY...
## $ `Project Start Date` <chr> "6/28/2019", "6/28/2019",...
## $ `Project Completion Date` <chr> NA, NA, NA, NA, NA, "6/28...
## $ `Building ID` <dbl> 413, 804652, 804825, 8048...
## $ Number <chr> "2070", "420", "2090", "2...
## $ Street <chr> "1 AVENUE", "EAST 111 STR...
## $ Borough <chr> "Manhattan", "Manhattan",...
## $ Postcode <dbl> 10029, 10029, 10029, 1002...
## $ BBL <dbl> 1017010001, 1017010001, 1...
## $ BIN <dbl> 1083953, 1083956, 1083954...
## $ `Community Board` <chr> "MN-11", "MN-11", "MN-11"...
## $ `Council District` <dbl> 8, 8, 8, 8, 24, 17, 17, 9...
## $ `Census Tract` <dbl> 162, 162, 162, 162, 1267,...
## $ `NTA - Neighborhood Tabulation Area` <chr> "MN33", "MN33", "MN33", "...
## $ Latitude <dbl> 40.79037, 40.79228, 40.79...
## $ Longitude <dbl> -73.93951, -73.93653, -73...
## $ `Latitude (Internal)` <dbl> 40.79088, 40.79088, 40.79...
## $ `Longitude (Internal)` <dbl> -73.93768, -73.93768, -73...
## $ `Building Completion Date` <chr> NA, NA, NA, NA, NA, "6/28...
## $ `Reporting Construction Type` <chr> "Preservation", "Preserva...
## $ `Extended Affordability Only` <chr> "No", "No", "No", "No", "...
## $ `Prevailing Wage Status` <chr> "Non Prevailing Wage", "N...
## $ `Extremely Low Income Units` <dbl> 74, 66, 68, 69, 75, 0, 0,...
## $ `Very Low Income Units` <dbl> 352, 310, 327, 326, 0, 3,...
## $ `Low Income Units` <dbl> 0, 0, 0, 0, 124, 12, 5, 2...
## $ `Moderate Income Units` <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ `Middle Income Units` <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ `Other Income Units` <dbl> 0, 0, 1, 1, 1, 0, 1, 0, 0...
## $ `Studio Units` <dbl> 39, 39, 39, 38, 90, 0, 0,...
## $ `1-BR Units` <dbl> 155, 155, 155, 156, 49, 5...
## $ `2-BR Units` <dbl> 150, 130, 135, 135, 54, 1...
## $ `3-BR Units` <dbl> 64, 38, 48, 48, 7, 0, 0, ...
## $ `4-BR Units` <dbl> 18, 14, 19, 19, 0, 0, 0, ...
## $ `5-BR Units` <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ `6-BR+ Units` <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ `Unknown-BR Units` <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ `Counted Rental Units` <dbl> 0, 0, 0, 0, 200, 15, 16, ...
## $ `Counted Homeownership Units` <dbl> 426, 376, 396, 396, 0, 0,...
## $ `All Counted Units` <dbl> 426, 376, 396, 396, 200, ...
## $ `Total Units` <dbl> 426, 376, 396, 396, 200, ...
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
sorted total units | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 13 | 41 | 59 | 62 | 75 | 86 | 102 | 118 | 122 | 199 | 249 | 250 | 281 |
Here is an example of a two dimensional frequency distribution.
Apply the data distribution approach to …
TO BE CONTINUED!