Learning outcomes

In this unit we will learn to:

  1. Arrange numeric data into groups (also known as bins, classes, intervals)

  2. Count the number of data points in each group: the frequency

  3. Transform the absolute frequency (count) into a relative frequency and cumulative relative frequency

  4. Plot on the vertical axis the relative and cumulative relative frequencies of the horizontal axis midpoints of each bin

  5. Use the plots to describe the shape of the data and the implications of shape for decisions

For the plots we will use a special bar graph called a histogram. We could also connect the midpoints with a line to produce a polygon graph. These graphs will help us answer two questions:

  1. What is the range of impact of the variable?

  2. How often do values and ranges of values of the variable occur?

How well did we do?

Imagine we run a distribution center for a major appliance manufacturer. Key indicators of our operational performance include the number of on-time in-full orders delivered, time from receipt of order to delivery, returns, and overall service level. Time, cost, and quality are the hallmarks of a well run supply chain.

Here is a sample of 20 separate on-time in-full orders for the past month. Each observation is the number of items in the order.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
orders 92 94 82 83 89 82 85 88 96 90 87 79 76 90 81 84 95 91 99 84

The data is a mix of various values with an index in the top row.

Our procedure straitforward:

  1. Arrange the data from lowest to highest values
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
sorted orders 76 79 81 82 82 83 84 84 85 87 88 89 90 90 91 92 94 95 96 99

The data has also been re-indexed as shown in the top row of the table.

  1. Choose the number of bins (groups, intervals). Let’s choose 5 bins. Usually bins are an odd, not an even, number. They typically range from 3 to 9 in practice.

  2. Calculate the bin width. We let \(w\) be the width, \(n\) the number of bins and \(x\) the sorted orders in the formula below.

\[ w = \frac{max(x)-min(x)}{n} = \frac{99 - 76}{5} = 4.6 \] Already we are binning to describe our data. We have sorted it from lowest value \(min(x)\) to the highest value \(max(x)\). We are beginning to group the data into \(n=5\).

  1. In a vertical table arrange the bins from lowest interval to highest. Initially use 5 columns as in the table below. The intervals will each have a beginning and ending value such that groups of orders will lie in non-overlapping intervals of bin-width.

\[ begin \leq orders < end \] We do not want to ever double count the number of orders in a bin, thus the \(<\) relation for the end. The first \(begin\) value is the minimum order \(76\). The bin-width is \(4.6\). The ending value of the first interval is then

\[ end = begin + width = 76 + 4.6 = 80.6 \]

Thus our first interval looks like this

\[ 76 \leq orders < 80.6 \]

The interval or class midpoint is the arithemetic average of the interval from beginning to end. So for the first group

\[ midpoint = \frac{begin+end}{2} = \frac{76 + 80.6}{2} = 78.3 \]

Because we have sorted our data, it is a simple exercise to count the number of orders in this first bin just by examining the sorted series.


How many are there?



The relationships in the last interval are very important to remember. From the very low to the high intervals the relationships are

\[ begin \leq orders < end \]

But to obery the rubric that we must use all of our data (and also remember not to every double count), the last, the very high interval has these relationships.

\[ begin \leq orders \leq max(orders) \]

We must always remember to use the \(\leq\) relationship to include the \(max(x)\) of our data series.

Let’s build out the table with 5 columns: category, begin, end, midpoint, frequency. Try this on paper.



  1. A few more steps and we will have a table of derived metrics to help answer our questions about the distribution center. In another column we calculate for each interval the relative frequency as the percentage of the bin’s frequency count of the total count (20) of the sample. In the first interval are 2 very low observations. The relative frequency (%) is thus 2/20 or 10% of the sample. We continue with the rest of the bins. Next we calculate the cumulative sum of contributions of the sample to all classes up to and including the latest class. If the relative frequency of the second class is 7/20 or 35%, then the cumulative relative frequency across both the first (10%) and the second (35%) intervals is 45%. This means that45% of the sample contributes to very low and low levels of order sizes.

Let’s finish building the table with two more columns. After that we can plot our handiwork.

  1. What do we notice about the cumulative relative frequency result in the last bin?

  2. How much of the data is high or very high in orders?



  1. Let’s plot our table. Draw a box, the left vertical side of which is the relative frequency in percentage (label this axis), the right vertical side of which is cumulative relative frequency (again label this secondary axis), with the bottom horizontal side the midpoints of the 5 bins (and again label using the midpoints and the categories for visual clarity). Plot relative frequency versus midpoints using a bar chart and cumulative relative frequency versus midpoint using a line plot.

What do you get?


Practice, practice, practice …

If we do enough of these exercises together, we will (probably) understand what it is to describe data with empirical distributions and answer these two questions:

  1. How much or many?

  2. How often?

Suppose you help operate the project management office of a housing authority in the Bronx. Here is a sample of the total number of housing units under the category of new construction for 2019. Here is a listing of the variables in the data base from which a sample from the Bronx is drawn.

Can you identify the four data types here?

## Observations: 4,103
## Variables: 41
## $ `Project ID`                         <dbl> 61875, 61875, 61875, 6187...
## $ `Project Name`                       <chr> "1199 HOUSING CORP.PLP.FY...
## $ `Project Start Date`                 <chr> "6/28/2019", "6/28/2019",...
## $ `Project Completion Date`            <chr> NA, NA, NA, NA, NA, "6/28...
## $ `Building ID`                        <dbl> 413, 804652, 804825, 8048...
## $ Number                               <chr> "2070", "420", "2090", "2...
## $ Street                               <chr> "1 AVENUE", "EAST 111 STR...
## $ Borough                              <chr> "Manhattan", "Manhattan",...
## $ Postcode                             <dbl> 10029, 10029, 10029, 1002...
## $ BBL                                  <dbl> 1017010001, 1017010001, 1...
## $ BIN                                  <dbl> 1083953, 1083956, 1083954...
## $ `Community Board`                    <chr> "MN-11", "MN-11", "MN-11"...
## $ `Council District`                   <dbl> 8, 8, 8, 8, 24, 17, 17, 9...
## $ `Census Tract`                       <dbl> 162, 162, 162, 162, 1267,...
## $ `NTA - Neighborhood Tabulation Area` <chr> "MN33", "MN33", "MN33", "...
## $ Latitude                             <dbl> 40.79037, 40.79228, 40.79...
## $ Longitude                            <dbl> -73.93951, -73.93653, -73...
## $ `Latitude (Internal)`                <dbl> 40.79088, 40.79088, 40.79...
## $ `Longitude (Internal)`               <dbl> -73.93768, -73.93768, -73...
## $ `Building Completion Date`           <chr> NA, NA, NA, NA, NA, "6/28...
## $ `Reporting Construction Type`        <chr> "Preservation", "Preserva...
## $ `Extended Affordability Only`        <chr> "No", "No", "No", "No", "...
## $ `Prevailing Wage Status`             <chr> "Non Prevailing Wage", "N...
## $ `Extremely Low Income Units`         <dbl> 74, 66, 68, 69, 75, 0, 0,...
## $ `Very Low Income Units`              <dbl> 352, 310, 327, 326, 0, 3,...
## $ `Low Income Units`                   <dbl> 0, 0, 0, 0, 124, 12, 5, 2...
## $ `Moderate Income Units`              <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ `Middle Income Units`                <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ `Other Income Units`                 <dbl> 0, 0, 1, 1, 1, 0, 1, 0, 0...
## $ `Studio Units`                       <dbl> 39, 39, 39, 38, 90, 0, 0,...
## $ `1-BR Units`                         <dbl> 155, 155, 155, 156, 49, 5...
## $ `2-BR Units`                         <dbl> 150, 130, 135, 135, 54, 1...
## $ `3-BR Units`                         <dbl> 64, 38, 48, 48, 7, 0, 0, ...
## $ `4-BR Units`                         <dbl> 18, 14, 19, 19, 0, 0, 0, ...
## $ `5-BR Units`                         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ `6-BR+ Units`                        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ `Unknown-BR Units`                   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ `Counted Rental Units`               <dbl> 0, 0, 0, 0, 200, 15, 16, ...
## $ `Counted Homeownership Units`        <dbl> 426, 376, 396, 396, 0, 0,...
## $ `All Counted Units`                  <dbl> 426, 376, 396, 396, 200, ...
## $ `Total Units`                        <dbl> 426, 376, 396, 396, 200, ...
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
sorted total units 1 1 1 1 1 1 1 13 41 59 62 75 86 102 118 122 199 249 250 281
  • Build a frequency table with 5 bins and categories from very low to very high. Include the beginning, ending and midpoint of intervals, the frequency, relative frequency, and cumulative relative frequency of each class interval.


  • Further visualize the table with a frequency histogram and cumulative frequency line plot.


Try this

Apply the data distribution approach to …

TO BE CONTINUED!