Create and interpret graphs (dot plots, histograms, pie charts, bar charts) as a means of summarizing and communicating data meaningfully.
Identify the shape of a distribution (right-skewed, left-skewed, symmetric, or uniform).
The center of a distribution may refer to the mean, the weight balancing point; the median, the 50/50 breaking points; or a mode, a peak. For a geometric explanation, see Mean, Median and Mode in Distributions: Geometric Aspects
A dot plot includes all values from the data set, with one dot for each occurrence of an observed value from the set.
How to Construct
The data set contains 15 petal lengths of iris flower. Create a dot plot to describe the distribution of petal lengths.
1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5, 1.5, 1.6, 1.4, 1.1, 1.2
Solution: For each number in the data set, we draw a dot. We stack dots of the same value from bottom to up.
The data set contains the heights of 20 Black Cherry Trees. Create a dot plot to describe the distribution of the heights.
64, 65, 66, 71, 72, 74, 74, 75, 75, 76, 76, 77, 79, 80, 80, 80, 81, 81, 86, 87
Histograms are frequently used to describe large sets of numerical data, in particular, data of a continuous variable.
A histogram consists of a horizontal axis, a vertical axis and adjoining vertical bars. The horizontal axis is labeled to break data values into bins (classes). The vertical axis is then scaled accordingly so that the area of a bar is the frequency or the relative frequency of values in the bin.
To construct a histogram, we need to decide boundary of bars which forms intervals, called bins or classes and find the frequency of values in each bin.
A frequency distribution is a table which contains bins, frequencies and relative frequencies which are proportions (percentage) defined by the formula \(\text{Relative frequency} =\dfrac{\text{Class frequency}}{\text{Sample size}}\).
The lower boundary of a bin is called a lower bin limit. The upper boundary of a bin is called an upper bin limit.
The bin width is the distance between the lower (or upper) bin limits of two consecutive bins.
The difference between the maximum and the minimum data entries is called the range.
The range is roughly, indeed slightly smaller than, \(\text{number of bins}\times \text{bin width}\).
The midpoint of a bin is the half of the sum of the lower and upper limits of the bin.
Bin limits should be chosen so that each value of the data is in exactly one bin.
The following data set show the mpg (mile per gallon) of \(30\) cars. Construct a frequency table and frequency histogram for the data set using \(7\) bins. What can be concluded from the histogram?
21, 21, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.2, 17.8, 16.4, 17.3, 15.2, 10.4, 10.4, 14.7, 32.4, 30.4, 33.9, 21.5, 15.5, 15.2, 13.3, 19.2, 27.3, 26, 30.4, 15.8, 19.7
Solution:
Bin | Frequency |
---|---|
[ 10.4 , 13.8 ) | 3 |
[ 13.8 , 17.2 ) | 7 |
[ 17.2 , 20.6 ) | 7 |
[ 20.6 , 24 ) | 6 |
[ 24 , 27.4 ) | 3 |
[ 27.4 , 30.8 ) | 2 |
[ 30.8 , 34.2 ) | 2 |
The blue dot-dash curve is called the density curve, the brown dashed line is over the mean, the red dotted line is over the median.
The following information can be obtained from the histogram.
The histogram has a single peak within the interval 13.8 and 20.6.
Majority of cars in the sample has mpg lower than 20.6.
The mean mpg is around 20.6.
The range of the mpg is between 13.8 and 34.2.
The right tail is longer.
There should be no space between any two bars.
The area of a bar represents the relative frequency for the bin. Equivalently, $$\text{area of bar} = \text{relative frequency} = \text{height of bar (density)}\times \text{bar width}.$$
The vertical axis is called the density scale.
For continuous variables, unequal bin width may also be used.
One convention about the bin width for \(k\) bins is to choose a number with the same or one more decimal place that is greater than \(\frac{\text{range}}{k}\), but no more than \(\frac{\text{range}}{k-1}\) as the bin width.
To determine the number of bins, there are practical rules. For example, the Rice rule takes the bin number \(k\) as the round up of \(n^{1 / 3}\). The webpage Statistic How To has more information.
The convenient starting point can be a value smaller but not too much smaller than the minimum.
The bin width can significantly affect the shape of the histogram. It is better to experiment with different choices.
The following data set show the petal length of 20 irises. Construct a frequency table and frequency histogram for the data set using 6 bins. What can you conclude from the histogram?
5.6, 5.1, 5, 6.7, 1.4, 5.9, 1.6, 1.5, 1.5, 3.9, 5.1, 1.2, 4.7, 4.3, 1.4, 4.7, 6.1, 4.2, 4.8, 6
Right skewed (or reverse \(J\)-shaped): A right-skewed distribution has a lot of data at lower variable values.
Left skewed (or \(J\)-shaped): A left skewed distribution has a lot of data at higher variable values with smaller amounts of data at lower variable values.
Symmetric with a central peak (or bell-shaped): A central peak with a tail in both directions. A bell-shaped distribution has a lot of data in the center with smaller amounts of data tapering off in each direction.
Uniform: A rectangular shape, the same amount of data for each variable value.
For examples of left skewed and uniform distributions, please see the example in Dotplot 2 of 2 in Concepts in Statistics
Statistics are used to compare and sometimes identify authors. The following lists shows a simple random sample that compares the letter counts for three authors.
Terry: 7, 9, 3, 3, 3, 4, 1, 3, 2, 2
Davis: 3, 4, 4, 4, 1, 4, 5, 2, 3, 1
Maris: 2, 3, 4, 4, 4, 6, 6, 6, 8, 3
Create a dot plot for each sample and describe the shape of the distribution of each sample.
A student survey was conducted at a major university. The following histogram shows distribution of alcoholic beverages consumed in a typical week.
The red line is over the median and the blue line is over the mean.
A frequency distribution for categorical data is a table that displays the possible categories along with the associated frequencies and/or relative frequencies.
The frequency of a category is the number of occurrences of elements in the category.
The proportion of a frequency to the size of the population or the sample is also called the relative frequency.
A relative frequency distribution is a frequency distribution that includes relative frequencies.
A bar chart consists of bars (rectangles), each represent a category and the area of each bar is constantly proportional to the relative frequency of that category.
A pie chart is a pie with sectors represents categories and the area of each sector is proportional to the relative frequency of that category.
Stacked bar char (segmented bar chart) is a bar divided into segments, with each segment representing a category and the area of the segment is proportional to the relative frequency for that category.
The counts of majors of 100 students in a sample are shown in the table on the right. Visualize the data using a bar, pie and stacked bar chart.
Major | Frequency (Counts) |
---|---|
Art | 30 |
Engineering | 50 |
Science | 20 |
Solution: The relative frequency table is shown below.
Major | Frequency | Relative Frequency |
---|---|---|
Art | 30 | 30% |
Engineering | 50 | 50% |
Science | 20 | 20% |
Total | 100 | 100% |
The following are the charts created in Excel.
Bar chart
Pie chart
Stacked bar chart
The following data table summarize passengers on Titanic. Using a chart to describe the data table.
Class | Passengers |
---|---|
1st | 325 |
2nd | 285 |
3rd | 706 |
Crew | 885 |
Lab Instructions in Excel
In Excel, to create a frequency table for a data array, we need a bin array. The values in a bin array in Excel are the first \(k-1\) upper bin limits. For example, if the bin array consists of 30, 40, and 50, then the bins will be \([\text{min},30]\), \((30,40]\), \((40, 50]\), \((50, \infty)\).
With a data array and a bin array, the Excel function FREQUENCY(data_array, bins_array)
can be used to create a frequency table.
Suppose the data set is in column A
and the bin array is in column B
.
column C
, select a column array of \(k\) cells, then enter =FEQUENCY(
,
)
.Hit Enter
(Ctrl + Shift + Enter
in older versions), you will get a frequency table.
Excel has many built-in chart functions. To create a charts,
Insert
tab, click on an appropriate chart in the Charts
command set.The appearance of chart can be changed after being created.
Select the data
On the Insert
tab, in the Charts
group, from the Insert Statistic Chart
dropdown list, select Histogram
:
Note: The histogram contains a special first bin which always contains the smallest number. This is different from many textbooks.
To format the histogram chart is similar to format a Pie chart. For example, you can change bin width from Format Axis
.
Right-click on the horizontal axis and choose Format Axis
in the popup menu:
In the Format Axis
pane, on the Axis Options
tab, you may try different options for bins.
Excel using a different convention to create histogram. The first bin is a closed interval and other bins are left open and right closed intervals.
Select the Overflow bin checkbox and type the number, all values above this number will be added to the last bin.
Select the Underflow bin checkbox and type the number, all values below and equal to this number will be added to the first bin.
Histograms show the shape and the spread of numerical data. For categorical data, discrete by its definition, bar charts are usually used to represent category frequencies.
Analysis ToolPak
Suppose your data set is in Column A
in Excel.
In the cell B1
, put the first lower bin limit, which is a number slightly less than the minimum but has more decimal places than the data set.
Create upper bin limits in column C.
In Data menu, look for the Data Analysis ToolPak (if not, go to File > Options > Add-ins > Manage Excel Add-ins, check Analysis ToolPak). In the popup windows, find Histogram.
In the input range, select your data set. In the bin range, select upper bins.
Check Chart Output and hit OK. You will see the frequency table and histogram in Sheet 2.
Change the gap between bars. Right-click a bar and choose Format Data Series...
and change the Gap Width
to 2% or 1%.
If you have a raw data set, follow the same procedure a creating a histogram but with a bin width equal the same accuracy of the data. For example, if you data set consists of integers, then choose 1 as the bin-width.
Change the format of bars in the histogram.
Right click a bar and select Format Data Series...
.
Find Fill & Line
and select both Picture or texture fill
and Stack and Scale with
.
Click the button Oneline...
and input dot in search bing
and hit enter.
Select a picture you like, and you will get a dot-plot.
Describe the distribution of percentage of college students attending college in home states. (To be demonstrated in-class)
93, 92, 91, 91, 90, 90, 90, 90, 89, 89, 89, 89, 89, 89, 89, 88, 87, 87, 85, 85, 85, 85, 84, 84, 83, 81, 81, 81, 80, 78, 77, 77, 76, 76, 76, 76, 72, 72, 70, 68, 67, 65, 65, 64, 62, 60, 58, 57, 57, 50
Data is taken from Example 3.15 in Introduction to Statistics and Data Analysis.
Consider the frequency table on the right.
Sleep Deficit | Morning Start | Afternoon Start |
---|---|---|
(in hours) | Rel. Freq. | Rel Freq. |
−6 to < −4 | 0.007 | 0.02 |
−4 to < −2 | 0.028 | 0.05 |
−2 to < 0 | 0.065 | 0.19 |
0 to < 2 | 0.442 | 0.57 |
2 to < 4 | 0.364 | 0.12 |
4 to < 6 | 0.078 | 0.04 |
6 to < 8 | 0.015 | 0.01 |
Source: Example 3.16 in Textbook Introduction to Statistics and Data Analysis | 6th Edition.
Use Excel to complete the following tasks:
Create a random sample of 30 two-digit integers.
Create a histogram with 6 bins for the sample.
Describe the shape of the distribution of the sample of 30 two-digit integers.