Introduction to Statistics and Data Analysis 6th Edition by Roxy Peck, Chris Olsen, Tom Short
Image source: Concepts in Statistics (lumen learning)
Data: collection of observations such as counts, measurements, responses or experiments.
Population: The entire collection of individuals or objects that are of interest.
Sample: a subset of a population that is selected for study.
Parameter: A number that is a property of the population.
Statistic: A number, such as a percentage, that represents a property of a sample.
Answer:
Identify statistic concepts in the following study: To learn the percentage of students go to school by public transportation, 500 students at a college were surveyed. 50% say they go to school by public transportation
Answer:
Variable: a characteristic, or attribute of interest that we gather about individuals or objects. There are two types of variables according to their values.
Univariate data: a collection of observations on a single variable.
Bivariate data: a collection of pairs of numbers.
Multivariate data: a collection of arrays of values of two or more variables.
A numerical variable is called a discrete variable if its values are countable. It is called a continuous variable if it can take all values in certain intervals.
The Higher Education Research Institute at UCLA surveys over 20,000 college seniors each year. One question on the survey asks seniors the following question: If you could make your college choice over, would you still choose to enroll at your current college? Possible responses are definitely yes (DY), probably yes (PY), probably no (PN), and definitely no (DN).
Question:
Answer:
Identify the population, sample, the variable of study, the type of the variable, the population parameter and the sample statistics.
An administrator wishes to estimate the passing rate of a certain course. She takes a random sample of 50 students and obtains their letter grades of that course. Among the 50 students, 32 students earned a grade C or better.
A statistical study can usually be categorized as an observational study or an experiment by the mean of study.
In an observational study, it is not possible to draw clear cause-and-effect conclusions because we cannot rule out the possibility that the observed effect is due to some other variables not being studied, known as extraneous variables.
Which type of study will answer the question.
What proportion of all college students in the United States have taken classes at a community college?
Does use of computer-aided instruction in college math classes improve test scores?
Answer:
See Types of Statistical Studies (2 of 4) in the textbook Concepts in Statistics for more examples.
Identify the type of statistical study:
A study took random sample of adults and asked them about their bedtime habits. The data showed that people who drank a cup of tea before bedtime were more likely to go to sleep earlier than those who didn’t drink tea.
Another study took a group of adults and randomly divided them into two groups. One group was told to drink tea every night for a week, while the other group was told not to drink tea that week. Researchers then compared when each group fell asleep.
Source: Khan Academy
Type of Research Question | Examples |
---|---|
Make an estimate about the population (often an estimate about an average value or a proportion with a given characteristic) | What is the average number of hours that community college students work each week? What proportion of all U.S. college students are enrolled at a community college? |
Test a claim about the population (often a claim about an average value or a proportion with a given characteristic) | Is the average course load for a community college student greater than 12 units? Do the majority of community college students qualify for federal student loans? |
Type of Research Question | Examples |
---|---|
Compare two populations (often a comparison of population averages or proportions with a given characteristic) | In community colleges, do female students have a higher GPA than male students? Are college athletes more likely than non-athletes to receive academic advising? |
Investigate a relationship between two variables in the population | Is there a relationship between the number of hours high school students spend each week on Facebook and their GPA? Is academic counseling associated with quicker completion of a college degree? |
A research question that focuses on a cause-and-effect relationship is common in disciplines that use experiments, such as medicine or psychology.
In a study of a relationship between two variables, one variable is the explanatory variable, and the other is the response variable.
To establish a cause-and-effect relationship, we want to get rid of confounding variable and make sure the explanatory variable is the only thing that impacts the response variable.
Confounding variable: An extra variable that may have effect on the response variable of interest.
Determine if the question is a cause-and-effect question? What are the explanatory and response variables?
Answer:
This question investigates a cause-and-effect relationship. The explanatory variable is computer-aided instruction and the response variable is the test scores.
This question investigates a correlation between variables in a population and is not a cause-and-effect question. The explanatory variable is tutoring, and the response variable is the performance.
In general, we should not make cause-and-effect statements from observational studies unless impact of confounding variables can be significantly decreased.
Example: A researcher studies the medical records of 500 randomly selected patients. Based on the information in the records, he divides the patients into two groups: those given the recommendation to take an aspirin every day and those with no such recommendation. He reports the percentage of each group that developed heart disease.
Determine whether the study supports the conclusion that taking aspirin lowers the risk of heart attacks.
Answer: The conclusion claims a cause-and-effect relationship. To answer the question, we need an experimental study. However, the study has no control on data which makes it inappropriate.
Does higher education attainment lead to higher salary?
To make accurate inference, the sample must be representative of the population.
A sampling plan describes exactly how we will choose the sample.
A sampling plan is biased if it systematically favors certain outcomes.
In random Sampling, every individual or object has an equal chance of being selected.
Simple random sample: groups of the same size are randomly selected. Table of random numbers, calculator and softwares are often used to generate random numbers.
Stratified random sample: The population is first split into groups. Then subjects from each group are selected randomly.
Cluster sample: The population is first split into groups. Then some groups are selected randomly.
Systematic sample: First, a starting number is chosen randomly. Then take every \(n\)-th piece of the data.
Determine the type of sampling method.
A market researcher polls every tenth person who walks into a store.
100 students whose student id numbers matches 100 numbers generated by a computer randomization program.
The first 30 people who walk into a sporting event are polled on their television preferences.
Voluntary Response Bias / Self-Selection Bias: participants can choose whether to participate in the study. Example: “non-scientific” polls taken on television or websites
Measurement or Response Bias: observations tends to produce values that systematically. Example: The question “How many bottles of bier do you drink each day?” will likely suffer a response bias.
Nonresponse bias; response are not obtained from all selected individuals. Example: Mall surveys
Undercoverage Bias: sample too few observations from a segment of the population. Example: random survey some classmates to estimate the average GPA or a college. This sampling method is known as convenience sampling.
Suppose that you want to estimate the proportion of students at your college that use the library.
Which sampling plan will produce the most reliable results?
Answer: The 4th sampling plan is the most reliable plan. The first three and undercover the college.
In general, the larger sample size, the more accurate of conclusion. However, we have to avoid bad sampling.
Click the link to open the practice in a new window.
Practice on Sampling Techniques
Control reduces the effects of variables other than the explanatory variables and the response variables, known as extraneous variables.
Three control strategies are control groups, placebos, and blinding.
A control group is a baseline group that receives no treatment or a neutral treatment.
A neutral treatment that has no “real” effect on the dependent variable is called a placebo, and a participant’s positive response to a placebo is called the placebo effect.
Blinding is the practice of not telling participants whether they are receiving a placebo. Double-blinding is the practice of not telling both the participants and the researchers which group receiving a treatment or a placebo.
Randomization ensures that this estimate is statistically valid.
Replication reduces variability in experimental results and increases their significance.
A confounding variable has at least a partial effect on the response variable.
Example: In the study of the relation between a type fertilizer and tomato size, the amount of sunshine will be a confounding variable. It contributes to the growth of tomato.
A lurking variable is an unseen (unmeasured) third variable that is a common response to an apparent association between the explanatory and response variables
Example: People find that there is a positive association between number of firefighters and amount of damage. However, both are affected the size of fire.
Both confounding and lurking variables are extraneous variables which are variable other than the explanatory variables that may have an effect on the response variable.
There is an ongoing debate about how many spaces should be placed after a period in typed documents. Alana read about a study where 100 participants all read the same document typed in Courier New font. Half of the participants were randomly assigned the document with one space after each period, and the other half were given the document with two spaces after each period.
Participants who read the document with two spaces after each period were able to finish reading significantly faster than those with one space after each period. Alana concluded that using two spaces after each period will help people read all documents faster.
Is this study appropriate? Why?
Source: Khan Academy
Lab Instructions in Excel
=RAND()
returns a random real number greater than or equal to 0 and less than 1. To generate a random real number between \(a\) and \(b\), one can use =RAND()*(b-a)+a
.=RANDBETWEEN(bottom, top)
returns an integer between bottom
and top
.=RANDARRAY([rows],[columns],[min],[max],[whole_number])
returns an array of random numbers. You can specify the number of rows and columns to fill, minimum and maximum values, and whether to return whole numbers or decimal values (TRUE for whole number and FALSE for decimal values).RANDARRAY
are optional. RANDARRAY()
is equivalent to RAND()
.=UNIQUE(RANDARRAY([rows],[columns],[min],[max],[whole_number]))
.RAND()
Randomly generate a number between 0 and 1.
Howto:
Step 1: Choose a cell, say A1
Step 2: click insert function button \(f_x\).
Step 3: In the popup window, search “random” and select RAND.
Step 4: Click OK, you will get a randomly generated number.
Alternatively, you may also manually enter the function: =rand()
in the cell and hit enter.
RANDBETWEEN()
Generate 10 random integers of 2 digits.
Howto:
Step 1: Generate a random integer, say in the cell A1
, using the Excel function =RANDBETWEEN(10,99)
.
Step 2: Move the mouse cursor to the lower right corner of the cell A1
. A solid plus +
will appear.
Step 3: Hold the left-click of the mouse and drag the cell to horizontally or vertically to autofill the selected array with 10 random numbers of 2 digits.
RANDARRAY()
Generate 10 random integers of 2 digits without repetition.
Howto:
In the cell with 9 empty cells below it, say A1
, apply the Excel function =UNIQUE(RANDARRAY(10, 1, 10, 99, TRUE))
. You will get a column array of 10 integers without duplication.
Generate a real number between 1 and 2.
Generate 10 integers of 2 digits that are less than 50.
Generate 10 integers of 2 digits that are less than 50 and without duplication.
If you have a desktop version Excel, you may install the Excel add-in, Analysis Toolpak which is frequently used for analyzing data.
To install the add-in The Analysis ToolPak
:
Step 1: In the Excel menu bar, select Home.
Step 2: Choose and click options
Step 3: In the popup window, choose and click Add-ins.
Step 4: In the new display, look for Manage: Excel Add-ins and click Go next to it.
Step 5: In the new popup windows, select The Analysis ToolPak and then click the OK button.