Introduction to Statistics and Data Analysis 6th Edition by Roxy Peck, Chris Olsen, Tom Short
Image source: Concepts in Statistics (lumen learning)
Data: collection of observations such as counts, measurements, responses or experiments.
Population: The entire collection of individuals or objects that are of interest.
Sample: a subset of a population that is selected for study.
Parameter: A number that is a property of the population.
Statistic: A number, such as a percentage, that represents a property of a sample.
Answer:
Identify statistic concepts in the following study: To learn the percentage of students go to school by public transportation, 500 students at a college were surveyed. 50% say they go to school by public transportation
Answer:
Variable: a characteristic, or attribute of interest that we gather about individuals or objects. There are two types of variables according to their values.
Univariate data: a collection of observations on a single variable.
Bivariate data: a collection of pairs of numbers.
Multivariate data: a collection of arrays of values of two or more variables.
A numerical variable is called a discrete variable if its values are countable. It is called a continuous variable if it can take all values in certain intervals.
The Higher Education Research Institute at UCLA surveys over 20,000 college seniors each year. One question on the survey asks seniors the following question: If you could make your college choice over, would you still choose to enroll at your current college? Possible responses are definitely yes (DY), probably yes (PY), probably no (PN), and definitely no (DN).
Question:
Answer:
Identify the population, sample, the variable of study, the type of the variable, the population parameter and the sample statistics.
An administrator wishes to estimate the passing rate of a certain course. She takes a random sample of 50 students and obtains their letter grades of that course. Among the 50 students, 32 students earned a grade C or better.
A statistical study can usually be categorized as an observational study or an experiment by the mean of study.
In an observational study, it is not possible to draw clear cause-and-effect conclusions because we cannot rule out the possibility that the observed effect is due to some other variables not being studied, known as extraneous variables.
Which type of study will answer the question.
What proportion of all college students in the United States have taken classes at a community college?
Does use of computer-aided instruction in college math classes improve test scores?
Answer:
See Types of Statistical Studies (2 of 4) in the textbook Concepts in Statistics for more examples.
Identify the type of statistical study:
A study took random sample of adults and asked them about their bedtime habits. The data showed that people who drank a cup of tea before bedtime were more likely to go to sleep earlier than those who didn’t drink tea.
Another study took a group of adults and randomly divided them into two groups. One group was told to drink tea every night for a week, while the other group was told not to drink tea that week. Researchers then compared when each group fell asleep.
Source: Khan Academy
| Type of Research Question | Examples |
|---|---|
| Make an estimate about the population (often an estimate about an average value or a proportion with a given characteristic) | What is the average number of hours that community college students work each week? What proportion of all U.S. college students are enrolled at a community college? |
| Test a claim about the population (often a claim about an average value or a proportion with a given characteristic) | Is the average course load for a community college student greater than 12 units? Do the majority of community college students qualify for federal student loans? |
| Type of Research Question | Examples |
|---|---|
| Compare two populations (often a comparison of population averages or proportions with a given characteristic) | In community colleges, do female students have a higher GPA than male students? Are college athletes more likely than non-athletes to receive academic advising? |
| Investigate a relationship between two variables in the population | Is there a relationship between the number of hours high school students spend each week on Facebook and their GPA? Is academic counseling associated with quicker completion of a college degree? |
Many research studies, especially in medicine and psychology, focus on cause-and-effect relationships.
Example: Does drinking red wine lower the risk of a heart attack?
In cause-and-effect study, an explanatory variable is a cause, and the response variable is the effect. To properly establish cause and effect relationship, researchers try to remove extraneous variables which are variables other than the explanatory variables that may affect the response.
A confounding variable is an extraneous variable that affects the response variable and makes it hard to see the real effect of the explanatory variable.
Example: In the study of the relation between a type fertilizer and tomato size, the amount of sunshine will be a confounding variable. It contributes to the growth of tomato.
A lurking variable is an unmeasured third variable that influences both the explanatory and response variables.
Example: More firefighters are linked with more fire damage, but the real lurking variable is fire size, which increases both.
In general, we should not make cause-and-effect statements from observational studies unless impact of confounding variables can be significantly decreased.
Determine if the question is a cause-and-effect question? What are the explanatory and response variables?
Answer:
This question investigates a cause-and-effect relationship. The explanatory variable is computer-aided instruction and the response variable is the test scores.
This question investigates a correlation between variables in a population and is not a cause-and-effect question. The explanatory variable is tutoring, and the response variable is the performance.
Do individuals with higher educational attainment tend to earn higher salaries?
Direct control: Participants are placed into two groups: control group (no treatment), treatment group (receives treatment).
Randomization (random assignment): Participants are placed into groups by chance.
Replication: Use enough subjects in each group, and ensure the study can be repeated.
Blocking: Group similar participants into blocks based on a factor known to affect the outcome (e.g., age, prior skill). Then randomize within each block.
Placebo: A fake treatment with no active ingredient.
Placebo effect: Participants report improvement simply because they believe they received real treatment.
Purpose:: Reduce or measure the placebo effect so it does not distort the results.
Blinding: Keep people unaware of which treatment they receive.
Purpose: Prevent bias in behavior, reporting, and measurement.
Goal: Test a new reading app for college students.
Block by class period (morning vs. afternoon).
Within each block, randomly assign students to app or no app.
Use many students per group.
Instructors and students do not know which version is “new” vs. “standard” (blinding, if possible).
Example: A researcher studies the medical records of 500 randomly selected patients. Based on the information in the records, he divides the patients into two groups: those given the recommendation to take an aspirin every day and those with no such recommendation. He reports the percentage of each group that developed heart disease.
Determine whether the study supports the conclusion that taking aspirin lowers the risk of heart attacks.
Answer: The conclusion claims a cause-and-effect relationship. To answer the question, we need an experimental study. However, the study has no control on data which makes it inappropriate.
A teacher read about a study testing whether the background color of a digital reading passage affects reading speed. In the study, 80 students read the same passage and were randomly assigned to: Group A (40 students) with white background; Group B (40 students) with light yellow background.
Students in Group B finished reading significantly faster than those in Group A.
The teacher concluded that using a light yellow background will help all students read faster in all reading situations.
Answer
Yes. It is an experiment with direct control (same passage) and random assignment, which helps support a cause‑and‑effect relationship.
No. The conclusion is overgeneralized. The study tested only one group of students, one passage, and one setting.
There is an ongoing debate about how many spaces should be placed after a period in typed documents. Alana read about a study where 100 participants all read the same document typed in Courier New font. Half of the participants were randomly assigned the document with one space after each period, and the other half were given the document with two spaces after each period.
Participants who read the document with two spaces after each period were able to finish reading significantly faster than those with one space after each period. Alana concluded that using two spaces after each period will help people read all documents faster.
Source: Khan Academy
To make accurate inference, the sample must be representative of the population.
A sampling plan describes exactly how we will choose the sample.
A sampling plan is biased if it systematically favors certain outcomes.
In random Sampling, every individual or object has an equal chance of being selected.
Simple random sample: groups of the same size are randomly selected. Table of random numbers, calculator and softwares are often used to generate random numbers.
Stratified random sample: The population is first split into groups. Then subjects from each group are selected randomly.
Cluster sample: The population is first split into groups. Then some groups are selected randomly.
Systematic sample: First, a starting number is chosen randomly. Then take every \(n\)-th piece of the data.
Determine the type of sampling method.
A market researcher polls every tenth person who walks into a store.
100 students whose student id numbers matches 100 numbers generated by a computer randomization program.
The first 30 people who walk into a sporting event are polled on their television preferences.
Voluntary Response Bias / Self-Selection Bias: participants can choose whether to participate in the study. Example: “non-scientific” polls taken on television or websites
Measurement or Response Bias: observations tends to produce values that systematically. Example: The question “How many bottles of bier do you drink each day?” will likely suffer a response bias.
Nonresponse bias; response are not obtained from all selected individuals. Example: Mall surveys
Undercoverage Bias: sample too few observations from a segment of the population. Example: random survey some classmates to estimate the average GPA or a college. This sampling method is known as convenience sampling.
Suppose that you want to estimate the proportion of students at your college that use the library.
Which sampling plan will produce the most reliable results?
Answer: The 4th sampling plan is the most reliable plan. The first three and undercover the college.
In general, the larger sample size, the more accurate of conclusion. However, we have to avoid bad sampling.
Click the link to open the practice in a new window.
Practice on Sampling Techniques
Lab Instructions in Excel
=RAND() returns a random real number greater than or equal to 0 and less than 1. To generate a random real number between \(a\) and \(b\), one can use =RAND()*(b-a)+a.=RANDBETWEEN(bottom, top) returns an integer between bottom and top.=RANDARRAY([rows],[columns],[min],[max],[whole_number]) returns an array of random numbers. You can specify the number of rows and columns to fill, minimum and maximum values, and whether to return whole numbers or decimal values (TRUE for whole number and FALSE for decimal values).RANDARRAY are optional. RANDARRAY() is equivalent to RAND().=UNIQUE(RANDARRAY([rows],[columns],[min],[max],[whole_number])).RAND()Randomly generate a number between 0 and 1.
Howto:
Step 1: Choose a cell, say A1
Step 2: click insert function button \(f_x\).
Step 3: In the popup window, search “random” and select RAND.
Step 4: Click OK, you will get a randomly generated number.
Alternatively, you may also manually enter the function: =rand() in the cell and hit enter.
RANDBETWEEN()Generate 10 random integers of 2 digits.
Howto:
Step 1: Generate a random integer, say in the cell A1, using the Excel function =RANDBETWEEN(10,99).
Step 2: Move the mouse cursor to the lower right corner of the cell A1. A solid plus + will appear.
Step 3: Hold the left-click of the mouse and drag the cell to horizontally or vertically to autofill the selected array with 10 random numbers of 2 digits.
RANDARRAY()Generate 10 random integers of 2 digits without repetition.
Howto:
In the cell with 9 empty cells below it, say A1, apply the Excel function =UNIQUE(RANDARRAY(10, 1, 10, 99, TRUE)). You will get a column array of 10 integers without duplication.
Generate a real number between 1 and 2.
Generate 10 integers of 2 digits that are less than 50.
Generate 10 integers of 2 digits that are less than 50 and without duplication.