Topic 1: Statistical Studies

References

Introduction to Statistics and Data Analysis 6th Edition by Roxy Peck, Chris Olsen, Tom Short
MA336 Statistics on https://stats.libretexts.org

Learning Goals

Distinguish between a population and a sample.
Determine whether a study is an observational study or an experiment.
Determine the goal of a statistical study and what types of conclusions are appropriate.
Recognize typical forms of sampling biases such as convenience sample and voluntary response.
Explain why randomization should be used.
Describe how to implement a randomized design:Simple random sample, Stratified random sample, Cluster random sample, Systematic random sample.
Determine whether the conclusion of an experiment design is appropriate.
Identify the lurking variable and confounding variable.

1 Why Study Statistics?

Allow you to critically evaluate the work of others and reports appeared in journals or media.
Provide you with the tools you need to make informed judgments.
Understand and use data to make decisions.
Success in your professional life.

2 Process of Statistical Studies

Understanding the nature of the problem.
Deciding what to measure and how to measure it.
Collecting data.
Data summarization and preliminary analysis.
Formal data analysis.
Interpretation of results

A picture show how statistics works

Image source: Concepts in Statistics (lumen learning)

3 Population vs Sample

Data: collection of observations such as counts, measurements, responses or experiments.
Population: The entire collection of individuals or objects that are of interest.
Sample: a subset of a population that is selected for study.
Parameter: A number that is a property of the population.
Statistic: A number, such as a percentage, that represents a property of a sample.

4 Example: Identify Statistic Concepts

Determine if the group is a population or sample
1. The grade of all students in a math class.
2. 10 students in a math class earned “A”.

Answer:

Population
Sample.

Identify statistic concepts in the following study: To learn the percentage of students go to school by public transportation, 500 students at a college were surveyed. 50% say they go to school by public transportation

Answer:
- Population: all students at the college
- Sample: 500 being surveyed
- Parameter: unknown percentage
- Statistic: 50%

5 Type of Variables

Variable: a characteristic, or attribute of interest that we gather about individuals or objects. There are two types of variables according to their values.
- Categorical variables (or qualitative variables) represent attributes, labels or nonnumerical entries, such as names, and colors.
- Numerical variables (or quantitative variables) represent numerical measurements or counts, such as weights and number of students in each class.
Univariate data: a collection of observations on a single variable.
Bivariate data: a collection of pairs of numbers.
Multivariate data: a collection of arrays of values of two or more variables.
A numerical variable is called a discrete variable if its values are countable. It is called a continuous variable if it can take all values in certain intervals.

6 Example: College Choice Do-Over

The Higher Education Research Institute at UCLA surveys over 20,000 college seniors each year. One question on the survey asks seniors the following question: If you could make your college choice over, would you still choose to enroll at your current college? Possible responses are definitely yes (DY), probably yes (PY), probably no (PN), and definitely no (DN).

Question:

Identify a variable in the study. Is it categorical or numerical?
Identify a data set. Is it univariate or bivariate or multivariate?

Answer:

A variable in the study is students do-over choice. It is categorical variable.
A data set is the collection of do-over choice of some or all students being surveyed. The data set is univariate.

Practice: Basic Statistical Concepts

Identify the population, sample, the variable of study, the type of the variable, the population parameter and the sample statistics.

An administrator wishes to estimate the passing rate of a certain course. She takes a random sample of 50 students and obtains their letter grades of that course. Among the 50 students, 32 students earned a grade C or better.

7 Types of Statistical Studies

A statistical study can usually be categorized as an observational study or an experiment by the mean of study.

An observational study observes individuals and measures variables of interest. The main purpose of an observational study is to describe a group of individuals or to investigate an association between two variables.
An experiment intentionally manipulates one variable in an attempt to cause an effect on another variable. The primary goal of an experiment is to provide evidence for a cause-and-effect relationship between two variables.

In an observational study, it is not possible to draw clear cause-and-effect conclusions because we cannot rule out the possibility that the observed effect is due to some other variables not being studied, known as extraneous variables.

8 Example: Types of Statistical Studies

Which type of study will answer the question.

What proportion of all college students in the United States have taken classes at a community college?
Does use of computer-aided instruction in college math classes improve test scores?

Answer:

Observational
Experimental

See Types of Statistical Studies (2 of 4) in the textbook Concepts in Statistics for more examples.

Practice: Observational vs Experimental I

Identify the type of statistical study:

A study took random sample of adults and asked them about their bedtime habits. The data showed that people who drank a cup of tea before bedtime were more likely to go to sleep earlier than those who didn’t drink tea.
Another study took a group of adults and randomly divided them into two groups. One group was told to drink tea every night for a week, while the other group was told not to drink tea that week. Researchers then compared when each group fell asleep.

Source: Khan Academy

Practice: Observational vs Experimental II

9 Questions about Population (1 of 2)

Type of Research Question	Examples
Make an estimate about the population (often an estimate about an average value or a proportion with a given characteristic)	What is the average number of hours that community college students work each week? What proportion of all U.S. college students are enrolled at a community college?
Test a claim about the population (often a claim about an average value or a proportion with a given characteristic)	Is the average course load for a community college student greater than 12 units? Do the majority of community college students qualify for federal student loans?

10 Questions about Population (2 of 2)

Type of Research Question	Examples
Compare two populations (often a comparison of population averages or proportions with a given characteristic)	In community colleges, do female students have a higher GPA than male students? Are college athletes more likely than non-athletes to receive academic advising?
Investigate a relationship between two variables in the population	Is there a relationship between the number of hours high school students spend each week on Facebook and their GPA? Is academic counseling associated with quicker completion of a college degree?

11 Question on Cause-and-Effect

Many research studies, especially in medicine and psychology, focus on cause-and-effect relationships.

Example: Does drinking red wine lower the risk of a heart attack?
In cause-and-effect study, an explanatory variable is a cause, and the response variable is the effect. To properly establish cause and effect relationship, researchers try to remove extraneous variables which are variables other than the explanatory variables that may affect the response.

12 Confounding and Lurking Variables

A confounding variable is an extraneous variable that affects the response variable and makes it hard to see the real effect of the explanatory variable.

Example: In the study of the relation between a type fertilizer and tomato size, the amount of sunshine will be a confounding variable. It contributes to the growth of tomato.
A lurking variable is an unmeasured third variable that influences both the explanatory and response variables.

Example: More firefighters are linked with more fire damage, but the real lurking variable is fire size, which increases both.

In general, we should not make cause-and-effect statements from observational studies unless impact of confounding variables can be significantly decreased.

13 Example: Type of Relationship

Determine if the question is a cause-and-effect question? What are the explanatory and response variables?

Does use of computer-aided instruction in college math classes improve test scores?
Does tutoring correlate with improved performance on exams?

Answer:

This question investigates a cause-and-effect relationship. The explanatory variable is computer-aided instruction and the response variable is the test scores.
This question investigates a correlation between variables in a population and is not a cause-and-effect question. The explanatory variable is tutoring, and the response variable is the performance.

Practice: Confounding Variable Definition

Practice: Cause-and-Effect or Correlation

Do individuals with higher educational attainment tend to earn higher salaries?

Determine if the question is a cause-and-effect question?
What are the explanatory and response variables?
If a student want to study this question, what type of statistical study can be used? What kind of conclusion can be drawn?

Practice: Correlation or Causation

14 Principles of Experimental Design

Direct control: Participants are placed into two groups: control group (no treatment), treatment group (receives treatment).
- Purpuse: Reduce the influence of extraneous variables.
Randomization (random assignment): Participants are placed into groups by chance.
- Purpose: Reduce the impact of chance differences between groups. Randomization makes groups similar on average in both seen and unseen factors.
Replication: Use enough subjects in each group, and ensure the study can be repeated.
- Purpose: A single study, especially with a small sample, may produce results by chance. Replication helps confirm that the treatment effect is real.
Blocking: Group similar participants into blocks based on a factor known to affect the outcome (e.g., age, prior skill). Then randomize within each block.
- Purpose: Reduce variation from known influential factors.

15 Strategies for Direct Control

Placebo: A fake treatment with no active ingredient.

Placebo effect: Participants report improvement simply because they believe they received real treatment.

Purpose:: Reduce or measure the placebo effect so it does not distort the results.
Blinding: Keep people unaware of which treatment they receive.
- Single-blind: Participants do not know their group.
- Double-blind: Neither participants nor the researchers who interact with them or measure outcomes know the group assignments.
Purpose: Prevent bias in behavior, reporting, and measurement.

16 Mini‑example of Experimental Design

Goal: Test a new reading app for college students.

Block by class period (morning vs. afternoon).
Within each block, randomly assign students to app or no app.
Use many students per group.
Instructors and students do not know which version is “new” vs. “standard” (blinding, if possible).

17 Example: Does the Study Support the Conclusion?

Example: A researcher studies the medical records of 500 randomly selected patients. Based on the information in the records, he divides the patients into two groups: those given the recommendation to take an aspirin every day and those with no such recommendation. He reports the percentage of each group that developed heart disease.
Determine whether the study supports the conclusion that taking aspirin lowers the risk of heart attacks.

Answer: The conclusion claims a cause-and-effect relationship. To answer the question, we need an experimental study. However, the study has no control on data which makes it inappropriate.

18 Example: Is the Conclusion Appropriate?

A teacher read about a study testing whether the background color of a digital reading passage affects reading speed. In the study, 80 students read the same passage and were randomly assigned to: Group A (40 students) with white background; Group B (40 students) with light yellow background.

Students in Group B finished reading significantly faster than those in Group A.
The teacher concluded that using a light yellow background will help all students read faster in all reading situations.

Is this study appropriate? Why or why not?
Is the conclusion appropriate? Why or why not?

Answer

Yes. It is an experiment with direct control (same passage) and random assignment, which helps support a cause‑and‑effect relationship.
No. The conclusion is overgeneralized. The study tested only one group of students, one passage, and one setting.

Practice: Principles of Experimental Design

Practice: Experimental Design

There is an ongoing debate about how many spaces should be placed after a period in typed documents. Alana read about a study where 100 participants all read the same document typed in Courier New font. Half of the participants were randomly assigned the document with one space after each period, and the other half were given the document with two spaces after each period.

Participants who read the document with two spaces after each period were able to finish reading significantly faster than those with one space after each period. Alana concluded that using two spaces after each period will help people read all documents faster.

Is this study appropriate? Why or Why not?
Is the conclusion appropriate? Why or Why not.

Source: Khan Academy

19 Sampling Plans

To make accurate inference, the sample must be representative of the population.

A sampling plan describes exactly how we will choose the sample.
A sampling plan is biased if it systematically favors certain outcomes.
In random Sampling, every individual or object has an equal chance of being selected.

20 Methods of Random Sampling (1 of 2)

Simple random sample: groups of the same size are randomly selected. Table of random numbers, calculator and softwares are often used to generate random numbers.
Stratified random sample: The population is first split into groups. Then subjects from each group are selected randomly.

21 Methods of Random Sampling (2 of 2)

Cluster sample: The population is first split into groups. Then some groups are selected randomly.
Systematic sample: First, a starting number is chosen randomly. Then take every \(n\)-th piece of the data.

Practice: Sampling Methods

Determine the type of sampling method.

A market researcher polls every tenth person who walks into a store.
100 students whose student id numbers matches 100 numbers generated by a computer randomization program.
The first 30 people who walk into a sporting event are polled on their television preferences.

22 Common Types of Biased Sampling

Voluntary Response Bias / Self-Selection Bias: participants can choose whether to participate in the study. Example: “non-scientific” polls taken on television or websites
Measurement or Response Bias: observations tends to produce values that systematically. Example: The question “How many bottles of bier do you drink each day?” will likely suffer a response bias.
Nonresponse bias; response are not obtained from all selected individuals. Example: Mall surveys
Undercoverage Bias: sample too few observations from a segment of the population. Example: random survey some classmates to estimate the average GPA or a college. This sampling method is known as convenience sampling.

23 Example: Appropriate Sampling Design

Suppose that you want to estimate the proportion of students at your college that use the library.

Which sampling plan will produce the most reliable results?

Select 100 students at random from students in the library.
Select 200 students at random from students who use the Tutoring Center.
Select 300 students who have checked out a book from the library.
Select 50 students at random from the college.

Answer: The 4th sampling plan is the most reliable plan. The first three and undercover the college.

In general, the larger sample size, the more accurate of conclusion. However, we have to avoid bad sampling.

Practice: Sampling Techniques

Click the link to open the practice in a new window.

Practice on Sampling Techniques

Lab Instructions in Excel

24 Introduction to Excel Spreadsheets

Click the link to open in a new window.

25 Random Numbers by Excel

=RAND() returns a random real number greater than or equal to 0 and less than 1. To generate a random real number between \(a\) and \(b\), one can use =RAND()*(b-a)+a.
=RANDBETWEEN(bottom, top) returns an integer between bottom and top.
=RANDARRAY([rows],[columns],[min],[max],[whole_number]) returns an array of random numbers. You can specify the number of rows and columns to fill, minimum and maximum values, and whether to return whole numbers or decimal values (TRUE for whole number and FALSE for decimal values).

All arguments of RANDARRAY are optional. RANDARRAY() is equivalent to RAND().
To get an array of random values without duplication, one can use =UNIQUE(RANDARRAY([rows],[columns],[min],[max],[whole_number])).

26 Example: Usage of `RAND()`

Randomly generate a number between 0 and 1.

Howto:

Step 1: Choose a cell, say A1
Step 2: click insert function button \(f_x\).
Step 3: In the popup window, search “random” and select RAND.
Step 4: Click OK, you will get a randomly generated number.

Alternatively, you may also manually enter the function: =rand() in the cell and hit enter.

27 Example: Usage of `RANDBETWEEN()`

Generate 10 random integers of 2 digits.

Howto:

Step 1: Generate a random integer, say in the cell A1, using the Excel function =RANDBETWEEN(10,99).
Step 2: Move the mouse cursor to the lower right corner of the cell A1. A solid plus + will appear.
Step 3: Hold the left-click of the mouse and drag the cell to horizontally or vertically to autofill the selected array with 10 random numbers of 2 digits.

28 Example: Usage of `RANDARRAY()`

Generate 10 random integers of 2 digits without repetition.

Howto:

In the cell with 9 empty cells below it, say A1, apply the Excel function =UNIQUE(RANDARRAY(10, 1, 10, 99, TRUE)). You will get a column array of 10 integers without duplication.

Practice: Random Numbers

Generate a real number between 1 and 2.
Generate 10 integers of 2 digits that are less than 50.
Generate 10 integers of 2 digits that are less than 50 and without duplication.

Topic 1: Statistical Studies

Fei Ye

February 2026

References

Learning Goals

1 Why Study Statistics?

2 Process of Statistical Studies

3 Population vs Sample

4 Example: Identify Statistic Concepts

5 Type of Variables

6 Example: College Choice Do-Over

Practice: Basic Statistical Concepts

7 Types of Statistical Studies

8 Example: Types of Statistical Studies

Practice: Observational vs Experimental I

Practice: Observational vs Experimental II

9 Questions about Population (1 of 2)

10 Questions about Population (2 of 2)

11 Question on Cause-and-Effect

12 Confounding and Lurking Variables

13 Example: Type of Relationship

Practice: Confounding Variable Definition

Practice: Cause-and-Effect or Correlation

Practice: Correlation or Causation

14 Principles of Experimental Design

15 Strategies for Direct Control

16 Mini‑example of Experimental Design

17 Example: Does the Study Support the Conclusion?

18 Example: Is the Conclusion Appropriate?

Practice: Principles of Experimental Design

Practice: Experimental Design

19 Sampling Plans

20 Methods of Random Sampling (1 of 2)

21 Methods of Random Sampling (2 of 2)

Practice: Sampling Methods

22 Common Types of Biased Sampling

23 Example: Appropriate Sampling Design

Practice: Sampling Techniques

24 Introduction to Excel Spreadsheets

25 Random Numbers by Excel

26 Example: Usage of RAND()

27 Example: Usage of RANDBETWEEN()

28 Example: Usage of RANDARRAY()

Practice: Random Numbers

26 Example: Usage of `RAND()`

27 Example: Usage of `RANDBETWEEN()`

28 Example: Usage of `RANDARRAY()`