Math Study Guide for the SAT Exam

Using Data Studies

Simply finding and recording data should not be the end of a statistical study. There is a great deal to be learned by studying the gathered data.

Inferences and Conclusions from Data

Statistics is not just about calculating. A major part of it is the interpretation of raw data to make meaningful statements about a larger group. On the SAT, you need the ability to draw inferences and conclusions from a given data set, chart, or study summary. It involves recognizing what the data suggest, understanding what can or cannot be concluded, and identifying whether a claim is supported by evidence.

Inferences help us extend the results from a sample to a population, while conclusions allow us to decide whether the data truly support a statement or hypothesis. However, not all conclusions are valid. It depends on how the data were collected and whether the study was properly designed or not. For example, while it’s generally acceptable to use results from a random sample to represent a larger population, if the data was biased or the sample isn’t actually random, that can lead to false generalizations. Ultimately, your goal is to read the data critically with proper reasoning and evidence to make sound judgments.

Inferences

As we stated, inferences about a population can be made from the results of a sample survey as long as the sample is random. Take this statement for example:

“In a survey of a random sample of students at Ocean Springs Senior High School, $68\%$ said that they spend at least seven hours a day on social networks.”

Based on this survey, you can correctly infer that roughly $68\%$ of all students at Ocean Springs Senior High School spend at least seven hours a day on social networks. You cannot, however, infer that $68\%$ of male students at the high school spend at least seven hours a day on social networks because we do not know the gender breakdown of the survey.

Let’s try an inference problem.

A survey at Ocean Springs Senior High School consisted of a random sampling of $100$ out of $1\text{,}250$ students, and $12$ of those surveyed said that they spent less than three hours a day on social networks. Based on those results, approximately how many students at the school spend less than three hours a day on social networks?

Solution

If you consider the information you have, you will hopefully recognize that this question is in fact a proportion question. We have three values and we need to find the unknown fourth. So, we set up a proportion like this:

\[\frac {12}{100} = \frac {N}{1\text{,}250}\]

We cross-multiply and get our answer:

\[N = 150\]

Therefore, based on the random sample in the survey, we can infer that around $150$ students at Ocean Springs Senior High School spend less than three hours a day on social networks.

Justifying Conclusions with Data

Questions on the SAT may ask you to judge whether a conclusion is valid based on the data collected. A common trap is assuming that numerical differences automatically prove cause and effect. Before jumping to conclusions, you should always analyze how the data were gathered and whether the methodology supports the claim.

Let’s consider the following situation.

A study proposes adopting Module A in a senior high school to improve students’ competency in mathematics. Two other modules were included in the study: the school’s current module (Module B) and another new module (Module C).

Three groups, each made up of two classrooms of juniors, were assigned one module each for two months. Afterward, students from each group were selected to take a test measuring their improvement across several topics. Below is a summary of the average test scores (out of $100$) earned by students in each module group:

Math Topic	Module A	Module B	Module C
Algebra	84	78	81
Geometry	87	80	83
Problem Solving	90	82	85
Data Analysis	86	79	82
Overall Average	86.8	79.8	82.8

Based on the data in the table, can the school conclude that Module A is the most effective module for improving students’ mathematics performance?

Solution

At first glance, the numbers suggest that Module A performed the best. It earned the highest score in every topic and has the highest overall average ($86.8$ compared to $79.8$ for Module B and $82.8$ for Module C). However, before accepting this conclusion, we must examine the study design.

Here are the key questions you should ask:

How were students chosen to take the test?—If students were not randomly selected, the results may be biased. For example, stronger students might have been chosen for Module A, skewing the data.
Were the groups randomly assigned to modules?—Random assignment ensures the groups are comparable. Without it, pre-existing differences between groups could explain the results rather than the modules themselves.
Were there any uncontrolled variables?—Unaccounted factors like teacher experience or instructional time could have influenced the scores, affecting the validity of the conclusion.
Were all groups tested under identical conditions?—Unequal sample sizes can distort results, making it difficult to compare the modules fairly.

Although Module A shows higher average scores in every topic, the data alone are not enough to justify adopting Module A unless the study was conducted with proper controls, including random assignment and unbiased student selection. The results suggest Module A might be promising, but the validity of the conclusion depends on whether the study design eliminates other possible explanations.

Evaluating Data Collection Methods

It is important to know how researchers obtained the data in their studies, because this largely determines whether it is appropriate to draw and apply conclusions to the entire population.

Data can be obtained by conducting one of four types of collection methods:

census—a study of an entire population
survey—a study of a random sample from a larger population
experimental study—a controlled study that determines cause and effect
observational study—a study of cause and effect that is not controlled

Population Parameter

A population is a group of entities or events with a common characteristic. It often refers to a group of people, although it may refer to other entities, as well. Examples of a population are:

all the students in Ocean Springs High School
all musicians in Oregon
all subscribers of a daily paper in Maine

A parameter or population parameter is a characteristic of a population expressed using a numerical value. Examples of a population parameter:

the average height of students in Ocean Springs High School
the percentage of musicians in Oregon who are self-employed
the average income of subscribers of a daily paper in Maine

Measurement Error and Margin of Error

When estimating a population parameter based on a sample statistic, it is expected that the resulting estimate will not be the exact (or true) value. What can be expected from a completely randomized sampling, instead, is the closest estimate to the true value. A margin of error is often used to describe the precision of such an estimate.

On the SAT, you will not be asked to calculate margins of error. Instead, they usually appear as part of the given information, and you are expected to understand their implication to the question.

For instance, if you are given a sample mean height that is computed to be $121 \text{ cm}$, and the SAT question provides the information that there is a margin of error of $1.3 \text{ cm}$, it means that the population’s true mean height falls within the values of $121 \pm 1.3 \text{ cm}$.

Here are things to remember about the margin of error:

1) A large margin of error can be decreased by increasing the sample size.
2) The larger the standard deviation, the larger the margin of error.
3) The margin of error applies to the true value of the parameter (e.g., the population mean) for the entire population.

Additional Statistical Concepts

Beyond the basics of collecting and summarizing data, it’s important to understand how estimates are made, how variables relate to each other, and how data variation affects results. The following sections expand on these ideas, introducing you to vital statistical tools for the SAT.

Confidence Interval

A confidence interval describes both the degree of accuracy and the uncertainty of an estimated value. It applies to a statistical parameter, such as, for instance, the mean height of the entire population. On the SAT, confidence intervals will usually be described as having a $95\%$ confidence level. What does that mean, exactly? Suppose you are given this statement:

“The mean height of a sample is $121 \text{ cm}$ and the margin of error is $1.3 \text{ cm}$ at $95\%$ confidence level.”

This statement can be interpreted as such:

“We are $95\%$ confident that the true average height is between $119.7 \text{ cm}$ and $122.3 \text{ cm}$.”

Why $121 \text{ cm}$ and $122.3 \text{ cm}$? Because $121 - 1.3 = 119.7$ and $121 + 1.3 = 122.3$.

Note: The statement is about the mean of the population. Therefore, it should not be interpreted as $95\%$ of the population has a height between $119.7 \text{ cm}$ and $122.3 \text{ cm}$.

Univariate vs. Bivariate Data

Univariate data refers to data sets with one type of variable, such as the number of hot beverages sold by a café. The variable is the number of each type of hot beverage sold. It can be shown in this data set:

Beverage/Flavor	Number of Cups Sold
Hot chocolate	$25$
Chai soy latte	$22$
Spiced apple cider	$18$
Caramel macchiato	$24$
Mulled wine	$17$
Total	$106$

Bivariate data refers to data sets with two types of variables. If the café owner wanted to find a relationship between their sales on a particular day and the temperature on that day, they would collect data with two variables: their sales of the five hot beverages versus the temperature for each day of the week. Together, those two variables are defined as bivariate data, and they can be presented as such:

Hot Beverage Sales in $ by Daily Temperature

\[\begin{array}{|c|c|c|c|c|c|c|} \hline \text{Temp } (^\circ\text{F}) & \text{Hot Choc} & \text{Chai} & \text{Cider} & \text{Caramel} & \text{Wine} & \text{Total} \\ \hline 57.2 & 80 & 73 & 65 & 79 & 52 & 349 \\ \hline 51.8 & 95 & 78 & 67 & 84 & 52 & 376 \\ \hline 60.8 & 77 & 69 & 60 & 80 & 48 & 334 \\ \hline 68 & 62 & 47 & 39 & 57 & 48 & 253 \\ \hline 66.2 & 63 & 69 & 40 & 65 & 47 & 284 \\ \hline 62.6 & 70 & 72 & 62 & 65 & 48 & 317 \\ \hline 71.6 & 55 & 42 & 37 & 50 & 45 & 229 \\ \hline \text{Total} & 502 & 450 & 370 & 480 & 340 & 2142 \\ \hline \end{array}\]

Variability

Parameters and statistics are estimates used to describe a population or a sample of a population. The numerical values, though, are not the exact actual values but are only the closest estimate. The variability of an estimate against actual values must be accounted for, and this is done by calculating measures of spread, which you learned about earlier.

Recall that the spread (or scatter) of data in a set is measured in various ways, with the most common measures being range, interquartile range (IQR), variance, and standard deviation. These are ways of describing spread in relation to the estimated value.

Randomization

Random sampling is necessary so that the result of an experiment can be generalized to the entire population. A random sample truly represents its population if it was selected by a purely chance method, also called randomization, and no element of the population has been excluded in the procedure. By this, we mean that every element of the population has a probability of being included in the sample, and the whole process is protected from biases.

There are various methods for ensuring randomization, including using a random number table or random number generator, flipping a coin, or throwing a die.

Random assignment of the subjects to different treatments is also necessary to ensure that all subjects started under generally the same condition before they were subjected to any treatment. This makes it appropriate to draw conclusions about the cause and effect of each treatment.

Beverage/Flavor	Number of Cups Sold
Hot chocolate	\(25\)
Chai soy latte	\(22\)
Spiced apple cider	\(18\)
Caramel macchiato	\(24\)
Mulled wine	\(17\)
Total	\(106\)