# How to Prepare for the Statistics and Probability Questions on the SBAC Mathematics Test

## General Information

You will find the need to understand and be able to use concepts falling in the realms of statistics and probablility while taking the SBAC Mathematics test. Here is a brief review of the important basics you’ll need to know. Be sure to seek additional information and practice if these strategies remain difficult or unclear.

# Skills to Understand and Be Able to Use

## Data and Statistical Concepts

The information gathered while investigating a question is data. Concepts of statistics are used to find patterns in data that will answer the questions.

### Single Count Variable

A count variable has only discrete integer values, for example the number of people who signed up for a particular service. It has to be a whole number. A single count variable measures just one aspect of a problem or situation.

#### Plots on the Real Number Line

Count variable data are integer values, so it is often effective to represent them on a number line. The common types of plots are summarized below.

dot plot— A dot plot shows how many times each value in a data set is observed. The horizontal axis shows the range of values, and the vertical axis shows how many times each value was observed. Each observation is represented by a dot, so the plot looks like a series of vertical stacks of dots.

histogram— A histogram is similar to the dot plot, however the number of observations is shown as a continuous bar above each observed value. It is best used when there are so many observations that individual dots would be impractical.

box plot— A box plot shows five characteristics of a data set: the minimum value, maximum value, median value, the first quartile, and third quartile. The observed values are shown on a horizontal axis. Above this a box shows the value of the quartiles and median while the minimum and maximum values are connected by lines on each side of the box.

#### Data Distribution

Histograms and dot plots show us how the observations are distributed over the observed range of data. It might be spread out over the entire range or concentrated at one or more values. The observed data can be compared to known statistical distributions to draw conclusions about the data.

mean— The mean is the average of the data computed by adding up all of the values observed and dividing by the number of observations. It is a measure of the center of data that is best used when the distribution is symmetric.

median— The median is the value of data for that separates the data in half. Half of the observations lie below this value and half of the observations lie above it. It better represents the center of data that are skewed to high or low values.

mode— The mode is the value of the data that is observed the most. It may not represent the center of the data if there are very few data points.

quartiles— Quartiles are the values of the observed data that divide the data into four equal groups. They can be thought of as the median of the lower and the upper half of the data.

center— If most of the observations fall near a particular value, that value is the center of the distribution and is the highest point on a histogram. The mean and median are commonly used to determine the center of the distribution. A bimodal distribution will have two high points.

spread— The spread of a distribution describes how close most of the data is to the center. A distribution with a narrow or tight spread has a tall central peak that falls off quickly on each side. A wide spread may show a flat distribution with a small peak near the center and many values far away on each side.

shape— The shape of a distribution may be symmetric, skewed to one side, bimodal, or flat with no obvious peaks. Symmetric plots with a peak in the middle are also called bell-shaped curves. Symmetric plots may also be U-shaped, a bimodal plot with the peaks near the minimum and maximum values.

outliers— An outlier is a data value that is very much greater or very much smaller than all of the other observations. They lie outside of the box in a box plot and near the extreme edges of a histogram. Part of statistical analysis is to define and determine what “very much” means for a given data set.

standard deviation— The standard deviation is a measure of how spread out the data is in a normal distribution. A low standard deviation indicates a narrow spread of data around a central mean and often indicates high accuracy of a measurement. The standard deviation is appropriate only when it is known that the data should be modeled by a normal distribution.

normal distribution— A normal distribution has data centered on a mean and is symmetric to the left and right. It is often used to model data that result from random processes. The normal distribution has a precise mathematical description that allows researchers to estimate the probability of observing any given range of values.

normal curve— The normal curve is the shape of the plot of the normal distribution. It is symmetric with the maximum value in the center and is often called the bell-shaped curve.

interquartile range— The interquartile range is the difference between the upper quartile (75th percentile) and the lower quartile (25th percentile). It is another way of measuring the spread of the data.

### Two Categorical and Quantitative Variables

Categorical variables describe traits or qualities that can be named or assigned to a category. It does not make sense to do numerical calculations with categorical data. Quantitative variables represent traits or qualities that can be measured numerically and used in calculations.

#### Two-Way Frequency Tables

A two-way frequency table provides a way to analyze the relationships between two categorical variables, for example, how responses to a survey may differ by gender and age. The data may show the actual counts or it may show relative frequencies, expressing each result as a percentage or fraction of the total. These tables may reveal associations between the variables (next section).

marginal frequency— The marginal frequencies are the totals of each row and column in the table.

joint frequency— The joint frequencies are the entries in the body of the table, excluding the row and column totals. They are also known as conditional frequencies.

When expressed as relative frequencies, the two-way frequency table (previous section) shows associations between the variables, a trend for certain values of each variable to be linked. In the previous example, one gender may show a preference for a particular answer in the survey.

#### Scatter Plots

A scatter plot compares two quantitative variables. Each variable is used as an axis on an x-y graph and the data points for each observed pair are plotted on this graph.

using functions— Once a scatter plot is made for a data set, various functions are tested to see which one fits, or models, the data better. The context of the problem may suggest the use of certain functions or rule out others. It is common to try linear, exponential, and quadratic models to best understand the nature of the relationship between the variables.

residuals— A residual is the difference between the value a model predicts for a data point and the actual value. Plotting and analyzing the residuals is an informal way to test how well a model fits the data.

linear association— A linear association, or linear relationship, exists when the data in a scatter plot can be best modelled by a straight-line function such as $$y = mx + b$$.

### Linear Models

Linear models use one or more predictor variables to predict the behavior of a response variable. The process of creating a linear model is called linear regression analysis.

#### Interpretation

Once a model is generated, it must be tested to see if it really shows that the predictor variable has an effect. One common test is the p-value, which tests the idea that there is no effect. A p-value of 0.05 or lower is generally required to consider that the effect is real.

slope— The slope of the model shows how much a change in the predictor variable will change the response variable, and in what direction. A large slope means that the response variable changes a lot. A negative slope causes a negative change (reduction) in the response variable.

intercept— The intercept is the value of the response variable when the predictor variable is zero. It often represents a fixed initial cost or overhead that is not affected by changes in the predictor variable.

#### Correlation

Correlation is a measure of how tightly two variables are related to each other. Measures of correlation usually must be computed with the aid of technology such as computer routines or specialized calculators.

correlation coefficient— The most common measure of correlation is the Pearson correlation coefficient, represented by r. It measures the strength of a linear relationship between two variables that are recorded as x,y data points.

linear fit— the correlation coefficient r varies between -1 and +1. Values near either extreme indicate a strong linear relationship between the variables. Values near zero indicate the lack of any relationship at all. Positive values mean that both variables either increase or decrease together. Negative values mean that as one increases, the other decreases.

correlation vs. causation— Just because two entities are correlated does not mean that one causes the other. Correlation is the relationship discovered through statistical operations. Causation is a demonstrable chain of events linking one entity to the other. Often, highly correlated entities are each caused by an independent factor, such as the passage of time.

### Other Statistical Concepts

There are a number of concepts in statistics that do not focus on computation. Instead, they are more about the interpretation of those computations.

#### Random Processes

A process is random if there is no pattern that allows one to predict the next result of that process. However, a collection of data generated by a random process still has statistical properties such as mean and standard deviation.

#### Inferences and Conclusions)

We use statistics to study populations that are too large to sample in their entirety. Statistical inference is the process of drawing conclusions about the entire population based on careful sampling and analysis of smaller portions of the population.

purpose recognition— The purpose of a statistical study or survey will determine how suitable it is for drawing conclusions about the whole population. Generally, a survey designed to study a specific characteristic may not be applicable to other purposes.

randomization— Randomization is important to any statistical study. The sample chosen must represent a random set of the whole population without any biases introduced by the choice of sampling method.

population mean— The population mean is the actual mean value for an entire population. However, measuring the whole population is often impractical, so the sample mean, or the mean value of a random subset, is used to estimate the population mean.

proportion— Instead of a mean value, you might instead want to know what proportion of a population shows a given characteristic. Again, the entire population is usually too much to sample, so a sample proportion of a random subset is used to estimate the population proportion.

margin of error— A sample mean or proportion will not be exactly equal to the true population value. The margin of error gives a measure of how many percent the sample may differ from the true population value, along with a confidence value, usually 95%.

treatment comparison— When comparing different results from two treatments, we need to know if this difference could have happened by chance. One way to do this is a simulation where the results from each individual are randomly assigned to new groups. If the differences between randomized groups are small, the original difference is probably real.

evaluation of reports— Any data-based report should be evaluated on how well the study used statistical techniques such as random sampling, whether the statistics chosen were appropriate for the question asked and the population sampled, and whether the statistics were interpreted correctly.