Math Study Guide for the SAT Exam

Page 9

Data Analysis

Data is all around us, from the number of students in your class to the scores you get on a test. The analysis of data helps us understand our world better. In this section, we will discuss various methods for gathering and analyzing data. This includes statistics and probability. Statistics is the broad study of collected data that involves analyzing, interpreting, and presenting it so conclusions may be drawn. Probability uses that data to predict future events.

Matching Graphs to Properties and Values

There are various types of graphs and various types of data that represent different properties and values. Some graphs are better suited to specific data types.

For instance, categorical data, such as the genre of music preferred by students in a high school, are appropriately presented by pie and bar graphs. On the other hand, numerical data, such as a company’s annual expenditures over a \(10\)-year period, are best plotted using line graphs, histograms, and scatter plots.

On the SAT, you may be given a graph and asked to interpret it. You will need to understand what you see and relate this graphical depiction to important features, such as the central tendency, spread, and shape.

Distributions and Measures of Center and Spread

In statistics, data sets are described using measures of central tendency and measures of spread. The measures of central tendency represent the typical value of data in a set, while the measures of spread show how much the values in the set vary.

Central Tendency

There are three basic measures of central tendency that a student taking the SAT must know. These are the mean, median, and mode. We’ll use the following data set to illustrate all three.

Twelve children have the following heights measured in centimeters:

\[100.5, \,98.0, \,98.5, \,98.4, \,98.7, \,100.0, \,100.4, \,100.7, \,104.0, \,98.8, \,98.0, \,98.5\]

In this graph, the mean (or average) value is \(99.5\). It is determined by adding all the values and dividing the sum by the number of values:

\[\frac{100.5 + 98.0 + 98.5 + 98.4 + 98.7 + 100.0 + 100.4 + 100.7 + 104.0 + 98.8 + 98.0 + 98.5}{12} = 99.5\]

The median is \(98.75\). To determine the median, the values must first be arranged in ascending order. The value in the middle is the median. In this sample set, because there is an even number of values, there are two middle numbers:

\[98.0, \;98.0, \;98.4, \;98.5, \;98.5,\;\, \mathbf{98.7}, \;\mathbf{98.8},\; 100.0, \;\,100.4, \;100.5, \;100.7, \;104.0\]

When that happens, we take their sum and divide that by two to get the median:

\[\frac{98.7 + 98.8}{2} = \frac{197.5}{2} = 98.75\]

The value that appears the most is the mode. It is common for a data set to have more than one mode, as is the case in this set, which is bimodal because it has two modes. Those modes are \(98.0\) and \(98.5\), because they both appear twice while the other values only appear once:

\[\mathbf{98.0}, \;\mathbf{98.0},\;98.4, \;\mathbf{98.5},\; \mathbf{98.5},\; 98.7, \;98.8, \;100.0, \;100.4, \;100.5, \;100.7, \;104.0\]

Spread

How the values vary in a data set is determined by measures of spread. For the SAT, it is more important that you understand the meaning of the measure of spread than that you know how to compute it. Two of the most common measures of spread are the range and standard deviation. It is important to know what these measures are and what they imply in the data set.

Range

The range of a data set is the difference between the largest and the smallest value. It shows the spread or span of all the data. In the data set above, the range of children’s height is \(6 \text{ cm}\) (\(104 \text{ cm}-98 \text{ cm}\)).

Suppose there was another group of \(12\) children with a range of \(2\text{ cm}\). We could conclude that the height measurements of the children in the first group show greater variation compared to those in the second group. A smaller range (\(2 \text{ cm}\)) means that the height measurements of the children in this group are closer. It also means there are no two children with a height variation of more than \(2 \text{ cm}\).

Standard Deviation

Standard deviation (SD) is another measure of spread. It measures how far away from the mean the values are in a set. SD is computed by taking the square root of the variance of a data set. The variance is the average of the squared differences of each value from the mean.

If that sounds complicated, just know that the SAT will not ask you to compute standard deviation. It will be enough for you to understand what it is and what it means for a data set. In the given example with height measurements, the standard deviation is \(1.6 \text{ cm}\). An SAT question may provide this information and ask, “How many students have heights within one standard deviation of the mean?”

Within one standard deviation of the mean refers to the measurement (or value) \(1.6 \text{ cm}\) above or below the mean. So you will need to check the data set for values falling within \(99.5 \pm 1.6 \text{ cm}\), and count how many there are.

The Shape of Data

The shape of data can be symmetrical or asymmetrical. When the values in a data set are evenly spread out and the mean is close in value to the median, the data is said to have a symmetric shape. When the values cluster in one area, that is the head. The values that decrease to \(0\) (either to the left or right of the head) are known as the tail. We call these data sets asymmetric or skewed because the center is shifted to either the right or left. When the mean is greater than the median, the graph of the data is skewed to the right (or the tail is to the right of the head). When the mean is less than the median, the graph is skewed to the left (or the tail is to the left of the head).

22d Data Shape.png

Outliers

In a given data set, numbers that are too far away from the main group (either too small or too large compared to most of the values) are called outliers. Outliers affect the mean, although not so much the median or mode.

Suppose there were \(10\) students who were given special coaching by their teacher to improve their performance in class. Prior to the coaching, the mean score of these students on every test never exceeded \(76\). It has been a month, and the teacher wants to know if the coaching is making progress.

On the latest test, these were the scores:

\[82,\, 82, \,83, \,15, \,84, \,83, \,80,\, 80, \,81, \text{ and } 82\]

The score of \(15\) is an extreme value and is clearly an outlier.

The mean score for the latest test is \(75.2\). Does it mean that the teacher’s coaching failed? Probably not, because both the median and the mode are equal to \(82\). That means that most of the students improved, and there was just one exception, or outlier, which brought down the mean. Before the coach can make any conclusions about these results, he would want to investigate the reasons for the outlier. Maybe the student was sick, or maybe they didn’t understand the test.

As a general rule, when there are outliers, they can be removed if there is justification for doing so. Another solution is to use the median or mode instead of the mean because those data points are the least affected by the outlier.

Two-Variable Data

It is important to know how to read graphical representations of data. There are four types of graphs you can expect to see on the SAT: scatter plot, box-and-whisker plot, histogram, and two-way table (see discussion of this below under “Relative Frequency”).

Scatter Plot

A scatter plot, also referred to as an \(XY\) plot, is usually the graph type of choice for showing the relationship between data with two variables (known as bivariate data). The data or values are plotted on the graph as \(x\)- and\(y\)-coordinates, with \(x\) as the independent variable and \(y\) as the dependent variable.

23 Scatterplot (NEW).png

A point in the graph represents two values. For instance, point P in the above graph represents \(1.5\) hours of tutoring (the \(x\) value) and a score of \(68\) (the \(y\) value). Viewing the whole scatter plot, we see that, generally, as the number of hours spent on the tutorial increases, the students’ scores also increase.

Line and Curve of Best Fit

When using scatter plots, we can draw the line of best fit. This line is useful for illustrating any potential trends in the data, allowing us to make estimates or projections by interpolation or extrapolation. From this line, the best-fit equation or regression equation can then be determined using algebra (straight lines and linear equations).

Variables have a linear relationship when they increase or decrease at a constant rate. In other words, as one variable increases, the other one either decreases or increases with it. See how the line below goes up? This tells us that there is a high positive correlation between the two variables because as one variable increases, the other also increases. We could already see that with just the scatter plot of data, but the line makes the precise level of correlation clearer. If the line were going down, it would be a negative correlation.

24 Line of Best Fit (NEW).png

The equation will not always be linear. It can take a quadratic or exponential model to create a curve of best fit. A U-shaped (parabolic) graph facing either upward or downward indicates a quadratic relationship. The rate of change is variable. There’s either a maximum or a minimum value, which is seen in the graph and is called the vertex of the graph. A graph that starts to change very gradually initially (either increasing or decreasing), but suddenly takes a significant change over time, indicates an exponential relationship. An exponential curve does not have a vertex.

24a Curve of Best Fit (NEW).png

Box-and-Whisker Plot

A box-and-whisker plot, also referred to as a box plot, is made up of a rectangular box with two horizontal lines on both ends. It looks like this:

25 Box and Whisker Plot FIXED.png

A box-and-whisker plot breaks the data into four parts called quartiles. In the graph, the first vertical line represents the first quartile (Q\(1\)), the vertical line within the box marks the second quartile (Q\(2\)), or the median of the data, and the third vertical line represents the third quartile (Q\(3\)).

Points \(A\), \(B\), \(C\), and \(D\) are only marked for our purposes. The tip of the horizontal line marked as \(A\) is the smallest value in the data set (excluding outliers), while \(B\) on the other tip is the largest value in the data set (excluding outliers). There are cases, however, when there are outliers in the data set. These values are represented as dots disconnected from the plot, such as points \(C\) and \(D\).

In our example above, the median of the data set is \(35\). Without the outliers, the range is \(28\) \((49 - 21)\). The range describes the spread of all the data. With the outliers, the range will be quite large—around \(50\). You may also determine the interquartile range (IQR), or the range of the middle half of the data. From the plot, the IQR is approximately \(14\) (Q\(3\) – Q\(1\)).

A box-and-whisker plot can be skewed to the right, meaning most of the observations are on the left side (the mean and median are closer to the minimum than the maximum), pulling the box to the left, with the longer whisker stretched to the right. Or it can be skewed to the left, with most of the observations to the right (the mean and median are closer to the maximum than the minimum).

Histogram

A histogram is a graph that uses columns or bars on an \(x-y\) plane to show the distribution of each element in a group of elements. The labels on both the \(x\)- and \(y\)-axes represent quantitative data, such as the number of athletes in a high school counted according to different height ranges.

26 Histogram (NEW).png

The histogram shows the frequency with which each height range occurs in the data set. This one is skewed to the right, which means that most of the athletes are on the shorter end of the scale (\(52\) athletes have height measurements between \(165 \text{ cm}\) and \(180 \text{ cm}\)), with fewer athletes on the taller end.

On the SAT, the range of values in a histogram will follow this convention: Each bar in the histogram includes the end value on the left and excludes the end value on the right of the range. So a range of \(165-170 \text{ cm}\) includes all values within the range, including \(165\) but excluding \(170\).

All Study Guides for the SAT Exam are now available as downloadable PDFs