Mathematics: Probabilistic and Statistical Reasoning Study Guide for the TSIA2

Computing and Describing Data

There are many tools, techniques, and terms that are useful when working with data. They help you make sense of the lists of numbers and provide desired information for you to interpret. Here are some things you should know about data description.

Variables

Often, there is more than one quantity or attribute measured in an experiment or survey. Each of these items that is measured is referred to as a variable.

Independent Variables

An independent variable is one that is controlled by experimenters or is used to organize the data. Time is often used as an independent variable and trends over time are observed in the data.

Dependent Variables

The dependent variable is the variable of interest and it may depend on the independent variable in some way. The study is often done in an attempt to discover whether a relationship exists between the variables.

Mean, Median, and Mode

Mean, median, and mode are all ways to attempt to describe a set of data.

Mean

The mean is the arithmetic average of the values, found by adding up all of the values and dividing by the number of values.

Median

The median is the value where 50% of the values fall above it and 50% of the values fall below. To obtain the median, rewrite the data in order from smallest to largest. Counting toward the middle from both ends, the median is the number in the middle. If there is an even number of entries, there will be two “middle numbers.” Then you would take the average of those two.

Mode

The mode is the value that occurs most often.

Example

Suppose 10 students took a test and got the following scores:

83, 87, 92, 78, 83, 85, 95, 50, 90, and 81

The mean is 82.4.
The median is 84 (average of 83 and 85, which are the middle numbers).
The mode is 83, which occurs twice in this series.
The mean is a little lower because of the one low score (50), an outlier, which is any number that varies widely front the rest. The mean will be skewed more by outliers than the median or mode.

Descriptive vs. Inferential Statistics

Descriptive statistics are used to describe the characteristics of a set of data, such as how many people in a survey had an income above a certain level. Inferential statistics use the results from a small population, such as people who took this survey, to make predictions about a larger population, such as the entire city where the survey took place.

Summary Statistics

Summary statistics are values that show different characteristics of the data set as a whole. They can tell you things like how well a math class did on an exam compared to previous classes, or how close the data points are to their mean. They help establish the reliability of the data set and any conclusions drawn from it.

Center and Spread of Data

Certain statistics can give us a value to show how the entire data set clusters around a central value. For example: mean, median, and mode. Other statistics give a measure of how spread out the data are. For example: range and standard deviation.

An example of using these might be a physics student doing an experiment to measure the acceleration of gravity. The student might repeat the experiment 10 times and calculate the mean of those results to arrive at a value to report as her experimental result. At the same time, she could report on how close the 10 values were to each other (range and standard deviation). If they are all very close to each other, she would likely have more confidence in her result than if they were spread out over a wide range.

Quartiles and Percentiles

Quartiles and percentiles are ways of describing where a particular value falls in a data set. The lowest value of a data set is 0% and the highest is 100%. A data value in the nth percentile means that it is somewhere in the first \(n\%\) of the values. If it is at the nth percentile, then \(n\%\) of the data is below that value. Quartiles divide the data into 4 ranges. The lowest quarter of the data is the first quartile, between one quarter and one half is the second quartile, and so on.

Standard Deviation

The standard deviation is a measure of how close the data points are to the mean. A smaller standard deviation means that the points are more tightly located near the mean and usually indicates a more accurate or reliable result.

Interpreting Data

When the data organization is complete and represented, the reader should be able to interpret the data and tell what it means. Part of doing this involves establishing correlation and causation. Data interpretation may also result in drawing conclusions about the topic and/or finding a solution to a problem.

Correlation vs. Causation

Statistics may show that there is a relationship between the variables. The dependent variable may increase or decrease sharply as the independent variable increases. This relationship is a correlation. However, it does not establish that changing the independent variable actually causes the change in the dependent variable.

For example, one could plot their age over the past 20 years as an independent variable and the cost of peanut butter as the dependent variable. There will likely be a strong correlation as they both increase. However, it does not mean that getting older is causing the increase in price, nor does it mean that increasing price is causing one to get older. They are both affected by a third variable, time.

Linear Correlation

If the plot of the independent (x) and dependent (y) variables is in or close to a straight line, it is said to have a linear correlation. The model is usually written as \(y = mx + b\), where m is the slope of the line and b is the value where it crosses the y axis. A positive correlation will slope upward to the right (m is positive) and a negative correlation will slope downward to the right (m is negative). A common measure of how close the plot is to a straight line is the correlation coefficient. Values closer to 1 indicate a stronger relationship.