Statistics

Statistics is a science of collecting, summarizing, representing, and
analyzing data in order to answer certain research questions. Also, for any
conclusions reached, a measure of reliability must be provided.

In this tutorial, we cover the following topics:

mode
median, percentiles, and quartiles
mean
standard deviation
Hypothesis testing

Mode

The mode is the value in the data which occurs most often. The mode is often used with categorical data – data whose values are labels for different possible categories. The mode can also be used with continuous quantitative data, in which case we use the methods calculus to obtain it.

Example:

Consider the data in the Table below.

Phone type	Number of sales
Samsung	205
iPhone	443
Google pixel	191
Xiaomi	328
Realme GT	263

What is the mode of this data?

Solution: the mode is the phone type with the highest number sold, viz: iPhone.
Nb. The data in the table are categorical (i.e., qualitative).

Median, percentiles, and quartiles

A median of a dataset is a number which divides it into two equal halves: lower half and upper half.
When data is skewed, such as with income data, a better measure of center to report is the median. The median and the mean are two of three measures of center (the third being the mode); they are used with quantitative data, making them suited for a different purpose than the mode, which is often the reported measure of center for categorical data. When data is symmetric, then a better measure of center to report is the mean (or average) because it uses all the values in the data for its calculation: the median only uses the middle value(s) for its computation, hence making it more resistant to outliers than the mean, and thus making it a better measure of center when data is skewed. An outlier is an observation or value that does not follow the general pattern in the data, such as a value which is much larger or much smaller than the rest of the data values.
So, the presence of an outlier can pull the mean down or up than it would have been without the outlier.

Example:

A quiz consisting of five questions was taken by 100 students and the following Table displays the results. Find the median score.

Score/5	Number of students
0	5
1	6
2	14
3	25
4	40
5	10

Solution: The median is the average of the two middle values: the values in the following
positions: 50th and 51st: obtained from the index formula:

$\frac{1}{2} (n + 1)$

, where

$n = 100 =$

size of the sample.

$\frac{1}{2} (n + 1) = \frac{101}{2}$

= 50.5, which when rounded down
and up gives 50 and 51, respectively. We now create the following cumulative frequency table to help with the locations of
the 50th and 51st values in the data (after the data has been sorted from smallest to largest, or vice-versa (which was already the case for this question)).
Cumulative frequency is the running total of all the frequency up to and including the current frequency.
The first cumulative frequency is the same as the first frequency. Nb. Frequency here is the number of students
at each score level.

Score/5	Number of students	Cumulative frequency
0	5	5
1	6	11
2	14	25
3	25	50
4	40	90
5	10	100

From the cumulative frequency column, we see that the 50th value is a score of 3 and
the 51st value is a score of 4. So, the median score is the average of 3 and 4: $\frac{3 + 4}{2} = 3.5$ .

$P e r c e n t i l e s$ , when clearly specified, are numbers which divide the data into one hundred parts,
with the parts containing roughly the same number of observations. For example, a 70th percentile score on an exam
is a score which 70% of observations are below or equal to and 30% of observations are above it or equal to it.
The relative position of the $k$ th percentile is at the index: $\frac{k}{n} (n + 1)$ . If this index
is an exact integer, then the $k$ th percentile is the observation at that index. Otherwise, the index is
rounded down and up, and the average of the values at these indices would be the $k$ th percentile.

$Q u a r t i l e s$ are a special case of percentiles: they divide the dataset into four parts, with each
part containing approximately the same number of observations. There are three quartiles, viz:

first quartile, $Q_{1}$
second quartile, $Q_{2}$
third quartile, $Q_{3}$

The first quartile, $Q_{1}$ , is the same as the 25th percentile.
The second quartile, $Q_{2}$ , is the same as the 50th percentile or median.
The third quartile, $Q_{3}$ , is the same as the 75th percentile.
The $k$ th quartile is located at the index $\frac{k}{4} (n + 1)$ .

Example:

Find the five-number summary for the following data: $20, 4, 0, - 6, 100, 6, 9, 3, 10, 20, 72$

Solution: Sort the data from smallest to largest as follows:
$- 6, 0, 3, 4, 6, 9, 10, 20, 20, 72, 100$
$minimum = - 6$
lower half: -6, 0, 3, 4, 6; $Q_{1} = median of lower half = 3$
$median = Q_{2} = 9$
upper half: 10, 20, 20, 72, 100; $Q_{3} = median of upper half = 20$
$maximum = 100$
So, the five-number summary is: $(min, Q_{1}, Q_{2}, Q_{3}, max) = (- 6, 3, 9, 20, 100)$ .

Mean

The mean or average is a measure of center. It is calculated by adding up the observations in the data and then
dividing by the number of observations, if the variable is a discrete random variable.
The mean is also called the expected value. For a continuous random variable with probability density function (pdf) $f (x)$ , $E (X) = \int_{- \infty}^{\infty} x f (x) d x$

Example:

A random variable $X$ has the following pdf.
$f (x) = {\begin{cases} x^{2} & if 0 < x < 1 \\ \frac{1}{9} & if x = 1 \\ \frac{8}{9} x & if 1 < x < 1.5 \\ 0 & otherwise \end{cases}$
Compute $E (X)$ , the expected value of $X$ .

Solution: This is mixed random variable because there is probaboility mass at $x = 1$ (this is the discrete part of
$X$ ) and the random variable takes values in the intervals: [0, 1] and [1, 1.5] (this is the continuous part of $X$ )
$\begin{aligned} E (X) & = \int_{0}^{1} x f (x) d x + 1. f (1) + \int_{1}^{1.5} x f (x) d x \\ = \int_{0}^{1} x \cdot x^{2} d x + 1 \cdot \frac{1}{9} + \int_{1}^{1.5} x \cdot \frac{8}{9} x d x \\ = \frac{1}{4} x^{4} |_{0}^{1} + \frac{1}{9} + \frac{8}{27} x^{3} |_{1}^{1.5} \\ = \frac{1}{4} + \frac{1}{9} + \frac{8}{27} ({1.5}^{3} - 1) \\ = \frac{115}{108} = 1.06 \end{aligned}$

Standard Deviation

The standard deviation of a random variable, $X$ , is how far, on average, a randomly sampled observation
falls below or above the mean. There are a number of ways to find the standard deviation, $σ_{X}$ , for a random variable, $X$ ,
but we will use the following definition: $variance = σ_{X}^{2} = E [X^{2}] - E (X)^{2}; σ_{X} = \sqrt{variance}$

Example:

A random variable $X$ has the following frequency distribution.
$\begin{array}{cc} X : Score / 5 & Number of students \\ 0 & 5 \\ 1 & 6 \\ 2 & 14 \\ 3 & 25 \\ 4 & 40 \\ 5 & 10 \end{array}$ Compute $σ_{X}$ . What percentage of the observations are within one standard deviation of the mean?

Solution: We obtain the following additional column of relative frequencies (i.e., probabilities). $\begin{array}{ccc} X : Score / 5 & Number of students & Relative frequency, p (x) \\ 0 & 5 & 5 / 100 = 0.05 \\ 1 & 6 & 6 / 100 = 0.06 \\ 2 & 14 & 14 / 100 = 0.14 \\ 3 & 25 & 25 / 100 = 0.25 \\ 4 & 40 & 40 / 100 = 0.40 \\ 5 & 10 & 10 / 100 = 0.10 \\ Total & 100 & 1 \end{array}$ $E [X] = \sum x \cdot p (x) = 0 (.05) + 1 (.06) + 2 (.14) + 3 (.25) + 4 (.4) + 5 (.1) = 3.19$
$E [X^{2}] = \sum x^{2} \cdot p (x) = 0 (.05) + 1 (.06) + 4 (.14) + 3^{2} (.25) + 4^{2} (.4) + 25 (.1) = 11.77$
$variance = σ_{X}^{2} = 11.77 - {3.19}^{2} = 1.5939 ⟹ σ_{X} = \sqrt{1.5939} = 1.26$
bounds for one std. dev. of the mean are: $[mean - 1 σ_{X}, mean + 1 σ_{X}] = [3.19 - 1.26, 3.19 + 1.26] = [1.93, 4.45]$ ,
so that the scores between 1.93 and 4.45 are: 2, 3, 4. Therefore, the percentage of observations with scores of
2, 3, or 4 is: 0.14 + .25 + .4 = 79%

Test of Hypothesis

Obs	1	2	3	4	5	6	7	8	9	10	11	12
$x_{1}$	1.25	2.16	4.33	1.68	6.63	4.19	2.85	3.29	2.66	1.73	8.19	5.83
$x_{2}$	2.1	3.08	2.81	1.72	5.91	2.26	4.07	4.11	3.41	2.68	7.33	8.25
$x_{3}$	5.15	1.52	4.16	5.8	2.56	1.78	3.82	3.25	2.51	5.95	8.92	7.32
$x_{4}$	1.34	1.96	3.17	4.14	2.22	1.88	2.19	4.92	2.63	4.73	7.21	5.93
$x_{5}$	3.26	2.67	2.18	7.12	3.06	.86	1.43	1.02	4.08	3.21	8.56	5.99
$y$	1.4	.86	1.09	2.84	.96	.81	.23	.14	1.46	1.1	2.92	1.22

Consider the table above, consisting five explanatory variables, $x_{1}$ , $x_{2}$ , $x_{3}$ , $x_{4}$ , and $x_{5}$ , and one response variable $y$ .
There are 12 observations (obs) in the data. An analyst claims that the variables $x_{3}$ and $x_{4}$ do not contribute significantly to predicting the
value of $y$ . Use the partial F-test to investigate this claim, with $α = 0.05$ significance level.

Solution: We want to test whether the reduced model given by $Model 1: y_{i} = β_{0} + β_{1} x_{1 i} + β_{2} x_{2 i} + β_{5} x_{5 i} + ϵ_{i}$ is better than the full model given by $Model 2: y_{i} = β_{0} + β_{1} x_{1 i} + β_{2} x_{2 i} + β_{3} x_{3 i} + β_{4} x_{4 i} + β_{5} x_{5 i} + ϵ_{i}$ So, the null and alternative hypoyheses are respectively: $H_{0} : β_{3} = β_{4} = 0$ $H_{1} : at least one of β_{3} or β_{4} is not zero$ The test statistic is given by: $F_{0} = \frac{\frac{SSE(reduced) - SSE(full)}{p}}{MSE(full)}$ where $MSE(full) = \frac{SSE(full)}{n - (k + 1)}$ $(MSE: mean squared error; SSE: sum of squared error)$ ,
$p =$ number of explanatory variables removed from the full model = 2 ( $β_{3} a n d β_{4}$ ),
$k =$ number of explanatory variables in the full model (= 5),
$n =$ number of observations in the data set (= 12),
Using R (full result from R is shown below), we get the least-squares regression equations
for the reduced and full model to be respectively: $Model 1 : y = .4276 + .1666 x_{1} - .2946 x_{2} + .3798 x_{5}$ $Model 2 : y = .4309 + .1690 x_{1} - .2885 x_{2} + .0172 x_{3} - .0318 x_{4} + .3799 x_{5}$
From the R output above, we see that the coefficients $β_{3}$ and $β_{4}$ are not significant, supporting
the analyst’s claim, hence we do not reject the null hypothesis.
Nb. If we had done this by hand, then we would have needed to compute the test statistic, $F_{0}$ ,
and then compared it to the critical value $F_{α}$ , which can be obtained from R using the following
F-distribution quantile command, with two degrees of freedom, one for the numerator ( $p = 2$ ), and the other
for the denominator ( $d f_{d e n o m} = n - (k + 1) = 12 - 6 = 6$ ): $q f (0.95, 2, 6) = 5.14$ . It is a good
exercise to verify that $F_{0} < F_{α}$ , so that the we do not reject $H_{0}$ , and the explanatory
variables $x_{3}$ and $x_{4}$ do not signicantly contribute to the prediction of $y$ .

Probability

Statistics

Calculus/Multivariable

High school mathematics courses

University math entrance exams

GRE quantitative

GRE Math Subject Test