Statistics

Statistics is a science of collecting, summarizing, representing, and
analyzing data in order to answer certain research questions. Also, for any
conclusions reached, a measure of reliability must be provided.

In this tutorial, we cover the following topics:

  • mode
  • median, percentiles, and quartiles
  • mean
  • standard deviation
  • Hypothesis testing

Mode

The mode is the value in the data which occurs most often. The mode is often used with categorical data – data whose values are labels for different possible categories. The mode can also be used with continuous quantitative data, in which case we use the methods calculus to obtain it.

Example:

Consider the data in the Table below.

Phone typeNumber of sales
Samsung205
iPhone443
Google pixel191
Xiaomi328
Realme GT263

What is the mode of this data?

Solution: the mode is the phone type with the highest number sold, viz: iPhone.
Nb. The data in the table are categorical (i.e., qualitative).

Median, percentiles, and quartiles

A median of a dataset is a number which divides it into two equal halves: lower half and upper half.
When data is skewed, such as with income data, a better measure of center to report is the median. The median and the mean are two of three measures of center (the third being the mode); they are used with quantitative data, making them suited for a different purpose than the mode, which is often the reported measure of center for categorical data. When data is symmetric, then a better measure of center to report is the mean (or average) because it uses all the values in the data for its calculation: the median only uses the middle value(s) for its computation, hence making it more resistant to outliers than the mean, and thus making it a better measure of center when data is skewed. An outlier is an observation or value that does not follow the general pattern in the data, such as a value which is much larger or much smaller than the rest of the data values.
So, the presence of an outlier can pull the mean down or up than it would have been without the outlier.

Example:

A quiz consisting of five questions was taken by 100 students and the following Table displays the results. Find the median score.

Score/5Number of students
05
16
214
325
440
510

Solution: The median is the average of the two middle values: the values in the following
positions: 50th and 51st: obtained from the index formula:

12(n+1)

, where 

n=100=

 size of the sample. 

12(n+1)=1012

 = 50.5, which when rounded down
and up gives 50 and 51, respectively. We now create the following cumulative frequency table to help with the locations of
the 50th and 51st values in the data (after the data has been sorted from smallest to largest, or vice-versa (which was already the case for this question)).
Cumulative frequency is the running total of all the frequency up to and including the current frequency.
The first cumulative frequency is the same as the first frequency. Nb. Frequency here is the number of students
at each score level.

Score/5Number of studentsCumulative frequency
055
1611
21425
32550
44090
510100

From the cumulative frequency column, we see that the 50th value is a score of 3 and
the 51st value is a score of 4. So, the median score is the average of 3 and 4: 3+42=3.5.

Percentiles, when clearly specified, are numbers which divide the data into one hundred parts,
with the parts containing roughly the same number of observations. For example, a 70th percentile score on an exam
is a score which 70% of observations are below or equal to and 30% of observations are above it or equal to it.
The relative position of the kth percentile is at the index: kn(n+1). If this index
is an exact integer, then the kth percentile is the observation at that index. Otherwise, the index is
rounded down and up, and the average of the values at these indices would be the kth percentile.

Quartiles are a special case of percentiles: they divide the dataset into four parts, with each
part containing approximately the same number of observations. There are three quartiles, viz:

  1. first quartile, Q1
  2. second quartile, Q2
  3. third quartile, Q3

The first quartile, Q1, is the same as the 25th percentile.
The second quartile, Q2, is the same as the 50th percentile or median.
The third quartile, Q3, is the same as the 75th percentile.
The kth quartile is located at the index k4(n+1).

 

Example:

Find the five-number summary for the following data:20,4,0,6,100,6,9,3,10,20,72

Solution: Sort the data from smallest to largest as follows:
6,0,3,4,6,9,10,20,20,72,100
minimum=6
lower half: -6, 0, 3, 4, 6; Q1=median of lower half=3
median=Q2=9
upper half: 10, 20, 20, 72, 100; Q3=median of upper half=20
maximum=100
So, the five-number summary is: (min,Q1,Q2,Q3,max)=(6,3,9,20,100).

Mean

The mean or average is a measure of center. It is calculated by adding up the observations in the data and then
dividing by the number of observations, if the variable is a discrete random variable.
The mean is also called the expected value. For a continuous random variable with probability density function (pdf) f(x),E(X)=xf(x) dx

Example:

A random variable X has the following pdf.
f(x)={x2if 0<x<119if x=189xif 1<x<1.50otherwise
Compute E(X), the expected value of X.

Solution: This is mixed random variable because there is probaboility mass at x=1 (this is the discrete part of
X) and the random variable takes values in the intervals: [0, 1] and [1, 1.5] (this is the continuous part of X)
E(X)=01xf(x) dx+1.f(1)+11.5xf(x) dx=01xx2 dx+119+11.5x89x dx=14x4|01+19+827x3|11.5=14+19+827(1.531)=115108=1.06

Standard Deviation

The standard deviation of a random variable, X, is how far, on average, a randomly sampled observation
falls below or above the mean. There are a number of ways to find the standard deviation, σX, for a random variable, X,
but we will use the following definition:variance =σX2=E[X2]E(X)2;σX=variance

Example:

A random variable X has the following frequency distribution.
X:Score/5Number of students0516214325440510Compute σX. What percentage of the observations are within one standard deviation of the mean?

Solution: We obtain the following additional column of relative frequencies (i.e., probabilities).X:Score/5Number of studentsRelative frequency, p(x)055/100=0.05166/100=0.0621414/100=0.1432525/100=0.2544040/100=0.4051010/100=0.10Total1001E[X]=xp(x)=0(.05)+1(.06)+2(.14)+3(.25)+4(.4)+5(.1)=3.19
E[X2]=x2p(x)=0(.05)+1(.06)+4(.14)+32(.25)+42(.4)+25(.1)=11.77
variance =σX2=11.773.192=1.5939σX=1.5939=1.26
bounds for one std. dev. of the mean are: [mean1σX,mean+1σX]=[3.191.26,3.19+1.26]=[1.93,4.45],
so that the scores between 1.93 and 4.45 are: 2, 3, 4. Therefore, the percentage of observations with scores of
2, 3, or 4 is: 0.14 + .25 + .4 = 79%

Test of Hypothesis

Obs123456789101112
x11.252.164.331.686.634.192.853.292.661.738.195.83
x22.13.082.811.725.912.264.074.113.412.687.338.25
x35.151.524.165.82.561.783.823.252.515.958.927.32
x41.341.963.174.142.221.882.194.922.634.737.215.93
x53.262.672.187.123.06.861.431.024.083.218.565.99
y1.4.861.092.84.96.81.23.141.461.12.921.22


Consider the table above, consisting five explanatory variables, x1x2x3x4, and x5, and one response variable y.
There are 12 observations (obs) in the data. An analyst claims that the variables x3 and x4 do not contribute significantly to predicting the
value of y. Use the partial F-test to investigate this claim, with α=0.05 significance level.

Solution: We want to test whether the reduced model given byModel 1:yi=β0+β1x1i+β2x2i+β5x5i+ϵiis better than the full model given byModel 2:yi=β0+β1x1i+β2x2i+β3x3i+β4x4i+β5x5i+ϵiSo, the null and alternative hypoyheses are respectively:H0:β3=β4=0H1:at least one of β3 or β4 is not zeroThe test statistic is given by:F0=SSE(reduced)SSE(full)pMSE(full)where MSE(full)=SSE(full)n(k+1) (MSE: mean squared error; SSE: sum of squared error),
p= number of explanatory variables removed from the full model = 2 (β3 and β4),
k= number of explanatory variables in the full model (= 5),
n= number of observations in the data set (= 12),
Using R (full result from R is shown below), we get the least-squares regression equations
for the reduced and full model to be respectively:Model 1:  y=.4276+.1666x1.2946x2+.3798x5Model 2:  y=.4309+.1690x1.2885x2+.0172x3.0318x4+.3799x5   
From the R output above, we see that the coefficients β3 and β4 are not significant, supporting
the analyst’s claim, hence we do not reject the null hypothesis.
Nb. If we had done this by hand, then we would have needed to compute the test statistic, F0,
and then compared it to the critical value Fα, which can be obtained from R using the following
F-distribution quantile command, with two degrees of freedom, one for the numerator (p=2), and the other
for the denominator (dfdenom=n(k+1)=126=6): qf(0.95,2,6)=5.14. It is a good
exercise to verify that F0<Fα, so that the we do not reject H0, and the explanatory
variables x3 and x4 do not signicantly contribute to the prediction of y.