Statistical Foundations of Data Science


  • Experiments are subject to random variation, i.e. the outcome of an experiment cannot be predicted exactly
  • Statistics aims to do the following:
    • To keep uncertainty to a minimum
    • To quantify the remaining uncertainty
    • Distinguishing between real differences and random variation
  • A random variable is a quantity whose value is subject to random variation, also known as an observation
  • Qualitative = nominal (e.g. gender) or ordinal (degree of severity of burns)
  • Quantitative = discrete (number of purchases) or continuous (weight or temperature)


Q) What are the aims of statistics. A) To keep uncertainty to a minimum and quantify the remaining uncertainty

Q) What are the different types of data. A) Quantitative (Continuous and discrete), Qualitative (Nominal or ordinal)

Numerical summaries

  • Summary statistics
    • The measure of location (mean, mode, median etc)
    • Mean divided by median is a good way of showing inequality in data
    • The measure of spread (SD, variance etc.)

Sample variance described as:

  • Standard deviation is just the square root of the variance

Graphical summaries

Graphical displays can be helpful for highlighting main features:

  • Shape, location, variation, outliers and clusters etc.
  • Stem and leaf plots are good for a summary of data, also histograms

Classical Probability

  • Frequentist view; where the longer we do an experiment the more likely it is to yield an average result that we agree with; casinos etc. are a prime example of this
  • Union event is the event consisting of all the outcomes that are in either A or in B, or in both
  • The intersection of two events A, B is that the event consisting of all the outcomes that are in both A and B.
  • U is or, upside-down U is both
  • Mutually exclusive is when two events both can’t occur at the same time, i.e. two things don’t need to be mutually exclusive
  • When calculating probability we need to add the probabilities of the events happening and then minus the event of them both happening

Formulas for working out probability

Discrete Probability

Part 1

  • X is a random variable
  • Discrete random variables take a countable number of values
  • The probability mass function (pmf) of a random variable defines the probability of each possible observed value
  • Cumulative distribution function

The PMF is just the sum of the probabilities of something happening at a given time, whereas CDF is cumulative of these values.

Part 2

Discrete random variables can include: Binomial, Geometric, Poisson – Distribution is analysing the potential outcomes of all the data, and how frequently they occur.

  • Normal distribution – Bell curve, naturally in most situations
  • Binomial distribution – the likelihood of a pass or fail outcome in a survey or experiment that is replicated many times, needs to be true or false
  • Poisson distribution – the probability of events occurring at a given time, in a set period

A binomial random variable needs to be a constant where trials are independent.

Poisson process distribution model ‘counts per fixed interval’; for a poisson random variable the expectation is equal to the variance

Continuous Probability

For continuous variables such as height, you can’t have the probability of exactly one number, i.e. 175 – because we might be able to measure it more accurately, so instead, we would state 169:5 x < 170:5cm

Cumulative distribution function is F(X) = Pr( x≤ x)

The gradient of F(X) or the derivative is known as the probability density function

Plays the equivalent roll of the mass function, rate of change of the function at a given point

PDF can’t be negative

When working on continuous variables the way to display probabilities is the area underneath the pdf

The total area underneath the PDF must equal 1, because this shows the outcome of all probabilities i.e. 1 (100%)

To find the probability between two numbers, you need to find the integration of f(x) between the two numbers given.

Uniform distribution

How to calculate the probability between two points

The exponential function

It has an exponential distribution where lambda is greater than 0.

The Poisson process

Modelling events along with a continuous interval, as a Poisson process with rate lambda, where the time or distance between events is called the inter-arrival time. This is shown to have an exponential distribution with parameter lambda.

Normal distribution

If something is normally distributed then 95% of the values need to be within two standard deviations of the mean and 69.3% of the results need to be within 2 standard deviations of the mean.

A normal distributed random variable has the PDF

These are seen by a bell-shaped curve, the PDF for normal distribution can also be obtained through ‘dnorm’.

For example, if we want to find the probability of something happening at Pr(X > 4.5) when we need to find the CFD at point 4.5 and use that to subtract from 1.

The sample mean and the central limit theorem

Assumptions are made that the samples used are independent and identically distributed, with a common expectation and a common variance

Expectation = Mean

Variance = Standard deviation squared, divided by the number of variables

The central limit theorem states that when the variables have any sort of distribution, no matter what the individual distributions are that the sample mean is always approximately distributed normally.


The inference is just the process of learning from data, specifically a sample of data when gathering the whole data is not feasible.

There are two schools of thought:

  • Bayesian, subjective viewpoint
  • Frequentist, limited relative frequency

Estimation: Point estimation, Maximum likelihood, Confidence intervals


Sample needs to be representative of the population, obviously random sampling does not achieve this

Parameter estimation

Previously it has been shown that the sample mean is a good estimator of the population, as it gives the correct answer over numerous samples. The letter T n in notation is now classed as the sampling distribution.

Two alternatives of this are:

  • T1 = X1, so the sample distribution would be the first observation
  • T2 = 1/N X, so the final observation divided by the number of samples

Biased and unbiased estimators

Choose the one with the smallest variance


The estimator is consistent if larger samples can be expected to more precise estimates

Likelihood methods

  • How to find the distribution when the parameters to calculate it aren’t known, likelihood methods are used to try to construct good estimators.
  • The likelihood function measures the goodness of fit of the statistical model to sample data, for the given values of the unknown parameter. It is described differently for discrete and continuous functions.
  • For example, the likelihood of a coin landing on heads is 0.5 PH, thus on a graph, it would have a direct positive correlation. When calculating the likelihood function then it would peak highest at 0.5, and display the bell curve

Likelihood function – single observation

L(theta|X) = p(x|theta)

You need to use calculus to be able to find the maximum points of the function, differentiate the curve to find the maximum point.

Confidence Intervals

Confidence intervals are used to specify a range of plausible answers around the outcome, we want to find the range where the value would fall 95% of the time, hence the name the 95% confidence interval. If we know that the data is normally distributed than 95% of the values will fall between two standard deviations of the mean.

Hypothesis Testing

Data is normally collected to test a hypothesis, such as a scientific test, comparisons, A|B testing etc. For each we will receive a P-value as a result of these:

P-value Interpretation

  • p > 0.1, No evidence against H0: do not reject H0.
  • 0.05 < p < 0.1, Slight evidence against H0, but not enough to reject it.
  • 0.01 < p < 0.05, Moderate evidence against H0: reject it and go with H1.
  • 0.001 < p < 0.01, Strong evidence against H0: reject it and go with H1.
  • p < 0.001, Very strong evidence against H0: reject it and go with H1.


  • H0: Theta = 0.6
  • H1: Theta > 0.6
  • p = Pr(X>38 = 0.6) = 1 − Pr(X<37| = 0.6) = 0.0132

The P-value is less than 0.05, so we would say it is statistically significant


Is a standardised test difference between the sample mean and the hypothesis mean, so use it as evidence to weigh up against H0

Linear Algebra

A matrix is a grid that stores mn items of data, m rows of n length