- Experiments are subject to random variation, i.e. the outcome of an experiment cannot be predicted exactly
- Statistics aims to do the following:
- To keep uncertainty to a minimum
- To quantify the remaining uncertainty
- Distinguishing between real differences and random variation
- A random variable is a quantity whose value is subject to random variation, also known as an observation
- Qualitative = nominal (e.g. gender) or ordinal (degree of severity of burns)
- Quantitative = discrete (number of purchases) or continuous (weight or temperature)
Q) What are the aims of statistics. A) To keep uncertainty to a minimum and quantify the remaining uncertainty
Q) What are the different types of data. A) Quantitative (Continuous and discrete), Qualitative (Nominal or ordinal)
- Summary statistics
- The measure of location (mean, mode, median etc)
- Mean divided by median is a good way of showing inequality in data
- The measure of spread (SD, variance etc.)
Sample variance described as:
- Standard deviation is just the square root of the variance
Graphical displays can be helpful for highlighting main features:
- Shape, location, variation, outliers and clusters etc.
- Stem and leaf plots are good for a summary of data, also histograms
- Frequentist view; where the longer we do an experiment the more likely it is to yield an average result that we agree with; casinos etc. are a prime example of this
- Union event is the event consisting of all the outcomes that are in either A or in B, or in both
- The intersection of two events A, B is that the event consisting of all the outcomes that are in both A and B.
- U is or, upside-down U is both
- Mutually exclusive is when two events both can’t occur at the same time, i.e. two things don’t need to be mutually exclusive
- When calculating probability we need to add the probabilities of the events happening and then minus the event of them both happening
Formulas for working out probability
- X is a random variable
- Discrete random variables take a countable number of values
- The probability mass function (pmf) of a random variable defines the probability of each possible observed value
- Cumulative distribution function
The PMF is just the sum of the probabilities of something happening at a given time, whereas CDF is cumulative of these values.
Discrete random variables can include: Binomial, Geometric, Poisson – Distribution is analysing the potential outcomes of all the data, and how frequently they occur.
- Normal distribution – Bell curve, naturally in most situations
- Binomial distribution – the likelihood of a pass or fail outcome in a survey or experiment that is replicated many times, needs to be true or false
- Poisson distribution – the probability of events occurring at a given time, in a set period
A binomial random variable needs to be a constant where trials are independent.
Poisson process distribution model ‘counts per fixed interval’; for a poisson random variable the expectation is equal to the variance
For continuous variables such as height, you can’t have the probability of exactly one number, i.e. 175 – because we might be able to measure it more accurately, so instead, we would state 169:5 x < 170:5cm
Cumulative distribution function is F(X) = Pr( x≤ x)
The gradient of F(X) or the derivative is known as the probability density function
Plays the equivalent roll of the mass function, rate of change of the function at a given point
PDF can’t be negative
When working on continuous variables the way to display probabilities is the area underneath the pdf
The total area underneath the PDF must equal 1, because this shows the outcome of all probabilities i.e. 1 (100%)
To find the probability between two numbers, you need to find the integration of f(x) between the two numbers given.
How to calculate the probability between two points
The exponential function
It has an exponential distribution where lambda is greater than 0.
The Poisson process
Modelling events along with a continuous interval, as a Poisson process with rate lambda, where the time or distance between events is called the inter-arrival time. This is shown to have an exponential distribution with parameter lambda.
If something is normally distributed then 95% of the values need to be within two standard deviations of the mean and 69.3% of the results need to be within 2 standard deviations of the mean.
A normal distributed random variable has the PDF
These are seen by a bell-shaped curve, the PDF for normal distribution can also be obtained through ‘dnorm’.
For example, if we want to find the probability of something happening at Pr(X > 4.5) when we need to find the CFD at point 4.5 and use that to subtract from 1.
The sample mean and the central limit theorem
Assumptions are made that the samples used are independent and identically distributed, with a common expectation and a common variance
Expectation = Mean
Variance = Standard deviation squared, divided by the number of variables
The central limit theorem states that when the variables have any sort of distribution, no matter what the individual distributions are that the sample mean is always approximately distributed normally.
The inference is just the process of learning from data, specifically a sample of data when gathering the whole data is not feasible.
There are two schools of thought:
- Bayesian, subjective viewpoint
- Frequentist, limited relative frequency
Estimation: Point estimation, Maximum likelihood, Confidence intervals
Sample needs to be representative of the population, obviously random sampling does not achieve this
Previously it has been shown that the sample mean is a good estimator of the population, as it gives the correct answer over numerous samples. The letter T n in notation is now classed as the sampling distribution.
Two alternatives of this are:
- T1 = X1, so the sample distribution would be the first observation
- T2 = 1/N X, so the final observation divided by the number of samples
Biased and unbiased estimators
Choose the one with the smallest variance
The estimator is consistent if larger samples can be expected to more precise estimates
- How to find the distribution when the parameters to calculate it aren’t known, likelihood methods are used to try to construct good estimators.
- The likelihood function measures the goodness of fit of the statistical model to sample data, for the given values of the unknown parameter. It is described differently for discrete and continuous functions.
- For example, the likelihood of a coin landing on heads is 0.5 PH, thus on a graph, it would have a direct positive correlation. When calculating the likelihood function then it would peak highest at 0.5, and display the bell curve
Likelihood function – single observation
L(theta|X) = p(x|theta)
You need to use calculus to be able to find the maximum points of the function, differentiate the curve to find the maximum point.
Confidence intervals are used to specify a range of plausible answers around the outcome, we want to find the range where the value would fall 95% of the time, hence the name the 95% confidence interval. If we know that the data is normally distributed than 95% of the values will fall between two standard deviations of the mean.
Data is normally collected to test a hypothesis, such as a scientific test, comparisons, A|B testing etc. For each we will receive a P-value as a result of these:
- p > 0.1, No evidence against H0: do not reject H0.
- 0.05 < p < 0.1, Slight evidence against H0, but not enough to reject it.
- 0.01 < p < 0.05, Moderate evidence against H0: reject it and go with H1.
- 0.001 < p < 0.01, Strong evidence against H0: reject it and go with H1.
- p < 0.001, Very strong evidence against H0: reject it and go with H1.
- H0: Theta = 0.6
- H1: Theta > 0.6
- p = Pr(X>38 = 0.6) = 1 − Pr(X<37| = 0.6) = 0.0132
The P-value is less than 0.05, so we would say it is statistically significant
Is a standardised test difference between the sample mean and the hypothesis mean, so use it as evidence to weigh up against H0
A matrix is a grid that stores mn items of data, m rows of n length