Statistical learning for Data Science

Multivariate Data

  • Multivariate data are measurements or observations of P(>1) variables on each of n items/individuals. When P = 1, that is univariate data, when p = 2, that is bivariate data

P is variables and n is observations

The types of prediction

  • Regression – to predict a quantitative response | Ridge regression, the LASSO
  • Classification – to predict a categorical response | Logistics regression, discriminant analysis
  • Both of the above are classed as supervised learning, as we are choosing the variables that are inputted into the model, and then we validate using cross-validation etc.

Types of analysis

  • Dimension reduction techniques; so that we can decrease the original amount of variables without any loss of information from the model
  • Cluster analysis; to find homogenous sub-groups amongst the variables

To check the data set is good, to begin with, we begin by checking the dependant variable data is equally as good. This can be done with a histogram of the dependant variable. It is also worth creating a scatter plot of the independent variables so that you can see how correlated they are. If the number of variables is large then you can use a heatmap.

  • Sample covariance – brings into play two or more variables
  • Correlation – quantifies the linear relationship between pairs of variables. Shouldn’t use sample coverance when the two variables you’re measuring are on very different scales.

Multivariate scatter

  • Is basically the variance between multiple variables…
  • Generalised variance – is the determinant of the sample covariance of the matrix
  • Total variation – is the sum of the diagonal elements of the sample covariance matrix

Standardisation transformation

  • Sometimes it is better to transform the data matrix so that the mean of each transformed variable is equal to 0 and the standard deviation is equal to 1. Due to the fact it brings all the variables in one scale.

Unsupervised Learning Techniques

This focuses on principle components analysis (PCA)

The aim of this is to standardise the range of continuous initial variables so that each one of them contributes equally to the analysis. Need to standardise the data before conducting PCA.

Principal components are a few variables that are constructed as linear combinations or mixtures of the initial variables. These combinations are done in such a way that the new variables are uncorrelated and that most of the information is stored within the first variable and that slowly decreases.

This allows you to reduce the number of variables without discarding too much information, and use these principal components as your new variables. Principal components represent the directions of the data that explain a maximal amount of variance.

Normally the cut-off point for the number of variables would be when we are able to account for 95% of the distribution of the data.

One of the downfalls of PCA is that the variables need to be of the same scale because if they’re not it means that it is likely to overestimate one of the variables or underestimate.

You shouldn’t read too much into the variations of PCA’s as it doesn’t translate too much from the original variables.

Cluster Analysis

Clustering refers to methods for finding unknown subgroups or clusters in data sets. Cluster analysis is a type of unsupervised machine learning, as you’re not giving the dependant variable

The most used dissimilarity measure is called Euclidean distance

Best off using correlation instead of distance in most cases, as the data might not be on similar scales.

K-means clustering

A simple and intuitive approach to partitioning the observations into K distinct, non-overlapping clusters. The dissimilarity metric used is squared Euclidean distance.

K is just the number of clusters that you categorise the data into

Choice of K

  • Sometimes from analysis, an appropriate choice of K might be obvious, though normally we will have to carry out analysis to find out. We can graph out a strategy by using running the K means algorithm for several values of the solution.

Agglomerative hierarchical clustering

  • K-means is fast and efficient and scales well to large data, though it means you need to pre-define K. Most common is bottom-up clustering


Linear regression

  • The dependent variable is the one you want to predict, independent variables have coefficients attached, and then the error value comes at the end of the equation.
  • Interaction terms – are products of the predictors which are thought to “interact” with each other.
  • Coefficient of determination R^2
    • 1 – the residual sum of squares / the total sum of squares
    • The closer to 1, the less scatter there is around the line of best fit

Subset selection – Identify a “good” subset of P* < p explanatory variables, then fir a model using least squares on these p* predictors

Regularisation – Modify the the residual sum of squares by adding a penalty which prescribes a “cost” to large values of the regression coefficients.

Best subset selection – involves using least squares to fit a linear regression model to each possible subset of the p explanatory variables.


  • Adjusted R-squared, choose the model which has the largest R^2 value
  • Mallow’s Cp statistic, choose the model which has the smallest value
  • Bayes information criteria, choose the model with the smallest value
  • Stepwise selection

When the number of independent variables is large (P); the automated stepwise process is an efficient solution to the problem.

  • Forward – Starts with one variable with the lowest p-value
  • Backward – Starts with all the variables and works backwards

Assessing predictive error ad cross-validation

  • Test error
  • Training error
  • Mean squared error

Regularisation methods

  • This works by modifying the function we need to minimise, generally referred to as a loss function – instead of finding the regression coefficients which minimise SSE, we look for the ones that minimise: loss function = SSE + penalty

Ridge regression

  • Imposes a penalty on the coefficients to shrink them towards zero, but doesn’t set any coefficients to zero. – so doesn’t carry out feature selection

LASSO – Least absolute shrinkage and selection operator

  • Performs both variable selection and regularisation in order to enhance prediction accuracy and interpretability of the statistical model it produces.