Machine learning


The difference between traditional learning and machine learning is that knowledge/rules are swapped with labels.

Traditional supervised learning, Deep learning and Unsupervised learning

The Challenge with this sort of model is the overfitting, this may be caused by less representative training data or a smaller set of training data.

Polynomial curve fitting

  • Same as previous models, need to minimise MSE
  • The model includes a minimise error function E(w); however, we need to keep the difference in MSE between the test and training set small

Linear regression

  • As written in the previous notes; the only thing to add is the need to split between training and testing set.
  • To find the optimal parameter where MSE is minimised

Libraries for machine learning

  • Sklearn is used for traditional machine learning
  • Keras is used for deep learning


Difference between classification and regression

  • Classification, certain outcome (true or false)
  • Regression, numeric outcome (house prices)

Naïve Bayes Classifier

  • Naive Bayes is a family of probabilistic algorithms that take advantage of probability theory and Bayes’ Theorem
  • Pros: Easy to train, easy to understand and fast
  • Cons: Conditional feature independence assumption, not suitable when there is data redundancy

Gaussian MLE

  • In a density graph, integration is between two points is the way to calculate how many people are likely between the points – Probability density function


  • Learning rate, regularisation coefficient, maximum iteration number
  • Much more hyperparameters in deep learning

Deep learning

  • Neural networks
  • Each neuron is connected to others and sends a signal on input
  • Changing the input produces a spike

Logistic Regression Classifier – Weighted sum of the parameters and the input-Scaling to range [0, 1] using sigmoid function

Perceptron – Weighted sum of the parameters and the input-Different names (weights and bias) of the model parameters

Neural Networks

  • Outperform traditional methods when the data is very large

Artificial Neurons in the network

  • Artificial neuron; has a number of weighted inputs x
  • An activation function used to model the non-linearity
  • Sigmoid [0, 1], Tanh [-1, 1] and Relu [0 to max]

Parameter Training

Steps for finding the best parameters:

  • Guess initial parameters
  • Calculate loss/error 3
  • Calculate derivative or loss of error
  • Update the parameters

Backpropagation (BP)

We don’t know what the hidden units are supposed to do, as they are unknown

  • This is different from training a single unit, where we know the exact difference for each unit and model
  • Instead, we have loss and the activity of each unit in the network

Network architectures

  • The parameters to be learned are weight and bias; same as most machine learning techniques
  • In logistic regression, we try and minimise the log loss to maximise the prediction probability for each data point

Cross-Entropy loss

  • One-hot encoding the training labels,
  • Cross entropy loss is log loss
  • Minimising the loss
  • Calculating the gradients
  • Updating parameters

The model learns from the loss using a training set. This is normally done by setting a test and a train set. However, if we are using big data this can take a long time, so instead, we use a mini-batch/random subset of training samples – which is efficient and robust.

The above is done using stochastic gradient descent; which shuffles the data numerous times to make sure it’s well sampled.

Dropout – Randomly ignores some neurons, that corresponding connections remain unchanged during this updating

Hyperparameter: is the percentage of neurons that are dropped at a certain layer

Dropout is only applied at the testing stage

Multilayer Perception

  • An approximation for the non-linear classification decision boundary
  • Raw data input only; no need for feature engineering or domain knowledge


The general idea of deep learning

  • Limitation of linear classifier
  • The role of non-linear activation function in neural networks

Parameter training

  • Parameters and Backpropagation

From shallow to deep

  • Softmax LR and Multilayer perceptron
  • Unsupervised model: Autoencoder
  • Techniques  – Stochastic gradient descent and dropout

Relevant network architecture

Convolutional Neural Network

  • Weight sharing concept
  • Convolutional layer and pooling layer
  • Application: Digit reconstruction

Recurrent Neural Networks

  • Vanilla recurrent neural network and training
  • Long short term memory
  • Application: Digit recognition and Sentiment analysis

Support vector machines

Linear and Kernel

  • There is a trade-off between the margin and the number of mistakes on the training data
  • You can use gradient descent to find the optimal parameters E,b for this linear regression

A Linear SVM (classifier)

  • Can be either a soft (with slack variables) or hard margin (a special case of a soft margin)

The straightforward way is to project raw features into linearly separable feature space; which is done through deep learning for feature extraction and through feature engineering.

Decision Tree and Random Forest

We aim to have the lowest entropy at the leaf nodes

The selection process is based on calculating the entropy of the whole dataset, then splitting the dataset one without a certain node and then calculating the entropy again. Finally selecting the attribute with the largest information

Decision tree

  • Good interpretability though has a lot of overfitting, low bias and high variance
  • Need to balance overfitting with accuracy

Random forests don’t overfit as much as decision trees, though decision trees are easier to interpret.


Support vector machines

  • Large margin, slack variables
  • Primal form, dual form
  • Kernel trick,
  • Hyperparameters selection through grid search

Decision tree and Random forest

  • ID3 algorithm: Entropy and information gain
  • Overfitting reduction through ensemble
  • Random Forest

K-nearest neighbour classifier

  • Majority voting


  • K-means
  • Expectation maximisation (EM)

Dimensionality Reduction

  • Principle component analysis
  • linear discriminant analysis
  • We did all about clustering in the last module, so worth going back and having a look through stuff for that