## Introduction

The difference between traditional learning and machine learning is that knowledge/rules are swapped with labels.

**Traditional supervised learning**, **Deep learning** and **Unsupervised learning**

The Challenge with this sort of model is the overfitting, this may be caused by less representative training data or a smaller set of training data.

**Polynomial curve fitting**

- Same as previous models, need to minimise MSE
- The model includes a minimise error function E(w); however, we need to keep the difference in MSE between the test and training set small

**Linear regression**

- As written in the previous notes; the only thing to add is the need to split between training and testing set.
- To find the optimal parameter where MSE is minimised

Libraries for machine learning

- Sklearn is used for traditional machine learning
- Keras is used for deep learning

## Probability

Difference between classification and regression

- Classification, certain outcome (true or false)
- Regression, numeric outcome (house prices)

**Naïve Bayes Classifier**

- Naive Bayes is a family of probabilistic algorithms that take advantage of probability theory and Bayes’ Theorem
**Pros:**Easy to train, easy to understand and fast**Cons:**Conditional feature independence assumption, not suitable when there is data redundancy

**Gaussian MLE**

- In a density graph, integration is between two points is the way to calculate how many people are likely between the points –
**Probability density function**

**Hyper-parameters**

- Learning rate, regularisation coefficient, maximum iteration number
- Much more hyperparameters in deep learning

**Deep learning**

- Neural networks
- Each neuron is connected to others and sends a signal on input
- Changing the input produces a spike

**Logistic Regression Classifier** – Weighted sum of the parameters and the input-Scaling to range [0, 1] using sigmoid function

**Perceptron** – Weighted sum of the parameters and the input-Different names (weights and bias) of the model parameters

**Neural Networks**

- Outperform traditional methods when the data is very large

**Artificial Neurons in the network**

- Artificial neuron; has a number of weighted inputs x
- An activation function used to model the non-linearity
- Sigmoid [0, 1], Tanh [-1, 1] and Relu [0 to max]

**Parameter Training**

Steps for finding the best parameters:

- Guess initial parameters
- Calculate loss/error 3
- Calculate derivative or loss of error
- Update the parameters

**Backpropagation (BP)**

We don’t know what the hidden units are supposed to do, as they are unknown

- This is different from training a single unit, where we know the exact difference for each unit and model
- Instead, we have loss and the activity of each unit in the network

**Network architectures**

- The parameters to be learned are weight and bias; same as most machine learning techniques
- In logistic regression, we try and minimise the log loss to maximise the prediction probability for each data point

**Cross-Entropy loss**

- One-hot encoding the training labels,
- Cross entropy loss is log loss
- Minimising the loss
- Calculating the gradients
- Updating parameters

The model learns from the loss using a training set. This is normally done by setting a test and a train set. However, if we are using big data this can take a long time, so instead, we use a mini-batch/random subset of training samples – which is efficient and robust.

The above is done using stochastic gradient descent; which shuffles the data numerous times to make sure it’s well sampled.

Dropout – Randomly ignores some neurons, that corresponding connections remain unchanged during this updating

Hyperparameter: is the percentage of neurons that are dropped at a certain layer

Dropout is only applied at the testing stage

**Multilayer Perception**

- An approximation for the non-linear classification decision boundary
- Raw data input only; no need for feature engineering or domain knowledge

**Summary**

The general idea of deep learning

- Limitation of linear classifier
- The role of non-linear activation function in neural networks

Parameter training

- Parameters and Backpropagation

From shallow to deep

- Softmax LR and Multilayer perceptron
- Unsupervised model: Autoencoder
- Techniques – Stochastic gradient descent and dropout

### Relevant network architecture

**Convolutional Neural Network**

- Weight sharing concept
- Convolutional layer and pooling layer
- Application: Digit reconstruction

**Recurrent Neural Networks**

- Vanilla recurrent neural network and training
- Long short term memory
- Application: Digit recognition and Sentiment analysis

**Support vector machines**

Linear and Kernel

- There is a trade-off between the margin and the number of mistakes on the training data
- You can use gradient descent to find the optimal parameters E,b for this linear regression

A Linear SVM (classifier)

- Can be either a soft (with slack variables) or hard margin (a special case of a soft margin)

The straightforward way is to project raw features into linearly separable feature space; which is done through deep learning for feature extraction and through feature engineering.

**Decision Tree and Random Forest**

We aim to have the lowest entropy at the leaf nodes

The selection process is based on calculating the entropy of the whole dataset, then splitting the dataset one without a certain node and then calculating the entropy again. Finally selecting the attribute with the largest information

**Decision tree**

- Good interpretability though has a lot of overfitting, low bias and high variance
- Need to balance overfitting with accuracy

Random forests don’t overfit as much as decision trees, though decision trees are easier to interpret.

**Summary**

**Support vector machines**

- Large margin, slack variables
- Primal form, dual form
- Kernel trick,
- Hyperparameters selection through grid search

**Decision tree and Random forest**

- ID3 algorithm: Entropy and information gain
- Overfitting reduction through ensemble
- Random Forest

**K-nearest neighbour classifier**

- Majority voting

**Clustering**

- K-means
- Expectation maximisation (EM)

**Dimensionality Reduction**

- Principle component analysis
- linear discriminant analysis
- We did all about clustering in the last module, so worth going back and having a look through stuff for that