Introduction
The difference between traditional learning and machine learning is that knowledge/rules are swapped with labels.
Traditional supervised learning, Deep learning and Unsupervised learning
The Challenge with this sort of model is the overfitting, this may be caused by less representative training data or a smaller set of training data.
Polynomial curve fitting
- Same as previous models, need to minimise MSE
- The model includes a minimise error function E(w); however, we need to keep the difference in MSE between the test and training set small
Linear regression
- As written in the previous notes; the only thing to add is the need to split between training and testing set.
- To find the optimal parameter where MSE is minimised
Libraries for machine learning
- Sklearn is used for traditional machine learning
- Keras is used for deep learning
Probability
Difference between classification and regression
- Classification, certain outcome (true or false)
- Regression, numeric outcome (house prices)
Naïve Bayes Classifier
- Naive Bayes is a family of probabilistic algorithms that take advantage of probability theory and Bayes’ Theorem
- Pros: Easy to train, easy to understand and fast
- Cons: Conditional feature independence assumption, not suitable when there is data redundancy
Gaussian MLE
- In a density graph, integration is between two points is the way to calculate how many people are likely between the points – Probability density function
Hyper-parameters
- Learning rate, regularisation coefficient, maximum iteration number
- Much more hyperparameters in deep learning
Deep learning
- Neural networks
- Each neuron is connected to others and sends a signal on input
- Changing the input produces a spike
Logistic Regression Classifier – Weighted sum of the parameters and the input-Scaling to range [0, 1] using sigmoid function
Perceptron – Weighted sum of the parameters and the input-Different names (weights and bias) of the model parameters
Neural Networks
- Outperform traditional methods when the data is very large
Artificial Neurons in the network
- Artificial neuron; has a number of weighted inputs x
- An activation function used to model the non-linearity
- Sigmoid [0, 1], Tanh [-1, 1] and Relu [0 to max]
Parameter Training
Steps for finding the best parameters:
- Guess initial parameters
- Calculate loss/error 3
- Calculate derivative or loss of error
- Update the parameters
Backpropagation (BP)
We don’t know what the hidden units are supposed to do, as they are unknown
- This is different from training a single unit, where we know the exact difference for each unit and model
- Instead, we have loss and the activity of each unit in the network
Network architectures
- The parameters to be learned are weight and bias; same as most machine learning techniques
- In logistic regression, we try and minimise the log loss to maximise the prediction probability for each data point
Cross-Entropy loss
- One-hot encoding the training labels,
- Cross entropy loss is log loss
- Minimising the loss
- Calculating the gradients
- Updating parameters
The model learns from the loss using a training set. This is normally done by setting a test and a train set. However, if we are using big data this can take a long time, so instead, we use a mini-batch/random subset of training samples – which is efficient and robust.
The above is done using stochastic gradient descent; which shuffles the data numerous times to make sure it’s well sampled.
Dropout – Randomly ignores some neurons, that corresponding connections remain unchanged during this updating
Hyperparameter: is the percentage of neurons that are dropped at a certain layer
Dropout is only applied at the testing stage
Multilayer Perception
- An approximation for the non-linear classification decision boundary
- Raw data input only; no need for feature engineering or domain knowledge
Summary
The general idea of deep learning
- Limitation of linear classifier
- The role of non-linear activation function in neural networks
Parameter training
- Parameters and Backpropagation
From shallow to deep
- Softmax LR and Multilayer perceptron
- Unsupervised model: Autoencoder
- Techniques – Stochastic gradient descent and dropout
Relevant network architecture
Convolutional Neural Network
- Weight sharing concept
- Convolutional layer and pooling layer
- Application: Digit reconstruction
Recurrent Neural Networks
- Vanilla recurrent neural network and training
- Long short term memory
- Application: Digit recognition and Sentiment analysis
Support vector machines
Linear and Kernel
- There is a trade-off between the margin and the number of mistakes on the training data
- You can use gradient descent to find the optimal parameters E,b for this linear regression
A Linear SVM (classifier)
- Can be either a soft (with slack variables) or hard margin (a special case of a soft margin)
The straightforward way is to project raw features into linearly separable feature space; which is done through deep learning for feature extraction and through feature engineering.
Decision Tree and Random Forest
We aim to have the lowest entropy at the leaf nodes
The selection process is based on calculating the entropy of the whole dataset, then splitting the dataset one without a certain node and then calculating the entropy again. Finally selecting the attribute with the largest information
Decision tree
- Good interpretability though has a lot of overfitting, low bias and high variance
- Need to balance overfitting with accuracy
Random forests don’t overfit as much as decision trees, though decision trees are easier to interpret.
Summary
Support vector machines
- Large margin, slack variables
- Primal form, dual form
- Kernel trick,
- Hyperparameters selection through grid search
Decision tree and Random forest
- ID3 algorithm: Entropy and information gain
- Overfitting reduction through ensemble
- Random Forest
K-nearest neighbour classifier
- Majority voting
Clustering
- K-means
- Expectation maximisation (EM)
Dimensionality Reduction
- Principle component analysis
- linear discriminant analysis
- We did all about clustering in the last module, so worth going back and having a look through stuff for that