Machine Learning 1 VO

1 Introduction

Learning

Learning is any process by which a system improves performance from experience. - Herbert Simon

Classification vs. Regression

Unsupervised vs. Supervised Learning

Unsupervised: Find interesting patterns in data. No known mapping of inputs to outputs.

Supervised: Use labeled dataset (training data) to predict labels for other data points (test input). See image below.

2 Mathematical Basics I

3 Mathematical Basics II

Gradient Descent

If step-size is sufficiently small, convergence to a local minimum. For convex functions, any local minimum is also a global minimum.

Stochastic Gradient Descent (SGD)

Sometimes one might not be able to evaluate f exactly, but just a "noisy version".

For theoretical convergence, the step-size needs to be reduced over number of iterations t.

Multivariate Gaussian Density

As this is a probability density, integrating over an area gives the corresponding probability.

4 Linear Regression

Regression: Predict a continuous target variable from one or more input variables. Hence, a form of supervised learning.

When the hypothesis space is restricted to all affine functions (y=(d=1Dwdxd)+by = (\sum_{d=1}^D w_d x_d) + b), we speak of linear regression.

Loss

A loss function (objective, cost function) is a notion of fitness. The lower loss, the better.

Least Squares Loss Function

Average the squared error over the whole data set.

L(θ)=1Ni=1N(y^iyi)2L(\theta) = \frac{1}{N} \sum_{i=1}^N ( \hat{y}_i - y_i )^2

Least-squares has an interesting connection to Gaussian density estimation and the maximum likelihood principle.

5 Non-Linear Regression, Logistic Regression

Non-Linear Features

Idea: We can apply any pre-processing (non-linear function) to the input features and add them to the design matrix! Regression with non-linear features works exactly the same as before.

Radial Basis Functions

Classification, Logistic Regression

Confusing name: It’s called logistic regression, even though it is a classification model

Minimizing Cross-Entropy

Cross-Entropy Loss is differentiable and convex, thus we can use gradient descent to find a global minimum.

6 Principal Component Analysis

PCA is a unsupervised learning technique for dimensionality reduction. Idea of PCA: Learn an "interesting" subspace V.

Targets and Applications of Dimensionality Reduction

Potential target characteristics (qualitative):

Applications of dimensionality reduction:

Principal Directions

Which of the three directions b', b'', b''' is the most "interesting" one? Answer: b' is the direction where

  1. the variance of the projected data is maximal
  2. or, equivalently, the sum of squared projection errors is minimal

The vector b' is called the first principal direction of dataset.

Computing Principal Components

Linear Discriminant Analysis

7 Neural Networks I

Excellent Tutorial by 3Blue1Brown https://www.youtube.com/playlist?list=PLZHQObOWTQDNU6R1_67000Dx_ZCJB-3pi

Neuron

Multilayer Perceptron (MLP)

Common Activation Functions Φ\Phi

Why do we need non-linear activation functions Φ\Phi?

  1. Linear functions often not adequate for real-world datasets.
  2. Plug-and-play philosophy of ANNs: Construct complicated functions out of simple ones. This does not work with linear neurons!

Expressive Efficiency

Automatic Differentiation (AD) (autodiff) (backpropagation)

Neural networks are powerful function approximators, with hundreds to hundreds of billions of parameters. How do we learn these parameters? Gradient descent of the loss function!

In neural networks a special case of AD, namely reverse mode AD is usually called backpropagation of error (backprop)

8 Model selection, Evaluation

We never observe the "true" pattern or principle, always just an imperfect and noisy version of it: the training data D. We use the training data to guide our search for a model (hypothesis), in the hope that it comes close to the ground truth.

Overfitting

When using a too large (powerful) model class H, we run risk to select a model which fits the data D well, but not the true underlying concept. We say "the model overfits to the data and does not generalize well."

Occam’s Razor, Lex Parsimoniae: considering competing hypotheses with similar predictions and explanatory power, one should prefer the hypothesis that requires the fewest assumptions.

Underfitting

The converse effect of overfitting is underfitting - using a too weak model class, such that the model does fit neither the training data nor the true underlying concept well.

Underfitting is usually not much of a problem, since it is rather easy to increase the model power. Overfitting requires more sophisticated strategies.

Model Evaluation

After optimization, the empirical test loss is an overly optimistic estimate of the true loss. How can we better estimate the true loss and detect overfitting? We need two datasets:

Model Selection

Most ML models have several hyper-parameters:

When selecting a model according to minimal test error, we are actually learning on the test set!

Really, we need three datasets:

Cross-Validation

Reserving samples for the validation set is called the holdout method. The data is split once into training and validation set. In cross-validation, we split the data multiple times into training and validation set. The validation results for each split are averaged.

K-fold cross-validation is the most widely used method for cross-validation. Randomly split the data in K parts (folds). Train in turn on all except the kth fold and compute the loss. The final validation loss is the mean of the k losses.

Alternatives:

9 Neural Networks II

Autodiff in Neural Nets

Classification with MLPs

10 K-Nearest Neighbors, Decision Trees and Random Forests

Bayes Optimal Classifier

Assume a classification setting and that we know the true data-generating distribution. Then, we can use Bayes law to calculate

p(yx)=p(x,y)yp(x,y)p(y|x) = \frac{p(x, y)}{\sum_y p(x, y)}

giving us the Bayes optimal classifier which suffers the least true classification error.

y^=arg maxyp(yx)\hat{y} = \argmax_y p(y|x)

K-Nearest Neighbors (KNN)

KNN converges towards the Bayes optimal classifier given NN \rightarrow \infty, KK \rightarrow \infty, and KN0\frac{K}{N} \rightarrow 0. A classifier with this property is called consistent.

Decision Trees

Decision trees represent a partition of the input space. Decision nodes (internal nodes) are associated with one or more features and a boolean function. Prediction nodes (leaves) are associated with either a fixed label y, or a distribution over labels p(Y).

CART Algorithm

Stopping Criterion
Costs - Impurity Measures

Random Forests

Individual models (like decision trees) easily overfit, i.e. they have low bias but high variance. Idea: instead of training one model, train K models and aggregate them

Bagging = Bootstrapping and Aggregating

11 Support Vector Machines (SVM), Kernel Trick

Learning Support Vector Machines

SVMs with Soft-Margin

Dual SVM and the Kernel Trick

It can be shown that the following is equivalent (dual problem):

12 K-Means Clustering

Organize N objects in K clusters, i.e., groups of similar objects. Relation to classification: "classes" are not given, but should be found in unsupervised way.

Clustering is not well-defined and there are multiple questions:

K-Means Problem

Mixed Integer Quadratic Program (MIQP). NP-hard.

K-Means Algorithm (Lloyd’s algorithm)

Iterative algorithm alternating between

Will always converge, but does not necessarily find the global optimum. The found solution depends on the initialization and can in theory be arbitrarily bad.

K-Means++

Basic idea: use data points as initial cluster seeds µ(1), ..., µ(K), spreaded over the whole dataset.

The main phase converges much faster, so that in total k-means++ is usually faster than vanilla k-means.

Guaranteed to achieve an objective which is only a multiplicative factor of O(log K).

13 Gaussian Mixture Models (GMMs)

Gaussian mixture models (GMMs) which can be seen as a probabilistic version of k-means. Main purpose of GMMs is density estimation, i.e., approximating the true data generating distribution ptruep_\text{true}.

The model can be used for

Maximum Likelihood Principle

Gaussian Mixture Model

GMMs are universal approximators of densities, that is, any density can be approximated arbitrarily well by a GMM. This is true even when restricted Gaussian with only diagonal covariances. However, we don’t know a big K needs to be and how to set the parameters.

Expectation-Maximization (EM)

14 Graphical Models