1. Four branches of machine learning¶

We have seen three specific types of machine learning problems: binary classification, multiclass classification, and scalar regression.
All three are instances of supervised learning.
Machine learning algorithms generally fall into four broad categories, described in the below.

Supervised learning¶

The most common case
It consists of learning to map input data to known targets (also called annotations), given a set of examples (often annotated by humans).
Generally, almost all applications of deep learning that are in the spotlight these days belong in this category.
- E.g., optical character recognition, speech recognition, image classification, and language translation
Although supervised learning mostly consists of classification and regression, there are more variants as well.
- Sequence generation: Given a picture, predict a caption describing it.
  
  https://cs.stanford.edu/people/karpathy/sfmltalk.pdf
- Syntax tree prediction: Given a sentence, predict its decomposition into a syntax tree.
- Object detection: Given a picture, draw a bounding box around certain objects inside the picture.
  
  https://medium.com/@jonathan_hui/real-time-object-detection-with-yolo-yolov2-28b1b93e2088
- Object segmentation: Given a picture, draw a pixe-level mask on a specific object.
  
  https://medium.com/@jonathan_hui/object-detection-speed-and-accuracy-comparison-faster-r-cnn-r-fcn-ssd-and-yolo-5425656ae359

Unsupervised learning¶

Finding interesting transformations of the input data without the help of any targets for the purpose of:
- data visualization
- data compression
- data denoising
- to better understand the correlations present in the data at hand
Dimensionality reduction and clustering are well-known categories of unsupervised learning.

Self-supervised Learning¶

A specific intance of supervised learning
Self-supervised learning is supervised learning without human-annotated labels.
There are still labels involved, but they are generated from the input data.
Examples
- autoencoders where the generated targets are the input
- predicting the next frame in a video, given past frames
- predicting the next word in a text, given previous words

Reinforcement learning¶

Reinforcement learning started to get a lot of attention after Google DeepMind successfully applied it to learning to play Atari games.
- https://www.youtube.com/watch?v=V1eYniJ0Rnk&vl=en
In RL, an agent receives information about its environment and learns to choose actions that will maximize some reward.
- For example, a neural network that "looks" at a video-game screen and outputs game actions in order to maximize its score can be trained via RL.
It can be applied to large range of real-world applications:
- self-driving cars, robotics, resource management, education, and so on.

2. Evaluating machine-learning models¶

In the previous examples, we split the data into a training set, a validation set, and a test set.
In machine learning, the goal is to achieve models that generalize - that perform well on never-before-seen data - and overfitting is the central obstacle.
Here, we will focus on how to measure generalization: how to evaluate machine-learning models

Training, validation, and test sets¶

Splitting the available data into three sets: training, validation, and test.
- We train on the training data and evaluate our model on the validation data.
- Once the model is ready, we test it one final time on the test data.
Why not have just two sets: a training set and a test set?
The reason is that developing a model always involves tuning its configuration.
- For example, choosing the number of layers or the size of the layers
  - They are called the hyperparameters of the model, to distinguish them from the parameters, which are the network's weights.
- We do this tuning by using the performance of the model on the validation data.
- This tuning is a form of learning: a search for a good configuration in some parameter space.
- As a result, tuning the configuration of the model can quickly result in overfitting to the validation set, even though your model is never directly trained on it.
Central to this phenomenon is the notion of information leak.
- Every time you tune a hyperparameter of your model based on the model's performance on the validation set, some information about the validation data leaks into the model.
A model that performs artificially well on the validation set does not guarantee similar performance on the test set.

Simple hold-out validation

The simplest evaluation protocol
- If little data is available, then the validation and test sets may contain too few samples to be statistically representative of the data at hand.
  - It is easy to observe: try different random shuffling rounds of the data
Code example

# hold-out validation
  num_validation_samples = 10000

  np.random.shuffle(data)

  validation_data = data[:num_validation_samples]
  data = data[num_validation_samples:]

  training_data = data[:]

  model = get_model()
  model.train(training_data)
  validation_score = model.evaluate(validation_data)

  # At this point you can tune your model!

  model = get_model()
  model.train(np.concatenate([training_data, validation_data]))

  test_score = model.evaluate(test_data)

k-fold validation
- Split the data into k partitions of equal size.
- For each partition i, train a model on the remaining k-1 partitions, and evaluate it on partition i.
- Then, the average of the k scores is obtained as the final score.

Code example

k = 4
num_validation_samples = len(data) // k

np.random.shuffle(data)

validation_scores = []
for fold in range(k):
  validation_data = data[num_validation_samples*fold : num_validation_samples*(fold+1)]
  training_data = data[:num_validation_samples*fold] + data[num_validation_samples*(fold+1):]

  model = get_model()
  model.train(training_data)
  validation_score = model.evaluate(validation_data)
  validation_scores.append(validation_score)

validation_score = np.average(validation_scores)

model = get_model()
model.train(data)
test_score = model.evaluate(test_data)

Iterated k-fold validation with shuffling
- Applying k-fold validation multiple times, shuffling the data every time before splitting it k ways
- The final score is the average of the scores obtained at each run of k-fold validation.

Things to keep in mind¶

Data representativeness
- What if you sort the data according to their classes?
- random shuffling is usually used before splitting it.
The arrow of time
- If you are trying to predict the future given the past, you should not randomly shuffle the data before splitting it.
Redundancy in your data
- If some data points in your data appear twice, then the performance might be over-estimated.
- Make sure your training set and validation set are disjoint.

3 Data preprocessing, feature engineering, and feature learning¶

How do we prepare the input data and targets before feeding them into a neural network?
Many data-preprocessing and feature-engineering techniques are domain specific.

Data preprocessing for neural networks¶

Vectorization
- (input, target) --> tensors of floating-point data
Value normalization
- Normalize each feature independently so that it had a standard deviation of 1 and a mean of 0.
Handling missing values
- With neural networks, it is safe to input missing values as 0.

Feature engineering¶

The process of using our own knowledge about the data to make the algorithm work better by applying hardcoded (nonlearned) transformations to the data.
Reading the time on a clock
Before deep learning, feature engineering used to be critical.
- Because classical shallow algorithms did not have hypothesis spaces rich enough to learn useful features by themselves.
- E.g., MNIST --> the number of loops, the height of each digit, a histogram of pixel values, etc.
Modern deep learning removes the need for most feature engineering.
- Because neural networks are capable of automatically extracting useful features from raw data.
However, this is still important for two reasons:
- Good features allow us to solve problems more elegantly while using fewer resources.
- Good features let us solve a problem with far less data.

4 Overfitting and underfitting¶

The fundamental issue in machine learning is the tension between optimization and generalization.
- Optimization refers to the process of adjusting a model to get the best performance on the training data.
- Generalization refers to how well the trained model performs on data it has never seen before.
- The goal is to get good generalization, but we can only adjust the model based on the training data.
At the beginning of training, optimization and generalization are correlated.
- The lower the loss on training data, the lower on test data.
- While this is happening, the model is said to be underfit.
After a certain number of iterations, generalization stops improving.
- The model is starting to overfit.
To prevent overfitting, the best solution is to get more training data.
When that isn't possible, the next-best solution is to modulate
- the quantity of information that the model is allowed to store,
- to add constraints on what information it's allowed to store.
- The process of fighting overfitting this way is called regularization.

Reducing the network's size¶

The simplest way to prevent overfitting is to reduce the size of the model: the number of learnable parameters in the model.
In deep learning, the number of learnable parameters in a model is often referred to as the model's capacity.
There is a compromise to be found between too much capacity and not enough capacity.
Unfortunately, there is no magical formula to determine the right number of layers or the right size for each layer.

Let's revisit the movie-review classification network.

The original model

from tensorflow.keras import models 
  from tensorflow.keras import layers

  model = models.Sequential() 
  model.add(layers.Dense(16, activation='relu', input_shape=(10000,))) 
  model.add(layers.Dense(16, activation='relu')) 
  model.add(layers.Dense(1, activation='sigmoid'))

Smaller network (low capacity)

model = models.Sequential() 
  model.add(layers.Dense(4, activation='relu', input_shape=(10000,))) 
  model.add(layers.Dense(4, activation='relu')) 
  model.add(layers.Dense(1, activation='sigmoid'))

A comparison of the validation losses of the original network and the smaller network
Bigger model (high capacity)

model = models.Sequential() 
  model.add(layers.Dense(512, activation='relu', input_shape=(10000,))) 
  model.add(layers.Dense(512, activation='relu')) 
  model.add(layers.Dense(1, activation='sigmoid'))

A comparison between the original network and the bigger network

Adding weight regularization¶

The principle of Occam's razor
- Given two explanations for something, the explanation most likely to be correct is the simplest one - the one that makes fewer assumptions.
A simple model in this context is a model where the distribution of parameter values has less entropy (or a model with fewer parameters).
A common way to mitigate overfitting is to put contraints on the complexity of a network by forcing its weights to take only small values, which makes the distribution of weight values more regular.
This is called weight regularization.
- It is done by adding to the loss function of the network a cost associated with having large weights.
- L1 regularization
  - The cost added is proportional to the absolute value of the weight coefficients.
  - The L1 norm of the weights
- L2 regularization
  - The cost added is proportional to the square of the value of the weight coefficients.
  - The L2 norm of the weights
  - It is also called weight decay in the context of neural networks.
L2 weight regularization in Keras

from tensorflow.keras import regularizers

  model = models.Sequential() 
  model.add(layers.Dense(16, kernel_regularizer=regularizers.l2(0.001), activation='relu', input_shape=(10000,))) 
  model.add(layers.Dense(16, kernel_regularizer=regularizers.l2(0.001), activation='relu')) 
  model.add(layers.Dense(1, activation='sigmoid'))

The impact of the L2 regularization

Adding dropout¶

Dropout is one of the most effective and most commonly used regularization techniques for neural networks.
It consists of randomly dropping out (setting to zero) a number of output features of the layer during training.
- E.g., [0.2, 0.5, 1.3, 0.8, 1.1] --> (dropout) --> [0, 0.5, 1.3, 0, 1.1]
The dropout rate is the fraction of the features that are zeroed out.
At test time, no units are dropped out.
- Instead, the layer's output values are scaled down by a factor equal to (1-the dropout rate) to balance for the fact that more units are active than at training time.
Implementation using Numpy

# At training time, we zero out 50% of activations.
  layer_output *= np.random.randint(0, high=2, size=layer_output.shape)

  # At test time, we scale down the output.
  layer_output *= 0.5

Another implementation (in practice)

# At training time
  layer_output *= np.random.randint(0, high=2, size=layer_output.shape) 
  layer_output /= 0.5

In Keras,

model.add(layers.Dropout(0.5))

Adding dropout to the IMDB network

model = models.Sequential() 
  model.add(layers.Dense(16, activation='relu', input_shape=(10000,))) 
  model.add(layers.Dropout(0.5)) 
  model.add(layers.Dense(16, activation='relu')) 
  model.add(layers.Dropout(0.5)) 
  model.add(layers.Dense(1, activation='sigmoid'))

5 The universal workflow of machine learning¶

Defining the problem and assembling a dataset¶

First, we must define the problem at hand:
- What will your input data be?
- What are you trying to predict?
- What type of problem are you facing?
The hypotheses you make at this stage:
- The outputs can be predicted given the inputs.
- The available data is sufficiently informative to learn the relationship between inputs and outputs.
Not all problems can be solved: a dataset (X, Y) doesn't mean X contains enough information to predict Y.
Keep in mind that machine learning can only be used to learn patterns that are present in the training data.
- For instance, using machine learning trained on past data to predict the future is making the assumption that the future will behave like the past.

Deciding on an evaluation protocol¶

How you will measure the current progress
- Maintaining a hold-out validation set
- Doing k-fold cross validation
- Doing iterated k-fold validation

Preparing the data¶

Once you know what you’re training on, what you’re optimizing for, and how to evaluate your approach, you’re almost ready to begin training models.
Formatting the data
- The data should be formatted as tensors.
- The values taken by these tensors should be scaled to small values.
- If different features take values in different ranges, then the data should be normalized.
- Some feature engineering may be needed, especially for small-data problems.

Developing a model that does better than a baseline¶

Developing a small model that is capable of beating a dumb baseline
Three key choices to build the network:
- Last-layer activation
- Loss function
- Optimization configuration

Scaling up: developing a model that overfits¶

Once you’ve obtained a model that has statistical power, the question becomes, is your model sufficiently powerful?
Developing a model that overfits:
- Add more layers
- Make the layers bigger
- Train for more epochs
Always monitor the training loss and validation loss, as well as the training and validation values for any metrics you care about.

Reguralizing the model and tuning the hyperparameters¶

Repeatedly modify the model, train it, evaluate on the validation data, again and again.
We can try:
- Add dropout
- Try different architectures
- Add regularization terms
- Try different hyperparameters
- Optionally, iterate on feature engineering
Keep in mind that every time you use feedback from your validation process to tune your model, you leak information about the validation process into the model.
Once you’ve developed a satisfactory model configuration, you can train your final production model on all the available data (training and validation) and evaluate it one last time on the test set.

Single Layer Perceptron (0)	2020.08.06
Convolution and Pooling (1)	2020.06.27
regression (0)	2020.06.20
multi class classification (0)	2020.06.20
binary classification_multi perceptron (0)	2020.06.20

데하

fundamentals of machine learning

1. Four branches of machine learning¶

Supervised learning¶

Unsupervised learning¶

Self-supervised Learning¶

Reinforcement learning¶

2. Evaluating machine-learning models¶

Training, validation, and test sets¶

Things to keep in mind¶

3 Data preprocessing, feature engineering, and feature learning¶

Data preprocessing for neural networks¶

Feature engineering¶

4 Overfitting and underfitting¶

Reducing the network's size¶

Adding weight regularization¶

Adding dropout¶

5 The universal workflow of machine learning¶

Defining the problem and assembling a dataset¶

Deciding on an evaluation protocol¶

Preparing the data¶

Developing a model that does better than a baseline¶

Scaling up: developing a model that overfits¶

Reguralizing the model and tuning the hyperparameters¶

'딥러닝' 카테고리의 다른 글

댓글

티스토리툴바

fundamentals of machine learning

1. Four branches of machine learning¶

Supervised learning¶

Unsupervised learning¶

Self-supervised Learning¶

Reinforcement learning¶

2. Evaluating machine-learning models¶

Training, validation, and test sets¶

Things to keep in mind¶

3 Data preprocessing, feature engineering, and feature learning¶

Data preprocessing for neural networks¶

Feature engineering¶

4 Overfitting and underfitting¶

Reducing the network's size¶

Adding weight regularization¶

Adding dropout¶

5 The universal workflow of machine learning¶

Defining the problem and assembling a dataset¶

Deciding on an evaluation protocol¶

Preparing the data¶

Developing a model that does better than a baseline¶

Scaling up: developing a model that overfits¶

Reguralizing the model and tuning the hyperparameters¶

'딥러닝' 카테고리의 다른 글

관련글

댓글

티스토리툴바