1. Four branches of machine learning¶
- We have seen three specific types of machine learning problems: binary classification, multiclass classification, and scalar regression.
- All three are instances of supervised learning.
- Machine learning algorithms generally fall into four broad categories, described in the below.
Supervised learning¶
-
The most common case
-
It consists of learning to map input data to known targets (also called annotations), given a set of examples (often annotated by humans).
-
Generally, almost all applications of deep learning that are in the spotlight these days belong in this category.
- E.g., optical character recognition, speech recognition, image classification, and language translation
-
Although supervised learning mostly consists of classification and regression, there are more variants as well.
-
Sequence generation: Given a picture, predict a caption describing it.
-
Syntax tree prediction: Given a sentence, predict its decomposition into a syntax tree.
-
Object detection: Given a picture, draw a bounding box around certain objects inside the picture.
https://medium.com/@jonathan_hui/real-time-object-detection-with-yolo-yolov2-28b1b93e2088
-
Object segmentation: Given a picture, draw a pixe-level mask on a specific object.
-
Unsupervised learning¶
-
Finding interesting transformations of the input data without the help of any targets for the purpose of:
- data visualization
- data compression
- data denoising
- to better understand the correlations present in the data at hand
-
Dimensionality reduction and clustering are well-known categories of unsupervised learning.
Self-supervised Learning¶
- A specific intance of supervised learning
- Self-supervised learning is supervised learning without human-annotated labels.
- There are still labels involved, but they are generated from the input data.
- Examples
- autoencoders where the generated targets are the input
- predicting the next frame in a video, given past frames
- predicting the next word in a text, given previous words
Reinforcement learning¶
-
Reinforcement learning started to get a lot of attention after Google DeepMind successfully applied it to learning to play Atari games.
-
In RL, an agent receives information about its environment and learns to choose actions that will maximize some reward.
- For example, a neural network that "looks" at a video-game screen and outputs game actions in order to maximize its score can be trained via RL.
-
It can be applied to large range of real-world applications:
- self-driving cars, robotics, resource management, education, and so on.
2. Evaluating machine-learning models¶
-
In the previous examples, we split the data into a training set, a validation set, and a test set.
-
In machine learning, the goal is to achieve models that generalize - that perform well on never-before-seen data - and overfitting is the central obstacle.
-
Here, we will focus on how to measure generalization: how to evaluate machine-learning models
Training, validation, and test sets¶
-
Splitting the available data into three sets: training, validation, and test.
- We train on the training data and evaluate our model on the validation data.
- Once the model is ready, we test it one final time on the test data.
-
Why not have just two sets: a training set and a test set?
-
The reason is that developing a model always involves tuning its configuration.
- For example, choosing the number of layers or the size of the layers
- They are called the hyperparameters of the model, to distinguish them from the parameters, which are the network's weights.
- We do this tuning by using the performance of the model on the validation data.
- This tuning is a form of learning: a search for a good configuration in some parameter space.
- As a result, tuning the configuration of the model can quickly result in overfitting to the validation set, even though your model is never directly trained on it.
- For example, choosing the number of layers or the size of the layers
-
Central to this phenomenon is the notion of information leak.
- Every time you tune a hyperparameter of your model based on the model's performance on the validation set, some information about the validation data leaks into the model.
-
A model that performs artificially well on the validation set does not guarantee similar performance on the test set.
-
Simple hold-out validation
-
The simplest evaluation protocol
- If little data is available, then the validation and test sets may contain too few samples to be statistically representative of the data at hand.
- It is easy to observe: try different random shuffling rounds of the data
- If little data is available, then the validation and test sets may contain too few samples to be statistically representative of the data at hand.
-
Code example
# hold-out validation num_validation_samples = 10000 np.random.shuffle(data) validation_data = data[:num_validation_samples] data = data[num_validation_samples:] training_data = data[:] model = get_model() model.train(training_data) validation_score = model.evaluate(validation_data) # At this point you can tune your model! model = get_model() model.train(np.concatenate([training_data, validation_data])) test_score = model.evaluate(test_data)
-
-
k-fold validation
- Split the data into k partitions of equal size.
- For each partition i, train a model on the remaining k-1 partitions, and evaluate it on partition i.
-
Then, the average of the k scores is obtained as the final score.
-
Code example
k = 4 num_validation_samples = len(data) // k np.random.shuffle(data) validation_scores = [] for fold in range(k): validation_data = data[num_validation_samples*fold : num_validation_samples*(fold+1)] training_data = data[:num_validation_samples*fold] + data[num_validation_samples*(fold+1):] model = get_model() model.train(training_data) validation_score = model.evaluate(validation_data) validation_scores.append(validation_score) validation_score = np.average(validation_scores) model = get_model() model.train(data) test_score = model.evaluate(test_data)
-
Iterated k-fold validation with shuffling
- Applying k-fold validation multiple times, shuffling the data every time before splitting it k ways
- The final score is the average of the scores obtained at each run of k-fold validation.
Things to keep in mind¶
-
Data representativeness
- What if you sort the data according to their classes?
- random shuffling is usually used before splitting it.
-
The arrow of time
- If you are trying to predict the future given the past, you should not randomly shuffle the data before splitting it.
-
Redundancy in your data
- If some data points in your data appear twice, then the performance might be over-estimated.
- Make sure your training set and validation set are disjoint.
3 Data preprocessing, feature engineering, and feature learning¶
- How do we prepare the input data and targets before feeding them into a neural network?
- Many data-preprocessing and feature-engineering techniques are domain specific.
Data preprocessing for neural networks¶
-
Vectorization
- (input, target) --> tensors of floating-point data
-
Value normalization
- Normalize each feature independently so that it had a standard deviation of 1 and a mean of 0.
-
Handling missing values
- With neural networks, it is safe to input missing values as 0.
Feature engineering¶
-
The process of using our own knowledge about the data to make the algorithm work better by applying hardcoded (nonlearned) transformations to the data.
-
Reading the time on a clock
-
Before deep learning, feature engineering used to be critical.
- Because classical shallow algorithms did not have hypothesis spaces rich enough to learn useful features by themselves.
- E.g., MNIST --> the number of loops, the height of each digit, a histogram of pixel values, etc.
-
Modern deep learning removes the need for most feature engineering.
- Because neural networks are capable of automatically extracting useful features from raw data.
-
However, this is still important for two reasons:
- Good features allow us to solve problems more elegantly while using fewer resources.
- Good features let us solve a problem with far less data.
4 Overfitting and underfitting¶
-
The fundamental issue in machine learning is the tension between optimization and generalization.
- Optimization refers to the process of adjusting a model to get the best performance on the training data.
- Generalization refers to how well the trained model performs on data it has never seen before.
- The goal is to get good generalization, but we can only adjust the model based on the training data.
-
At the beginning of training, optimization and generalization are correlated.
- The lower the loss on training data, the lower on test data.
- While this is happening, the model is said to be underfit.
-
After a certain number of iterations, generalization stops improving.
- The model is starting to overfit.
-
To prevent overfitting, the best solution is to get more training data.
-
When that isn't possible, the next-best solution is to modulate
- the quantity of information that the model is allowed to store,
- to add constraints on what information it's allowed to store.
- The process of fighting overfitting this way is called regularization.
Reducing the network's size¶
-
The simplest way to prevent overfitting is to reduce the size of the model: the number of learnable parameters in the model.
-
In deep learning, the number of learnable parameters in a model is often referred to as the model's capacity.
-
There is a compromise to be found between too much capacity and not enough capacity.
-
Unfortunately, there is no magical formula to determine the right number of layers or the right size for each layer.
-
Let's revisit the movie-review classification network.
- The original model
from tensorflow.keras import models from tensorflow.keras import layers model = models.Sequential() model.add(layers.Dense(16, activation='relu', input_shape=(10000,))) model.add(layers.Dense(16, activation='relu')) model.add(layers.Dense(1, activation='sigmoid'))
- Smaller network (low capacity)
model = models.Sequential() model.add(layers.Dense(4, activation='relu', input_shape=(10000,))) model.add(layers.Dense(4, activation='relu')) model.add(layers.Dense(1, activation='sigmoid'))
-
A comparison of the validation losses of the original network and the smaller network
-
Bigger model (high capacity)
model = models.Sequential() model.add(layers.Dense(512, activation='relu', input_shape=(10000,))) model.add(layers.Dense(512, activation='relu')) model.add(layers.Dense(1, activation='sigmoid'))
-
A comparison between the original network and the bigger network
Adding weight regularization¶
-
The principle of Occam's razor
- Given two explanations for something, the explanation most likely to be correct is the simplest one - the one that makes fewer assumptions.
-
A simple model in this context is a model where the distribution of parameter values has less entropy (or a model with fewer parameters).
-
A common way to mitigate overfitting is to put contraints on the complexity of a network by forcing its weights to take only small values, which makes the distribution of weight values more regular.
-
This is called weight regularization.
- It is done by adding to the loss function of the network a cost associated with having large weights.
- L1 regularization
- The cost added is proportional to the absolute value of the weight coefficients.
- The L1 norm of the weights
- L2 regularization
- The cost added is proportional to the square of the value of the weight coefficients.
- The L2 norm of the weights
- It is also called weight decay in the context of neural networks.
-
L2 weight regularization in Keras
from tensorflow.keras import regularizers
model = models.Sequential()
model.add(layers.Dense(16, kernel_regularizer=regularizers.l2(0.001), activation='relu', input_shape=(10000,)))
model.add(layers.Dense(16, kernel_regularizer=regularizers.l2(0.001), activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))
-
The impact of the L2 regularization
Adding dropout¶
-
Dropout is one of the most effective and most commonly used regularization techniques for neural networks.
-
It consists of randomly dropping out (setting to zero) a number of output features of the layer during training.
- E.g., [0.2, 0.5, 1.3, 0.8, 1.1] --> (dropout) --> [0, 0.5, 1.3, 0, 1.1]
-
The dropout rate is the fraction of the features that are zeroed out.
-
At test time, no units are dropped out.
- Instead, the layer's output values are scaled down by a factor equal to (1-the dropout rate) to balance for the fact that more units are active than at training time.
-
Implementation using Numpy
# At training time, we zero out 50% of activations.
layer_output *= np.random.randint(0, high=2, size=layer_output.shape)
# At test time, we scale down the output.
layer_output *= 0.5
- Another implementation (in practice)
# At training time
layer_output *= np.random.randint(0, high=2, size=layer_output.shape)
layer_output /= 0.5
- In Keras,
model.add(layers.Dropout(0.5))
- Adding dropout to the IMDB network
model = models.Sequential()
model.add(layers.Dense(16, activation='relu', input_shape=(10000,)))
model.add(layers.Dropout(0.5))
model.add(layers.Dense(16, activation='relu'))
model.add(layers.Dropout(0.5))
model.add(layers.Dense(1, activation='sigmoid'))
5 The universal workflow of machine learning¶
Defining the problem and assembling a dataset¶
-
First, we must define the problem at hand:
- What will your input data be?
- What are you trying to predict?
- What type of problem are you facing?
-
The hypotheses you make at this stage:
- The outputs can be predicted given the inputs.
- The available data is sufficiently informative to learn the relationship between inputs and outputs.
-
Not all problems can be solved: a dataset (X, Y) doesn't mean X contains enough information to predict Y.
-
Keep in mind that machine learning can only be used to learn patterns that are present in the training data.
- For instance, using machine learning trained on past data to predict the future is making the assumption that the future will behave like the past.
Deciding on an evaluation protocol¶
- How you will measure the current progress
- Maintaining a hold-out validation set
- Doing k-fold cross validation
- Doing iterated k-fold validation
Preparing the data¶
-
Once you know what you’re training on, what you’re optimizing for, and how to evaluate your approach, you’re almost ready to begin training models.
-
Formatting the data
- The data should be formatted as tensors.
- The values taken by these tensors should be scaled to small values.
- If different features take values in different ranges, then the data should be normalized.
- Some feature engineering may be needed, especially for small-data problems.
Developing a model that does better than a baseline¶
-
Developing a small model that is capable of beating a dumb baseline
-
Three key choices to build the network:
- Last-layer activation
- Loss function
- Optimization configuration
Scaling up: developing a model that overfits¶
-
Once you’ve obtained a model that has statistical power, the question becomes, is your model sufficiently powerful?
-
Developing a model that overfits:
- Add more layers
- Make the layers bigger
- Train for more epochs
-
Always monitor the training loss and validation loss, as well as the training and validation values for any metrics you care about.
Reguralizing the model and tuning the hyperparameters¶
-
Repeatedly modify the model, train it, evaluate on the validation data, again and again.
-
We can try:
- Add dropout
- Try different architectures
- Add regularization terms
- Try different hyperparameters
- Optionally, iterate on feature engineering
-
Keep in mind that every time you use feedback from your validation process to tune your model, you leak information about the validation process into the model.
-
Once you’ve developed a satisfactory model configuration, you can train your final production model on all the available data (training and validation) and evaluate it one last time on the test set.
SangheumHwang[deep learning class]
'딥러닝' 카테고리의 다른 글
Single Layer Perceptron (0) | 2020.08.06 |
---|---|
Convolution and Pooling (1) | 2020.06.27 |
regression (0) | 2020.06.20 |
multi class classification (0) | 2020.06.20 |
binary classification_multi perceptron (0) | 2020.06.20 |
댓글