본문 바로가기
딥러닝

fundamentals of machine learning

by 볼록티 2020. 6. 20.
728x90
반응형

 

 

 

 

 1. Four branches of machine learning

 
  • We have seen three specific types of machine learning problems: binary classification, multiclass classification, and scalar regression.
  • All three are instances of supervised learning.
  • Machine learning algorithms generally fall into four broad categories, described in the below.
 

Supervised learning

 
 

Unsupervised learning

 
  • Finding interesting transformations of the input data without the help of any targets for the purpose of:

    • data visualization
    • data compression
    • data denoising
    • to better understand the correlations present in the data at hand
  • Dimensionality reduction and clustering are well-known categories of unsupervised learning.

 

Self-supervised Learning

 
  • A specific intance of supervised learning
  • Self-supervised learning is supervised learning without human-annotated labels.
  • There are still labels involved, but they are generated from the input data.
  • Examples
    • autoencoders where the generated targets are the input
    • predicting the next frame in a video, given past frames
    • predicting the next word in a text, given previous words
 

Reinforcement learning

 
  • Reinforcement learning started to get a lot of attention after Google DeepMind successfully applied it to learning to play Atari games.

  • In RL, an agent receives information about its environment and learns to choose actions that will maximize some reward.

    • For example, a neural network that "looks" at a video-game screen and outputs game actions in order to maximize its score can be trained via RL.
  • It can be applied to large range of real-world applications:

    • self-driving cars, robotics, resource management, education, and so on.
 

2. Evaluating machine-learning models

 
  • In the previous examples, we split the data into a training set, a validation set, and a test set.

  • In machine learning, the goal is to achieve models that generalize - that perform well on never-before-seen data - and overfitting is the central obstacle.

  • Here, we will focus on how to measure generalization: how to evaluate machine-learning models

 

Training, validation, and test sets

 
  • Splitting the available data into three sets: training, validation, and test.

    • We train on the training data and evaluate our model on the validation data.
    • Once the model is ready, we test it one final time on the test data.
  • Why not have just two sets: a training set and a test set?

  • The reason is that developing a model always involves tuning its configuration.

    • For example, choosing the number of layers or the size of the layers
      • They are called the hyperparameters of the model, to distinguish them from the parameters, which are the network's weights.
    • We do this tuning by using the performance of the model on the validation data.
    • This tuning is a form of learning: a search for a good configuration in some parameter space.
    • As a result, tuning the configuration of the model can quickly result in overfitting to the validation set, even though your model is never directly trained on it.
  • Central to this phenomenon is the notion of information leak.

    • Every time you tune a hyperparameter of your model based on the model's performance on the validation set, some information about the validation data leaks into the model.
  • A model that performs artificially well on the validation set does not guarantee similar performance on the test set.

 
  • Simple hold-out validation

    • The simplest evaluation protocol

      • If little data is available, then the validation and test sets may contain too few samples to be statistically representative of the data at hand.
        • It is easy to observe: try different random shuffling rounds of the data
    • Code example

    # hold-out validation
      num_validation_samples = 10000
    
      np.random.shuffle(data)
    
      validation_data = data[:num_validation_samples]
      data = data[num_validation_samples:]
    
      training_data = data[:]
    
      model = get_model()
      model.train(training_data)
      validation_score = model.evaluate(validation_data)
    
      # At this point you can tune your model!
    
      model = get_model()
      model.train(np.concatenate([training_data, validation_data]))
    
      test_score = model.evaluate(test_data)
    
 
  • k-fold validation

    • Split the data into k partitions of equal size.
    • For each partition i, train a model on the remaining k-1 partitions, and evaluate it on partition i.
    • Then, the average of the k scores is obtained as the final score.

  • Code example

    k = 4
    num_validation_samples = len(data) // k
    
    np.random.shuffle(data)
    
    validation_scores = []
    for fold in range(k):
      validation_data = data[num_validation_samples*fold : num_validation_samples*(fold+1)]
      training_data = data[:num_validation_samples*fold] + data[num_validation_samples*(fold+1):]
    
      model = get_model()
      model.train(training_data)
      validation_score = model.evaluate(validation_data)
      validation_scores.append(validation_score)
    
    validation_score = np.average(validation_scores)
    
    model = get_model()
    model.train(data)
    test_score = model.evaluate(test_data)
    
 
  • Iterated k-fold validation with shuffling

    • Applying k-fold validation multiple times, shuffling the data every time before splitting it k ways
    • The final score is the average of the scores obtained at each run of k-fold validation.
 

Things to keep in mind

 
  • Data representativeness

    • What if you sort the data according to their classes?
    • random shuffling is usually used before splitting it.
  • The arrow of time

    • If you are trying to predict the future given the past, you should not randomly shuffle the data before splitting it.
  • Redundancy in your data

    • If some data points in your data appear twice, then the performance might be over-estimated.
    • Make sure your training set and validation set are disjoint.
 

3 Data preprocessing, feature engineering, and feature learning

 
  • How do we prepare the input data and targets before feeding them into a neural network?
  • Many data-preprocessing and feature-engineering techniques are domain specific.
 

Data preprocessing for neural networks

 
  • Vectorization

    • (input, target) --> tensors of floating-point data
  • Value normalization

    • Normalize each feature independently so that it had a standard deviation of 1 and a mean of 0.
  • Handling missing values

    • With neural networks, it is safe to input missing values as 0.
 

Feature engineering

 
  • The process of using our own knowledge about the data to make the algorithm work better by applying hardcoded (nonlearned) transformations to the data.

  • Reading the time on a clock

  • Before deep learning, feature engineering used to be critical.

    • Because classical shallow algorithms did not have hypothesis spaces rich enough to learn useful features by themselves.
    • E.g., MNIST --> the number of loops, the height of each digit, a histogram of pixel values, etc.
  • Modern deep learning removes the need for most feature engineering.

    • Because neural networks are capable of automatically extracting useful features from raw data.
  • However, this is still important for two reasons:

    • Good features allow us to solve problems more elegantly while using fewer resources.
    • Good features let us solve a problem with far less data.
 

4 Overfitting and underfitting

 
  • The fundamental issue in machine learning is the tension between optimization and generalization.

    • Optimization refers to the process of adjusting a model to get the best performance on the training data.
    • Generalization refers to how well the trained model performs on data it has never seen before.
    • The goal is to get good generalization, but we can only adjust the model based on the training data.
  • At the beginning of training, optimization and generalization are correlated.

    • The lower the loss on training data, the lower on test data.
    • While this is happening, the model is said to be underfit.
  • After a certain number of iterations, generalization stops improving.

    • The model is starting to overfit.
  • To prevent overfitting, the best solution is to get more training data.

  • When that isn't possible, the next-best solution is to modulate

    • the quantity of information that the model is allowed to store,
    • to add constraints on what information it's allowed to store.
    • The process of fighting overfitting this way is called regularization.
 

Reducing the network's size

 
  • The simplest way to prevent overfitting is to reduce the size of the model: the number of learnable parameters in the model.

  • In deep learning, the number of learnable parameters in a model is often referred to as the model's capacity.

  • There is a compromise to be found between too much capacity and not enough capacity.

  • Unfortunately, there is no magical formula to determine the right number of layers or the right size for each layer.

  • Let's revisit the movie-review classification network.

    • The original model
    from tensorflow.keras import models 
      from tensorflow.keras import layers
    
      model = models.Sequential() 
      model.add(layers.Dense(16, activation='relu', input_shape=(10000,))) 
      model.add(layers.Dense(16, activation='relu')) 
      model.add(layers.Dense(1, activation='sigmoid'))
    
    • Smaller network (low capacity)
    model = models.Sequential() 
      model.add(layers.Dense(4, activation='relu', input_shape=(10000,))) 
      model.add(layers.Dense(4, activation='relu')) 
      model.add(layers.Dense(1, activation='sigmoid'))
    
    • A comparison of the validation losses of the original network and the smaller network

    • Bigger model (high capacity)

    model = models.Sequential() 
      model.add(layers.Dense(512, activation='relu', input_shape=(10000,))) 
      model.add(layers.Dense(512, activation='relu')) 
      model.add(layers.Dense(1, activation='sigmoid'))
    
    • A comparison between the original network and the bigger network

 

Adding weight regularization

 
  • The principle of Occam's razor

    • Given two explanations for something, the explanation most likely to be correct is the simplest one - the one that makes fewer assumptions.
  • A simple model in this context is a model where the distribution of parameter values has less entropy (or a model with fewer parameters).

  • A common way to mitigate overfitting is to put contraints on the complexity of a network by forcing its weights to take only small values, which makes the distribution of weight values more regular.

  • This is called weight regularization.

    • It is done by adding to the loss function of the network a cost associated with having large weights.
    • L1 regularization
      • The cost added is proportional to the absolute value of the weight coefficients.
      • The L1 norm of the weights
    • L2 regularization
      • The cost added is proportional to the square of the value of the weight coefficients.
      • The L2 norm of the weights
      • It is also called weight decay in the context of neural networks.
  • L2 weight regularization in Keras

from tensorflow.keras import regularizers

  model = models.Sequential() 
  model.add(layers.Dense(16, kernel_regularizer=regularizers.l2(0.001), activation='relu', input_shape=(10000,))) 
  model.add(layers.Dense(16, kernel_regularizer=regularizers.l2(0.001), activation='relu')) 
  model.add(layers.Dense(1, activation='sigmoid'))
  • The impact of the L2 regularization

 

Adding dropout

 
  • Dropout is one of the most effective and most commonly used regularization techniques for neural networks.

  • It consists of randomly dropping out (setting to zero) a number of output features of the layer during training.

    • E.g., [0.2, 0.5, 1.3, 0.8, 1.1] --> (dropout) --> [0, 0.5, 1.3, 0, 1.1]
  • The dropout rate is the fraction of the features that are zeroed out.

  • At test time, no units are dropped out.

    • Instead, the layer's output values are scaled down by a factor equal to (1-the dropout rate) to balance for the fact that more units are active than at training time.
  • Implementation using Numpy

# At training time, we zero out 50% of activations.
  layer_output *= np.random.randint(0, high=2, size=layer_output.shape)

  # At test time, we scale down the output.
  layer_output *= 0.5
  • Another implementation (in practice)
# At training time
  layer_output *= np.random.randint(0, high=2, size=layer_output.shape) 
  layer_output /= 0.5
  • In Keras,
model.add(layers.Dropout(0.5))
  • Adding dropout to the IMDB network
model = models.Sequential() 
  model.add(layers.Dense(16, activation='relu', input_shape=(10000,))) 
  model.add(layers.Dropout(0.5)) 
  model.add(layers.Dense(16, activation='relu')) 
  model.add(layers.Dropout(0.5)) 
  model.add(layers.Dense(1, activation='sigmoid'))
 

5 The universal workflow of machine learning

 

Defining the problem and assembling a dataset

 
  • First, we must define the problem at hand:

    • What will your input data be?
    • What are you trying to predict?
    • What type of problem are you facing?
  • The hypotheses you make at this stage:

    • The outputs can be predicted given the inputs.
    • The available data is sufficiently informative to learn the relationship between inputs and outputs.
  • Not all problems can be solved: a dataset (X, Y) doesn't mean X contains enough information to predict Y.

  • Keep in mind that machine learning can only be used to learn patterns that are present in the training data.

    • For instance, using machine learning trained on past data to predict the future is making the assumption that the future will behave like the past.
 

Deciding on an evaluation protocol

 
  • How you will measure the current progress
    • Maintaining a hold-out validation set
    • Doing k-fold cross validation
    • Doing iterated k-fold validation
 

Preparing the data

 
  • Once you know what you’re training on, what you’re optimizing for, and how to evaluate your approach, you’re almost ready to begin training models.

  • Formatting the data

    • The data should be formatted as tensors.
    • The values taken by these tensors should be scaled to small values.
    • If different features take values in different ranges, then the data should be normalized.
    • Some feature engineering may be needed, especially for small-data problems.
 

Developing a model that does better than a baseline

 
  • Developing a small model that is capable of beating a dumb baseline

  • Three key choices to build the network:

    • Last-layer activation
    • Loss function
    • Optimization configuration
 

Scaling up: developing a model that overfits

 
  • Once you’ve obtained a model that has statistical power, the question becomes, is your model sufficiently powerful?

  • Developing a model that overfits:

    • Add more layers
    • Make the layers bigger
    • Train for more epochs
  • Always monitor the training loss and validation loss, as well as the training and validation values for any metrics you care about.

 

Reguralizing the model and tuning the hyperparameters

 
  • Repeatedly modify the model, train it, evaluate on the validation data, again and again.

  • We can try:

    • Add dropout
    • Try different architectures
    • Add regularization terms
    • Try different hyperparameters
    • Optionally, iterate on feature engineering
  • Keep in mind that every time you use feedback from your validation process to tune your model, you leak information about the validation process into the model.

  • Once you’ve developed a satisfactory model configuration, you can train your final production model on all the available data (training and validation) and evaluate it one last time on the test set.

 

 

SangheumHwang[deep learning class]

728x90
반응형

'딥러닝' 카테고리의 다른 글

Single Layer Perceptron  (0) 2020.08.06
Convolution and Pooling  (1) 2020.06.27
regression  (0) 2020.06.20
multi class classification  (0) 2020.06.20
binary classification_multi perceptron  (0) 2020.06.20

댓글