Week 3: Hyperparameter Tuning, Batch Normalization and Programming Frameworks

Hyperparameter Tunning

Tunning Process

Hyperparameters: We optimize in this order

$\alpha: learning \ rate \\ \beta: Momentum \ term \\ \# Hidden \ Units\\ \\ Mini \ Batch \ Size \\ \\ \#Layers \\ \\ Learning \ rate \ decay \\ \beta_1, \beta_2, \epsilon: Adam \ Optimization \ Algorithm$

Untitled

Try random values and don’t use a Grid

Coarse to fine, try more in a subsample of points that works fine

Untitled

Using an Appropriate Scale to pick Hyperparameters

Sampling at random doesn't mean sampling uniformly at random, over the range of valid values.

We sample over a logarithmic scale to avoid that we have an overrepresentation of certain scales

Untitled

Observe that when we search $0.9000 \leq \beta \leq 0.9005$ We’re taking approximately 10 last gradients

But when we search over $0.999 \leq \beta \leq 0.9995$ We’re taking approximately between 1000 and 2000 last gradients

Hyperparameters in practice: Pandas vs Caviar

Intuitions about hyperparameter settings from one application area may or may not transfer to a different one.