Hyperparameters: We optimize in this order
$\alpha: learning \ rate \\ \beta: Momentum \ term \\ \# Hidden \ Units\\ \\ Mini \ Batch \ Size \\ \\ \#Layers \\ \\ Learning \ rate \ decay \\ \beta_1, \beta_2, \epsilon: Adam \ Optimization \ Algorithm$
Try random values and don’t use a Grid
Coarse to fine, try more in a subsample of points that works fine
Sampling at random doesn't mean sampling uniformly at random, over the range of valid values.
We sample over a logarithmic scale to avoid that we have an overrepresentation of certain scales
Observe that when we search $0.9000 \leq \beta \leq 0.9005$ We’re taking approximately 10 last gradients
But when we search over $0.999 \leq \beta \leq 0.9995$ We’re taking approximately between 1000 and 2000 last gradients
Intuitions about hyperparameter settings from one application area may or may not transfer to a different one.