Week 2: Optimization Algorithms

Mini-batch Gradient Descent

What if we have too many examples to train even with a vectorized implementation?

We have to let finish the process in the entire dataset to make gradient descent? NO

So, it turns out that you can get a faster algorithm if you let gradient descent start to make some progress even before you finish processing your entire, your giant training sets.

In some cases, if the scale of the features is very different, normalizing the input data will speed up the training process.

Mini batches notation

$X^{\lbrace l \rbrace}$ Refers to the batch $l$

Untitled

Supposing that we have a 5 million training set and e split it in 5000 batches of 1000 elements

Untitled

When we do a single pass to the training set we call it epoch

When we pass over a minibatch we call it iteration

Understanding Mini Batch Gradient Descend

Untitled

Choosing the mini batch size

If minibatch size = m = len(training set): We have Batch Gradient Descent
1. Too long per iteration
If minibatch = 1: We have stochastic Gradient Descent. Every example is their own minibatch
1. As stochastic gradient descent won't ever converge, it'll always just kind of oscillate and wander around the region of the minimum. But it won't ever just head to the minimum and stay there
2. We lose speedup for vectorization

In practice: We use some number between 1 and m not too big and too small