What if we have too many examples to train even with a vectorized implementation?
We have to let finish the process in the entire dataset to make gradient descent? NO
So, it turns out that you can get a faster algorithm if you let gradient descent start to make some progress even before you finish processing your entire, your giant training sets.
In some cases, if the scale of the features is very different, normalizing the input data will speed up the training process.
Mini batches notation
$X^{\lbrace l \rbrace}$ Refers to the batch $l$
Supposing that we have a 5 million training set and e split it in 5000 batches of 1000 elements
When we do a single pass to the training set we call it epoch
When we pass over a minibatch we call it iteration
In practice: We use some number between 1 and m not too big and too small