Optimizers
#
MXNet.mx.AbstractLearningRateScheduler — Type.
AbstractLearningRateScheduler
Base type for all learning rate scheduler.
#
MXNet.mx.AbstractMomentumScheduler — Type.
AbstractMomentumScheduler
Base type for all momentum scheduler.
#
MXNet.mx.AbstractOptimizer — Type.
AbstractOptimizer
Base type for all optimizers.
#
MXNet.mx.AbstractOptimizerOptions — Type.
AbstractOptimizerOptions
Base class for all optimizer options.
#
MXNet.mx.OptimizationState — Type.
OptimizationState
Attributes:
- batch_size: The size of the mini-batch used in stochastic training.
- curr_epoch: The current epoch count. Epoch 0 means no training yet, during the first pass through the data, the epoch will be 1; during the second pass, the epoch count will be 1, and so on.
- curr_batch: The current mini-batch count. The batch count is reset during every epoch. The batch count 0 means the beginning of each epoch, with no mini-batch seen yet. During the first mini-batch, the mini-batch count will be 1.
- curr_iter: The current iteration count. One iteration corresponds to one mini-batch, but unlike the mini-batch count, the iteration count does not reset in each epoch. So it track the total number of mini-batches seen so far.
#
MXNet.mx.get_learning_rate — Function.
get_learning_rate(scheduler, state)
Arguments
- scheduler::AbstractLearningRateScheduler: a learning rate scheduler.
- state::OptimizationState: the current state about epoch, mini-batch and iteration count.
Returns the current learning rate.
#
MXNet.mx.get_momentum — Function.
get_momentum(scheduler, state)
- scheduler::AbstractMomentumScheduler: the momentum scheduler.
- state::OptimizationState: the state about current epoch, mini-batch and iteration count.
Returns the current momentum.
#
MXNet.mx.get_updater — Method.
get_updater(optimizer)
A utility function to create an updater function, that uses its closure to store all the states needed for each weights.
- optimizer::AbstractOptimizer: the underlying optimizer.
#
MXNet.mx.normalized_gradient — Method.
normalized_gradient(opts, state, weight, grad)
- opts::AbstractOptimizerOptions: options for the optimizer, should contain the field
grad_clip and weight_decay.
- state::OptimizationState: the current optimization state.
- weight::NDArray: the trainable weights.
- 
grad::NDArray: the original gradient of the weights.Get the properly normalized gradient (re-scaled and clipped if necessary). 
#
MXNet.mx.LearningRate.Exp — Type.
LearningRate.Exp
$ta_t = ta_0gamma^t$. Here $t$ is the epoch count, or the iteration count if decay_on_iteration is set to true.
#
MXNet.mx.LearningRate.Fixed — Type.
LearningRate.Fixed
Fixed learning rate scheduler always return the same learning rate.
#
MXNet.mx.LearningRate.Inv — Type.
LearningRate.Inv
$ta_t = ta_0 * (1 + gamma * t)^(-power)$. Here $t$ is the epoch count, or the iteration count if decay_on_iteration is set to true.
#
MXNet.mx.Momentum.Fixed — Type.
Momentum.Fixed
Fixed momentum scheduler always returns the same value.
#
MXNet.mx.Momentum.NadamScheduler — Type.
Momentum.NadamScheduler
Nesterov-accelerated adaptive momentum scheduler.
Description in "Incorporating Nesterov Momentum into Adam." http://cs229.stanford.edu/proj2015/054_report.pdf
$mu_t = mu_0 * (1 - gamma * lpha^{t * delta})$. Here
- $t$ is the iteration count
- $delta$: default 0.004is scheduler decay,
- $gamma$: default 0.5
- $lpha$: default 0.96
- $mu_0$: default 0.99
#
MXNet.mx.Momentum.Null — Type.
Momentum.Null
The null momentum scheduler always returns 0 for momentum. It is also used to explicitly indicate momentum should not be used.
Built-in optimizers
Stochastic Gradient Descent
#
MXNet.mx.SGD — Type.
SGD
Stochastic gradient descent optimizer.
SGD(; kwargs...)
Arguments:
- lr::Real: default- 0.01, learning rate.
- lr_scheduler::AbstractLearningRateScheduler: default- nothing, a dynamic learning rate scheduler. If set, will overwrite the- lrparameter.
- momentum::Real: default- 0.0, the momentum.
- momentum_scheduler::AbstractMomentumScheduler: default- nothing, a dynamic momentum scheduler. If set, will overwrite the- momentumparameter.
- grad_clip::Real: default- 0, if positive, will clip the gradient into the bounded range- [-grad_clip, grad_clip].
- weight_decay::Real: default- 0.0001, weight decay is equivalent to adding a global l2 regularizer to the parameters.
ADAM
#
MXNet.mx.ADAM — Type.
 ADAM
The solver described in Diederik Kingma, Jimmy Ba: Adam: A Method for Stochastic Optimization. arXiv:1412.6980 [cs.LG].
ADAM(; kwargs...)
- lr::Real: default- 0.001, learning rate.
- lr_scheduler::AbstractLearningRateScheduler: default- nothing, a dynamic learning rate scheduler. If set, will overwrite the- lrparameter.
- beta1::Real: default- 0.9.
- beta2::Real: default- 0.999.
- epsilon::Real: default- 1e-8.
- grad_clip::Real: default- 0, if positive, will clip the gradient into the range- [-grad_clip, grad_clip].
- weight_decay::Real: default- 0.00001, weight decay is equivalent to adding a global l2 regularizer for all the parameters.
AdaGrad
#
MXNet.mx.AdaGrad — Type.
AdaGrad
Scale learning rates by dividing with the square root of accumulated squared gradients. See [1] for further description.
AdaGrad(; kwargs...)
Attributes
- lr::Real: default- 0.1, the learning rate controlling the size of update steps
- epsilon::Real: default- 1e-6, small value added for numerical stability
- grad_clip::Real: default- 0, if positive, will clip the gradient into the range- [-grad_clip, grad_clip].
- weight_decay::Real: default- 0.00001, weight decay is equivalent to adding a global l2 regularizer for all the parameters.
Notes
Using step size lr AdaGrad calculates the learning rate for feature i at time step t as: $η_{t,i} = rac{lr}{sqrt{sum^t_{t^prime} g^2_{t^prime,i} + ϵ}} g_{t,i}$ as such the learning rate is monotonically decreasing. Epsilon is not included in the typical formula, see [2].
References
- [1]: Duchi, J., Hazan, E., & Singer, Y. (2011): Adaptive subgradient methods for online learning and stochastic optimization. JMLR, 12:2121-2159.
- [2]: Chris Dyer: Notes on AdaGrad. http://www.ark.cs.cmu.edu/cdyer/adagrad.pdf
AdaDelta
#
MXNet.mx.AdaDelta — Type.
AdaDelta
Scale learning rates by the ratio of accumulated gradients to accumulated updates, see [1] and notes for further description.
AdaDelta(; kwargs...)
Attributes
- lr::Real: default- 1.0, the learning rate controlling the size of update steps
- rho::Real: default- 0.9, squared gradient moving average decay factor
- epsilon::Real: default- 1e-6, small value added for numerical stability
- grad_clip::Real: default- 0, if positive, will clip the gradient into the range- [-grad_clip, grad_clip].
- weight_decay::Real: default- 0.00001, weight decay is equivalent to adding a global l2 regularizer for all the parameters.
Notes
rho should be between 0 and 1. A value of rho close to 1 will decay the moving average slowly and a value close to 0 will decay the moving average fast.
rho = 0.95 and epsilon = 1e-6 are suggested in the paper and reported to work for multiple datasets (MNIST, speech). In the paper, no learning rate is considered (so lr = 1.0). Probably best to keep it at this value.
epsilon is important for the very first update (so the numerator does not become 0).
Using the step size lr and a decay factor rho the learning rate is calculated as: r_t &=  ho r_{t-1} + (1- ho)*g^2
ta_t &= ta rac{sqrt{s_{t-1} + psilon}} {sqrt{r_t + psilon}}
s_t &=  ho s_{t-1} + (1- ho)*(ta_t*g)^2
References
- [1]: Zeiler, M. D. (2012): ADADELTA: An Adaptive Learning Rate Method. arXiv Preprint arXiv:1212.5701.
AdaMax
#
MXNet.mx.AdaMax — Type.
AdaMax
This is a variant of of the Adam algorithm based on the infinity norm. See [1] for further description.
AdaMax(; kwargs...)
Attributes
- lr::Real: default- 0.002, the learning rate controlling the size of update steps
- beta1::Real: default- 0.9, exponential decay rate for the first moment estimates
- beta2::Real: default- 0.999, exponential decay rate for the weighted infinity norm estimates
- epsilon::Real: default- 1e-8, small value added for numerical stability
- grad_clip::Real: default- 0, if positive, will clip the gradient into the range- [-grad_clip, grad_clip].
- weight_decay::Real: default- 0.00001, weight decay is equivalent to adding a global l2 regularizer for all the parameters.
References
- [1]: Kingma, Diederik, and Jimmy Ba (2014): Adam: A Method for Stochastic Optimization. http://arxiv.org/abs/1412.6980v8.
RMSProp
#
MXNet.mx.RMSProp — Type.
RMSProp
Scale learning rates by dividing with the moving average of the root mean squared (RMS) gradients. See [1] for further description.
RMSProp(; kwargs...)
Attributes
- lr::Real: default- 0.1, the learning rate controlling the size of update steps
- rho::Real: default- 0.9, gradient moving average decay factor
- epsilon::Real: default- 1e-6, small value added for numerical stability
- grad_clip::Real: default- 0, if positive, will clip the gradient into the range- [-grad_clip, grad_clip].
- weight_decay::Real: default- 0.00001, weight decay is equivalent to adding a global l2 regularizer for all the parameters.
Notes
rho should be between 0 and 1. A value of rho close to 1 will decay the moving average slowly and a value close to 0 will decay the moving average fast.
Using the step size $lr$ and a decay factor $ho$ the learning rate $ta_t$ is calculated as: r_t &= ρ r_{t-1} + (1 - ρ)*g^2 
  η_t &= rac{lr}{sqrt{r_t + ϵ}}
References
- [1]: Tieleman, T. and Hinton, G. (2012): Neural Networks for Machine Learning, Lecture 6.5 - rmsprop. Coursera. http://www.youtube.com/watch?v=O3sxAc4hxZU (formula @5:20)
Nadam
#
MXNet.mx.Nadam — Type.
Nadam
Nesterov Adam optimizer: Adam RMSprop with Nesterov momentum, see [1] and notes for further description.
Nadam(; kwargs...)
Attributes
- lr::Real: default- 0.001, learning rate.
- beta1::Real: default- 0.99.
- beta2::Real: default- 0.999.
- epsilon::Real: default- 1e-8, small value added for numerical stability
- grad_clip::Real: default- 0, if positive, will clip the gradient into the range- [-grad_clip, grad_clip].
- weight_decay::Real: default- 0.00001, weight decay is equivalent to adding a global l2 regularizer for all the parameters.
- lr_scheduler::AbstractLearningRateScheduler: default- nothing, a dynamic learning rate scheduler. If set, will overwrite the- lrparameter.
- momentum_scheduler::AbstractMomentumSchedulerdefault- NadamSchedulerof the form $mu_t = beta1 * (1 - 0.5 * 0.96^{t * 0.004})$
Notes
Default parameters follow those provided in the paper. It is recommended to leave the parameters of this optimizer at their default values.
References
- [1]: Incorporating Nesterov Momentum into Adam. http://cs229.stanford.edu/proj2015/054_report.pdf
- [2]: On the importance of initialization and momentum in deep learning http://www.cs.toronto.edu/~fritz/absps/momentum.pdf