Optimizers
#
MXNet.mx.AbstractLearningRateScheduler
— Type.
AbstractLearningRateScheduler
Base type for all learning rate scheduler.
#
MXNet.mx.AbstractMomentumScheduler
— Type.
AbstractMomentumScheduler
Base type for all momentum scheduler.
#
MXNet.mx.AbstractOptimizer
— Type.
AbstractOptimizer
Base type for all optimizers.
#
MXNet.mx.AbstractOptimizerOptions
— Type.
AbstractOptimizerOptions
Base class for all optimizer options.
#
MXNet.mx.OptimizationState
— Type.
OptimizationState
Attributes:
batch_size
: The size of the mini-batch used in stochastic training.curr_epoch
: The current epoch count. Epoch 0 means no training yet, during the first pass through the data, the epoch will be 1; during the second pass, the epoch count will be 1, and so on.curr_batch
: The current mini-batch count. The batch count is reset during every epoch. The batch count 0 means the beginning of each epoch, with no mini-batch seen yet. During the first mini-batch, the mini-batch count will be 1.curr_iter
: The current iteration count. One iteration corresponds to one mini-batch, but unlike the mini-batch count, the iteration count does not reset in each epoch. So it track the total number of mini-batches seen so far.
#
MXNet.mx.get_learning_rate
— Function.
get_learning_rate(scheduler, state)
Arguments
scheduler::AbstractLearningRateScheduler
: a learning rate scheduler.state::OptimizationState
: the current state about epoch, mini-batch and iteration count.
Returns the current learning rate.
#
MXNet.mx.get_momentum
— Function.
get_momentum(scheduler, state)
scheduler::AbstractMomentumScheduler
: the momentum scheduler.state::OptimizationState
: the state about current epoch, mini-batch and iteration count.
Returns the current momentum.
#
MXNet.mx.get_updater
— Method.
get_updater(optimizer)
A utility function to create an updater function, that uses its closure to store all the states needed for each weights.
optimizer::AbstractOptimizer
: the underlying optimizer.
#
MXNet.mx.normalized_gradient
— Method.
normalized_gradient(opts, state, weight, grad)
opts::AbstractOptimizerOptions
: options for the optimizer, should contain the field
grad_clip
and weight_decay
.
state::OptimizationState
: the current optimization state.weight::NDArray
: the trainable weights.-
grad::NDArray
: the original gradient of the weights.Get the properly normalized gradient (re-scaled and clipped if necessary).
#
MXNet.mx.LearningRate.Exp
— Type.
LearningRate.Exp
$ta_t = ta_0gamma^t$. Here $t$ is the epoch count, or the iteration count if decay_on_iteration
is set to true.
#
MXNet.mx.LearningRate.Fixed
— Type.
LearningRate.Fixed
Fixed learning rate scheduler always return the same learning rate.
#
MXNet.mx.LearningRate.Inv
— Type.
LearningRate.Inv
$ta_t = ta_0 * (1 + gamma * t)^(-power)$. Here $t$ is the epoch count, or the iteration count if decay_on_iteration
is set to true.
#
MXNet.mx.Momentum.Fixed
— Type.
Momentum.Fixed
Fixed momentum scheduler always returns the same value.
#
MXNet.mx.Momentum.NadamScheduler
— Type.
Momentum.NadamScheduler
Nesterov-accelerated adaptive momentum scheduler.
Description in "Incorporating Nesterov Momentum into Adam." http://cs229.stanford.edu/proj2015/054_report.pdf
$mu_t = mu_0 * (1 - gamma * lpha^{t * delta})$. Here
- $t$ is the iteration count
- $delta$: default
0.004
is scheduler decay, - $gamma$: default
0.5
- $lpha$: default
0.96
- $mu_0$: default
0.99
#
MXNet.mx.Momentum.Null
— Type.
Momentum.Null
The null momentum scheduler always returns 0 for momentum. It is also used to explicitly indicate momentum should not be used.
Built-in optimizers
Stochastic Gradient Descent
#
MXNet.mx.SGD
— Type.
SGD
Stochastic gradient descent optimizer.
SGD(; kwargs...)
Arguments:
lr::Real
: default0.01
, learning rate.lr_scheduler::AbstractLearningRateScheduler
: defaultnothing
, a dynamic learning rate scheduler. If set, will overwrite thelr
parameter.momentum::Real
: default0.0
, the momentum.momentum_scheduler::AbstractMomentumScheduler
: defaultnothing
, a dynamic momentum scheduler. If set, will overwrite themomentum
parameter.grad_clip::Real
: default0
, if positive, will clip the gradient into the bounded range[-grad_clip, grad_clip]
.weight_decay::Real
: default0.0001
, weight decay is equivalent to adding a global l2 regularizer to the parameters.
ADAM
#
MXNet.mx.ADAM
— Type.
ADAM
The solver described in Diederik Kingma, Jimmy Ba: Adam: A Method for Stochastic Optimization. arXiv:1412.6980 [cs.LG].
ADAM(; kwargs...)
lr::Real
: default0.001
, learning rate.lr_scheduler::AbstractLearningRateScheduler
: defaultnothing
, a dynamic learning rate scheduler. If set, will overwrite thelr
parameter.beta1::Real
: default0.9
.beta2::Real
: default0.999
.epsilon::Real
: default1e-8
.grad_clip::Real
: default0
, if positive, will clip the gradient into the range[-grad_clip, grad_clip]
.weight_decay::Real
: default0.00001
, weight decay is equivalent to adding a global l2 regularizer for all the parameters.
AdaGrad
#
MXNet.mx.AdaGrad
— Type.
AdaGrad
Scale learning rates by dividing with the square root of accumulated squared gradients. See [1] for further description.
AdaGrad(; kwargs...)
Attributes
lr::Real
: default0.1
, the learning rate controlling the size of update stepsepsilon::Real
: default1e-6
, small value added for numerical stabilitygrad_clip::Real
: default0
, if positive, will clip the gradient into the range[-grad_clip, grad_clip]
.weight_decay::Real
: default0.00001
, weight decay is equivalent to adding a global l2 regularizer for all the parameters.
Notes
Using step size lr AdaGrad calculates the learning rate for feature i at time step t as: $η_{t,i} = rac{lr}{sqrt{sum^t_{t^prime} g^2_{t^prime,i} + ϵ}} g_{t,i}$ as such the learning rate is monotonically decreasing. Epsilon is not included in the typical formula, see [2].
References
- [1]: Duchi, J., Hazan, E., & Singer, Y. (2011): Adaptive subgradient methods for online learning and stochastic optimization. JMLR, 12:2121-2159.
- [2]: Chris Dyer: Notes on AdaGrad. http://www.ark.cs.cmu.edu/cdyer/adagrad.pdf
AdaDelta
#
MXNet.mx.AdaDelta
— Type.
AdaDelta
Scale learning rates by the ratio of accumulated gradients to accumulated updates, see [1] and notes for further description.
AdaDelta(; kwargs...)
Attributes
lr::Real
: default1.0
, the learning rate controlling the size of update stepsrho::Real
: default0.9
, squared gradient moving average decay factorepsilon::Real
: default1e-6
, small value added for numerical stabilitygrad_clip::Real
: default0
, if positive, will clip the gradient into the range[-grad_clip, grad_clip]
.weight_decay::Real
: default0.00001
, weight decay is equivalent to adding a global l2 regularizer for all the parameters.
Notes
rho
should be between 0 and 1. A value of rho
close to 1 will decay the moving average slowly and a value close to 0 will decay the moving average fast.
rho
= 0.95 and epsilon
= 1e-6 are suggested in the paper and reported to work for multiple datasets (MNIST, speech). In the paper, no learning rate is considered (so lr
= 1.0). Probably best to keep it at this value.
epsilon
is important for the very first update (so the numerator does not become 0).
Using the step size lr
and a decay factor rho
the learning rate is calculated as: r_t &= ho r_{t-1} + (1- ho)*g^2
ta_t &= ta rac{sqrt{s_{t-1} + psilon}} {sqrt{r_t + psilon}}
s_t &= ho s_{t-1} + (1- ho)*(ta_t*g)^2
References
- [1]: Zeiler, M. D. (2012): ADADELTA: An Adaptive Learning Rate Method. arXiv Preprint arXiv:1212.5701.
AdaMax
#
MXNet.mx.AdaMax
— Type.
AdaMax
This is a variant of of the Adam algorithm based on the infinity norm. See [1] for further description.
AdaMax(; kwargs...)
Attributes
lr::Real
: default0.002
, the learning rate controlling the size of update stepsbeta1::Real
: default0.9
, exponential decay rate for the first moment estimatesbeta2::Real
: default0.999
, exponential decay rate for the weighted infinity norm estimatesepsilon::Real
: default1e-8
, small value added for numerical stabilitygrad_clip::Real
: default0
, if positive, will clip the gradient into the range[-grad_clip, grad_clip]
.weight_decay::Real
: default0.00001
, weight decay is equivalent to adding a global l2 regularizer for all the parameters.
References
- [1]: Kingma, Diederik, and Jimmy Ba (2014): Adam: A Method for Stochastic Optimization. http://arxiv.org/abs/1412.6980v8.
RMSProp
#
MXNet.mx.RMSProp
— Type.
RMSProp
Scale learning rates by dividing with the moving average of the root mean squared (RMS) gradients. See [1] for further description.
RMSProp(; kwargs...)
Attributes
lr::Real
: default0.1
, the learning rate controlling the size of update stepsrho::Real
: default0.9
, gradient moving average decay factorepsilon::Real
: default1e-6
, small value added for numerical stabilitygrad_clip::Real
: default0
, if positive, will clip the gradient into the range[-grad_clip, grad_clip]
.weight_decay::Real
: default0.00001
, weight decay is equivalent to adding a global l2 regularizer for all the parameters.
Notes
rho
should be between 0 and 1. A value of rho
close to 1 will decay the moving average slowly and a value close to 0 will decay the moving average fast.
Using the step size $lr$ and a decay factor $ho$ the learning rate $ta_t$ is calculated as: r_t &= ρ r_{t-1} + (1 - ρ)*g^2
η_t &= rac{lr}{sqrt{r_t + ϵ}}
References
- [1]: Tieleman, T. and Hinton, G. (2012): Neural Networks for Machine Learning, Lecture 6.5 - rmsprop. Coursera. http://www.youtube.com/watch?v=O3sxAc4hxZU (formula @5:20)
Nadam
#
MXNet.mx.Nadam
— Type.
Nadam
Nesterov Adam optimizer: Adam RMSprop with Nesterov momentum, see [1] and notes for further description.
Nadam(; kwargs...)
Attributes
lr::Real
: default0.001
, learning rate.beta1::Real
: default0.99
.beta2::Real
: default0.999
.epsilon::Real
: default1e-8
, small value added for numerical stabilitygrad_clip::Real
: default0
, if positive, will clip the gradient into the range[-grad_clip, grad_clip]
.weight_decay::Real
: default0.00001
, weight decay is equivalent to adding a global l2 regularizer for all the parameters.lr_scheduler::AbstractLearningRateScheduler
: defaultnothing
, a dynamic learning rate scheduler. If set, will overwrite thelr
parameter.momentum_scheduler::AbstractMomentumScheduler
defaultNadamScheduler
of the form $mu_t = beta1 * (1 - 0.5 * 0.96^{t * 0.004})$
Notes
Default parameters follow those provided in the paper. It is recommended to leave the parameters of this optimizer at their default values.
References
- [1]: Incorporating Nesterov Momentum into Adam. http://cs229.stanford.edu/proj2015/054_report.pdf
- [2]: On the importance of initialization and momentum in deep learning http://www.cs.toronto.edu/~fritz/absps/momentum.pdf