Optimizers

# MXNet.mx.AbstractLearningRateSchedulerType.

AbstractLearningRateScheduler

Base type for all learning rate scheduler.

source

# MXNet.mx.AbstractMomentumSchedulerType.

AbstractMomentumScheduler

Base type for all momentum scheduler.

source

# MXNet.mx.AbstractOptimizerType.

AbstractOptimizer

Base type for all optimizers.

source

# MXNet.mx.AbstractOptimizerOptionsType.

AbstractOptimizerOptions

Base class for all optimizer options.

source

# MXNet.mx.OptimizationStateType.

OptimizationState

Attributes:

  • batch_size: The size of the mini-batch used in stochastic training.
  • curr_epoch: The current epoch count. Epoch 0 means no training yet, during the first pass through the data, the epoch will be 1; during the second pass, the epoch count will be 1, and so on.
  • curr_batch: The current mini-batch count. The batch count is reset during every epoch. The batch count 0 means the beginning of each epoch, with no mini-batch seen yet. During the first mini-batch, the mini-batch count will be 1.
  • curr_iter: The current iteration count. One iteration corresponds to one mini-batch, but unlike the mini-batch count, the iteration count does not reset in each epoch. So it track the total number of mini-batches seen so far.

source

# MXNet.mx.get_learning_rateFunction.

get_learning_rate(scheduler, state)

Arguments

  • scheduler::AbstractLearningRateScheduler: a learning rate scheduler.
  • state::OptimizationState: the current state about epoch, mini-batch and iteration count.

Returns the current learning rate.

source

# MXNet.mx.get_momentumFunction.

get_momentum(scheduler, state)
  • scheduler::AbstractMomentumScheduler: the momentum scheduler.
  • state::OptimizationState: the state about current epoch, mini-batch and iteration count.

Returns the current momentum.

source

# MXNet.mx.get_updaterMethod.

get_updater(optimizer)

A utility function to create an updater function, that uses its closure to store all the states needed for each weights.

  • optimizer::AbstractOptimizer: the underlying optimizer.

source

# MXNet.mx.normalized_gradientMethod.

normalized_gradient(opts, state, weight, grad)
  • opts::AbstractOptimizerOptions: options for the optimizer, should contain the field

grad_clip and weight_decay.

  • state::OptimizationState: the current optimization state.
  • weight::NDArray: the trainable weights.
  • grad::NDArray: the original gradient of the weights.

    Get the properly normalized gradient (re-scaled and clipped if necessary).

source

# MXNet.mx.LearningRate.ExpType.

LearningRate.Exp

$ta_t = ta_0gamma^t$. Here $t$ is the epoch count, or the iteration count if decay_on_iteration is set to true.

source

# MXNet.mx.LearningRate.FixedType.

LearningRate.Fixed

Fixed learning rate scheduler always return the same learning rate.

source

# MXNet.mx.LearningRate.InvType.

LearningRate.Inv

$ta_t = ta_0 * (1 + gamma * t)^(-power)$. Here $t$ is the epoch count, or the iteration count if decay_on_iteration is set to true.

source

# MXNet.mx.Momentum.FixedType.

Momentum.Fixed

Fixed momentum scheduler always returns the same value.

source

# MXNet.mx.Momentum.NadamSchedulerType.

Momentum.NadamScheduler

Nesterov-accelerated adaptive momentum scheduler.

Description in "Incorporating Nesterov Momentum into Adam." http://cs229.stanford.edu/proj2015/054_report.pdf

$mu_t = mu_0 * (1 - gamma * lpha^{t * delta})$. Here

  • $t$ is the iteration count
  • $delta$: default 0.004 is scheduler decay,
  • $gamma$: default 0.5
  • $lpha$: default 0.96
  • $mu_0$: default 0.99

source

# MXNet.mx.Momentum.NullType.

Momentum.Null

The null momentum scheduler always returns 0 for momentum. It is also used to explicitly indicate momentum should not be used.

source

Built-in optimizers

Stochastic Gradient Descent

# MXNet.mx.SGDType.

SGD

Stochastic gradient descent optimizer.

SGD(; kwargs...)

Arguments:

  • lr::Real: default 0.01, learning rate.
  • lr_scheduler::AbstractLearningRateScheduler: default nothing, a dynamic learning rate scheduler. If set, will overwrite the lr parameter.
  • momentum::Real: default 0.0, the momentum.
  • momentum_scheduler::AbstractMomentumScheduler: default nothing, a dynamic momentum scheduler. If set, will overwrite the momentum parameter.
  • grad_clip::Real: default 0, if positive, will clip the gradient into the bounded range [-grad_clip, grad_clip].
  • weight_decay::Real: default 0.0001, weight decay is equivalent to adding a global l2 regularizer to the parameters.

source

ADAM

# MXNet.mx.ADAMType.

 ADAM

The solver described in Diederik Kingma, Jimmy Ba: Adam: A Method for Stochastic Optimization. arXiv:1412.6980 [cs.LG].

ADAM(; kwargs...)
  • lr::Real: default 0.001, learning rate.
  • lr_scheduler::AbstractLearningRateScheduler: default nothing, a dynamic learning rate scheduler. If set, will overwrite the lr parameter.
  • beta1::Real: default 0.9.
  • beta2::Real: default 0.999.
  • epsilon::Real: default 1e-8.
  • grad_clip::Real: default 0, if positive, will clip the gradient into the range [-grad_clip, grad_clip].
  • weight_decay::Real: default 0.00001, weight decay is equivalent to adding a global l2 regularizer for all the parameters.

source

AdaGrad

# MXNet.mx.AdaGradType.

AdaGrad

Scale learning rates by dividing with the square root of accumulated squared gradients. See [1] for further description.

AdaGrad(; kwargs...)

Attributes

  • lr::Real: default 0.1, the learning rate controlling the size of update steps
  • epsilon::Real: default 1e-6, small value added for numerical stability
  • grad_clip::Real: default 0, if positive, will clip the gradient into the range [-grad_clip, grad_clip].
  • weight_decay::Real: default 0.00001, weight decay is equivalent to adding a global l2 regularizer for all the parameters.

Notes

Using step size lr AdaGrad calculates the learning rate for feature i at time step t as: $η_{t,i} = rac{lr}{sqrt{sum^t_{t^prime} g^2_{t^prime,i} + ϵ}} g_{t,i}$ as such the learning rate is monotonically decreasing. Epsilon is not included in the typical formula, see [2].

References

  • [1]: Duchi, J., Hazan, E., & Singer, Y. (2011): Adaptive subgradient methods for online learning and stochastic optimization. JMLR, 12:2121-2159.
  • [2]: Chris Dyer: Notes on AdaGrad. http://www.ark.cs.cmu.edu/cdyer/adagrad.pdf

source

AdaDelta

# MXNet.mx.AdaDeltaType.

AdaDelta

Scale learning rates by the ratio of accumulated gradients to accumulated updates, see [1] and notes for further description.

AdaDelta(; kwargs...)

Attributes

  • lr::Real: default 1.0, the learning rate controlling the size of update steps
  • rho::Real: default 0.9, squared gradient moving average decay factor
  • epsilon::Real: default 1e-6, small value added for numerical stability
  • grad_clip::Real: default 0, if positive, will clip the gradient into the range [-grad_clip, grad_clip].
  • weight_decay::Real: default 0.00001, weight decay is equivalent to adding a global l2 regularizer for all the parameters.

Notes

rho should be between 0 and 1. A value of rho close to 1 will decay the moving average slowly and a value close to 0 will decay the moving average fast.

rho = 0.95 and epsilon = 1e-6 are suggested in the paper and reported to work for multiple datasets (MNIST, speech). In the paper, no learning rate is considered (so lr = 1.0). Probably best to keep it at this value.

epsilon is important for the very first update (so the numerator does not become 0).

Using the step size lr and a decay factor rho the learning rate is calculated as: r_t &= ho r_{t-1} + (1- ho)*g^2 ta_t &= ta rac{sqrt{s_{t-1} + psilon}} {sqrt{r_t + psilon}} s_t &= ho s_{t-1} + (1- ho)*(ta_t*g)^2

References

  • [1]: Zeiler, M. D. (2012): ADADELTA: An Adaptive Learning Rate Method. arXiv Preprint arXiv:1212.5701.

source

AdaMax

# MXNet.mx.AdaMaxType.

AdaMax

This is a variant of of the Adam algorithm based on the infinity norm. See [1] for further description.

AdaMax(; kwargs...)

Attributes

  • lr::Real: default 0.002, the learning rate controlling the size of update steps
  • beta1::Real: default 0.9, exponential decay rate for the first moment estimates
  • beta2::Real: default 0.999, exponential decay rate for the weighted infinity norm estimates
  • epsilon::Real: default 1e-8, small value added for numerical stability
  • grad_clip::Real: default 0, if positive, will clip the gradient into the range [-grad_clip, grad_clip].
  • weight_decay::Real: default 0.00001, weight decay is equivalent to adding a global l2 regularizer for all the parameters.

References

source

RMSProp

# MXNet.mx.RMSPropType.

RMSProp

Scale learning rates by dividing with the moving average of the root mean squared (RMS) gradients. See [1] for further description.

RMSProp(; kwargs...)

Attributes

  • lr::Real: default 0.1, the learning rate controlling the size of update steps
  • rho::Real: default 0.9, gradient moving average decay factor
  • epsilon::Real: default 1e-6, small value added for numerical stability
  • grad_clip::Real: default 0, if positive, will clip the gradient into the range [-grad_clip, grad_clip].
  • weight_decay::Real: default 0.00001, weight decay is equivalent to adding a global l2 regularizer for all the parameters.

Notes

rho should be between 0 and 1. A value of rho close to 1 will decay the moving average slowly and a value close to 0 will decay the moving average fast.

Using the step size $lr$ and a decay factor $ho$ the learning rate $ta_t$ is calculated as: r_t &= ρ r_{t-1} + (1 - ρ)*g^2 η_t &= rac{lr}{sqrt{r_t + ϵ}}

References

source

Nadam

# MXNet.mx.NadamType.

Nadam

Nesterov Adam optimizer: Adam RMSprop with Nesterov momentum, see [1] and notes for further description.

Nadam(; kwargs...)

Attributes

  • lr::Real: default 0.001, learning rate.
  • beta1::Real: default 0.99.
  • beta2::Real: default 0.999.
  • epsilon::Real: default 1e-8, small value added for numerical stability
  • grad_clip::Real: default 0, if positive, will clip the gradient into the range [-grad_clip, grad_clip].
  • weight_decay::Real: default 0.00001, weight decay is equivalent to adding a global l2 regularizer for all the parameters.
  • lr_scheduler::AbstractLearningRateScheduler: default nothing, a dynamic learning rate scheduler. If set, will overwrite the lr parameter.
  • momentum_scheduler::AbstractMomentumScheduler default NadamScheduler of the form $mu_t = beta1 * (1 - 0.5 * 0.96^{t * 0.004})$

Notes

Default parameters follow those provided in the paper. It is recommended to leave the parameters of this optimizer at their default values.

References

source