batch_size: The size of the mini-batch used in stochastic training.
curr_epoch: The current epoch count. Epoch 0 means no training yet, during the first pass through the data, the epoch will be 1; during the second pass, the epoch count will be 1, and so on.
curr_batch: The current mini-batch count. The batch count is reset during every epoch. The batch count 0 means the beginning of each epoch, with no mini-batch seen yet. During the first mini-batch, the mini-batch count will be 1.
curr_iter: The current iteration count. One iteration corresponds to one mini-batch, but unlike the mini-batch count, the iteration count does not reset in each epoch. So it track the total number of mini-batches seen so far.

source

# MXNet.mx.get_learning_rate — Function.

get_learning_rate(scheduler, state)

Arguments

scheduler::AbstractLearningRateScheduler: a learning rate scheduler.
state::OptimizationState: the current state about epoch, mini-batch and iteration count.

Returns the current learning rate.

source

# MXNet.mx.get_momentum — Function.

get_momentum(scheduler, state)

scheduler::AbstractMomentumScheduler: the momentum scheduler.
state::OptimizationState: the state about current epoch, mini-batch and iteration count.

Returns the current momentum.

source

# MXNet.mx.get_updater — Method.

get_updater(optimizer)

A utility function to create an updater function, that uses its closure to store all the states needed for each weights.

optimizer::AbstractOptimizer: the underlying optimizer.

source

# MXNet.mx.normalized_gradient — Method.

normalized_gradient(opts, state, weight, grad)

opts::AbstractOptimizerOptions: options for the optimizer, should contain the field

grad_clip and weight_decay.

state::OptimizationState: the current optimization state.
weight::NDArray: the trainable weights.
grad::NDArray: the original gradient of the weights.

Get the properly normalized gradient (re-scaled and clipped if necessary).

source

# MXNet.mx.LearningRate.Exp — Type.

LearningRate.Exp

$ta_t = ta_0gamma^t$. Here $t$ is the epoch count, or the iteration count if decay_on_iteration is set to true.

source

# MXNet.mx.LearningRate.Fixed — Type.

LearningRate.Fixed

Fixed learning rate scheduler always return the same learning rate.

source

# MXNet.mx.LearningRate.Inv — Type.

LearningRate.Inv

$ta_t = ta_0 * (1 + gamma * t)^(-power)$. Here $t$ is the epoch count, or the iteration count if decay_on_iteration is set to true.

source

# MXNet.mx.Momentum.Fixed — Type.

Momentum.Fixed

Fixed momentum scheduler always returns the same value.

source

# MXNet.mx.Momentum.NadamScheduler — Type.

Momentum.NadamScheduler

Nesterov-accelerated adaptive momentum scheduler.

Description in "Incorporating Nesterov Momentum into Adam." http://cs229.stanford.edu/proj2015/054_report.pdf

$mu_t = mu_0 * (1 - gamma * lpha^{t * delta})$. Here

$t$ is the iteration count
$delta$: default 0.004 is scheduler decay,
$gamma$: default 0.5
$lpha$: default 0.96
$mu_0$: default 0.99

source

# MXNet.mx.Momentum.Null — Type.

Momentum.Null

The null momentum scheduler always returns 0 for momentum. It is also used to explicitly indicate momentum should not be used.

source

Built-in optimizers

Stochastic Gradient Descent

# MXNet.mx.SGD — Type.

SGD

Stochastic gradient descent optimizer.

SGD(; kwargs...)

Arguments:

lr::Real: default 0.01, learning rate.
lr_scheduler::AbstractLearningRateScheduler: default nothing, a dynamic learning rate scheduler. If set, will overwrite the lr parameter.
momentum::Real: default 0.0, the momentum.
momentum_scheduler::AbstractMomentumScheduler: default nothing, a dynamic momentum scheduler. If set, will overwrite the momentum parameter.
grad_clip::Real: default 0, if positive, will clip the gradient into the bounded range [-grad_clip, grad_clip].
weight_decay::Real: default 0.0001, weight decay is equivalent to adding a global l2 regularizer to the parameters.

source

ADAM

# MXNet.mx.ADAM — Type.

 ADAM

The solver described in Diederik Kingma, Jimmy Ba: Adam: A Method for Stochastic Optimization. arXiv:1412.6980 [cs.LG].

ADAM(; kwargs...)

lr::Real: default 0.001, learning rate.
lr_scheduler::AbstractLearningRateScheduler: default nothing, a dynamic learning rate scheduler. If set, will overwrite the lr parameter.
beta1::Real: default 0.9.
beta2::Real: default 0.999.
epsilon::Real: default 1e-8.
grad_clip::Real: default 0, if positive, will clip the gradient into the range [-grad_clip, grad_clip].
weight_decay::Real: default 0.00001, weight decay is equivalent to adding a global l2 regularizer for all the parameters.

source

AdaGrad

# MXNet.mx.AdaGrad — Type.

AdaGrad

Scale learning rates by dividing with the square root of accumulated squared gradients. See [1] for further description.

AdaGrad(; kwargs...)

Attributes

lr::Real: default 0.1, the learning rate controlling the size of update steps
epsilon::Real: default 1e-6, small value added for numerical stability
grad_clip::Real: default 0, if positive, will clip the gradient into the range [-grad_clip, grad_clip].
weight_decay::Real: default 0.00001, weight decay is equivalent to adding a global l2 regularizer for all the parameters.

Notes

Using step size lr AdaGrad calculates the learning rate for feature i at time step t as: $η_{t,i} = rac{lr}{sqrt{sum^t_{t^prime} g^2_{t^prime,i} + ϵ}} g_{t,i}$ as such the learning rate is monotonically decreasing. Epsilon is not included in the typical formula, see [2].

References

[1]: Duchi, J., Hazan, E., & Singer, Y. (2011): Adaptive subgradient methods for online learning and stochastic optimization. JMLR, 12:2121-2159.
[2]: Chris Dyer: Notes on AdaGrad. http://www.ark.cs.cmu.edu/cdyer/adagrad.pdf

source

AdaDelta

# MXNet.mx.AdaDelta — Type.

AdaDelta

Scale learning rates by the ratio of accumulated gradients to accumulated updates, see [1] and notes for further description.

AdaDelta(; kwargs...)

Attributes

lr::Real: default 1.0, the learning rate controlling the size of update steps
rho::Real: default 0.9, squared gradient moving average decay factor
epsilon::Real: default 1e-6, small value added for numerical stability
grad_clip::Real: default 0, if positive, will clip the gradient into the range [-grad_clip, grad_clip].
weight_decay::Real: default 0.00001, weight decay is equivalent to adding a global l2 regularizer for all the parameters.

Notes

rho should be between 0 and 1. A value of rho close to 1 will decay the moving average slowly and a value close to 0 will decay the moving average fast.

rho = 0.95 and epsilon = 1e-6 are suggested in the paper and reported to work for multiple datasets (MNIST, speech). In the paper, no learning rate is considered (so lr = 1.0). Probably best to keep it at this value.

epsilon is important for the very first update (so the numerator does not become 0).

Using the step size lr and a decay factor rho the learning rate is calculated as: r_t &= ho r_{t-1} + (1- ho)*g^2 ta_t &= ta rac{sqrt{s_{t-1} + psilon}} {sqrt{r_t + psilon}} s_t &= ho s_{t-1} + (1- ho)*(ta_t*g)^2

References

[1]: Zeiler, M. D. (2012): ADADELTA: An Adaptive Learning Rate Method. arXiv Preprint arXiv:1212.5701.

source

AdaMax

# MXNet.mx.AdaMax — Type.

AdaMax

This is a variant of of the Adam algorithm based on the infinity norm. See [1] for further description.

AdaMax(; kwargs...)

Attributes

lr::Real: default 0.002, the learning rate controlling the size of update steps
beta1::Real: default 0.9, exponential decay rate for the first moment estimates
beta2::Real: default 0.999, exponential decay rate for the weighted infinity norm estimates
epsilon::Real: default 1e-8, small value added for numerical stability
grad_clip::Real: default 0, if positive, will clip the gradient into the range [-grad_clip, grad_clip].
weight_decay::Real: default 0.00001, weight decay is equivalent to adding a global l2 regularizer for all the parameters.

References

[1]: Kingma, Diederik, and Jimmy Ba (2014): Adam: A Method for Stochastic Optimization. http://arxiv.org/abs/1412.6980v8.

source

RMSProp

# MXNet.mx.RMSProp — Type.

RMSProp

Scale learning rates by dividing with the moving average of the root mean squared (RMS) gradients. See [1] for further description.

RMSProp(; kwargs...)

Attributes

lr::Real: default 0.1, the learning rate controlling the size of update steps
rho::Real: default 0.9, gradient moving average decay factor
epsilon::Real: default 1e-6, small value added for numerical stability
grad_clip::Real: default 0, if positive, will clip the gradient into the range [-grad_clip, grad_clip].
weight_decay::Real: default 0.00001, weight decay is equivalent to adding a global l2 regularizer for all the parameters.

Notes

rho should be between 0 and 1. A value of rho close to 1 will decay the moving average slowly and a value close to 0 will decay the moving average fast.

Using the step size $lr$ and a decay factor $ho$ the learning rate $ta_t$ is calculated as: r_t &= ρ r_{t-1} + (1 - ρ)*g^2 η_t &= rac{lr}{sqrt{r_t + ϵ}}

References

[1]: Tieleman, T. and Hinton, G. (2012): Neural Networks for Machine Learning, Lecture 6.5 - rmsprop. Coursera. http://www.youtube.com/watch?v=O3sxAc4hxZU (formula @5:20)

source

Nadam

# MXNet.mx.Nadam — Type.

Nadam

Nesterov Adam optimizer: Adam RMSprop with Nesterov momentum, see [1] and notes for further description.

Nadam(; kwargs...)

Attributes

lr::Real: default 0.001, learning rate.
beta1::Real: default 0.99.
beta2::Real: default 0.999.
epsilon::Real: default 1e-8, small value added for numerical stability
grad_clip::Real: default 0, if positive, will clip the gradient into the range [-grad_clip, grad_clip].
weight_decay::Real: default 0.00001, weight decay is equivalent to adding a global l2 regularizer for all the parameters.
lr_scheduler::AbstractLearningRateScheduler: default nothing, a dynamic learning rate scheduler. If set, will overwrite the lr parameter.
momentum_scheduler::AbstractMomentumScheduler default NadamScheduler of the form $mu_t = beta1 * (1 - 0.5 * 0.96^{t * 0.004})$

Notes

Default parameters follow those provided in the paper. It is recommended to leave the parameters of this optimizer at their default values.

References

[1]: Incorporating Nesterov Momentum into Adam. http://cs229.stanford.edu/proj2015/054_report.pdf
[2]: On the importance of initialization and momentum in deep learning http://www.cs.toronto.edu/~fritz/absps/momentum.pdf

source