API

Data Input

XGBoost.DMatrixType
DMatrix <: AbstractMatrix{Union{Missing,Float32}}

Data structure for storing data which can be understood by an xgboost Booster. These can store both features and targets. Values of the DMatrix can be accessed as with any other AbstractMatrix, however doing so causes additional allocations. Performant indexing and matrix operation code should not use DMatrix directly.

Aside from a primary array, the DMatrix object can have various "info" fields associated with it. Training target variables are stored as a special info field with the name label, see setinfo! and setinfos!. These can be retrieved with getinfo and getlabel.

Note that the xgboost library internally uses Float32 to represent all data, so input data is automatically copied unless provided in this format. Unfortunately because of the different representations used by C and Julia, any non Transpose matrix will also be copied.

On missing Values

Xgboost supports training on missing data. Such data is simply omitted from tree splits. Because the DMatrix is internally a Float32 matrix, libxgboost uses a settable default value to represent missing values, see the missing_value keyword argument below (default NaN32). This value is used only on matrix construction. This will cause input matrix elements to ultimately be converted to missing. The most obvious consequence of this is that NaN32 values will automatically be converted to missing with default arguments. The provided constructors ensure that missing values will be preserved.

TL;DR: DMatrix supports missing and NaN's will be converted to missing.

Constructors

DMatrix(X::AbstractMatrix; kw...)
DMatrix(X::AbstractMatrix, y::AbstractVector; kw...)
DMatrix((X, y); kw...)
DMatrix(tbl; kw...)
DMatrix(tbl, y; kw...)
DMatrix(tbl, yname::Symbol; kw...)

Arguments

  • X: A matrix that is the primary data wrapped by the DMatrix. Elements can be missing. Matrices with Float32 eleemnts do not need to be copied.
  • y: Data to assign to the label info field. This is the target variable used in training. Can also be set with the label keyword.
  • tbl: The input matrix in tabular form. tbl must satisfy the Tables.jl interface. If data is passed in tabular form feature names will be set automatically but can be overriden with the keyword argument.
  • yname: If passed a tabular argument tbl, yname is the name of the column which holds the label data. It will automatically be omitted from the features.

Keyword Arguments

  • missing_value: The Float32 value of elements of input data to be interpreted as missing, defaults to NaN32.
  • label: Training target data, this is the same as the y argument above, i.e. DMatrix(X,y) and DMatrix(X, label=y) are equivalent.
  • weight: An AbstractVector of weights for each data point. This array must have lenght equal to the number of rows of the main data matrix.
  • base_margin: Sets the global bias for a boosted model trained on this dataset, see https://xgboost.readthedocs.io/en/stable/prediction.html#base-margin

Examples

(X, y) = (randn(10,3), randn(10))

# the following are all equivalent
DMatrix(X, y)
DMatrix((X, y))
DMatrix(X, label=y)

DMatrix(X, y, feature_names=["a", "b", "c"])  # explicitly set feature names

df = DataFrame(A=randn(10), B=randn(10))
DMatrix(df)  # has feature names ["A", "B"] but no label
source
XGBoost.loadMethod
load(DMatrix, fname; silent=true, format=:libsvm, kw...)

Load a DMatrix from file with name fname. The matrix must have been serialized with a call to save(::DMatrix, fname). If silent the xgboost library will print logs to stdout. Additional keyword arguments are passed to the DMatrix on construction. Format describes the file format, valid options are :binary, :csv and :libsvm.

source
XGBoost.saveMethod
save(dm::DMatrix, fname; silent=true)

Save the DMatrix to file fname in an opaque (xgboost-specific) serialization format. Will print logs to stdout unless silent. Files created with this function can be loaded using XGBoost.load(DMatrix, fname, format=:binary).

source
XGBoost.setfeaturenames!Function
setfeaturenames!(dm::DMatrix, names)

Sets the names of the features in dm. This can be used by Booster for reporting. names must be a rank-1 array of strings with length equal to the number of features. Note that this will be set automatically by DMatrix constructors from table objects.

source
XGBoost.setinfo!Function
setinfo!(dm::DMatrix, name, info)

Set DMatrix ancillary info, for example :label or :weight. name can be a string or a Symbol. See DMatrix.

source
XGBoost.setinfos!Function
setinfos!(dm::DMatrix; kw...)

Make arbitrarily many calls to setinfo! via keyword arguments. This function is called by all DMatrix constructors, i.e. DMatrix(X; kw...) is equivalent to setinfos!(DMatrix(X); kw...).

source
XGBoost.setlabel!Function
setlabel!(dm::DMatrix, y)

Set the label data of dm to y. Equivalent to setinfo!(dm, "label", y).

source
XGBoost.getinfoFunction
getinfo(dm::DMatrix, T, name)

Get DMatrix info with name name. Users must specify the underlying data type due to limitations of the xgboost library. One must have T<:AbstractFloat to get floating point data (e.g. label, weight), or T<:Integer to get integer data. The output will be converted to Vector{T} in all cases. name can be either a string or Symbol.

source
XGBoost.sliceFunction
slice(dm::DMatrix, idx; kw...)

Create a new DMatrix out of the subset of rows of dm given by indices idx. For performance reasons it is recommended to take slices before converting to DMatrix. Additional keyword arguments are passed to the newly constructed slice.

This can also be called via Base.getindex, for example, the following are equivalent

slice(dm, 1:4)
dm[1:4, :]  # second argument *must* be `:` as column slices are not supported.
source
XGBoost.ncolsFunction
ncols(dm::DMatrix)

Returns the number of columns of the DMatrix. Note that this will only count columns of the main data (the X argument to the constructor). The value returned is independent of the presence of labels. In particular size(X,2) == ncols(DMatrix(X)).

source
Base.sizeMethod
size(dm::DMatrix, [dim])

Returns the size of the primary data of the DMatrix. Note that this only accounts for the primary data and is independent of whether labels or any other ancillary data are present. In particular size(X) == size(DMatrix(X)).

source
XGBoost.isgpuFunction
isgpu(dm::DMatrix)

Whether or not the DMatrix data was initialized for a GPU. Boosters trained on such data utilize the GPU for training.

source
XGBoost.getlabelFunction
getlabel(dm::DMatrix)

Retrieve the label (training target) data from the DMatrix. Returns Float32[] if not set.

source
XGBoost.setfeatureinfo!Function
setfeatureinfo!(dm::DMatrix, info_name, strs)

Sets feature metadata in dm. Valid options for info_name are "feature_name" and "feature_type". strs must be a rank-1 array of strings. See setfeaturenames!.

source
XGBoost.setproxy!Function
setproxy!(dm::DMatrix, X::AbstractMatrix; kw...)

Set data in a "proxy" DMatrix like one created with proxy(DMatrix). Keyword arguments are set to the passed matrix.

source
XGBoost.DataIteratorType
DataIterator

A data structure which wraps an iterator which iteratively provides data for a DMatrix. This can be used e.g. to aid with loading data into external memory into a DMatrix object that can be used by Booster.

Users should not typically have to deal with DataIterator directly as it is essentially a wrapper around a normal Julia iterator for the purpose of achieving compatiblity with the underlying xgboost library calls. See fromiterator for how to construct a DMatrix from an iterator.

source
XGBoost.fromiteratorFunction
fromiterator(DMatrix, itr; cache_prefix=joinpath(tempdir(),"xgb-cache"), nthreads=nothing, kw...)

Create a DMatrix from an iterable object. itr can be any object that implements Julia's Base iteration protocol. Objects returned by the iterator must be key-value collections with Symbol keys with X as the main matrix and y as labels. For example

(X=randn(10,2), y=randn(10))

Other keys will be interpreted as keyword arguments to DMatrix.

When this is called XGBoost will start caching data provided by the iterator on disk in a format that it likes. All cache files generated this way will have a the prefix cache_prefix which is in /tmp by default.

What exactly xgboost does with nthreads is a bit mysterious, nothing gives the library's default.

Additional keyword arguments are passed to a DMatrix constructor.

source

Training and Prediction

XGBoost.xgboostFunction
xgboost(data; num_round=10, watchlist=Dict(), kw...)
xgboost(data, ℓ′, ℓ″; kw...)

Creates an xgboost gradient booster object on training data data and runs nrounds of training. This is essentially an alias for constructing a Booster with data and keyword arguments followed by update! for nrounds.

watchlist is a dict the keys of which are strings giving the name of the data to watch and the values of which are DMatrix objects containing the data. It is mandatory to use an OrderedDict when utilising earlystoppingrounds and there is more than 1 element in watchlist to ensure XGBoost uses the correct and intended dataset to perform early stop.

early_stopping_rounds activates early stopping if set to > 0. Validation metric needs to improve at least once in every k rounds. If watchlist is not explicitly provided, it will use the training dataset to evaluate the stopping criterion. Otherwise, it will use the last data element in watchlist and the last metric in eval_metric (if more than one). Note that watchlist cannot be empty if early_stopping_rounds is enabled.

maximize If earlystoppingrounds is set, then this parameter must be set as well. When it is false, it means the smaller the evaluation score the better. When set to true, the larger the evaluation score the better.

All other keyword arguments are passed to Booster. With few exceptions these are model training hyper-parameters, see here for a comprehensive list.

A custom loss function can be provided via its first and second derivatives (ℓ′ and ℓ″ respectively). See updateone! for more details.

Examples

# Example 1: Basic usage of XGBoost
(X, y) = (randn(100,3), randn(100))

b = xgboost((X, y), num_round=10, max_depth=10, η=0.1)

ŷ = predict(b, X)

# Example 2: Using early stopping (using a validation set) with a watchlist
dtrain = DMatrix((randn(100,3), randn(100)))
dvalid = DMatrix((randn(100,3), randn(100)))

watchlist = OrderedDict(["train" => dtrain, "valid" => dvalid])

b = xgboost(dtrain, num_round=10, early_stopping_rounds = 2, watchlist = watchlist, max_depth=10, η=0.1)

# note that ntree_limit in the predict function helps assign the upper bound for iteration_range in the XGBoost API 1.4+
ŷ = predict(b, dvalid, ntree_limit = b.best_iteration)
source
XGBoost.BoosterType
Booster

Data structure containing xgboost decision trees or other model objects. Booster is used in all methods for training and predition.

Booster can only consume data from DMatrix objects but most methods can convert provided data implicitly. Note that Booster does not store any of its input or output data.

See xgboost which is shorthand for a Booster constructor followed by training.

The Booster object records all non-default model hyper-parameters set either at construction or with setparam!. The xgboost library does not support retrieval of such parameters so these should be considered for UI purposes only, they are reported in the deafult show methods of the Booster.

Constructors

Booster(train; kw...)
Booster(trains::AbstractVector; kw...)

Arguments

  • train: Training data. If not a DMatrix this will be passed to the DMatrix constructor. For example it can be a training matrix or a training matrix, target pair.
  • trains: An array of objects used as training data, each of which will be passed to a DMatrix constructor.

Keyword Arguments

All keyword arguments excepting only those listed below will be interpreted as model parameters, see here for a comprehensive list. Both parameter names and their values must be provided exactly as they appear in the linked documentation. Model parameters can also be set after construction, see setparam! and setparams!.

  • tree_method: This parameter gets special handling. By default it is nothing which uses the default from libxgboost as per the documentation unless GPU arrays are used in which case it defaults to "gpu_hist". If an explicit option is set, it will always be used.
  • feature_names: Sets the feature names of training data. This will use the feature names set in the input data if available (e.g. if tabular data was passed this will use column names).
  • model_buffer: A buffer (AbstractVector{UInt8} or IO) from which to load an existing booster model object.
  • model_file: Name of a file from which to load an existing booster model object, see save.
source
XGBoost.updateone!Function
updateone!(b::Booster, data; round_number=getnrounds(b)+1,
           watchlist=Dict("train"=>data), update_feature_names=false
          )

Run one round of gradient boosting with booster b on data data. data can be any object that is accepted by a DMatrix constructor. round_number is the number of the current round and is used for logs only. Info logs will be printed for training sets in watchlist; keys give the name of that dataset for logging purposes only.

source
updateone!(b::Booster, data, ℓ′, ℓ″; kw...)

Run one of gradient boosting with a loss function . ℓ′ and ℓ″ are the first and second scalar derivatives of the loss function. For example

ℓ(ŷ, y) = (ŷ - y)^2
ℓ′(ŷ, y) = 2(ŷ - y)
ℓ″(ŷ, y) = 2

where the derivatives are with respect to the first argument (the prediction).

Other arguments are the same as they would be provided to other methods of updateone!.

source
XGBoost.update!Function
update!(b::Booster, data; num_round=1, kw...)
update!(b::Booster, data, ℓ′, ℓ″; kw...)

Run num_round rounds of gradient boosting on Booster b.

The first and second derivatives of the loss function (ℓ′ and ℓ″ respectively) can be provided for custom loss.

source
XGBoost.predictFunction
predict(b::Booster, data; margin=false, training=false, ntree_limit=0)

Use the model b to run predictions on data. This will return a Vector{Float32} which can be compared to training or test target data.

If ntree_limit > 0 only the first ntree_limit trees will be used in prediction.

Examples

(X, y) = (randn(100,3), randn(100))
b = xgboost((X, y), 10)

ŷ = predict(b, X)
source
XGBoost.predict_nocopyFunction
predict_nocopy(b::Booster, data; kw...)

Same as predict, but the output array is not copied. Data in the array output by this function may be overwritten by future calls to predict_nocopy or predict.

source
XGBoost.setparam!Function
setparam!(b::Booster, name, val)

Set a model parameter in the booster. The complete list of model parameters can be found here. Any non-default parameters set via this method will be stored so they can be seen in REPL text output, however the xgboost library does not support parameter retrieval. name can be either a string or a Symbol.

source
XGBoost.getnroundsFunction
getnrounds(b::Booster)

Get the number of rounds run by the Booster object. Normally this will correspond to the total number of trees stored in the Booster.

source
XGBoost.load!Function
load!(b::Booster, file_or_buffer)

Load a serialized Booster object from a file or buffer into an existing model object. file_or_buffer can be a string giving the name of the file to load from, a AbstractVector{UInt8} buffer, or an IO.

This should load models stored via save (not serialize which may give incompatible buffers).

source
XGBoost.loadMethod
load(Booster, file_or_buffer)

Load a saved Booster model object from a file or buffer. file_or_buffer can be a string giving the name of the file to load from, an AbstractVector{UInt8} buffer or an IO.

This should load models stored via save (not serialize which may give incompatible buffers).

source
XGBoost.saveMethod
save(b::Booster, fname; format="json")
save(b::Booster, Vector{UInt8}; format="json")
save(b::Booster, io::IO; format="json")

Save the Booster object. This saves to formats which are intended to be stored on disk but the formats used are a lot zanier than those used by deserialize. A model saved with this function can be retrieved with load or load!. Valid formats are "json" and "ubj" (universal binary JSON).

source
XGBoost.serializeFunction
serialize(b::Booster)

Serialize the model b into an opaque binary format. Returns a Vector{UInt8}. The output of this function can be loaded with deserialize.

source
XGBoost.nfeaturesFunction
nfeatures(b::Booster)

Get the number of features on which b is being trained. Note that this can return nothing if the Booster object is uninitialized (was created with no data arguments).

source
XGBoost.deserializeFunction
deserialize(Booster, buf, data=[]; kw...)

Deserialize the data in buffer buf to a new Booster object. The data in buf should have been created with serialize. data and keyword arguments are sent to a Booster constructor.

source

Introspection

XGBoost.treesFunction
trees(b::Booster; with_stats=true)

Return all trees of the model of the Booster b as Node objects. The output of this function is a Vector of Nodes each representing the root of a separate tree.

If with_stats the output Node objects will contain the computed statistics gain and cover.

source
XGBoost.importancetableFunction
importancetable(b::Booster)

Return a Table.jl compatible table (named tuple of Vectors) giving a summary of all available feature importance statistics for b. This table is mainly intended for display purposes, see importance for a more direct way of retrieving importance statistics. See importancereport for a convenient display of this table.

source
XGBoost.importanceFunction
importance(b::Booster, type="gain")

Compute feature importance metric from a trained Booster. Valid options for type are

  • "gain"
  • "weight"
  • "cover"
  • "total_gain"
  • "total_cover"

The output is an OrderedDict with keys corresponding to feature names and values corresponding to importances. The importances are always returned as Vectors, typically with length 1 but possibly longer in multi-class cases. If feature names were not set the keys of the output dict will be integers giving the feature column number. The output will be sorted with the highest importance feature listed first and the lowest importance feature listed last.

See importancetable for a way to generate a tabular summary of all available feature importances and importancereport for a convenient text display of it.

source
XGBoost.importancereportFunction
importancereport(b::Booster)

Show a convenient text display of the table output by importancetable.

This is intended entirely for display purposes, see importance for how to retrieve feature importance statistics directly.

Note

In Julia >= 1.9, you have to load Term.jl to be able to use this functionality.

source
XGBoost.NodeType
Node

A data structure representing a node of an XGBoost tree. These are constructed from the dicts returned by dump.

Nodes satisfy the AbstractTrees.jl interface with all nodes being of type Node.

Use trees(booster) to return all trees in the model as Node objects, see trees.

All properties of this struct should be considered public, see propertynames(node) for a list. Leaf nodes will have their value given by leaf.

source
XGBoost.dumpFunction
dump(b::Booster; with_stats=false)

Return the model stored by Booster as a set of hierararchical objects (i.e. from parsed JSON). This can be used to inspect the state of the model. See trees and Node which parse the output from this into a more useful format which satisfies the AbstractTrees interface.

source
XGBoost.dumprawFunction
dumpraw(b::Booster; format="json", with_stats=false)

Dump the models stored by b to a string format. Valid options for format are "json" or "text". See also dump which returns the same thing as parsed JSON.

source

Default Parameters

XGBoost.regressionFunction
regression(;kw...)

Default parameters for performing a regression. Returns a named tuple that can be used to supply arguments in the usual way

Example

using XGBoost: regression

regression()  # will merely return default parameters as named tuple

xgboost(X, y, 10; regression(max_depth=8)...)
xgboost(X, y, 10; regression()..., max_depth=8)
source
XGBoost.countregressionFunction
countregression(;kw...)

Default parameters for performing a regression on a Poisson-distributed variable.

source
XGBoost.randomforestFunction
randomforest(;kw...)

Default parameters for training as a random forest. Note that a conventional random forest would involve using these parameters with exactly 1 round of boosting, however there is nothing stopping you from boosting n random forests.

Parameters that are particularly relevant to random forests are:

  • num_parallel_tree: number of trees in the forest.
  • subsample: Sample fraction of data (occurs once per boosting iteration).
  • colsample_bynode: Sampling fraction of data on node splits.
  • η: Learning rate, when set to 1 there is no shrinking of updates.

See here for more details.

Examples

using XGBoost: regression, randomforest

xgboost(X, y, 1; regression()..., randomforest()...)
source