API
XGBoost.Booster
XGBoost.DMatrix
XGBoost.DataIterator
XGBoost.Node
Base.size
XGBoost.classification
XGBoost.countregression
XGBoost.deserialize
XGBoost.deserialize!
XGBoost.dump
XGBoost.dumpraw
XGBoost.fromiterator
XGBoost.getfeatureinfo
XGBoost.getfeaturenames
XGBoost.getinfo
XGBoost.getlabel
XGBoost.getnrounds
XGBoost.getweights
XGBoost.importance
XGBoost.importancereport
XGBoost.importancetable
XGBoost.isgpu
XGBoost.load
XGBoost.load
XGBoost.load!
XGBoost.ncols
XGBoost.nfeatures
XGBoost.nrows
XGBoost.predict
XGBoost.predict_nocopy
XGBoost.randomforest
XGBoost.regression
XGBoost.save
XGBoost.save
XGBoost.serialize
XGBoost.setfeatureinfo!
XGBoost.setfeaturenames!
XGBoost.setinfo!
XGBoost.setinfos!
XGBoost.setlabel!
XGBoost.setparam!
XGBoost.setparams!
XGBoost.setproxy!
XGBoost.slice
XGBoost.trees
XGBoost.update!
XGBoost.updateone!
XGBoost.xgboost
Data Input
XGBoost.DMatrix
— TypeDMatrix <: AbstractMatrix{Union{Missing,Float32}}
Data structure for storing data which can be understood by an xgboost Booster
. These can store both features and targets. Values of the DMatrix
can be accessed as with any other AbstractMatrix
, however doing so causes additional allocations. Performant indexing and matrix operation code should not use DMatrix
directly.
Aside from a primary array, the DMatrix
object can have various "info" fields associated with it. Training target variables are stored as a special info field with the name label
, see setinfo!
and setinfos!
. These can be retrieved with getinfo
and getlabel
.
Note that the xgboost library internally uses Float32
to represent all data, so input data is automatically copied unless provided in this format. Unfortunately because of the different representations used by C and Julia, any non Transpose
matrix will also be copied.
On missing
Values
Xgboost supports training on missing
data. Such data is simply omitted from tree splits. Because the DMatrix
is internally a Float32
matrix, libxgboost
uses a settable default value to represent missing values, see the missing_value
keyword argument below (default NaN32
). This value is used only on matrix construction. This will cause input matrix elements to ultimately be converted to missing
. The most obvious consequence of this is that NaN32
values will automatically be converted to missing
with default arguments. The provided constructors ensure that missing
values will be preserved.
TL;DR: DMatrix
supports missing
and NaN
's will be converted to missing
.
Constructors
DMatrix(X::AbstractMatrix; kw...)
DMatrix(X::AbstractMatrix, y::AbstractVector; kw...)
DMatrix((X, y); kw...)
DMatrix(tbl; kw...)
DMatrix(tbl, y; kw...)
DMatrix(tbl, yname::Symbol; kw...)
Arguments
X
: A matrix that is the primary data wrapped by theDMatrix
. Elements can bemissing
. Matrices withFloat32
eleemnts do not need to be copied.y
: Data to assign to thelabel
info field. This is the target variable used in training. Can also be set with thelabel
keyword.tbl
: The input matrix in tabular form.tbl
must satisfy the Tables.jl interface. If data is passed in tabular form feature names will be set automatically but can be overriden with the keyword argument.yname
: If passed a tabular argumenttbl
,yname
is the name of the column which holds the label data. It will automatically be omitted from the features.
Keyword Arguments
missing_value
: TheFloat32
value of elements of input data to be interpreted asmissing
, defaults toNaN32
.label
: Training target data, this is the same as they
argument above, i.e.DMatrix(X,y)
andDMatrix(X, label=y)
are equivalent.weight
: AnAbstractVector
of weights for each data point. This array must have lenght equal to the number of rows of the main data matrix.base_margin
: Sets the global bias for a boosted model trained on this dataset, see https://xgboost.readthedocs.io/en/stable/prediction.html#base-margin
Examples
(X, y) = (randn(10,3), randn(10))
# the following are all equivalent
DMatrix(X, y)
DMatrix((X, y))
DMatrix(X, label=y)
DMatrix(X, y, feature_names=["a", "b", "c"]) # explicitly set feature names
df = DataFrame(A=randn(10), B=randn(10))
DMatrix(df) # has feature names ["A", "B"] but no label
XGBoost.load
— Methodload(DMatrix, fname; silent=true, format=:libsvm, kw...)
Load a DMatrix
from file with name fname
. The matrix must have been serialized with a call to save(::DMatrix, fname)
. If silent
the xgboost library will print logs to stdout
. Additional keyword arguments are passed to the DMatrix
on construction. Format describes the file format, valid options are :binary
, :csv
and :libsvm
.
XGBoost.save
— Methodsave(dm::DMatrix, fname; silent=true)
Save the DMatrix
to file fname
in an opaque (xgboost-specific) serialization format. Will print logs to stdout
unless silent
. Files created with this function can be loaded using XGBoost.load(DMatrix, fname, format=:binary)
.
XGBoost.setfeaturenames!
— Functionsetfeaturenames!(dm::DMatrix, names)
Sets the names of the features in dm
. This can be used by Booster
for reporting. names
must be a rank-1 array of strings with length equal to the number of features. Note that this will be set automatically by DMatrix
constructors from table objects.
XGBoost.getfeaturenames
— Functiongetfeaturenames(dm::DMatrix)
Get the names of features in dm
.
XGBoost.setinfo!
— Functionsetinfo!(dm::DMatrix, name, info)
Set DMatrix
ancillary info, for example :label
or :weight
. name
can be a string or a Symbol
. See DMatrix
.
XGBoost.setinfos!
— Functionsetinfos!(dm::DMatrix; kw...)
Make arbitrarily many calls to setinfo!
via keyword arguments. This function is called by all DMatrix
constructors, i.e. DMatrix(X; kw...)
is equivalent to setinfos!(DMatrix(X); kw...)
.
XGBoost.setlabel!
— Functionsetlabel!(dm::DMatrix, y)
Set the label data of dm
to y
. Equivalent to setinfo!(dm, "label", y)
.
XGBoost.getinfo
— Functiongetinfo(dm::DMatrix, T, name)
Get DMatrix
info with name name
. Users must specify the underlying data type due to limitations of the xgboost library. One must have T<:AbstractFloat
to get floating point data (e.g. label
, weight
), or T<:Integer
to get integer data. The output will be converted to Vector{T}
in all cases. name
can be either a string or Symbol
.
XGBoost.slice
— Functionslice(dm::DMatrix, idx; kw...)
Create a new DMatrix
out of the subset of rows of dm
given by indices idx
. For performance reasons it is recommended to take slices before converting to DMatrix
. Additional keyword arguments are passed to the newly constructed slice.
This can also be called via Base.getindex
, for example, the following are equivalent
slice(dm, 1:4)
dm[1:4, :] # second argument *must* be `:` as column slices are not supported.
XGBoost.nrows
— Functionnrows(dm::DMatrix)
Returns the number of rows of the DMatrix
.
XGBoost.ncols
— Functionncols(dm::DMatrix)
Returns the number of columns of the DMatrix
. Note that this will only count columns of the main data (the X
argument to the constructor). The value returned is independent of the presence of labels. In particular size(X,2) == ncols(DMatrix(X))
.
Base.size
— Methodsize(dm::DMatrix, [dim])
Returns the size
of the primary data of the DMatrix
. Note that this only accounts for the primary data and is independent of whether labels or any other ancillary data are present. In particular size(X) == size(DMatrix(X))
.
XGBoost.isgpu
— Functionisgpu(dm::DMatrix)
Whether or not the DMatrix
data was initialized for a GPU. Boosters trained on such data utilize the GPU for training.
XGBoost.getlabel
— Functiongetlabel(dm::DMatrix)
Retrieve the label (training target) data from the DMatrix
. Returns Float32[]
if not set.
XGBoost.getweights
— Functiongetweights(dm::DMatrix)
Get data training weights. Returns Float32[]
if not set.
XGBoost.setfeatureinfo!
— Functionsetfeatureinfo!(dm::DMatrix, info_name, strs)
Sets feature metadata in dm
. Valid options for info_name
are "feature_name"
and "feature_type"
. strs
must be a rank-1 array of strings. See setfeaturenames!
.
XGBoost.getfeatureinfo
— Functiongetfeatureinfo(dm::DMatrix, info_name)
Get feature info that was set via setfeatureinfo!
. Valid options for info_name
are "feature_name"
and "feature_type"
. See getfeaturenames
.
XGBoost.setproxy!
— Functionsetproxy!(dm::DMatrix, X::AbstractMatrix; kw...)
Set data in a "proxy" DMatrix
like one created with proxy(DMatrix)
. Keyword arguments are set to the passed matrix.
XGBoost.DataIterator
— TypeDataIterator
A data structure which wraps an iterator which iteratively provides data for a DMatrix
. This can be used e.g. to aid with loading data into external memory into a DMatrix
object that can be used by Booster
.
Users should not typically have to deal with DataIterator
directly as it is essentially a wrapper around a normal Julia iterator for the purpose of achieving compatiblity with the underlying xgboost library calls. See fromiterator
for how to construct a DMatrix
from an iterator.
XGBoost.fromiterator
— Functionfromiterator(DMatrix, itr; cache_prefix=joinpath(tempdir(),"xgb-cache"), nthreads=nothing, kw...)
Create a DMatrix
from an iterable object. itr
can be any object that implements Julia's Base
iteration protocol. Objects returned by the iterator must be key-value collections with Symbol
keys with X
as the main matrix and y
as labels. For example
(X=randn(10,2), y=randn(10))
Other keys will be interpreted as keyword arguments to DMatrix
.
When this is called XGBoost will start caching data provided by the iterator on disk in a format that it likes. All cache files generated this way will have a the prefix cache_prefix
which is in /tmp
by default.
What exactly xgboost does with nthreads
is a bit mysterious, nothing
gives the library's default.
Additional keyword arguments are passed to a DMatrix
constructor.
Training and Prediction
XGBoost.xgboost
— Functionxgboost(data; num_round=10, watchlist=Dict(), kw...)
xgboost(data, ℓ′, ℓ″; kw...)
Creates an xgboost gradient booster object on training data data
and runs nrounds
of training. This is essentially an alias for constructing a Booster
with data
and keyword arguments followed by update!
for nrounds
.
watchlist
is a dict the keys of which are strings giving the name of the data to watch and the values of which are DMatrix
objects containing the data. It is mandatory to use an OrderedDict when utilising earlystoppingrounds and there is more than 1 element in watchlist to ensure XGBoost uses the correct and intended dataset to perform early stop.
early_stopping_rounds
activates early stopping if set to > 0. Validation metric needs to improve at least once in every k rounds. If watchlist
is not explicitly provided, it will use the training dataset to evaluate the stopping criterion. Otherwise, it will use the last data element in watchlist
and the last metric in eval_metric
(if more than one). Note that watchlist
cannot be empty if early_stopping_rounds
is enabled.
maximize
If earlystoppingrounds is set, then this parameter must be set as well. When it is false, it means the smaller the evaluation score the better. When set to true, the larger the evaluation score the better.
All other keyword arguments are passed to Booster
. With few exceptions these are model training hyper-parameters, see here for a comprehensive list.
A custom loss function can be provided via its first and second derivatives (ℓ′
and ℓ″
respectively). See updateone!
for more details.
Examples
# Example 1: Basic usage of XGBoost
(X, y) = (randn(100,3), randn(100))
b = xgboost((X, y), num_round=10, max_depth=10, η=0.1)
ŷ = predict(b, X)
# Example 2: Using early stopping (using a validation set) with a watchlist
dtrain = DMatrix((randn(100,3), randn(100)))
dvalid = DMatrix((randn(100,3), randn(100)))
watchlist = OrderedDict(["train" => dtrain, "valid" => dvalid])
b = xgboost(dtrain, num_round=10, early_stopping_rounds = 2, watchlist = watchlist, max_depth=10, η=0.1)
# note that ntree_limit in the predict function helps assign the upper bound for iteration_range in the XGBoost API 1.4+
ŷ = predict(b, dvalid, ntree_limit = b.best_iteration)
XGBoost.Booster
— TypeBooster
Data structure containing xgboost decision trees or other model objects. Booster
is used in all methods for training and predition.
Booster
can only consume data from DMatrix
objects but most methods can convert provided data implicitly. Note that Booster
does not store any of its input or output data.
See xgboost
which is shorthand for a Booster
constructor followed by training.
The Booster
object records all non-default model hyper-parameters set either at construction or with setparam!
. The xgboost library does not support retrieval of such parameters so these should be considered for UI purposes only, they are reported in the deafult show
methods of the Booster
.
Constructors
Booster(train; kw...)
Booster(trains::AbstractVector; kw...)
Arguments
train
: Training data. If not aDMatrix
this will be passed to theDMatrix
constructor. For example it can be a training matrix or a training matrix, target pair.trains
: An array of objects used as training data, each of which will be passed to aDMatrix
constructor.
Keyword Arguments
All keyword arguments excepting only those listed below will be interpreted as model parameters, see here for a comprehensive list. Both parameter names and their values must be provided exactly as they appear in the linked documentation. Model parameters can also be set after construction, see setparam!
and setparams!
.
tree_method
: This parameter gets special handling. By default it isnothing
which uses the default fromlibxgboost
as per the documentation unless GPU arrays are used in which case it defaults to"gpu_hist"
. If an explicit option is set, it will always be used.feature_names
: Sets the feature names of training data. This will use the feature names set in the input data if available (e.g. if tabular data was passed this will use column names).model_buffer
: A buffer (AbstractVector{UInt8}
orIO
) from which to load an existing booster model object.model_file
: Name of a file from which to load an existing booster model object, seesave
.
XGBoost.updateone!
— Functionupdateone!(b::Booster, data; round_number=getnrounds(b)+1,
watchlist=Dict("train"=>data), update_feature_names=false
)
Run one round of gradient boosting with booster b
on data data
. data
can be any object that is accepted by a DMatrix
constructor. round_number
is the number of the current round and is used for logs only. Info logs will be printed for training sets in watchlist
; keys give the name of that dataset for logging purposes only.
updateone!(b::Booster, data, ℓ′, ℓ″; kw...)
Run one of gradient boosting with a loss function ℓ
. ℓ′
and ℓ″
are the first and second scalar derivatives of the loss function. For example
ℓ(ŷ, y) = (ŷ - y)^2
ℓ′(ŷ, y) = 2(ŷ - y)
ℓ″(ŷ, y) = 2
where the derivatives are with respect to the first argument (the prediction).
Other arguments are the same as they would be provided to other methods of updateone!
.
XGBoost.update!
— Functionupdate!(b::Booster, data; num_round=1, kw...)
update!(b::Booster, data, ℓ′, ℓ″; kw...)
Run num_round
rounds of gradient boosting on Booster
b
.
The first and second derivatives of the loss function (ℓ′
and ℓ″
respectively) can be provided for custom loss.
XGBoost.predict
— Functionpredict(b::Booster, data; margin=false, training=false, ntree_limit=0)
Use the model b
to run predictions on data
. This will return a Vector{Float32}
which can be compared to training or test target data.
If ntree_limit > 0
only the first ntree_limit
trees will be used in prediction.
Examples
(X, y) = (randn(100,3), randn(100))
b = xgboost((X, y), 10)
ŷ = predict(b, X)
XGBoost.predict_nocopy
— Functionpredict_nocopy(b::Booster, data; kw...)
Same as predict
, but the output array is not copied. Data in the array output by this function may be overwritten by future calls to predict_nocopy
or predict
.
XGBoost.setparam!
— Functionsetparam!(b::Booster, name, val)
Set a model parameter in the booster. The complete list of model parameters can be found here. Any non-default parameters set via this method will be stored so they can be seen in REPL text output, however the xgboost library does not support parameter retrieval. name
can be either a string or a Symbol
.
XGBoost.setparams!
— Functionsetparams!(b::Booster; kw...)
Set arbitrarily many model parameters via keyword arguments, see setparam!
.
XGBoost.getnrounds
— Functiongetnrounds(b::Booster)
Get the number of rounds run by the Booster
object. Normally this will correspond to the total number of trees stored in the Booster
.
XGBoost.load!
— Functionload!(b::Booster, file_or_buffer)
Load a serialized Booster
object from a file or buffer into an existing model object. file_or_buffer
can be a string giving the name of the file to load from, a AbstractVector{UInt8}
buffer, or an IO
.
This should load models stored via save
(not serialize
which may give incompatible buffers).
XGBoost.load
— Methodload(Booster, file_or_buffer)
Load a saved Booster
model object from a file or buffer. file_or_buffer
can be a string giving the name of the file to load from, an AbstractVector{UInt8}
buffer or an IO
.
This should load models stored via save
(not serialize
which may give incompatible buffers).
XGBoost.save
— Methodsave(b::Booster, fname; format="json")
save(b::Booster, Vector{UInt8}; format="json")
save(b::Booster, io::IO; format="json")
Save the Booster
object. This saves to formats which are intended to be stored on disk but the formats used are a lot zanier than those used by deserialize
. A model saved with this function can be retrieved with load
or load!
. Valid formats are "json"
and "ubj"
(universal binary JSON).
XGBoost.serialize
— Functionserialize(b::Booster)
Serialize the model b
into an opaque binary format. Returns a Vector{UInt8}
. The output of this function can be loaded with deserialize
.
XGBoost.nfeatures
— Functionnfeatures(b::Booster)
Get the number of features on which b
is being trained. Note that this can return nothing
if the Booster
object is uninitialized (was created with no data arguments).
XGBoost.deserialize!
— Functiondeserialize!(b::Booster, buf)
Deserialize a buffer created with serialize
to the provided Booster
object.
XGBoost.deserialize
— Functiondeserialize(Booster, buf, data=[]; kw...)
Deserialize the data in buffer buf
to a new Booster
object. The data in buf
should have been created with serialize
. data
and keyword arguments are sent to a Booster
constructor.
Introspection
XGBoost.trees
— Functiontrees(b::Booster; with_stats=true)
Return all trees of the model of the Booster
b
as Node
objects. The output of this function is a Vector
of Node
s each representing the root of a separate tree.
If with_stats
the output Node
objects will contain the computed statistics gain
and cover
.
XGBoost.importancetable
— Functionimportancetable(b::Booster)
Return a Table.jl compatible table (named tuple of Vector
s) giving a summary of all available feature importance statistics for b
. This table is mainly intended for display purposes, see importance
for a more direct way of retrieving importance statistics. See importancereport
for a convenient display of this table.
XGBoost.importance
— Functionimportance(b::Booster, type="gain")
Compute feature importance metric from a trained Booster
. Valid options for type
are
"gain"
"weight"
"cover"
"total_gain"
"total_cover"
The output is an OrderedDict
with keys corresponding to feature names and values corresponding to importances. The importances are always returned as Vector
s, typically with length 1 but possibly longer in multi-class cases. If feature names were not set the keys of the output dict will be integers giving the feature column number. The output will be sorted with the highest importance feature listed first and the lowest importance feature listed last.
See importancetable
for a way to generate a tabular summary of all available feature importances and importancereport
for a convenient text display of it.
XGBoost.importancereport
— Functionimportancereport(b::Booster)
Show a convenient text display of the table output by importancetable
.
This is intended entirely for display purposes, see importance
for how to retrieve feature importance statistics directly.
In Julia >= 1.9, you have to load Term.jl to be able to use this functionality.
XGBoost.Node
— TypeNode
A data structure representing a node of an XGBoost tree. These are constructed from the dicts returned by dump
.
Node
s satisfy the AbstractTrees.jl interface with all nodes being of type Node
.
Use trees(booster)
to return all trees in the model as Node
objects, see trees
.
All properties of this struct should be considered public, see propertynames(node)
for a list. Leaf nodes will have their value given by leaf
.
XGBoost.dump
— Functiondump(b::Booster; with_stats=false)
Return the model stored by Booster
as a set of hierararchical objects (i.e. from parsed JSON). This can be used to inspect the state of the model. See trees
and Node
which parse the output from this into a more useful format which satisfies the AbstractTrees interface.
XGBoost.dumpraw
— Functiondumpraw(b::Booster; format="json", with_stats=false)
Dump the models stored by b
to a string format. Valid options for format
are "json"
or "text"
. See also dump
which returns the same thing as parsed JSON.
Default Parameters
XGBoost.regression
— Functionregression(;kw...)
Default parameters for performing a regression. Returns a named tuple that can be used to supply arguments in the usual way
Example
using XGBoost: regression
regression() # will merely return default parameters as named tuple
xgboost(X, y, 10; regression(max_depth=8)...)
xgboost(X, y, 10; regression()..., max_depth=8)
XGBoost.countregression
— Functioncountregression(;kw...)
Default parameters for performing a regression on a Poisson-distributed variable.
XGBoost.classification
— Functionclassification(;kw...)
Default parameters for performing a classification.
XGBoost.randomforest
— Functionrandomforest(;kw...)
Default parameters for training as a random forest. Note that a conventional random forest would involve using these parameters with exactly 1 round of boosting, however there is nothing stopping you from boosting n
random forests.
Parameters that are particularly relevant to random forests are:
num_parallel_tree
: number of trees in the forest.subsample
: Sample fraction of data (occurs once per boosting iteration).colsample_bynode
: Sampling fraction of data on node splits.η
: Learning rate, when set to1
there is no shrinking of updates.
See here for more details.
Examples
using XGBoost: regression, randomforest
xgboost(X, y, 1; regression()..., randomforest()...)