Interface

Implement a class gbart that mimics the R BART package.

class bartz.BART.DataFrame(*args, **kwargs)[source]

DataFrame duck-type for gbart.

Variables:

columns (Sequence[str]) – The names of the columns.

to_numpy()[source]

Convert the dataframe to a 2d numpy array with columns on the second axis.

Return type:

ndarray

class bartz.BART.Series(*args, **kwargs)[source]

Series duck-type for gbart.

Variables:

name (str | None) – The name of the series.

to_numpy()[source]

Convert the series to a 1d numpy array.

Return type:

ndarray

class bartz.BART.gbart(x_train, y_train, *, x_test=None, type='wbart', xinfo=None, usequants=False, rm_const=True, sigest=None, sigdf=3.0, sigquant=0.9, k=2.0, power=2.0, base=0.95, lamda=None, tau_num=None, offset=None, w=None, ntree=None, numcut=100, ndpost=1000, nskip=100, keepevery=None, printevery=100, seed=0, maxdepth=6, init_kw=None, run_mcmc_kw=None)[source]

Nonparametric regression with Bayesian Additive Regression Trees (BART).

Regress y_train on x_train with a latent mean function represented as a sum of decision trees. The inference is carried out by sampling the posterior distribution of the tree ensemble with an MCMC.

Parameters:
  • x_train (Real[Array, 'p n'] | DataFrame) – The training predictors.

  • y_train (Bool[Array, 'n'] | Float32[Array, 'n'] | Series) – The training responses.

  • x_test (Real[Array, 'p m'] | DataFrame | None, default: None) – The test predictors.

  • type (Literal['wbart', 'pbart'], default: 'wbart') – The type of regression. ‘wbart’ for continuous regression, ‘pbart’ for binary regression with probit link.

  • xinfo (Float[Array, 'p n'] | None, default: None) –

    A matrix with the cutpoins to use to bin each predictor. If not specified, it is generated automatically according to usequants and numcut.

    Each row shall contain a sorted list of cutpoints for a predictor. If there are less cutpoints than the number of columns in the matrix, fill the remaining cells with NaN.

    xinfo shall be a matrix even if x_train is a dataframe.

  • usequants (bool, default: False) – Whether to use predictors quantiles instead of a uniform grid to bin predictors. Ignored if xinfo is specified.

  • rm_const (bool | None, default: True) – How to treat predictors with no associated decision rules (i.e., there are no available cutpoints for that predictor). If True (default), they are ignored. If False, an error is raised if there are any. If None, no check is performed, and the output of the MCMC may not make sense if there are predictors without cutpoints. The option None is provided only to allow jax tracing.

  • sigest (float | Float[Any, ''] | None, default: None) – An estimate of the residual standard deviation on y_train, used to set lamda. If not specified, it is estimated by linear regression (with intercept, and without taking into account w). If y_train has less than two elements, it is set to 1. If n <= p, it is set to the standard deviation of y_train. Ignored if lamda is specified.

  • sigdf (float | Float[Any, ''], default: 3.0) – The degrees of freedom of the scaled inverse-chisquared prior on the noise variance.

  • sigquant (float | Float[Any, ''], default: 0.9) – The quantile of the prior on the noise variance that shall match sigest to set the scale of the prior. Ignored if lamda is specified.

  • k (float | Float[Any, ''], default: 2.0) – The inverse scale of the prior standard deviation on the latent mean function, relative to half the observed range of y_train. If y_train has less than two elements, k is ignored and the scale is set to 1.

  • power (float | Float[Any, ''], default: 2.0)

  • base (float | Float[Any, ''], default: 0.95) – Parameters of the prior on tree node generation. The probability that a node at depth d (0-based) is non-terminal is base / (1 + d) ** power.

  • lamda (float | Float[Any, ''] | None, default: None) – The prior harmonic mean of the error variance. (The harmonic mean of x is 1/mean(1/x).) If not specified, it is set based on sigest and sigquant.

  • tau_num (float | Float[Any, ''] | None, default: None) – The numerator in the expression that determines the prior standard deviation of leaves. If not specified, default to (max(y_train) - min(y_train)) / 2 (or 1 if y_train has less than two elements) for continuous regression, and 3 for binary regression.

  • offset (float | Float[Any, ''] | None, default: None) – The prior mean of the latent mean function. If not specified, it is set to the mean of y_train for continuous regression, and to Phi^-1(mean(y_train)) for binary regression. If y_train is empty, offset is set to 0. With binary regression, if y_train is all False or True, it is set to Phi^-1(1/(n+1)) or Phi^-1(n/(n+1)), respectively.

  • w (Float[Array, 'n'] | None, default: None) – Coefficients that rescale the error standard deviation on each datapoint. Not specifying w is equivalent to setting it to 1 for all datapoints. Note: w is ignored in the automatic determination of sigest, so either the weights should be O(1), or sigest should be specified by the user.

  • ntree (int | None, default: None) – The number of trees used to represent the latent mean function. By default 200 for continuous regression and 50 for binary regression.

  • numcut (int, default: 100) –

    If usequants is False: the exact number of cutpoints used to bin the predictors, ranging between the minimum and maximum observed values (excluded).

    If usequants is True: the maximum number of cutpoints to use for binning the predictors. Each predictor is binned such that its distribution in x_train is approximately uniform across bins. The number of bins is at most the number of unique values appearing in x_train, or numcut + 1.

    Before running the algorithm, the predictors are compressed to the smallest integer type that fits the bin indices, so numcut is best set to the maximum value of an unsigned integer type, like 255.

    Ignored if xinfo is specified.

  • ndpost (int, default: 1000) – The number of MCMC samples to save, after burn-in.

  • nskip (int, default: 100) – The number of initial MCMC samples to discard as burn-in.

  • keepevery (int | None, default: None) – The thinning factor for the MCMC samples, after burn-in. By default, 1 for continuous regression and 10 for binary regression.

  • printevery (int | None, default: 100) –

    The number of iterations (including thinned-away ones) between each log line. Set to None to disable logging.

    printevery has a few unexpected side effects. On cpu, interrupting with ^C halts the MCMC only on the next log. And the total number of iterations is a multiple of printevery, so if nskip + keepevery * ndpost is not a multiple of printevery, some of the last iterations will not be saved.

  • seed (int | Key[Array, ''], default: 0) – The seed for the random number generator.

  • maxdepth (int, default: 6) – The maximum depth of the trees. This is 1-based, so with the default maxdepth=6, the depths of the levels range from 0 to 5.

  • init_kw (dict | None, default: None) – Additional arguments passed to bartz.mcmcstep.init.

  • run_mcmc_kw (dict | None, default: None) – Additional arguments passed to bartz.mcmcloop.run_mcmc.

Variables:
  • offset (Float32[Array, '']) – The prior mean of the latent mean function.

  • sigest (Float32[Array, ''] | None) – The estimated standard deviation of the error used to set lamda.

  • sigma (Float32[Array, 'nskip+ndpost'] | None) – The standard deviation of the error, including burn-in samples.

  • yhat_test (Float32[Array, 'ndpost m'] | None) – The conditional posterior mean at x_test for each MCMC iteration.

Notes

This interface imitates the function gbart from the R package BART, but with these differences:

  • If x_train and x_test are matrices, they have one predictor per row instead of per column.

  • If usequants=False, R BART switches to quantiles anyway if there are less predictor values than the required number of bins, while bartz always follows the specification.

  • The error variance parameter is called lamda instead of lambda.

  • Some functionality is missing (e.g., variable selection).

  • There are some additional attributes, and some missing.

  • The trees have a maximum depth.

  • rm_const refers to predictors without decision rules instead of predictors that are constant in x_train.

property prob_test: Float32[Array, 'ndpost m'] | None[source]

The posterior probability of y being True at x_test for each MCMC iteration.

property prob_test_mean: Float32[Array, 'm'] | None[source]

The marginal posterior probability of y being True at x_test.

property prob_train: Float32[Array, 'ndpost n'] | None[source]

The posterior probability of y being True at x_train for each MCMC iteration.

property prob_train_mean: Float32[Array, 'n'] | None[source]

The marginal posterior probability of y being True at x_train.

property sigma_mean: Float32[Array, ''] | None[source]

The mean of sigma, only over the post-burnin samples.

property varcount: Int32[Array, 'ndpost p'][source]

Histogram of predictor usage for decision rules in the trees.

property varcount_mean: Float32[Array, 'p'][source]

Average of varcount across MCMC iterations.

property yhat_test_mean: Float32[Array, 'm'] | None[source]

The marginal posterior mean at x_test.

Not defined with binary regression because it’s error-prone, typically the right thing to consider would be prob_test_mean.

property yhat_train: Float32[Array, 'ndpost n'][source]

The conditional posterior mean at x_train for each MCMC iteration.

property yhat_train_mean: Float32[Array, 'n'] | None[source]

The marginal posterior mean at x_train.

Not defined with binary regression because it’s error-prone, typically the right thing to consider would be prob_train_mean.

predict(x_test)[source]

Compute the posterior mean at x_test for each MCMC iteration.

Parameters:

x_test (Real[Array, 'p m'] | DataFrame) – The test predictors.

Returns:

Float32[Array, 'ndpost m'] – The conditional posterior mean at x_test for each MCMC iteration.

Raises:

ValueError – If x_test has a different format than x_train.