Data processing¶
Functions to preprocess data.
- bartz.prepcovars.quantilized_splits_from_matrix(X, max_bins)[source]¶
Determine bins that make the distribution of each predictor uniform.
- Parameters:
X (array (p, n)) – A matrix with
p
predictors andn
observations.max_bins (int) – The maximum number of bins to produce.
- Returns:
splits (array (p, m)) – A matrix containing, for each predictor, the boundaries between bins.
m
ismin(max_bins, n) - 1
, which is an upper bound on the number of splits. Each predictor may have a different number of splits; unused values at the end of each row are filled with the maximum value representable in the type ofX
.max_split (array (p,)) – The number of actually used values in each row of
splits
.
- bartz.prepcovars.uniform_splits_from_matrix(X, num_bins)[source]¶
Make an evenly spaced binning grid.
- Parameters:
X (array (p, n)) – A matrix with
p
predictors andn
observations.num_bins (int) – The number of bins to produce.
- Returns:
splits (array (p, num_bins - 1)) – A matrix containing, for each predictor, the boundaries between bins. The excluded endpoints are the minimum and maximum value in each row of
X
.max_split (array (p,)) – The number of cutpoints in each row of
splits
, i.e.,num_bins - 1
.
- bartz.prepcovars.bin_predictors(X, splits, **kw)[source]¶
Bin the predictors according to the given splits.
A value
x
is mapped to bini
iffsplits[i - 1] < x <= splits[i]
.- Parameters:
X (array (p, n)) – A matrix with
p
predictors andn
observations.splits (array (p, m)) – A matrix containing, for each predictor, the boundaries between bins.
m
is the maximum number of splits; each row may have shorter actual length, marked by padding unused locations at the end of the row with the maximum value allowed by the type.**kw (dict) – Additional arguments are passed to
jax.numpy.searchsorted
.
- Returns:
X_binned (int array (p, n)) – A matrix with
p
predictors andn
observations, where each predictor has been replaced by the index of the bin it falls into.