Title: | Auto Stats |
---|---|
Description: | Automatically do statistical exploration. Create formulas using 'tidyselect' syntax, and then determine cross-validated model accuracy and variable contributions using 'glm' and 'xgboost'. Contains additional helper functions to create and modify formulas. Has a flagship function to quickly determine relationships between categorical and continuous variables in the data set. |
Authors: | Harrison Tietze [aut, cre] |
Maintainer: | Harrison Tietze <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.4.1 |
Built: | 2025-03-05 03:27:09 UTC |
Source: | https://github.com/harrison4192/autostats |
A wrapper around lm and anova to run a regression of a continuous variable against categorical variables. Used for determining the whether the mean of a continuous variable is statistically significant amongst different levels of a categorical variable.
auto_anova( data, ..., baseline = c("mean", "median", "first_level", "user_supplied"), user_supplied_baseline = NULL, sparse = FALSE, pval_thresh = 0.1 )
auto_anova( data, ..., baseline = c("mean", "median", "first_level", "user_supplied"), user_supplied_baseline = NULL, sparse = FALSE, pval_thresh = 0.1 )
data |
a data frame |
... |
tidyselect specification or cols |
baseline |
choose from "mean", "median", "first_level", "user_supplied". what is the baseline to compare each category to? can use the mean and median of the target variable as a global baseline |
user_supplied_baseline |
if intercept is "user_supplied", can enter a numeric value |
sparse |
default FALSE; if true returns a truncated output with only significant results |
pval_thresh |
control significance level for sparse output filtering |
Columns can be inputted as unquoted names or tidyselect. Continuous and categorical variables are automatically determined. If no character or factor column is present, the column with the lowest amount of unique values will be considered the categorical variable.
Description of columns in the output
continuous variables
categorical variables
levels in the categorical variables
difference between level target mean and baseline
target mean per level
rows in predictor level
standard error of target in predictor level
p.value for t.test of whether target mean differs significantly between level and baseline
level p.value represented by stars
p.value for significance of entire predictor given by F test
predictor p.value represented by stars
text interpretation of tests
data frame
iris %>% auto_anova(tidyselect::everything()) -> iris_anova1 iris_anova1 %>% print(width = Inf)
iris %>% auto_anova(tidyselect::everything()) -> iris_anova1 iris_anova1 %>% print(width = Inf)
Wraps geom_boxplot
to simplify creating boxplots.
auto_boxplot( .data, continuous_outcome, categorical_variable, categorical_facets = NULL, alpha = 0.3, width = 0.15, color_dots = "black", color_box = "red" )
auto_boxplot( .data, continuous_outcome, categorical_variable, categorical_facets = NULL, alpha = 0.3, width = 0.15, color_dots = "black", color_box = "red" )
.data |
data |
continuous_outcome |
continuous y variable. unquoted column name |
categorical_variable |
categorical x variable. unquoted column name |
categorical_facets |
categorical facet variable. unquoted column name |
alpha |
alpha points |
width |
width of jitter |
color_dots |
dot color |
color_box |
box color |
ggplot
iris %>% auto_boxplot(continuous_outcome = Petal.Width, categorical_variable = Species)
iris %>% auto_boxplot(continuous_outcome = Petal.Width, categorical_variable = Species)
Finds the correlation between numeric variables in a data frame, chosen using tidyselect.
Additional parameters for the correlation test can be specified as in cor.test
auto_cor( .data, ..., use = c("pairwise.complete.obs", "all.obs", "complete.obs", "everything", "na.or.complete"), method = c("pearson", "kendall", "spearman", "xicor"), include_nominals = TRUE, max_levels = 5L, sparse = TRUE, pval_thresh = 0.1 )
auto_cor( .data, ..., use = c("pairwise.complete.obs", "all.obs", "complete.obs", "everything", "na.or.complete"), method = c("pearson", "kendall", "spearman", "xicor"), include_nominals = TRUE, max_levels = 5L, sparse = TRUE, pval_thresh = 0.1 )
.data |
data frame |
... |
tidyselect cols |
use |
method to deal with na. Default is to remove rows with NA |
method |
correlation method. default is pearson, but also supports xicor. |
include_nominals |
logicals, default TRUE. Dummify nominal variables? |
max_levels |
maximum numbers of dummies to be created from nominal variables |
sparse |
logical, default TRUE. Filters and arranges cor table |
pval_thresh |
threshold to filter out weak correlations |
includes the asymmetric correlation coefficient xi from xicor
data frame of correlations
iris %>% auto_cor() # don't use sparse if you're interested in only one target variable iris %>% auto_cor(sparse = FALSE) %>% dplyr::filter(x == "Petal.Length")
iris %>% auto_cor() # don't use sparse if you're interested in only one target variable iris %>% auto_cor(sparse = FALSE) %>% dplyr::filter(x == "Petal.Length")
Runs a cross validated xgboost and regularized linear regression, and reports accuracy metrics. Automatically determines whether the provided formula is a regression or classification.
auto_model_accuracy( data, formula, ..., n_folds = 4, as_flextable = TRUE, include_linear = FALSE, theme = "tron", seed = 1, mtry = 1, trees = 15L, min_n = 1L, tree_depth = 6L, learn_rate = 0.3, loss_reduction = 0, sample_size = 1, stop_iter = 10L, counts = FALSE, penalty = 0.015, mixture = 0.35 )
auto_model_accuracy( data, formula, ..., n_folds = 4, as_flextable = TRUE, include_linear = FALSE, theme = "tron", seed = 1, mtry = 1, trees = 15L, min_n = 1L, tree_depth = 6L, learn_rate = 0.3, loss_reduction = 0, sample_size = 1, stop_iter = 10L, counts = FALSE, penalty = 0.015, mixture = 0.35 )
data |
data frame |
formula |
formula |
... |
any other params for xgboost |
n_folds |
number of cross validation folds |
as_flextable |
if FALSE, returns a tibble |
include_linear |
if TRUE includes a regularized linear model |
theme |
make_flextable theme |
seed |
seed |
mtry |
# Randomly Selected Predictors; defaults to .75; (xgboost: colsample_bynode) (type: numeric, range 0 - 1) (or type: integer if |
trees |
# Trees (xgboost: nrounds) (type: integer, default: 500L) |
min_n |
Minimal Node Size (xgboost: min_child_weight) (type: integer, default: 2L); [typical range: 2-10] Keep small value for highly imbalanced class data where leaf nodes can have smaller size groups. Otherwise increase size to prevent overfitting outliers. |
tree_depth |
Tree Depth (xgboost: max_depth) (type: integer, default: 7L); Typical values: 3-10 |
learn_rate |
Learning Rate (xgboost: eta) (type: double, default: 0.05); Typical values: 0.01-0.3 |
loss_reduction |
Minimum Loss Reduction (xgboost: gamma) (type: double, default: 1.0); range: 0 to Inf; typical value: 0 - 20 assuming low-mid tree depth |
sample_size |
Proportion Observations Sampled (xgboost: subsample) (type: double, default: .75); Typical values: 0.5 - 1 |
stop_iter |
# Iterations Before Stopping (xgboost: early_stop) (type: integer, default: 15L) only enabled if validation set is provided |
counts |
if |
penalty |
linear regularization parameter |
mixture |
linear model parameter, combines l1 and l2 regularization |
a table
Performs a t.test on 2 populations for numeric variables.
auto_t_test(data, col, ..., var_equal = FALSE, abbrv = TRUE)
auto_t_test(data, col, ..., var_equal = FALSE, abbrv = TRUE)
data |
dataframe |
col |
a column with 2 categories representing the 2 populations |
... |
numeric variables to perform t.test on. Default is to select all numeric variables |
var_equal |
default FALSE; t.test parameter |
abbrv |
default TRUE; remove some extra columns from output |
dataframe
iris %>% dplyr::filter(Species != "setosa") %>% auto_t_test(col = Species)
iris %>% dplyr::filter(Species != "setosa") %>% auto_t_test(col = Species)
Automatically tunes an xgboost model using grid or bayesian optimization
auto_tune_xgboost( .data, formula, tune_method = c("grid", "bayes"), event_level = c("first", "second"), n_fold = 5L, n_iter = 100L, seed = 1, save_output = FALSE, parallel = TRUE, trees = tune::tune(), min_n = tune::tune(), mtry = tune::tune(), tree_depth = tune::tune(), learn_rate = tune::tune(), loss_reduction = tune::tune(), sample_size = tune::tune(), stop_iter = tune::tune(), counts = FALSE, tree_method = c("auto", "exact", "approx", "hist", "gpu_hist"), monotone_constraints = 0L, num_parallel_tree = 1L, lambda = 1, alpha = 0, scale_pos_weight = 1, verbosity = 0L )
auto_tune_xgboost( .data, formula, tune_method = c("grid", "bayes"), event_level = c("first", "second"), n_fold = 5L, n_iter = 100L, seed = 1, save_output = FALSE, parallel = TRUE, trees = tune::tune(), min_n = tune::tune(), mtry = tune::tune(), tree_depth = tune::tune(), learn_rate = tune::tune(), loss_reduction = tune::tune(), sample_size = tune::tune(), stop_iter = tune::tune(), counts = FALSE, tree_method = c("auto", "exact", "approx", "hist", "gpu_hist"), monotone_constraints = 0L, num_parallel_tree = 1L, lambda = 1, alpha = 0, scale_pos_weight = 1, verbosity = 0L )
.data |
dataframe |
formula |
formula |
tune_method |
method of tuning. defaults to grid |
event_level |
for binary classification, which factor level is the positive class. specify "second" for second level |
n_fold |
integer. n folds in resamples |
n_iter |
n iterations for tuning (bayes); paramter grid size (grid) |
seed |
seed |
save_output |
FASLE. If set to TRUE will write the output as an rds file |
parallel |
default TRUE; If set to TRUE, will enable parallel processing on resamples for grid tuning |
trees |
# Trees (xgboost: nrounds) (type: integer, default: 500L) |
min_n |
Minimal Node Size (xgboost: min_child_weight) (type: integer, default: 2L); [typical range: 2-10] Keep small value for highly imbalanced class data where leaf nodes can have smaller size groups. Otherwise increase size to prevent overfitting outliers. |
mtry |
# Randomly Selected Predictors; defaults to .75; (xgboost: colsample_bynode) (type: numeric, range 0 - 1) (or type: integer if |
tree_depth |
Tree Depth (xgboost: max_depth) (type: integer, default: 7L); Typical values: 3-10 |
learn_rate |
Learning Rate (xgboost: eta) (type: double, default: 0.05); Typical values: 0.01-0.3 |
loss_reduction |
Minimum Loss Reduction (xgboost: gamma) (type: double, default: 1.0); range: 0 to Inf; typical value: 0 - 20 assuming low-mid tree depth |
sample_size |
Proportion Observations Sampled (xgboost: subsample) (type: double, default: .75); Typical values: 0.5 - 1 |
stop_iter |
# Iterations Before Stopping (xgboost: early_stop) (type: integer, default: 15L) only enabled if validation set is provided |
counts |
if |
tree_method |
xgboost tree_method. default is |
monotone_constraints |
an integer vector with length of the predictor cols, of |
num_parallel_tree |
should be set to the size of the forest being trained. default 1L |
lambda |
[default=.5] L2 regularization term on weights. Increasing this value will make model more conservative. |
alpha |
[default=.1] L1 regularization term on weights. Increasing this value will make model more conservative. |
scale_pos_weight |
[default=1] Control the balance of positive and negative weights, useful for unbalanced classes. if set to TRUE, calculates sum(negative instances) / sum(positive instances). If first level is majority class, use values < 1, otherwise normally values >1 are used to balance the class distribution. |
verbosity |
[default=1] Verbosity of printing messages. Valid values are 0 (silent), 1 (warning), 2 (info), 3 (debug). |
Default is to tune all 7 xgboost parameters. Individual parameter values can be optionally fixed to reduce tuning complexity.
workflow object
iris %>% framecleaner::create_dummies() -> iris1 iris1 %>% tidy_formula(target = Petal.Length) -> petal_form iris1 %>% rsample::initial_split() -> iris_split iris_split %>% rsample::analysis() -> iris_train iris_split %>% rsample::assessment() -> iris_val ## Not run: iris_train %>% auto_tune_xgboost(formula = petal_form, n_iter = 10, parallel = FALSE, tune_method = "grid", mtry = .5) -> xgb_tuned xgb_tuned %>% parsnip::fit(iris_train) %>% parsnip::extract_fit_engine() -> xgb_tuned_fit xgb_tuned_fit %>% tidy_predict(newdata = iris_val, form = petal_form) -> iris_val1 ## End(Not run)
iris %>% framecleaner::create_dummies() -> iris1 iris1 %>% tidy_formula(target = Petal.Length) -> petal_form iris1 %>% rsample::initial_split() -> iris_split iris_split %>% rsample::analysis() -> iris_train iris_split %>% rsample::assessment() -> iris_val ## Not run: iris_train %>% auto_tune_xgboost(formula = petal_form, n_iter = 10, parallel = FALSE, tune_method = "grid", mtry = .5) -> xgb_tuned xgb_tuned %>% parsnip::fit(iris_train) %>% parsnip::extract_fit_engine() -> xgb_tuned_fit xgb_tuned_fit %>% tidy_predict(newdata = iris_val, form = petal_form) -> iris_val1 ## End(Not run)
Return a variable importance plot and coefficient plot from a linear model. Used to easily visualize the contributions of explanatory variables in a supervised model
auto_variable_contributions(data, formula, scale = TRUE)
auto_variable_contributions(data, formula, scale = TRUE)
data |
dataframe |
formula |
formula |
scale |
logical. If FALSE puts coefficients on original scale |
a ggplot object
iris %>% framecleaner::create_dummies() %>% auto_variable_contributions( tidy_formula(., target = Petal.Width) ) iris %>% auto_variable_contributions( tidy_formula(., target = Species) )
iris %>% framecleaner::create_dummies() %>% auto_variable_contributions( tidy_formula(., target = Petal.Width) ) iris %>% auto_variable_contributions( tidy_formula(., target = Species) )
Caps the outliers of a numeric vector by percentiles. Also outputs a plot of the capped distribution
cap_outliers(x, q = 0.05, type = c("both", "upper", "lower"))
cap_outliers(x, q = 0.05, type = c("both", "upper", "lower"))
x |
numeric vector |
q |
decimal input to the quantile function to set cap. default .05 caps at the 95 and 5th percentile |
type |
chr vector. where to cap: both, upper, or lower |
numeric vector
cap_outliers(iris$Petal.Width)
cap_outliers(iris$Petal.Width)
helper function to create the integer vector to pass to the monotone_constraints
argument in xgboost
create_monotone_constraints( .data, formula, decreasing = NULL, increasing = NULL )
create_monotone_constraints( .data, formula, decreasing = NULL, increasing = NULL )
.data |
dataframe, training data for tidy_xgboost |
formula |
formula used for tidy_xgboost |
decreasing |
character vector or tidyselect regular expression to designate decreasing cols |
increasing |
character vector or tidyselect regular expression to designate increasing cols |
a named integer vector with entries of 0, 1, -1
iris %>% framecleaner::create_dummies(Species) -> iris_dummy iris_dummy %>% tidy_formula(target= Petal.Length) -> petal_form iris_dummy %>% create_monotone_constraints(petal_form, decreasing = tidyselect::matches("Petal|Species"), increasing = "Sepal.Width")
iris %>% framecleaner::create_dummies(Species) -> iris_dummy iris_dummy %>% tidy_formula(target= Petal.Length) -> petal_form iris_dummy %>% create_monotone_constraints(petal_form, decreasing = tidyselect::matches("Petal|Species"), increasing = "Sepal.Width")
Automatically evaluates predictions created by tidy_predict
. No need to supply column names.
eval_preds(.data, ..., softprob_model = NULL)
eval_preds(.data, ..., softprob_model = NULL)
.data |
dataframe as a result of |
... |
additional metrics from yarstick to be calculated |
softprob_model |
character name of the model used to create multiclass probabilities |
tibble of summarized metrics
takes the lhs and rhs of a formula as character vectors and outputs a formula
f_charvec_to_formula(lhs, rhs)
f_charvec_to_formula(lhs, rhs)
lhs |
lhs atomic chr vec |
rhs |
rhs chr vec |
formula
lhs <- "Species" rhs <- c("Petal.Width", "Custom_Var") f_charvec_to_formula(lhs, rhs)
lhs <- "Species" rhs <- c("Petal.Width", "Custom_Var") f_charvec_to_formula(lhs, rhs)
Accepts a formula and returns the rhs as a character vector.
f_formula_to_charvec(f, include_lhs = FALSE, .data = NULL)
f_formula_to_charvec(f, include_lhs = FALSE, .data = NULL)
f |
formula |
include_lhs |
FALSE. If TRUE, appends lhs to beginning of vector |
.data |
dataframe for names if necessary |
chr vector
iris %>% tidy_formula(target = Species, tidyselect::everything()) -> f f f %>% f_formula_to_charvec()
iris %>% tidy_formula(target = Species, tidyselect::everything()) -> f f f %>% f_formula_to_charvec()
Modify components of a formula by adding / removing vars from the rhs or replacing the lhs.
f_modify_formula( f, rhs_remove = NULL, rhs_add = NULL, lhs_replace = NULL, negate = TRUE )
f_modify_formula( f, rhs_remove = NULL, rhs_add = NULL, lhs_replace = NULL, negate = TRUE )
f |
formula |
rhs_remove |
regex or character vector for dropping variables from the rhs |
rhs_add |
character vector for adding variables to rhs |
lhs_replace |
string to replace formula lhs if supplied |
negate |
should |
formula
iris %>% tidy_formula(target = Species, tidyselect::everything()) -> f f f %>% f_modify_formula( rhs_remove = c("Petal.Width", "Sepal.Length"), rhs_add = "Custom_Variable" ) f %>% f_modify_formula( rhs_remove = "Petal", lhs_replace = "Petal.Length" )
iris %>% tidy_formula(target = Species, tidyselect::everything()) -> f f f %>% f_modify_formula( rhs_remove = c("Petal.Width", "Sepal.Length"), rhs_add = "Custom_Variable" ) f %>% f_modify_formula( rhs_remove = "Petal", lhs_replace = "Petal.Length" )
s3 method to extract params of a model with names consistent for use in the 'autostats' package
get_params(model, ...) ## S3 method for class 'xgb.Booster' get_params(model, ...) ## S3 method for class 'workflow' get_params(model, ...)
get_params(model, ...) ## S3 method for class 'xgb.Booster' get_params(model, ...) ## S3 method for class 'workflow' get_params(model, ...)
model |
a model |
... |
additional arguments |
list of params
iris %>% framecleaner::create_dummies() -> iris_dummies iris_dummies %>% tidy_formula(target = Petal.Length) -> p_form iris_dummies %>% tidy_xgboost(p_form, mtry = .5, trees = 5L, loss_reduction = 2, sample_size = .7) -> xgb ## reuse these parameters to find the cross validated error rlang::exec(auto_model_accuracy, data = iris_dummies, formula = p_form, !!!get_params(xgb))
iris %>% framecleaner::create_dummies() -> iris_dummies iris_dummies %>% tidy_formula(target = Petal.Length) -> p_form iris_dummies %>% tidy_xgboost(p_form, mtry = .5, trees = 5L, loss_reduction = 2, sample_size = .7) -> xgb ## reuse these parameters to find the cross validated error rlang::exec(auto_model_accuracy, data = iris_dummies, formula = p_form, !!!get_params(xgb))
Imputes missing values of a numeric matrix using stochastic gradient descent. recosystem
impute_recosystem( .data, lrate = c(0.05, 0.1), costp_l1 = c(0, 0.05), costq_l1 = c(0, 0.05), costp_l2 = c(0, 0.05), costq_l2 = c(0, 0.05), nthread = 8, loss = "l2", niter = 15, verbose = FALSE, nfold = 4, seed = 1 )
impute_recosystem( .data, lrate = c(0.05, 0.1), costp_l1 = c(0, 0.05), costq_l1 = c(0, 0.05), costp_l2 = c(0, 0.05), costq_l2 = c(0, 0.05), nthread = 8, loss = "l2", niter = 15, verbose = FALSE, nfold = 4, seed = 1 )
.data |
long format data frame |
lrate |
learning rate |
costp_l1 |
l1 cost p |
costq_l1 |
l1 cost q |
costp_l2 |
l2 cost p |
costq_l2 |
l2 cost q |
nthread |
nthreads |
loss |
loss function. also can use "l1" |
niter |
training iterations for tune |
verbose |
show training loss? |
nfold |
folds for tune validation |
seed |
seed for randomness |
input is a long data frame with 3 columns: ID col, Item col (the column names from pivoting longer), and the ratings (values from pivoting longer)
pre-processing generally requires pivoting a wide user x item matrix to long format. The missing values from the matrix must be retained as NA values in the rating column. The values will be predicted and filled in by the algorithm. Output is a long data frame with the same number of rows as input, but no missing values.
This function automatically tunes the recosystem learner before applying. Parameter values can be supplied for tuning. To avoid tuning, use single values for the parameters.
long format data frame
Runs a conditional inference forest.
tidy_cforest(data, formula, seed = 1)
tidy_cforest(data, formula, seed = 1)
data |
dataframe |
formula |
formula |
seed |
seed integer |
a cforest model
iris %>% tidy_cforest( tidy_formula(., Petal.Width) ) -> iris_cfor iris_cfor iris_cfor %>% visualize_model()
iris %>% tidy_cforest( tidy_formula(., Petal.Width) ) -> iris_cfor iris_cfor iris_cfor %>% visualize_model()
tidy conditional inference tree. Creates easily interpretable decision tree models that be shown with the visualize_model
function.
Statistical significance required for a split , and minimum necessary samples in a terminal leaf can be controlled to create the desired tree visual.
tidy_ctree(.data, formula, minbucket = 7L, mincriterion = 0.95, ...)
tidy_ctree(.data, formula, minbucket = 7L, mincriterion = 0.95, ...)
.data |
dataframe |
formula |
formula |
minbucket |
minimum amount of samples in terminal leaves, default is 7 |
mincriterion |
(1 - alpha) value between 0 -1, default is .95. lowering this value creates more splits, but less significant |
... |
optional parameters to |
a ctree object
iris %>% tidy_formula(., Sepal.Length) -> sepal_form iris %>% tidy_ctree(sepal_form) %>% visualize_model() iris %>% tidy_ctree(sepal_form, minbucket = 30) %>% visualize_model(plot_type = "box")
iris %>% tidy_formula(., Sepal.Length) -> sepal_form iris %>% tidy_ctree(sepal_form) %>% visualize_model() iris %>% tidy_ctree(sepal_form, minbucket = 30) %>% visualize_model(plot_type = "box")
Takes a dataframe and allows for use of tidyselect to construct a formula.
tidy_formula(data, target, ...)
tidy_formula(data, target, ...)
data |
dataframe |
target |
lhs |
... |
tidyselect. rhs |
a formula
iris %>% tidy_formula(Species, tidyselect::everything())
iris %>% tidy_formula(Species, tidyselect::everything())
Runs either a linear regression, logistic regression, or multinomial classification. The model is automatically determined based off the nature of the target variable.
tidy_glm(data, formula)
tidy_glm(data, formula)
data |
dataframe |
formula |
formula |
glm model
# linear regression iris %>% tidy_glm( tidy_formula(., target = Petal.Width)) -> glm1 glm1 glm1 %>% visualize_model() # multinomial classification tidy_formula(iris, target = Species) -> species_form iris %>% tidy_glm(species_form) -> glm2 glm2 %>% visualize_model() # logistic regression iris %>% dplyr::filter(Species != "setosa") %>% tidy_glm(species_form) -> glm3 suppressWarnings({ glm3 %>% visualize_model()})
# linear regression iris %>% tidy_glm( tidy_formula(., target = Petal.Width)) -> glm1 glm1 glm1 %>% visualize_model() # multinomial classification tidy_formula(iris, target = Species) -> species_form iris %>% tidy_glm(species_form) -> glm2 glm2 %>% visualize_model() # logistic regression iris %>% dplyr::filter(Species != "setosa") %>% tidy_glm(species_form) -> glm3 suppressWarnings({ glm3 %>% visualize_model()})
tidy predict
tidy_predict( model, newdata, form = NULL, olddata = NULL, bind_preds = FALSE, ... ) ## S3 method for class 'Rcpp_ENSEMBLE' tidy_predict(model, newdata, form = NULL, ...) ## S3 method for class 'glm' tidy_predict(model, newdata, form = NULL, ...) ## Default S3 method: tidy_predict(model, newdata, form = NULL, ...) ## S3 method for class 'BinaryTree' tidy_predict(model, newdata, form = NULL, ...) ## S3 method for class 'xgb.Booster' tidy_predict( model, newdata, form = NULL, olddata = NULL, bind_preds = FALSE, ... ) ## S3 method for class 'lgb.Booster' tidy_predict( model, newdata, form = NULL, olddata = NULL, bind_preds = FALSE, ... )
tidy_predict( model, newdata, form = NULL, olddata = NULL, bind_preds = FALSE, ... ) ## S3 method for class 'Rcpp_ENSEMBLE' tidy_predict(model, newdata, form = NULL, ...) ## S3 method for class 'glm' tidy_predict(model, newdata, form = NULL, ...) ## Default S3 method: tidy_predict(model, newdata, form = NULL, ...) ## S3 method for class 'BinaryTree' tidy_predict(model, newdata, form = NULL, ...) ## S3 method for class 'xgb.Booster' tidy_predict( model, newdata, form = NULL, olddata = NULL, bind_preds = FALSE, ... ) ## S3 method for class 'lgb.Booster' tidy_predict( model, newdata, form = NULL, olddata = NULL, bind_preds = FALSE, ... )
model |
model |
newdata |
dataframe |
form |
the formula used for the model |
olddata |
training data set |
bind_preds |
set to TURE if newdata is a dataset without any labels, to bind the new and old data with the predictions under the original target name |
... |
other parameters to pass to |
dataframe
iris %>% framecleaner::create_dummies(Species) -> iris_dummy iris_dummy %>% tidy_formula(target= Petal.Length) -> petal_form iris_dummy %>% tidy_xgboost( petal_form, trees = 20, mtry = .5 ) -> xg1 xg1 %>% tidy_predict(newdata = iris_dummy, form = petal_form) %>% head()
iris %>% framecleaner::create_dummies(Species) -> iris_dummy iris_dummy %>% tidy_formula(target= Petal.Length) -> petal_form iris_dummy %>% tidy_xgboost( petal_form, trees = 20, mtry = .5 ) -> xg1 xg1 %>% tidy_predict(newdata = iris_dummy, form = petal_form) %>% head()
plot and summarize shapley values from an xgboost model
tidy_shap(model, newdata, form = NULL, ..., top_n = 12, aggregate = NULL)
tidy_shap(model, newdata, form = NULL, ..., top_n = 12, aggregate = NULL)
model |
xgboost model |
newdata |
dataframe similar to model input |
form |
formula used for model |
... |
additional parameters for shapley value |
top_n |
top n features |
aggregate |
a character vector. Predictors containing the string will be aggregated, and renamed to that string. |
returns a list with the following entries
: table of shaply values
: table summarizing shapley values. Includes correlation between shaps and feature values.
: one plot showing the relation between shaps and features
: returns the top 9 most important features as determined by sum of absolute shapley values, as a facetted scatterplot of feature vs shap
list
Accepts a formula to run an xgboost model. Automatically determines whether the formula is for classification or regression. Returns the xgboost model.
tidy_xgboost( .data, formula, ..., mtry = 0.75, trees = 500L, min_n = 2L, tree_depth = 7L, learn_rate = 0.05, loss_reduction = 1, sample_size = 0.75, stop_iter = 15L, counts = FALSE, tree_method = c("auto", "exact", "approx", "hist", "gpu_hist"), monotone_constraints = 0L, num_parallel_tree = 1L, lambda = 0.5, alpha = 0.1, scale_pos_weight = 1, verbosity = 0L, validate = TRUE, booster = c("gbtree", "gblinear") )
tidy_xgboost( .data, formula, ..., mtry = 0.75, trees = 500L, min_n = 2L, tree_depth = 7L, learn_rate = 0.05, loss_reduction = 1, sample_size = 0.75, stop_iter = 15L, counts = FALSE, tree_method = c("auto", "exact", "approx", "hist", "gpu_hist"), monotone_constraints = 0L, num_parallel_tree = 1L, lambda = 0.5, alpha = 0.1, scale_pos_weight = 1, verbosity = 0L, validate = TRUE, booster = c("gbtree", "gblinear") )
.data |
dataframe |
formula |
formula |
... |
additional parameters to be passed to |
mtry |
# Randomly Selected Predictors; defaults to .75; (xgboost: colsample_bynode) (type: numeric, range 0 - 1) (or type: integer if |
trees |
# Trees (xgboost: nrounds) (type: integer, default: 500L) |
min_n |
Minimal Node Size (xgboost: min_child_weight) (type: integer, default: 2L); [typical range: 2-10] Keep small value for highly imbalanced class data where leaf nodes can have smaller size groups. Otherwise increase size to prevent overfitting outliers. |
tree_depth |
Tree Depth (xgboost: max_depth) (type: integer, default: 7L); Typical values: 3-10 |
learn_rate |
Learning Rate (xgboost: eta) (type: double, default: 0.05); Typical values: 0.01-0.3 |
loss_reduction |
Minimum Loss Reduction (xgboost: gamma) (type: double, default: 1.0); range: 0 to Inf; typical value: 0 - 20 assuming low-mid tree depth |
sample_size |
Proportion Observations Sampled (xgboost: subsample) (type: double, default: .75); Typical values: 0.5 - 1 |
stop_iter |
# Iterations Before Stopping (xgboost: early_stop) (type: integer, default: 15L) only enabled if validation set is provided |
counts |
if |
tree_method |
xgboost tree_method. default is |
monotone_constraints |
an integer vector with length of the predictor cols, of |
num_parallel_tree |
should be set to the size of the forest being trained. default 1L |
lambda |
[default=.5] L2 regularization term on weights. Increasing this value will make model more conservative. |
alpha |
[default=.1] L1 regularization term on weights. Increasing this value will make model more conservative. |
scale_pos_weight |
[default=1] Control the balance of positive and negative weights, useful for unbalanced classes. if set to TRUE, calculates sum(negative instances) / sum(positive instances). If first level is majority class, use values < 1, otherwise normally values >1 are used to balance the class distribution. |
verbosity |
[default=1] Verbosity of printing messages. Valid values are 0 (silent), 1 (warning), 2 (info), 3 (debug). |
validate |
default TRUE. report accuracy metrics on a validation set. |
booster |
defaults to 'gbtree' for tree boosting but can be set to 'gblinear' |
In binary classification the target variable must be a factor with the first level set to the event of interest. A higher probability will predict the first level.
reference for parameters: xgboost docs
xgb.Booster model
options(rlang_trace_top_env = rlang::current_env()) # regression on numeric variable iris %>% framecleaner::create_dummies(Species) -> iris_dummy iris_dummy %>% tidy_formula(target= Petal.Length) -> petal_form iris_dummy %>% tidy_xgboost( petal_form, trees = 20, mtry = .5 ) -> xg1 xg1 %>% tidy_predict(newdata = iris_dummy, form = petal_form) -> iris_preds iris_preds %>% eval_preds()
options(rlang_trace_top_env = rlang::current_env()) # regression on numeric variable iris %>% framecleaner::create_dummies(Species) -> iris_dummy iris_dummy %>% tidy_formula(target= Petal.Length) -> petal_form iris_dummy %>% tidy_xgboost( petal_form, trees = 20, mtry = .5 ) -> xg1 xg1 %>% tidy_predict(newdata = iris_dummy, form = petal_form) -> iris_preds iris_preds %>% eval_preds()
s3 method to automatically visualize the output of of a model object. Additional arguments can be supplied for the original function. Check the corresponding plot function documentation for any custom arguments.
visualize_model(model, ...) ## S3 method for class 'RandomForest' visualize_model(model, ..., method) ## S3 method for class 'BinaryTree' visualize_model(model, ..., method) ## S3 method for class 'glm' visualize_model(model, ..., method) ## S3 method for class 'multinom' visualize_model(model, ..., method) ## S3 method for class 'xgb.Booster' visualize_model( model, top_n = 10L, aggregate = NULL, as_table = FALSE, formula = NULL, measure = c("Gain", "Cover", "Frequency"), ..., method ) ## Default S3 method: visualize_model(model, ..., method)
visualize_model(model, ...) ## S3 method for class 'RandomForest' visualize_model(model, ..., method) ## S3 method for class 'BinaryTree' visualize_model(model, ..., method) ## S3 method for class 'glm' visualize_model(model, ..., method) ## S3 method for class 'multinom' visualize_model(model, ..., method) ## S3 method for class 'xgb.Booster' visualize_model( model, top_n = 10L, aggregate = NULL, as_table = FALSE, formula = NULL, measure = c("Gain", "Cover", "Frequency"), ..., method ) ## Default S3 method: visualize_model(model, ..., method)
model |
a model |
... |
additional arguments |
method |
choose amongst different visualization methods |
top_n |
return top n elements |
aggregate |
= summarize |
as_table |
= false, table or graph, |
formula |
= formula, |
measure |
= c("Gain", "Cover", "Frequency") |
a plot