Title: | Make Tidy Bins |
---|---|
Description: | Multiple ways to bin numeric columns with a tidy output. Wraps a variety of existing binning methods into one function, and includes a new method for binning by equal value, which is useful for sales data. Provides a function to automatically summarize the properties of the binned columns. |
Authors: | Harrison Tietze [aut, cre] |
Maintainer: | Harrison Tietze <[email protected]> |
License: | GPL (>= 3) |
Version: | 0.1.1 |
Built: | 2024-11-09 03:09:02 UTC |
Source: | https://github.com/harrison4192/tidybins |
Wraps KMeans_rcpp
to create a column that is a cluster formed from select columns in the data frame.
Clusters names are specified by capital letters.
add_clusters(.data, ..., n_clusters = 4, cluster_name = "cluster")
add_clusters(.data, ..., n_clusters = 4, cluster_name = "cluster")
.data |
dataframe |
... |
columns to cluster (tidyselect) |
n_clusters |
integer |
cluster_name |
column name |
data frame
iris %>% tibble::as_tibble() %>% add_clusters(Sepal.Width, Sepal.Length, n_clusters = 3, cluster_name = "Sepal_Cluster") -> iris1 iris1 iris1 %>% numeric_summary(original_col = Sepal.Width, bucket_col = Sepal_Cluster)
iris %>% tibble::as_tibble() %>% add_clusters(Sepal.Width, Sepal.Length, n_clusters = 3, cluster_name = "Sepal_Cluster") -> iris1 iris1 iris1 %>% numeric_summary(original_col = Sepal.Width, bucket_col = Sepal_Cluster)
Make bins in a tidy fashion. Adds a column to your data frame containing the integer codes of the specified bins of a certain column. Specifying multiple columns is only intended for supervised binning, so mutliple columns can be simultaneously binned optimally with respect to a target variable.
bin_cols( .data, col, n_bins = 10, bin_type = "frequency", ..., target = NULL, pretty_labels = FALSE, seed = 1, method = "mdlp" )
bin_cols( .data, col, n_bins = 10, bin_type = "frequency", ..., target = NULL, pretty_labels = FALSE, seed = 1, method = "mdlp" )
.data |
a data frame |
col |
a column, vector of columns, or tidyselect |
n_bins |
number of bins |
bin_type |
method to make bins |
... |
params to be passed to selected binning method |
target |
unquoted column for supervised binning |
pretty_labels |
logical. If T returns interval label rather than integer rank |
seed |
seed for stochastic binning (xgboost) |
method |
method for bin mdlp |
Description of the arguments for bin_type
creates bins of equal content via quantiles. Wraps bin
with method "content". Similar to ntile
create bins of equal numeric width. Wraps bin
with method "length"
create bins using 1-dimensional kmeans. Wraps bin
with method "clusters"
each bin has equal sum of values
column is binned by best predictor of a target column using step_discretize_xgb
if the col does not have enough distinct values, xgboost will fail and automatically revert to step_discretize_cart
column is binned by weight of evidence. Requires binary target
column is binned by logistic regression. Requires binary target.
uses the discretizeDF.supervised
algorithm with a variety of methods.
a data frame
iris %>% bin_cols(Sepal.Width, n_bins = 5, pretty_labels = TRUE) %>% bin_cols(Petal.Width, n_bins = 3, bin_type = c("width", "kmeans")) %>% bin_cols(Sepal.Width, bin_type = "xgboost", target = Species, seed = 1) -> iris1 #binned columns are named by original name + method abbreviation + number bins created. #Sometimes the actual number of bins is less than n_bins if the col lacks enough variance. iris1 %>% print(width = Inf) iris1 %>% bin_summary() %>% print(width = Inf)
iris %>% bin_cols(Sepal.Width, n_bins = 5, pretty_labels = TRUE) %>% bin_cols(Petal.Width, n_bins = 3, bin_type = c("width", "kmeans")) %>% bin_cols(Sepal.Width, bin_type = "xgboost", target = Species, seed = 1) -> iris1 #binned columns are named by original name + method abbreviation + number bins created. #Sometimes the actual number of bins is less than n_bins if the col lacks enough variance. iris1 %>% print(width = Inf) iris1 %>% bin_summary() %>% print(width = Inf)
Bins a numeric column such that each bin contains 10 Intended for positive numeric vectors that make sense to sum, such as sales. Negative and NAs get treated as 0. The function never puts two rows with the same value into different bins. Accessed by the "value" method of the bin_cols function.
bin_equal_value(mdb, col, n_bins = 10)
bin_equal_value(mdb, col, n_bins = 10)
mdb |
dataframe |
col |
a numeric vector |
n_bins |
number of bins |
an integer vector
Returns a summary of all bins created by 'bin_cols' in a data frame. Takes no arguments other than the data frame but relies on regular expressions based of the 'bin_cols' output in order to identify the corresponding columns.
bin_summary(mdb, ...)
bin_summary(mdb, ...)
mdb |
dataframe output from bin_cols |
... |
optional tidyselect specification for specific cols |
a tibble
iris %>% bin_cols(Sepal.Width) %>% bin_summary()
iris %>% bin_cols(Sepal.Width) %>% bin_summary()
Drops the original column from the dataframe once bins are made. Throws an error if the same column has multiple bin cols.
drop_original_cols(.data, ..., restore_names = FALSE)
drop_original_cols(.data, ..., restore_names = FALSE)
.data |
dataframe output from bin_cols |
... |
tidyselect. default chooses all cols created from binning |
restore_names |
Logical, default FALSE. rename the binned cols with the original column names? |
dataframe
iris %>% bin_cols(Sepal.Length) %>% bin_cols(Sepal.Width, pretty_labels = TRUE) -> iris1 iris1 iris1 %>% drop_original_cols(restore_names = TRUE) iris1 %>% drop_original_cols(restore_names = FALSE)
iris %>% bin_cols(Sepal.Length) %>% bin_cols(Sepal.Width, pretty_labels = TRUE) -> iris1 iris1 iris1 %>% drop_original_cols(restore_names = TRUE) iris1 %>% drop_original_cols(restore_names = FALSE)
The five number summary of a numeric vector you would get from 'summary' but returned with a tidy output.
five_number_summary(x)
five_number_summary(x)
x |
a numeric vector |
a tibble
iris$Petal.Width %>% five_number_summary()
iris$Petal.Width %>% five_number_summary()
This function summarizes an arbitrary bin column, with respect to its original column. Can be used to summarize bins created from any package, or any arbitrary categorical column paired with a numeric column.
numeric_summary(mdb, original_col, bucket_col)
numeric_summary(mdb, original_col, bucket_col)
mdb |
a data frame |
original_col |
original numeric column |
bucket_col |
columns of bins |
a tibble
iris %>% numeric_summary(original_col = Sepal.Length, bucket_col = Species)
iris %>% numeric_summary(original_col = Sepal.Length, bucket_col = Species)