Title: | Clean Data Frames |
---|---|
Description: | Provides a friendly interface for modifying data frames with a sequence of piped commands built upon the 'tidyverse' Wickham et al., (2019) <doi:10.21105/joss.01686> . The majority of commands wrap 'dplyr' mutate statements in a convenient way to concisely solve common issues that arise when tidying small to medium data sets. Includes smart defaults and allows flexible selection of columns via 'tidyselect'. |
Authors: | Harrison Tietze [aut, cre] |
Maintainer: | Harrison Tietze <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.2.1 |
Built: | 2024-10-11 04:14:55 UTC |
Source: | https://github.com/harrison4192/framecleaner |
coerce to integer. if too large, coerces to 64-bit integer
as_integer16_or_64(x)
as_integer16_or_64(x)
x |
integerish vec |
int or int64
Call from a saved R script. Automatically sets your working directory to the directory that you saved the current R script in. Takes no arguments.
auto_setwd()
auto_setwd()
No return value.
Uses the functions of framecleaner and other operations to apply cleaning operations to a data frame
clean_frame(.data)
clean_frame(.data)
.data |
a data frame |
Functions applied in clean_frame
rename_with
.fn = enc2utf8
clean_names
case = "all_caps", ascii = FALSE)
data frame
iris %>% clean_frame()
iris %>% clean_frame()
adapted from the dummy_cols
function Added the option to truncate the dummy column
names, and to specify dummy cols using tidyselect.
create_dummies( .data, ..., append_col_name = TRUE, max_levels = 10L, remove_first_dummy = FALSE, remove_most_frequent_dummy = FALSE, clean_names = TRUE, ignore_na = FALSE, split = NULL, remove_selected_columns = TRUE )
create_dummies( .data, ..., append_col_name = TRUE, max_levels = 10L, remove_first_dummy = FALSE, remove_most_frequent_dummy = FALSE, clean_names = TRUE, ignore_na = FALSE, split = NULL, remove_selected_columns = TRUE )
.data |
data frame |
... |
tidyselect columns. default selection is all character or factor variables |
append_col_name |
logical, default TRUE. Appends original column name to dummy col name |
max_levels |
uses |
remove_first_dummy |
logical, default FALSE. |
remove_most_frequent_dummy |
logical, default FALSE |
clean_names |
logical, default TRUE. apply |
ignore_na |
logical, default FALSE |
split |
NULL |
remove_selected_columns |
logical, default TRUE |
reference the fastDummies package for documentation on the original function.
data frame
iris %>% create_dummies(Species, append_col_name = FALSE) %>% tibble::as_tibble()
iris %>% create_dummies(Species, append_col_name = FALSE) %>% tibble::as_tibble()
create flag
create_flag(.data, col, flag, full_name = FALSE, drop = FALSE)
create_flag(.data, col, flag, full_name = FALSE, drop = FALSE)
.data |
data frame |
col |
column |
flag |
column entry |
full_name |
Logical. default F. if T, new column name is original name + flag. other wise just flag |
drop |
logical. default F. If T, drop original column. |
data frame
iris %>% create_flag( col = Species, flag = "versicolor", drop = TRUE) %>% head()
iris %>% create_flag( col = Species, flag = "versicolor", drop = TRUE) %>% head()
creates a semesterly date vector from a date vector
date_yh(x)
date_yh(x)
x |
a date |
date vector
seq.Date(lubridate::ymd(20200101), lubridate::ymd(20220101), length.out = 10) -> d1 d1 %>% tibble::enframe() %>% dplyr::mutate(YH = date_yh(value))
seq.Date(lubridate::ymd(20200101), lubridate::ymd(20220101), length.out = 10) -> d1 d1 %>% tibble::enframe() %>% dplyr::mutate(YH = date_yh(value))
creates a monthly date vector from a date vector
date_ym(x)
date_ym(x)
x |
a date |
date vector
seq.Date(lubridate::ymd(20200101), lubridate::ymd(20220101), length.out = 10) -> d1 d1 %>% tibble::enframe() %>% dplyr::mutate(YM = date_ym(value))
seq.Date(lubridate::ymd(20200101), lubridate::ymd(20220101), length.out = 10) -> d1 d1 %>% tibble::enframe() %>% dplyr::mutate(YM = date_ym(value))
creates a quarterly date vector from a date vector
date_yq(x)
date_yq(x)
x |
a date |
date vector
seq.Date(lubridate::ymd(20200101), lubridate::ymd(20220101), length.out = 10) -> d1 d1 %>% tibble::enframe() %>% dplyr::mutate(YQ = date_yq(value))
seq.Date(lubridate::ymd(20200101), lubridate::ymd(20220101), length.out = 10) -> d1 d1 %>% tibble::enframe() %>% dplyr::mutate(YQ = date_yq(value))
use tidyselect to fill NA
values
Default behavior is to fill all integer or double columns cols with 0, preserving their types.
fill_na(.data, ..., fill = 0L, missing_type = c("all", "NA", "NaN", "Inf"))
fill_na(.data, ..., fill = 0L, missing_type = c("all", "NA", "NaN", "Inf"))
.data |
data frame |
... |
tidyselect specification. Default selection: none |
fill |
value to fill missings |
missing_type |
character vector. Choose what type of missing to fill. Default is all types. choose from "all", "Na", "NaN", "Inf" |
data frame
tibble::tibble(x = c(NA, 1L, 2L, NA, NaN, 5L, Inf)) -> tbl tbl %>% fill_na() tbl %>% fill_na(fill = 1L, missing_type = "Inf") tbl %>% fill_na(missing_type = "NaN")
tibble::tibble(x = c(NA, 1L, 2L, NA, NaN, 5L, Inf)) -> tbl tbl %>% fill_na() tbl %>% fill_na(fill = 1L, missing_type = "Inf") tbl %>% fill_na(missing_type = "NaN")
Filter for all instances of a column that meet a specific condition at least once.
filter_for(.data, what, where)
filter_for(.data, what, where)
.data |
data frame |
what |
unquote col or vector of unquoted cols. |
where |
a logical condition used for filter |
data frame
# An example using some time series data tibble::tibble( CLIENT_ID = c("A1001", "B1001", "C1001", "A1001", "B1001", "C1001", "A1001", "B1001", "C1001"), YEAR = c(2019L, 2019L, 2019L, 2020L, 2020L, 2020L, 2021L, 2021L, 2021L), SALES = c(3124, 56424, 3214132, 65534, 2342, 6566, 87654, 2332, 6565) ) %>% dplyr::arrange(CLIENT_ID, YEAR) -> sales_data sales_data # filter for Clients that had sales greater than 4000 in the year 2019. # this way we can see how the same clients sales looked in subsequent years sales_data %>% filter_for(what = CLIENT_ID, where = YEAR == 2019 & SALES > 4000L) # filter for clients whose sales were less than 4000 in the year 2021 sales_data %>% filter_for(what = CLIENT_ID, where = YEAR == 2021 & SALES < 4000L)
# An example using some time series data tibble::tibble( CLIENT_ID = c("A1001", "B1001", "C1001", "A1001", "B1001", "C1001", "A1001", "B1001", "C1001"), YEAR = c(2019L, 2019L, 2019L, 2020L, 2020L, 2020L, 2021L, 2021L, 2021L), SALES = c(3124, 56424, 3214132, 65534, 2342, 6566, 87654, 2332, 6565) ) %>% dplyr::arrange(CLIENT_ID, YEAR) -> sales_data sales_data # filter for Clients that had sales greater than 4000 in the year 2019. # this way we can see how the same clients sales looked in subsequent years sales_data %>% filter_for(what = CLIENT_ID, where = YEAR == 2019 & SALES > 4000L) # filter for clients whose sales were less than 4000 in the year 2021 sales_data %>% filter_for(what = CLIENT_ID, where = YEAR == 2021 & SALES < 4000L)
More complex wrapper around dplyr::filter(!is.na())
to remove NA
rows using tidyselect. If any specified column contains an NA
the whole row is removed. Reports the amount of rows removed containing NaN
, NA
, Inf
, in that order.
For example if one row contains Inf
in one column and in another, the removed row will be counted in the NA
tally.
filter_missing(.data, ..., remove_inf = TRUE) ## S3 method for class 'data.frame' filter_missing(.data, ..., remove_inf = TRUE, condition = c("any", "all"))
filter_missing(.data, ..., remove_inf = TRUE) ## S3 method for class 'data.frame' filter_missing(.data, ..., remove_inf = TRUE, condition = c("any", "all"))
.data |
dataframe |
... |
tidyselect. default selection is all columns |
remove_inf |
logical. default is to also remove |
condition |
defaults to "any". in which case removes rows if |
S3 method, can also be used on vectors
data frame
tibble::tibble(x = c(NA, 1L, 2L, NA, NaN, 5L, Inf), y = c(1L, NA, 2L, NA, Inf, 5L, Inf)) -> tbl1 tbl1 # remove any row with a missing or Inf tbl1 %>% filter_missing() # remove any row with Na or NaN in the x column tbl1 %>% filter_missing(x, remove_inf = FALSE) # only remove rows where every entry is Na, NaN, or Inf tbl1 %>% filter_missing(condition = "all")
tibble::tibble(x = c(NA, 1L, 2L, NA, NaN, 5L, Inf), y = c(1L, NA, 2L, NA, Inf, 5L, Inf)) -> tbl1 tbl1 # remove any row with a missing or Inf tbl1 %>% filter_missing() # remove any row with Na or NaN in the x column tbl1 %>% filter_missing(x, remove_inf = FALSE) # only remove rows where every entry is Na, NaN, or Inf tbl1 %>% filter_missing(condition = "all")
import directory
import_dir( dir, ..., method = c("rio", "vroom", "vroom_jp", "read_csv"), return_type = c("df", "list") )
import_dir( dir, ..., method = c("rio", "vroom", "vroom_jp", "read_csv"), return_type = c("df", "list") )
dir |
dir path |
... |
arguments passed to import method |
method |
import method chosen from import tibble |
return_type |
default is to bind dataframes together and remove duplicates. only recommended for a folder of files with the same data format. otherwise specify return as list of data frames |
data frame
wrapper around multiple file readers. The default being import
set to return a tibble
Also available vroom
and vroom_jp
for japanese characters.
import_tibble( path, ..., method = c("rio", "vroom", "vroom_jp", "read_csv", "read_excel") )
import_tibble( path, ..., method = c("rio", "vroom", "vroom_jp", "read_csv", "read_excel") )
path |
filepath |
... |
other arguments |
method |
method of import. default is rio |
Supports multiple types of importing through method
a tibble
Set elements to NA values using tidyselect specification. Don't use this function on columns of different modes at once. Defaults to choosing all character columns.
## S3 method for class 'data.frame' make_na(.data, ..., vec = c("-", "", " ", "null", "NA", "NA_")) make_na(.data, ..., vec = c("-", "", " ", "null", "NA", "NA_"))
## S3 method for class 'data.frame' make_na(.data, ..., vec = c("-", "", " ", "null", "NA", "NA_")) make_na(.data, ..., vec = c("-", "", " ", "null", "NA", "NA_"))
.data |
data frame |
... |
tidyselect. Default selection: all chr cols |
vec |
vector of possible elements to replace with NA |
data frame
# easily set NA values. blank space and empty space are default options tibble::tibble(x = c("a", "b", "", "d", " ", "", "e")) %>% make_na()
# easily set NA values. blank space and empty space are default options tibble::tibble(x = c("a", "b", "", "d", " ", "", "e")) %>% make_na()
Automatically pads elements of a column to the largest sized element. Useful when an integer code with leading zeros is read in as an integer and needs to be fixed.
pad_auto(mdb, ..., side = "left", pad = "0")
pad_auto(mdb, ..., side = "left", pad = "0")
mdb |
data frame |
... |
tidyselect specification |
side |
str_pad side |
pad |
str_pad pad |
data frame
# good for putting leading 0's tibble::tibble(x = 1:10) %>% pad_auto(x)
# good for putting leading 0's tibble::tibble(x = 1:10) %>% pad_auto(x)
wrapper around mutate and str_pad
pad_col(mdb, ..., width, pad = "0", side = "left")
pad_col(mdb, ..., width, pad = "0", side = "left")
mdb |
data frame |
... |
tidyselect |
width |
str_pad width |
pad |
str_pad pad |
side |
str_pad side |
data frame
# manually pad with 0's (or other value) # use case over [pad_auto()]: the desired width is greater than the widest element tibble::tibble( ID = c(2, 13, 86, 302) ) %>% pad_col(ID, width = 4)
# manually pad with 0's (or other value) # use case over [pad_auto()]: the desired width is greater than the widest element tibble::tibble( ID = c(2, 13, 86, 302) ) %>% pad_col(ID, width = 4)
recode_chr
recode_chr(df, col, old_names, new_name, regex = FALSE, negate = FALSE)
recode_chr(df, col, old_names, new_name, regex = FALSE, negate = FALSE)
df |
data frame |
col |
unquoted col |
old_names |
character vector or regular expression |
new_name |
atomic chr string |
regex |
Logical, default F. Specify elements for old_names using a regex? |
negate |
logical, defailt F. If negating the regex, set to T |
df
# Use a negative regex to rename all species other than "virginica" to "none" iris %>% recode_chr( col = Species, old_names = "vir", new_name = "none", regex = TRUE, negate = TRUE) %>% dplyr::count(Species) # Specify old names using a regex iris %>% recode_chr( col = Species, old_names = "set|vir", new_name = "other", regex = TRUE) %>% dplyr::count(Species)
# Use a negative regex to rename all species other than "virginica" to "none" iris %>% recode_chr( col = Species, old_names = "vir", new_name = "none", regex = TRUE, negate = TRUE) %>% dplyr::count(Species) # Specify old names using a regex iris %>% recode_chr( col = Species, old_names = "set|vir", new_name = "other", regex = TRUE) %>% dplyr::count(Species)
Arranges columns alphabetically and then by type The user can supply a tidyselect argument to specify columns that should come first
relocate_all(.data, ..., regex = NULL)
relocate_all(.data, ..., regex = NULL)
.data |
data frame |
... |
a tidyselect specification |
regex |
a regular expression to match columns that will be put at the front of the df |
data frame
iris %>% head %>% relocate_all(matches("Petal"))
iris %>% head %>% relocate_all(matches("Petal"))
Remove whitespace from columns using a tidyselect specification.
remove_whitespace(.data, ...)
remove_whitespace(.data, ...)
.data |
data frame |
... |
tidyselect specification (default selection: all character columns) |
data frame
tibble::tibble(a = c(" a ", "b ", " c")) -> t1 t1 t1 %>% remove_whitespace()
tibble::tibble(a = c(" a ", "b ", " c")) -> t1 t1 t1 %>% remove_whitespace()
flexible select operator that powers the tidy consultant universe. Used to set sensible defaults and flexibly return the chosen columns. A developer focused function, but may be useful in interactive programming due to the ability to return different types.
select_otherwise( .data, ..., otherwise = NULL, col = NULL, return_type = c("names", "index", "df") )
select_otherwise( .data, ..., otherwise = NULL, col = NULL, return_type = c("names", "index", "df") )
.data |
dataframe |
... |
tidyselect. columns to choose |
otherwise |
tidyselect. default columns to choose if ... is not specified |
col |
tidyselect. column to choose regardless of ... or otherwise specifications |
return_type |
choose to return column index, names, or df. defaults to index |
integer vector by default. possibly data frame or character vector
iris %>% select_otherwise(where(is.double), return_type = "index")
iris %>% select_otherwise(where(is.double), return_type = "index")
set character
set_chr(.data, ...)
set_chr(.data, ...)
.data |
dataframe |
... |
tidyselect. Default selection: none |
dataframe
iris %>% tibble::as_tibble() %>% set_chr(tidyselect::everything())
iris %>% tibble::as_tibble() %>% set_chr(tidyselect::everything())
set dates manually or automatically
set_date(.data, ..., date_fn = lubridate::ymd)
set_date(.data, ..., date_fn = lubridate::ymd)
.data |
dataframe |
... |
tidyselect |
date_fn |
a function to convert to a date object |
note: can be called without any ...
arguments and instead automatically determines which character columns
are actually dates, then proceeds to set them. It checks for the date specified in date_fn
and also ymd_hms
.
On auto detect mode, it sets ymd_hms
output to ymd dates instead of datetimes with hms. This is because of the common occurrence
of trying to extract a ymd
date from an excel workbook, and having it come with extra 00:00:00. If you need a datetime, manually
supply the appropriate lubridate function.
Auto mode is experimental. Commonly detected error is a long character string of integers being interpreted as a date.
tibble
tibble::tibble(date_col1 = c("20190101", "20170205"), date_col2 = c("20201015", "20180909"), not_date_col = c("a345", "b040")) -> t1 t1 t1 %>% set_date() t1 %>% set_date(date_col1)
tibble::tibble(date_col1 = c("20190101", "20170205"), date_col2 = c("20201015", "20180909"), not_date_col = c("a345", "b040")) -> t1 t1 t1 %>% set_date() t1 %>% set_date(date_col1)
set double
set_dbl(.data, ...) ## S3 method for class 'character' set_dbl(.data, ...) ## S3 method for class 'factor' set_dbl(.data, ...) ## S3 method for class 'Date' set_dbl(.data, ...) ## S3 method for class 'numeric' set_dbl(.data, ...) ## S3 method for class 'integer64' set_dbl(.data, ...) ## S3 method for class 'data.frame' set_dbl(.data, ...)
set_dbl(.data, ...) ## S3 method for class 'character' set_dbl(.data, ...) ## S3 method for class 'factor' set_dbl(.data, ...) ## S3 method for class 'Date' set_dbl(.data, ...) ## S3 method for class 'numeric' set_dbl(.data, ...) ## S3 method for class 'integer64' set_dbl(.data, ...) ## S3 method for class 'data.frame' set_dbl(.data, ...)
.data |
dataframe |
... |
tidyselect. Default selection: none |
tibble
date_col <- c(lubridate::ymd(20180101), lubridate::ymd(20210420)) tibble::tibble(int = c(1L, 2L), fct = factor(c(10, 11)), date = date_col, chr = c("a2.1", "rtg50.5")) -> t1 t1 t1 %>% set_dbl(tidyselect::everything()) # s3 method works for vectors individually # custom date coercion to represent date as a number. For lubridate's coercion method, use set_int date_col %>% set_dbl
date_col <- c(lubridate::ymd(20180101), lubridate::ymd(20210420)) tibble::tibble(int = c(1L, 2L), fct = factor(c(10, 11)), date = date_col, chr = c("a2.1", "rtg50.5")) -> t1 t1 t1 %>% set_dbl(tidyselect::everything()) # s3 method works for vectors individually # custom date coercion to represent date as a number. For lubridate's coercion method, use set_int date_col %>% set_dbl
allows option to manually set the first level of the factor, for consistency with yardstick which automatically considers the first level as the "positive class" when evaluating classification.
set_fct( .data, ..., first_level = NULL, order_fct = FALSE, labels = NULL, max_levels = Inf ) ## S3 method for class 'data.frame' set_fct(.data, ..., first_level = NULL, order_fct = FALSE, max_levels = Inf) ## Default S3 method: set_fct(.data, ..., first_level = NULL, order_fct = FALSE, max_levels = Inf)
set_fct( .data, ..., first_level = NULL, order_fct = FALSE, labels = NULL, max_levels = Inf ) ## S3 method for class 'data.frame' set_fct(.data, ..., first_level = NULL, order_fct = FALSE, max_levels = Inf) ## Default S3 method: set_fct(.data, ..., first_level = NULL, order_fct = FALSE, max_levels = Inf)
.data |
dataframe |
... |
tidyselect (default selection: all character columns) |
first_level |
character string to set the first level of the factor |
order_fct |
logical. ordered factor? |
labels |
chr vector of labels, length equal to factor levels |
max_levels |
integer. uses |
tibble
## simply set the first level of a factor iris$Species %>% levels iris %>% set_fct(Species, first_level = "virginica") %>% dplyr::pull(Species) %>% levels()
## simply set the first level of a factor iris$Species %>% levels iris %>% set_fct(Species, first_level = "virginica") %>% dplyr::pull(Species) %>% levels()
set integer
set_int(.data, ...) ## S3 method for class 'data.frame' set_int(.data, ...) ## S3 method for class 'grouped_df' set_int(.data, ...)
set_int(.data, ...) ## S3 method for class 'data.frame' set_int(.data, ...) ## S3 method for class 'grouped_df' set_int(.data, ...)
.data |
dataframe |
... |
tidyselect. Default Selecton: integerish doubles or integerish characters |
tibble
int_vec <- c("1", "2", "10") tibble::tibble( chr_int = int_vec, dbl_int = c(1.0, 5.0, 20.0), chr_int64 = c("1033493932", "4432500065", "30303022192"), string_int = c("SALES2020", "SALES2021", "SALES2022")) -> tbl # automatically coerce integerish cols in a tibble tbl # integerish doubles or chars will be detected for coercion automatically tbl %>% set_int() # string_int requires parsing, so it must be specified directly for coercion tbl %>% set_int(matches("str|chr")) # s3 method works for vectors as well int_vec int_vec %>% set_int()
int_vec <- c("1", "2", "10") tibble::tibble( chr_int = int_vec, dbl_int = c(1.0, 5.0, 20.0), chr_int64 = c("1033493932", "4432500065", "30303022192"), string_int = c("SALES2020", "SALES2021", "SALES2022")) -> tbl # automatically coerce integerish cols in a tibble tbl # integerish doubles or chars will be detected for coercion automatically tbl %>% set_int() # string_int requires parsing, so it must be specified directly for coercion tbl %>% set_int(matches("str|chr")) # s3 method works for vectors as well int_vec int_vec %>% set_int()
note: for non-binary data, all values other than the true_level will be set to false
## S3 method for class 'data.frame' set_lgl(.data, ..., true_level = 1L) set_lgl(.data, ..., true_level = 1L) ## Default S3 method: set_lgl(.data, ...) ## S3 method for class 'numeric' set_lgl(.data, ..., true_level = 1L) ## S3 method for class 'character' set_lgl(.data, ..., true_level = c("T", "TRUE"))
## S3 method for class 'data.frame' set_lgl(.data, ..., true_level = 1L) set_lgl(.data, ..., true_level = 1L) ## Default S3 method: set_lgl(.data, ...) ## S3 method for class 'numeric' set_lgl(.data, ..., true_level = 1L) ## S3 method for class 'character' set_lgl(.data, ..., true_level = c("T", "TRUE"))
.data |
dataframe |
... |
tidyselect. Default selection: none |
true_level |
specify the value to set as TRUE. Default value is 1 for seamless conversion between logicals and integers. Can be given as a vector of values. |
dataframe
# convert a 1/0 vector back into T/F tibble::tibble(x = c(1, 0, 0, 1, 0, 1)) %>% set_lgl(x)
# convert a 1/0 vector back into T/F tibble::tibble(x = c(1, 0, 0, 1, 0, 1)) %>% set_lgl(x)