Title: | Validate Data Frames |
---|---|
Description: | Functions for validating the structure and properties of data frames. Answers essential questions about a data set after initial import or modification. What are the unique or missing values? What columns form a primary key? What are the properties of the numeric or categorical columns? What kind of overlap or mapping exists between 2 columns? |
Authors: | Harrison Tietze [aut, cre] |
Maintainer: | Harrison Tietze <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.1.0 |
Built: | 2024-11-11 06:02:05 UTC |
Source: | https://github.com/harrison4192/validata |
Confirm whether the rows of a data frame can be uniquely identified by the keys in the selected columns. Also reports whether the dataframe has duplicates. If so, it is best to remove duplicates and re-run the function.
confirm_distinct(.data, ...)
confirm_distinct(.data, ...)
.data |
A dataframe |
... |
(ID) columns |
a Logical value invisibly with description printed to console
iris %>% confirm_distinct(Species, Sepal.Width)
iris %>% confirm_distinct(Species, Sepal.Width)
The mapping between elements of 2 columns can have 4 different relationships: one - one, one - many, many - one, many - many. This function returns a view of the mappings by row, and prints a summary to the console.
confirm_mapping(.data, col1, col2, view = T)
confirm_mapping(.data, col1, col2, view = T)
.data |
a data frame |
col1 |
column 1 |
col2 |
column 2 |
view |
View results? |
A view of mappings. Also returns the view as a data frame invisibly.
iris %>% confirm_mapping(Species, Sepal.Width, view = FALSE)
iris %>% confirm_mapping(Species, Sepal.Width, view = FALSE)
Prints a venn-diagram style summary of the unique value overlap between two columns and also invisibly returns a dataframe that can be assigned to a variable and queried with the overlap helpers. The helpers can return values that appeared only the first col, second col, or both cols.
confirm_overlap(vec1, vec2, return_tibble = F) co_find_only_in_1(co_output) co_find_only_in_2(co_output) co_find_in_both(co_output)
confirm_overlap(vec1, vec2, return_tibble = F) co_find_only_in_1(co_output) co_find_only_in_2(co_output) co_find_in_both(co_output)
vec1 |
vector 1 |
vec2 |
vector 2 |
return_tibble |
logical. If TRUE, returns a tibble. otherwise by default returns the database invisibly to be queried by helper functions. |
co_output |
dataframe output from confirm_overlap |
tibble. overlap summary or overlap table
confirm_overlap(iris$Sepal.Width, iris$Sepal.Length) -> iris_overlap iris_overlap iris_overlap %>% co_find_only_in_1() iris_overlap %>% co_find_only_in_2() iris_overlap %>% co_find_in_both()
confirm_overlap(iris$Sepal.Width, iris$Sepal.Length) -> iris_overlap iris_overlap iris_overlap %>% co_find_only_in_1() iris_overlap %>% co_find_only_in_2() iris_overlap %>% co_find_in_both()
returns a count table of string lengths for a character column. The helper function choose_strlen
filters dataframe for rows containing specific string length for the specified column.
confirm_strlen(mdb, col) choose_strlen(cs_output, len)
confirm_strlen(mdb, col) choose_strlen(cs_output, len)
mdb |
dataframe |
col |
unquoted column |
cs_output |
dataframe. output from |
len |
integer vector. |
prints a summary and returns a dataframe invisibly
dataframe with original columns, filtered to the specific string length
iris %>% tibble::as_tibble() %>% confirm_strlen(Species) -> iris_cs_output iris_cs_output iris_cs_output %>% choose_strlen(6)
iris %>% tibble::as_tibble() %>% confirm_strlen(Species) -> iris_cs_output iris_cs_output iris_cs_output %>% choose_strlen(6)
Uses confirm_distinct
in an iterative fashion to determine the primary keys.
determine_distinct(df, ..., listviewer = TRUE)
determine_distinct(df, ..., listviewer = TRUE)
df |
a data frame |
... |
columns or a tidyselect specification. defaults to everything |
listviewer |
logical. defaults to TRUE to view output using the listviewer package |
The goal of this function is to automatically determine which columns uniquely identify the rows of a dataframe. The output is a printed description of the combination of columns that form unique identifiers at each level. At level 1, the function tests if individual columns are primary keys At level 2, the function tests n C 2 combinations of columns to see if they form primary keys. The final level is testing all columns at once.
For completely unique columns, they are recorded in level 1, but then dropped from the data frame to facilitate the determination of multi-column primary keys.
If the dataset contains duplicated rows, they are eliminated before proceeding.
list
sample_data1 %>% head ## on level 1, each column is tested as a unique identifier. the VAL columns have no ## duplicates and hence qualify, even though they normally would be considered as IDs ## on level 3, combinations of 3 columns are tested. implying that ID_COL 1,2,3 form a unique key ## level 2 does not appear, implying that combinations of any 2 ID_COLs do not form a unique key sample_data1 %>% determine_distinct(listviewer = FALSE)
sample_data1 %>% head ## on level 1, each column is tested as a unique identifier. the VAL columns have no ## duplicates and hence qualify, even though they normally would be considered as IDs ## on level 3, combinations of 3 columns are tested. implying that ID_COL 1,2,3 form a unique key ## level 2 does not appear, implying that combinations of any 2 ID_COLs do not form a unique key sample_data1 %>% determine_distinct(listviewer = FALSE)
Determine pairwise structural mappings
determine_mapping(df, ..., listviewer = TRUE)
determine_mapping(df, ..., listviewer = TRUE)
df |
a data frame |
... |
columns or a tidyselect specification |
listviewer |
logical. defaults to TRUE to view output using the listviewer package |
description of mappings
iris %>% determine_mapping(listviewer = FALSE)
iris %>% determine_mapping(listviewer = FALSE)
Uses confirm_overlap
in a pairise fashion to see venn style comparison of unique values between
the columns chosen by a tidyselect specification.
determine_overlap(db, ...)
determine_overlap(db, ...)
db |
a data frame |
... |
tidyselect specification. Default being everything. |
tibble
iris %>% determine_overlap()
iris %>% determine_overlap()
Pipe in a dataframe to return a diagnosis of its missing and unique values for each columns. Default behavior is to diagnose all columns, but a subset can be specified in the dots with tidyselect.
diagnose(df, ...)
diagnose(df, ...)
df |
dataframe |
... |
tidyselect |
this function is inspired by the excellent dlookr package. It takes a dataframe and returns a summary of unique and missing values of the columns.
dataframe summary
iris %>% diagnose()
iris %>% diagnose()
counts the distinct entries of categorical variables. The max_distinct
argument limits the scope to
categorical variables with a maximum number of unique entries, to prevent overflow.
diagnose_category(.data, ..., max_distinct = 5)
diagnose_category(.data, ..., max_distinct = 5)
.data |
dataframe |
... |
tidyselect |
max_distinct |
integer |
dataframe
iris %>% diagnose_category()
iris %>% diagnose_category()
faster than diagnose if emphasis is on diagnosing missing values. Also, only shows the columns with any missing values.
diagnose_missing(df, ...)
diagnose_missing(df, ...)
df |
dataframe |
... |
optional tidyselect |
tibble summary
iris %>% framecleaner::make_na(Species, vec = "setosa") %>% diagnose_missing()
iris %>% framecleaner::make_na(Species, vec = "setosa") %>% diagnose_missing()
Inputs a dataframe and returns various summary statistics of the numeric columns. For example zeros
returns the ratio
of 0 values in that column. minus
counts negative values and infs
counts Inf values. Other rarer metrics
are also returned that may be helpful for quick diagnosis or understanding of numeric data. mode
returns the most common
value in the column (chooses at random in case of tie) , and mode_ratio
returns its frequency as a ratio of the total rows
diagnose_numeric(.data, ...)
diagnose_numeric(.data, ...)
.data |
dataframe |
... |
tidyselect. Default: all numeric columns |
dataframe
iris %>% diagnose_numeric() %>% print(width = Inf)
iris %>% diagnose_numeric() %>% print(width = Inf)
n_dupes
n_dupes(x)
n_dupes(x)
x |
a df |
an integer; number of dupe rows
View rows of the dataframe where columns in the tidyselect specification contain missings by default, detects missings in any column. The result is by default displayed in the viewer pane. Can be returned as a tibble optionally.
view_missing(df, ..., view = TRUE)
view_missing(df, ..., view = TRUE)
df |
dataframe |
... |
tidyselect |
view |
logical. if false, returns tibble |
tibble
iris %>% framecleaner::make_na(Species, vec = "setosa") %>% view_missing(view = FALSE)
iris %>% framecleaner::make_na(Species, vec = "setosa") %>% view_missing(view = FALSE)