--- title: "validata" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{validata} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) iris <- tibble::tibble(iris) ``` ```{r setup} library(validata) library(tidyselect) ``` # Distinct ## Confirm Distinct In data analysis tasks we often have data sets with multiple possible ID columns, but it's not always clear which combination uniquely identifies each row. sample_data1 has 125 row with 3 ID type columns and 3 value columns. ```{r} head(sample_data1) ``` Let's use `confirm_distinct` iteratively to find the uniquely identifying columns of sample_data1. ```{r} sample_data1 %>% confirm_distinct(ID_COL1) ``` ```{r} sample_data1 %>% confirm_distinct(ID_COL1, ID_COL2) ``` ```{r} sample_data1 %>% confirm_distinct(ID_COL1, ID_COL2, ID_COL3) ``` Here we can conclude that the combination of 3 ID columns is the primary key for the data. ## Determine Distinct These steps can be automated with the wrapper function `determine distinct`. ```{r} sample_data1 %>% determine_distinct(matches("ID")) ``` # Mapping `confirm_mapping` tells you the mapping between two columns in a data frame: - 1 - 1 mapping - 1 - many mapping - many - 1 mapping - many - many mapping ## Confirm mapping `confirm_mapping` gives the option to view which type of mapping is associated with each individual row. ```{r} sample_data1 %>% confirm_mapping(ID_COL1, ID_COL2, view = F) ``` ## Determine mapping ```{r} sample_data1 %>% determine_mapping(everything()) ``` # Overlap The `overlap` functions give a venn style description of the values in 2 columns. This is especially useful before performing a `join` function, and you want to confirm that the dataframes have matching keys. ## Confirm Overlap `confirm_overlap` is different from the other `confirm` functions in that it takes 2 vectors as arguments, instead of a data frame. This is to allow the user to test overlap between different dataframes, or arbitrary vectors if necessary ```{r} confirm_overlap(iris$Sepal.Width, iris$Petal.Length) -> iris_overlap ``` `confirm_overlap` returns a summary data frame invisibly allowing you to access individual elements using the helper functions. ```{r} print(iris_overlap) ``` Find the elements unique to the first column ```{r} iris_overlap %>% co_find_only_in_1() %>% head() ``` Find the elements unique to the second column ```{r} iris_overlap %>% co_find_only_in_2() %>% head() ``` Find the elements shared by both columns ```{r} iris_overlap %>% co_find_in_both() %>% head() ``` ## Determine Overlap `determine_overlap` takes a dataframe and a tidyselect specification, and returns a tibble summarizing all of the pairwise overlaps. Only pairs with matching types are tested. ```{r eval=FALSE, include=FALSE,} iris %>% determine_overlap(everything()) ``` Note that the `overlap` functions only test pairwise overlaps. For multi-column and large-scale overlap testing, see [Complex Upset Plots](https://krassowski.github.io/complex-upset/) # string length ## confirm string length Get a frequency table of string lengths in a character column. Table is printed while the original df is returned invisibly with a column indicating the string lengths. ```{r} iris %>% confirm_strlen(Species) -> species_len ``` output is a dataframe ```{r} head(species_len) ``` ## choose string length A helped function for the output of `confirm_strlen` that filters the database for chosen string lengths. ```{r} species_len %>% choose_strlen(len = 6) %>% head() ``` # diagnose Reproduction of diagnose from the dlookr package. Usually a good choice for first analyzing a data set. ```{r} iris %>% diagnose() ```