---
title: "validata"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{validata}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
iris <- tibble::tibble(iris)
```
```{r setup}
library(validata)
library(tidyselect)
```
# Distinct
## Confirm Distinct
In data analysis tasks we often have data sets with multiple possible ID columns, but it's not always clear which combination uniquely identifies each row.
sample_data1 has 125 row with 3 ID type columns and 3 value columns.
```{r}
head(sample_data1)
```
Let's use `confirm_distinct` iteratively to find the uniquely identifying columns of sample_data1.
```{r}
sample_data1 %>%
confirm_distinct(ID_COL1)
```
```{r}
sample_data1 %>%
confirm_distinct(ID_COL1, ID_COL2)
```
```{r}
sample_data1 %>%
confirm_distinct(ID_COL1, ID_COL2, ID_COL3)
```
Here we can conclude that the combination of 3 ID columns is the primary key for the data.
## Determine Distinct
These steps can be automated with the wrapper function `determine distinct`.
```{r}
sample_data1 %>%
determine_distinct(matches("ID"))
```
# Mapping
`confirm_mapping` tells you the mapping between two columns in a data frame:
- 1 - 1 mapping
- 1 - many mapping
- many - 1 mapping
- many - many mapping
## Confirm mapping
`confirm_mapping` gives the option to view which type of mapping is associated with each individual row.
```{r}
sample_data1 %>%
confirm_mapping(ID_COL1, ID_COL2, view = F)
```
## Determine mapping
```{r}
sample_data1 %>%
determine_mapping(everything())
```
# Overlap
The `overlap` functions give a venn style description of the values in 2 columns. This is especially useful before performing a `join` function, and you want to confirm that the dataframes have matching keys.
## Confirm Overlap
`confirm_overlap` is different from the other `confirm` functions in that it takes 2 vectors as arguments, instead of a data frame. This is to allow the user to test overlap between different dataframes, or arbitrary vectors if necessary
```{r}
confirm_overlap(iris$Sepal.Width, iris$Petal.Length) -> iris_overlap
```
`confirm_overlap` returns a summary data frame invisibly allowing you to access individual elements using the helper functions.
```{r}
print(iris_overlap)
```
Find the elements unique to the first column
```{r}
iris_overlap %>%
co_find_only_in_1() %>%
head()
```
Find the elements unique to the second column
```{r}
iris_overlap %>%
co_find_only_in_2() %>%
head()
```
Find the elements shared by both columns
```{r}
iris_overlap %>%
co_find_in_both() %>%
head()
```
## Determine Overlap
`determine_overlap` takes a dataframe and a tidyselect specification, and returns a tibble summarizing all of the pairwise overlaps. Only pairs with matching types are tested.
```{r eval=FALSE, include=FALSE,}
iris %>%
determine_overlap(everything())
```
Note that the `overlap` functions only test pairwise overlaps. For multi-column and large-scale overlap testing, see [Complex Upset Plots](https://krassowski.github.io/complex-upset/)
# string length
## confirm string length
Get a frequency table of string lengths in a character column.
Table is printed while the original df is returned invisibly with a column indicating the string lengths.
```{r}
iris %>%
confirm_strlen(Species) -> species_len
```
output is a dataframe
```{r}
head(species_len)
```
## choose string length
A helped function for the output of `confirm_strlen` that filters the database for chosen string lengths.
```{r}
species_len %>%
choose_strlen(len = 6) %>%
head()
```
# diagnose
Reproduction of diagnose from the dlookr package. Usually a good choice for first analyzing a data set.
```{r}
iris %>%
diagnose()
```