In data analysis tasks we often have data sets with multiple possible ID columns, but it’s not always clear which combination uniquely identifies each row.
sample_data1 has 125 row with 3 ID type columns and 3 value columns.
head(sample_data1)
#> # A tibble: 6 × 6
#> ID_COL1 ID_COL2 ID_COL3 VAL1 VAL2 VAL3
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 2413 1034 1014 -0.0639 -1.16 -0.302
#> 2 2413 1034 1322 0.363 1.62 0.165
#> 3 2413 1034 2999 -0.00466 1.23 0.819
#> 4 2413 1034 3544 1.83 -2.58 -0.525
#> 5 2413 1034 9901 0.837 -0.442 -0.341
#> 6 2413 1122 1014 -0.894 -1.11 0.768
Let’s use confirm_distinct
iteratively to find the
uniquely identifying columns of sample_data1.
sample_data1 %>%
confirm_distinct(ID_COL1, ID_COL2)
#> database has 100 duplicates at ID_COL1, ID_COL2
sample_data1 %>%
confirm_distinct(ID_COL1, ID_COL2, ID_COL3)
#> database is distinct at ID_COL1, ID_COL2, ID_COL3
Here we can conclude that the combination of 3 ID columns is the primary key for the data.
These steps can be automated with the wrapper function
determine distinct
.
sample_data1 %>%
determine_distinct(matches("ID"))
#> Warning: There was 1 warning in `dplyr::summarize()`.
#> ℹ In argument: `dplyr::across(.fns = ~dplyr::n_distinct(.) == rws)`.
#> Caused by warning:
#> ! Using `across()` without supplying `.cols` was deprecated in dplyr 1.1.0.
#> ℹ Please supply `.cols` instead.
confirm_mapping
tells you the mapping between two
columns in a data frame:
confirm_mapping
gives the option to view which type of
mapping is associated with each individual row.
The overlap
functions give a venn style description of
the values in 2 columns. This is especially useful before performing a
join
function, and you want to confirm that the dataframes
have matching keys.
confirm_overlap
is different from the other
confirm
functions in that it takes 2 vectors as arguments,
instead of a data frame. This is to allow the user to test overlap
between different dataframes, or arbitrary vectors if necessary
confirm_overlap(iris$Sepal.Width, iris$Petal.Length) -> iris_overlap
#> # A tibble: 1 × 5
#> only_in_iris_Sepal.Width only_in_iris_Petal.Length shared_names total_names
#> <int> <int> <int> <int>
#> 1 12 32 11 55
#> # ℹ 1 more variable: pct_shared <chr>
confirm_overlap
returns a summary data frame invisibly
allowing you to access individual elements using the helper
functions.
print(iris_overlap)
#> # A tibble: 55 × 4
#> x iris_Sepal.Width iris_Petal.Length both_flags
#> <dbl> <dbl> <dbl> <dbl>
#> 1 3.5 1 1 2
#> 2 3 1 1 2
#> 3 3.2 1 0 1
#> 4 3.1 1 0 1
#> 5 3.6 1 1 2
#> 6 3.9 1 1 2
#> 7 3.4 1 0 1
#> 8 2.9 1 0 1
#> 9 3.7 1 1 2
#> 10 4 1 1 2
#> # ℹ 45 more rows
Find the elements unique to the first column
iris_overlap %>%
co_find_only_in_1() %>%
head()
#> # A tibble: 6 × 1
#> iris_Sepal.Width
#> <dbl>
#> 1 3.2
#> 2 3.1
#> 3 3.4
#> 4 2.9
#> 5 2.3
#> 6 2.8
Find the elements unique to the second column
iris_overlap %>%
co_find_only_in_2() %>%
head()
#> # A tibble: 6 × 1
#> iris_Petal.Length
#> <dbl>
#> 1 1.4
#> 2 1.3
#> 3 1.5
#> 4 1.7
#> 5 1.6
#> 6 1.1
Find the elements shared by both columns
determine_overlap
takes a dataframe and a tidyselect
specification, and returns a tibble summarizing all of the pairwise
overlaps. Only pairs with matching types are tested.
Note that the overlap
functions only test pairwise
overlaps. For multi-column and large-scale overlap testing, see Complex Upset
Plots
Get a frequency table of string lengths in a character column. Table is printed while the original df is returned invisibly with a column indicating the string lengths.
iris %>%
confirm_strlen(Species) -> species_len
#> Species_chr_len n percent
#> 6 50 33.3%
#> 9 50 33.3%
#> 10 50 33.3%
output is a dataframe
head(species_len)
#> # A tibble: 6 × 6
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species Species_chr_len
#> <dbl> <dbl> <dbl> <dbl> <fct> <int>
#> 1 5.1 3.5 1.4 0.2 setosa 6
#> 2 4.9 3 1.4 0.2 setosa 6
#> 3 4.7 3.2 1.3 0.2 setosa 6
#> 4 4.6 3.1 1.5 0.2 setosa 6
#> 5 5 3.6 1.4 0.2 setosa 6
#> 6 5.4 3.9 1.7 0.4 setosa 6
A helped function for the output of confirm_strlen
that
filters the database for chosen string lengths.
species_len %>%
choose_strlen(len = 6) %>%
head()
#> # A tibble: 6 × 6
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species Species_chr_len
#> <dbl> <dbl> <dbl> <dbl> <fct> <int>
#> 1 5.1 3.5 1.4 0.2 setosa 6
#> 2 4.9 3 1.4 0.2 setosa 6
#> 3 4.7 3.2 1.3 0.2 setosa 6
#> 4 4.6 3.1 1.5 0.2 setosa 6
#> 5 5 3.6 1.4 0.2 setosa 6
#> 6 5.4 3.9 1.7 0.4 setosa 6
Reproduction of diagnose from the dlookr package. Usually a good choice for first analyzing a data set.
iris %>%
diagnose()
#> # A tibble: 5 × 6
#> variables types missing_count missing_percent unique_count unique_rate
#> <chr> <chr> <int> <dbl> <int> <dbl>
#> 1 Sepal.Length numeric 0 0 35 0.233
#> 2 Sepal.Width numeric 0 0 23 0.153
#> 3 Petal.Length numeric 0 0 43 0.287
#> 4 Petal.Width numeric 0 0 22 0.147
#> 5 Species factor 0 0 3 0.02