Getting Started • affirm

library(affirm)
library(dplyr)

The affirm package was build to run data checks, or affirmations, against data that continually updates. In this brief tutorial, we’ll walk through the basics of using the package for identifying and reporting errors in the data.

Validate EDC Data

Validating electronic data capture system data has some nuance different from other types of validation, and the examples below illustrate issues that often arise during EDC validation.

Initiate

affirm_init(replace = TRUE)
#> ✔ We're ready to make data affirmations...

# this option is used to auto-select columns that appear in the report listings 
# (default are these IDs and the columns that appear in the condition being tested)
options('affirm.id_cols' = "SUBJECT")

Affirm

Affirm that there are no missing subject IDs

affirm_true(
  RAND,
  label = "RAND: Subject ID is not missing",
  condition = !is.na(SUBJECT),
  id = 1L,
  priority = 1,
  data_frames = "RAND"
)
#> • RAND: Subject ID is not missing
#>   0 issues identified.
#> # A tibble: 5 × 3
#>   SUBJECT RAND_GROUP RAND_STRATA
#>     <dbl> <chr>      <chr>      
#> 1       1 Drug A     <65yr      
#> 2       2 Drug B     >=65yr     
#> 3       3 Drug A     <65yr      
#> 4       4 Drug B     >=65yr     
#> 5       5 NA         >=65yr

This time we’ll affirm that the randomization assignment is not missing and if it is missing we take the further action of removing those rows from the returned data frame.

RAND <-
  affirm_true(
    RAND,
    label = "RAND: Randomization Group is not missing",
    condition = !is.na(RAND_GROUP),
    data_action = filter(., !is.na(RAND_GROUP)),
    id = 2L,
    priority = 1,
    data_frames = "RAND"
  )
#> • RAND: Randomization Group is not missing
#>   1 issue identified.

In this affirmation, we merge in data from the DM data set, and check whether the reported subject age aligns with the age group in the randomization stratification variable.

RAND |>
  left_join(
    DM |> prepend_df_name() |> select(SUBJECT, DM.AGE) , 
    by = "SUBJECT"
  ) |> 
  affirm_true(
    label = "RAND: Randomization strata match recorded subject age",
    condition =
      (RAND_STRATA %in% "<65yr" & DM.AGE < 65) | (RAND_STRATA %in% ">=65yr" & DM.AGE >= 65),
    id = 3L,
    priority = 1,
    data_frames = "RAND, DM"
  )
#> • RAND: Randomization strata match recorded subject age
#>   1 issue identified.
#> # A tibble: 4 × 4
#>   SUBJECT RAND_GROUP RAND_STRATA DM.AGE
#>     <dbl> <chr>      <chr>        <dbl>
#> 1       1 Drug A     <65yr           40
#> 2       2 Drug B     >=65yr          70
#> 3       3 Drug A     <65yr           50
#> 4       4 Drug B     >=65yr          60

In this example, we will modify the data frame that will be reported to a data management team. We will return all rows from the data frame, and include a flag for row with bad inputs.

affirm_true(
  DM,
  label = "DM: Subject race is one of 'Asian', 'Black or African American', 'Native Hawaiian or Other Pacific Islander', 'American Indian or Alaska Native', 'White'",
  condition = RACE %in% c('Asian', 'Black or African American', 'Native Hawaiian or Other Pacific Islander', 'American Indian or Alaska Native', 'White'),
  report_listing =
    select(., SUBJECT, RACE) |> 
    mutate(..flag.. = ifelse(!lgl_condition, label, NA)),
  id = 4L,
  data_frames = "DM"
)
#> • DM: Subject race is one of 'Asian', 'Black or African American', 'Native
#>   Hawaiian or Other Pacific Islander', 'American Indian or Alaska Native',
#>   'White'
#>   1 issue identified.
#> # A tibble: 4 × 3
#>   SUBJECT   AGE RACE                                     
#>     <dbl> <dbl> <chr>                                    
#> 1       1    40 Asian                                    
#> 2       2    70 Black or African American                
#> 3       3    50 Native American                          
#> 4       4    60 Native Hawaiian or Other Pacific Islander

# we'll take a peak at the 'report_listing' data frame now
affirm_report_raw_data() |> 
  filter(id == 4L) |> 
  pull(data)
#> [[1]]
#> # A tibble: 4 × 3
#>   SUBJECT RACE                                      ..flag..                    
#>     <dbl> <chr>                                     <chr>                       
#> 1       1 Asian                                     NA                          
#> 2       2 Black or African American                 NA                          
#> 3       3 Native American                           DM: Subject race is one of …
#> 4       4 Native Hawaiian or Other Pacific Islander NA

Report

Get a summary of the collection of data affirmations in a gt table with affirm_report_gt(). The table includes

affirm_report_gt()

ID	Affirmation	Priority	Data Frames	Columns	No. Errors	Total No. Checks	Error Rate
1	RAND: Subject ID is not missing	1	RAND	SUBJECT	0	5	0.0%
2	RAND: Randomization Group is not missing	1	RAND	RAND_GROUP	1	5	20.0%
3	RAND: Randomization strata match recorded subject age	1	RAND, DM	RAND_STRATA, DM.AGE	1	4	25.0%
4	DM: Subject race is one of 'Asian', 'Black or African American', 'Native Hawaiian or Other Pacific Islander', 'American Indian or Alaska Native', 'White'	—	DM	RACE	1	4	25.0%

Validate Derived Variables

Using EDC data to derive new variables requires a different style of data validations. When validating raw EDC data, we must report bad/inconsistent data to a data manager who will then investigate and correct the data in the source data base. When validating derived variables based on raw EDC data, we make assumptions about the data. Validations can be used to ensure that whatever assumptions we made on the day we first derived a new variable are still met as the raw EDC data continues to be updated.

For example, imagine you are classifying tumor locations into a broader tumor region variable. The first time you write the code, you will classify every location into a broader region, but there is no way to know what may be entered as a new tumor location in the future. Therefore, we can write a validation that each location is mapped to a region. If a location is not mapped, rather than reporting this to a data management team, you may opt to return an error so you know that the new location needs to be handled. Return errors by using the affirm_true(error=TRUE) argument. The error message will reference the affirmation label, making it clear why a script has erred.

In this case, you may want to set the following option at the top of a script that derives to analysis variables.

options("affirm.error" = TRUE)

Every newly derived variable should be associated with multiple affirmations to ensure the derivation remains correct into the future.