cohort-joiner

cohort-joiner performs a database-style left join between several left-hand cohorts and a single right-hand cohort. Typically, the left-hand cohorts are several weekly or monthly extracts of pseudonymised patient data for variables that are expected to change by week or by month. The right-hand cohort is a single extract of these data for variables that are not expected to change. Extracting and joining the cohorts in this way is more efficient than several weekly or monthly extracts of these data for all variables.

Using cohort-joiner has several advantages over writing a scripted action:

Compatibility: cohort-joiner uses the same logic as cohort-extractor to save files, meaning that it's compatible with "downstream" actions, such as the measures framework.
Efficiency: cohort-joiner uses an efficient strategy for joining the extracts. This strategy uses roughly 2.9 times less memory than an alternative, previously documented, strategy.
Transparency: cohort-joiner doesn't replace the extracts; instead, it saves the joined extracts in a new output directory. Replacing the extracts makes it harder to construct an audit trail, which reduces computational and analytical transparency; core principles of the OpenSAFELY platform.

Usage

In summary:

Use cohort-extractor to extract the left-hand cohorts. These are several weekly or monthly extracts for variables that are expected to change, such as whether a patient has a systolic blood pressure record.
Use cohort-extractor to extract the right hand cohort. This is a single extract for variables that are not expected to change, such as ethnicity.
Use cohort-joiner to join the cohorts, and then to save them to an output directory.
Optionally, use cohort-extractor to generate one or more measures from the joined cohorts.

Let's walk through an example project.yaml.

The following cohort-extractor action extracts the left-hand cohorts:

generate_cohort:
  run: >
    cohortextractor:latest generate_cohort
      --study-definition study_definition
      --index-date-range "2021-01-01 to 2021-06-30 by month"
  outputs:
    highly_sensitive:
      cohort: output/input_2021-*.csv

The following cohort-extractor action extracts the right hand cohort:

generate_ethnicity_cohort:
  run: >
    cohortextractor:latest generate_cohort
      --study-definition study_definition_ethnicity
  outputs:
    highly_sensitive:
      cohort: output/input_ethnicity.csv

The following cohort-joiner reusable action joins the cohorts, and then saves them to a new output directory. Remember to replace [version] with a cohort-joiner version:

join_cohorts:
  run: >
    cohort-joiner:[version]
      --lhs output/input_2021-*.csv
      --rhs output/input_ethnicity.csv
      --output-dir output/joined
  needs: [generate_cohort, generate_ethnicity_cohort]
  outputs:
    highly_sensitive:
      cohort: output/joined/input_2021-*.csv

Optionally, the following cohort-extractor action generates one or more measures from the joined cohorts:

generate_measures:
  run: >
    cohortextractor:latest generate_measures
      --study-definition study_definition
      --output-dir output/joined
  needs: [join_cohorts]
  outputs:
    moderately_sensitive:
      measure: output/joined/measure_*.csv

FAQs

Dummy data

I'm using the expectations framework to generate dummy data for the left-hand cohorts and the right-hand cohort. Why are so many rows in the joined cohorts missing data from the right-hand cohort?

Each time we run generate_cohort, the expectations framework generates random patient IDs. Consequently, patient IDs in the left-hand cohorts are independent of patient IDs in the right-hand cohort. In other words, even if the population of the right-hand cohort must include the population of the left-hand cohorts when using "real" data, this isn't the case when using dummy data that is generated by the expectations framework.

Because cohort-joiner performs a database-style left join, patient IDs in the left-hand cohorts are always in the joined cohorts. However, where these patient IDs are not joined to patient IDs in the right-hand cohort, there will be rows in the joined cohorts that are missing data from the right-hand cohort.

Notes for developers

Please see DEVELOPERS.md.