Loading Data from OMOP Common Data Model into EHRData#

This tutorial demonstrates how to load electronic health record data from the OMOP Common Data Model (CDM) with ehrdata.

What is OMOP CDM?#

The OMOP CDM standardizes observational healthcare data across different sources to enable reproducible, multi-site analyses.

We’ll use the MIMIC-IV demo dataset (100 patients) in OMOP format as our example [KTP+21] [GAG+00].

Setup and Data Download#

Note

ehrdata provides ready-to-use OMOP datasets!

The ed.dt.mimic_iv_omop() function automatically:

  1. Downloads a real data sample from MIMIC IV in an OMOP table format

  2. Loads the tables into a database connection

  3. Makes them ready for you to construct EHRData objects

This workflow mirrors how ehrdata can be applied to any dataset in the OMOP CDM format: (Download OMOP tables) → Connect to databaseCreate EHRData structure (using arguments explained below)

import ehrdata as ed
import duckdb

We’ll use ehrdata’s built-in mimic_iv_omop() function, which downloads the data as OMOP tables and loads them into a database connection:

# Create an in-memory database
con = duckdb.connect(":memory:")

# Download data and load OMOP tables into the database
# This creates tables like: person, measurement, observation, etc.
data_path = ed.dt.mimic_iv_omop(backend_handle=con)
print(f"Data downloaded to: {data_path}")

Quick Look at OMOP Tables#

The OMOP CDM organizes data into standardized tables. Let’s explore them:

# See what tables we have
tables = con.execute("SHOW TABLES").df()
print(f"Number of tables: {len(tables)}")
print("\nAvailable tables:")
print(tables)
Number of tables: 30

Available tables:
                    name
0              care_site
1             cdm_source
2                 cohort
3      cohort_definition
4                concept
5   concept_relationship
6          condition_era
7   condition_occurrence
8                   cost
9                  death
10       device_exposure
11              dose_era
12              drug_era
13         drug_exposure
14     fact_relationship
15              location
16           measurement
17              metadata
18                  note
19              note_nlp
20           observation
21    observation_period
22     payer_plan_period
23                person
24  procedure_occurrence
25              provider
26              specimen
27          visit_detail
28      visit_occurrence
29            vocabulary

The person table#

The person table contains demographic information about each patient.

person_df = con.execute("SELECT * FROM person LIMIT 5").df()
person_df.head()
person_id gender_concept_id year_of_birth month_of_birth day_of_birth birth_datetime race_concept_id ethnicity_concept_id location_id provider_id care_site_id person_source_value gender_source_value gender_source_concept_id race_source_value race_source_concept_id ethnicity_source_value ethnicity_source_concept_id
0 3589912774911670296 8507 2095 None None NaT 0 38003563 None None None 10009628 M 0 None 0 HISPANIC/LATINO 2000001408
1 -3210373572193940939 8507 2079 None None NaT 0 38003563 None None None 10011398 M 0 None 0 HISPANIC/LATINO 2000001408
2 -775517641933593374 8507 2149 None None NaT 8516 0 None None None 10004235 M 0 BLACK/AFRICAN AMERICAN 2000001406 None 0
3 -2575767131279873665 8507 2050 None None NaT 8516 0 None None None 10024043 M 0 BLACK/AFRICAN AMERICAN 2000001406 None 0
4 -8970844422700220177 8507 2114 None None NaT 8527 0 None None None 10038933 M 0 WHITE 2000001404 None 0

The measurement table#

The measurement table contains lab values, vital signs, and other measurements:

measurement_df = con.execute("SELECT * FROM measurement LIMIT 5").df()
measurement_df.head()
measurement_id person_id measurement_concept_id measurement_date measurement_datetime measurement_time measurement_type_concept_id operator_concept_id value_as_number value_as_concept_id unit_concept_id range_low range_high provider_id visit_occurrence_id visit_detail_id measurement_source_value measurement_source_concept_id unit_source_value value_source_value
0 7620661609057829801 -7437341330444582833 3007913 2113-09-14 2113-09-14 10:41:00 None 32856 None 586.0 <NA> 8876 NaN NaN None 3697313480337443666 None 50801 2000001001 mm Hg 586
1 -6166868866082303206 -2312013739856114142 3012501 2116-07-05 2116-07-05 05:51:00 None 32856 None -4.0 <NA> 9557 NaN NaN None -5005846256467230136 None 50802 2000001002 mEq/L -4
2 -5240588523649662838 -4234372750442829205 3012501 2154-01-02 2154-01-02 20:18:00 None 32856 None -2.0 <NA> 9557 NaN NaN None -5317261811030552609 None 50802 2000001002 mEq/L -2
3 6455538495061268502 8805478484003283429 3012501 2114-06-20 2114-06-20 09:59:00 None 32856 None 3.0 <NA> 9557 NaN NaN None 4821424463988433938 None 50802 2000001002 mEq/L 3
4 8972684215776493449 6339505631013617478 3012501 2111-11-15 2111-11-15 03:09:00 None 32856 None -2.0 <NA> 9557 NaN NaN None -6354047184485090226 None 50802 2000001002 mEq/L -2

Building an EHRData Object from OMOP#

Now let’s construct an EHRData object from the OMOP database. We’ll do this in three steps:

  1. Setup rows (.obs) - Define what each row represents (patients, visits, etc.)

  2. Setup variables (.var and .layers) - Extract clinical measurements as time series

  3. Explore the result - See how OMOP data maps to EHRData structure

Step 1: Setup rows (.obs)#

The first step is to define what each row in our final EHRData object should represent. We use ed.io.omop.setup_obs() for this.

In OMOP, we can choose different observation units:

  • "person" - Each row is one patient

  • "person_visit_occurrence" - Each row is one hospital visit

  • "person_observation_period" - Each row is one observation period

  • "person_cohort”` - Each row is a person of a cohort

Important

The choice of observation table determines the reference timepoint (t=0) for time series data:

Observation Table

Timepoint 0 Reference

"person"

Patient’s birth date

"person_visit_occurrence"

Start of the hospital visit

"person_observation_period"

Start of the observation period

"person_cohort"

Start of the cohort

This allows you to leverage the temporal information already defined in the OMOP database to align your time series data to clinically meaningful reference points.

For this tutorial, we’ll use "person_visit_occurrence" because we care about monitoring and modeling individual hospital visits. This is particularly useful when:

  • Patients have multiple hospital encounters

  • Each visit has distinct clinical characteristics

  • You want to predict outcomes at the visit level (e.g., in-hospital mortality, length of stay)

Let’s see the first step in action:

# Step 1: Setup observations from the person table
edata = ed.io.omop.setup_obs(
    backend_handle=con,
    observation_table="person_visit_occurrence",
    death_table=True,  # Include death information
)
edata
EHRData object with n_obs × n_vars × n_t = 852 × 0 × 1
    obs: 'person_id', 'gender_concept_id', 'year_of_birth', 'month_of_birth', 'day_of_birth', 'birth_datetime', 'race_concept_id', 'ethnicity_concept_id', 'location_id', 'provider_id', 'care_site_id', 'person_source_value', 'gender_source_value', 'gender_source_concept_id', 'race_source_value', 'race_source_concept_id', 'ethnicity_source_value', 'ethnicity_source_concept_id', 'visit_occurrence_id', 'person_id_1', 'visit_concept_id', 'visit_start_date', 'visit_start_datetime', 'visit_end_date', 'visit_end_datetime', 'visit_type_concept_id', 'provider_id_1', 'care_site_id_1', 'visit_source_value', 'visit_source_concept_id', 'admitting_source_concept_id', 'admitting_source_value', 'discharge_to_concept_id', 'discharge_to_source_value', 'preceding_visit_occurrence_id', 'death_date', 'death_datetime', 'death_type_concept_id', 'cause_concept_id', 'cause_source_value', 'cause_source_concept_id'
    uns: 'omop_io_observation_table'

Let’s examine the .obs table - it contains patient demographics from the OMOP person table:

edata.obs.head()
person_id gender_concept_id year_of_birth month_of_birth day_of_birth birth_datetime race_concept_id ethnicity_concept_id location_id provider_id ... admitting_source_value discharge_to_concept_id discharge_to_source_value preceding_visit_occurrence_id death_date death_datetime death_type_concept_id cause_concept_id cause_source_value cause_source_concept_id
0 -4502092208250381979 8532 2071 None None NaT 8527 0 None None ... TRANSFER FROM HOSPITAL 581476 HOME HEALTH CARE <NA> NaT NaT NaN NaN NaN NaN
1 4239478333578644568 8507 2111 None None NaT 8527 0 None None ... PHYSICIAN REFERRAL 581476 HOME <NA> NaT NaT NaN NaN NaN NaN
2 -8090189584974691216 8507 2118 None None NaT 8527 0 None None ... EMERGENCY ROOM 581476 HOME <NA> NaT NaT NaN NaN NaN NaN
3 2188642953583197091 8532 2102 None None NaT 8527 0 None None ... None <NA> None -6726268352887624443 NaT NaT NaN NaN NaN NaN
4 3129727379702505063 8532 2145 None None NaT 8516 0 None None ... EMERGENCY ROOM <NA> None <NA> NaT NaT NaN NaN NaN NaN

5 rows × 41 columns

We can find person-level information here: columns such as gender_concept_id, and year_of_birth have been loaded from the person table.

We can further find visit-level information:

edata.obs.iloc[:, 18:35].head()
visit_occurrence_id person_id_1 visit_concept_id visit_start_date visit_start_datetime visit_end_date visit_end_datetime visit_type_concept_id provider_id_1 care_site_id_1 visit_source_value visit_source_concept_id admitting_source_concept_id admitting_source_value discharge_to_concept_id discharge_to_source_value preceding_visit_occurrence_id
0 -9176297757944464068 -4502092208250381979 581385 2154-02-05 2154-02-05 17:09:00 2154-02-09 2154-02-09 15:15:00 32817 None None 10018328|26706939 2000001807 8717 TRANSFER FROM HOSPITAL 581476 HOME HEALTH CARE <NA>
1 -9149771978458038515 4239478333578644568 8883 2177-03-12 2177-03-12 07:15:00 2177-03-19 2177-03-19 14:25:00 32817 None None 10022880|27708593 2000001808 38004207 PHYSICIAN REFERRAL 581476 HOME <NA>
2 -9133360720296560252 -8090189584974691216 8870 2174-05-26 2174-05-26 04:20:00 2174-05-31 2174-05-31 14:15:00 32817 None None 10009049|22995465 2000001806 8870 EMERGENCY ROOM 581476 HOME <NA>
3 -9128519808785176541 2188642953583197091 38004207 2146-08-26 2146-08-26 15:36:00 2146-08-26 2146-08-26 15:36:00 32817 None None 10008287|2146-08-26 2000001801 <NA> None <NA> None -6726268352887624443
4 -9127810274408915712 3129727379702505063 581385 2197-04-16 2197-04-16 22:57:00 2197-04-17 2197-04-17 09:48:00 32817 None None 10002930|25282382 2000001805 8870 EMERGENCY ROOM <NA> None <NA>

Since we used the visit_occurrence table to define our rows.

Particularly interesting here is the visit_concept_id, encoding information about what type the visit is, as well as the visit_start_datetime.

Finally, the .obs column contains information about a persons death:

edata.obs.iloc[:, 35:].head(12)
death_date death_datetime death_type_concept_id cause_concept_id cause_source_value cause_source_concept_id
0 NaT NaT NaN NaN NaN NaN
1 NaT NaT NaN NaN NaN NaN
2 NaT NaT NaN NaN NaN NaN
3 NaT NaT NaN NaN NaN NaN
4 NaT NaT NaN NaN NaN NaN
5 NaT NaT NaN NaN NaN NaN
6 NaT NaT NaN NaN NaN NaN
7 NaT NaT NaN NaN NaN NaN
8 NaT NaT NaN NaN NaN NaN
9 NaT NaT NaN NaN NaN NaN
10 NaT NaT NaN NaN NaN NaN
11 2155-12-07 2155-12-07 15:30:00 32817.0 0.0 None 0.0

Step 2: Setup Variables (.var and .layers)#

Now we extract clinical measurements from the OMOP measurement table and organize them as time series data.

We need to specify:

  • data_tables: Which OMOP tables to use (e.g., measurement, observation). Here, we are interested in the measurement table.

  • data_field_to_keep: Which field contains the values we want to extract (e.g., "value_as_number").

  • time_precision: Whether to look for date or datetime fields in the OMOP tables. Since we care about hourly intervals, datetime is more suitable.

  • interval_length and num_intervals: How to discretize time. Here, we choose to create intervals of 1h, and create 24 of these intervals. This monitors the first 24h after the beginning of the visit stay here.

edata = ed.io.omop.setup_variables(
    edata=edata,
    layer="tem_data",
    backend_handle=con,
    data_tables=["measurement"],
    data_field_to_keep=["value_as_number"],
    time_precision="datetime",
    interval_length_number=1,
    interval_length_unit="h",
    num_intervals=24,
)

Step 3: Explore the Result#

Let’s examine what was constructed from the OMOP database:

First, let’s compare the original OMOP person table with the .obs attribute of our EHRData:

edata.obs[["person_id", "visit_occurrence_id", "visit_start_datetime", "gender_source_value", "year_of_birth"]]
person_id visit_occurrence_id visit_start_datetime gender_source_value year_of_birth
0 -4502092208250381979 -9176297757944464068 2154-02-05 17:09:00 F 2071
1 4239478333578644568 -9149771978458038515 2177-03-12 07:15:00 M 2111
2 -8090189584974691216 -9133360720296560252 2174-05-26 04:20:00 M 2118
3 2188642953583197091 -9128519808785176541 2146-08-26 15:36:00 F 2102
4 3129727379702505063 -9127810274408915712 2197-04-16 22:57:00 F 2145
... ... ... ... ... ...
847 7918537411740862407 9159184765222535078 2132-02-12 09:43:00 F 2055
848 -626229666378242477 9165125063680661115 2174-11-28 10:30:00 M 2117
849 4985579811051920670 9197703010583516730 2112-02-05 14:48:00 F 2064
850 7131048714591189903 9197704996243617072 2191-11-18 15:00:00 M 2133
851 4985579811051920670 9218061359648594772 2111-06-22 10:37:00 F 2064

852 rows × 5 columns

print(f"Total patients: {edata.obs['person_id'].nunique()}")
print(f"Total visits: {len(edata)}")
Total patients: 100
Total visits: 852

The .var table contains information about each clinical variable (mapped from OMOP concepts):

edata.var.head(10)
data_table_concept_id
0 0
1 1175625
2 3000067
3 3000068
4 3000099
5 3000285
6 3000330
7 3000348
8 3000456
9 3000461

Understanding .layers - Time Series Data#

The .layers tensor contains the time series data with shape (visits × variables × time).

edata.layers
Layers3D with keys: tem_data
print(f"edata.layers['tem_data'] shape: {edata.layers['tem_data'].shape}")
print(f"  → {edata.layers['tem_data'].shape[0]} visits")
print(f"  → {edata.layers['tem_data'].shape[1]} variables")
print(f"  → {edata.layers['tem_data'].shape[2]} time intervals")
edata.layers['tem_data'] shape: (852, 450, 24)
  → 852 visits
  → 450 variables
  → 24 time intervals

Let’s explore this for a specific visit, and the measurement concept id 3016723, which represents a measurement of Creatinine [Mass/volume] in Serum or Plasma.

Note

You can look up concept ids such as 3016723 using OHDSI’s Athena.

visit_occurrence_id = -9149771978458038515
variable_concept_id = 3016723

visit_index = edata.obs["visit_occurrence_id"] == visit_occurrence_id
variable_index = edata.var["data_table_concept_id"] == variable_concept_id

patient_id = edata.obs[visit_index]["person_id"]
visit_id = edata.obs[visit_index]["visit_occurrence_id"]
visit_start = edata.obs[visit_index]["visit_start_datetime"]


print(f"Example: Visit {visit_occurrence_id} (Patient {patient_id})")
print(f"Visit start: {visit_start}")

# Show measurements for this visit from the database
print("\nOriginal measurements from OMOP for this visit:")
measurements = con.execute(f"""
    SELECT measurement_datetime, measurement_concept_id, value_as_number
    FROM measurement
    WHERE visit_occurrence_id = {visit_occurrence_id}
    AND measurement_concept_id = {variable_concept_id}
    ORDER BY measurement_datetime
    LIMIT 1
""").df()

measurements
Example: Visit -9149771978458038515 (Patient 1    4239478333578644568
Name: person_id, dtype: int64)
Visit start: 1   2177-03-12 07:15:00
Name: visit_start_datetime, dtype: datetime64[us]

Original measurements from OMOP for this visit:
measurement_datetime measurement_concept_id value_as_number
0 2177-03-12 12:47:00 3016723 1.1

Now let’s see how to access this data in the .layers tensor:

Looking at the measurement above:

  • Visit start: 2177-03-12 07:15:00

  • Measurement time: 2177-03-12 12:47:00

  • Time elapsed: ~5.5 hours (5 hours and 32 minutes)

This measurement falls into interval 5 (the time bin from 5-6 hours after visit start). Since we’re using 1-hour intervals starting from the visit start time, the measurement at 5.5 hours gets aggregated into the 6th hourly bin (index 5, since we start counting from 0).

print(edata[visit_index, variable_index, :].layers["tem_data"])
[[[nan nan nan nan nan 1.1 nan nan nan nan nan nan nan nan nan nan nan
   nan nan nan nan nan nan nan]]]

ehrapy provides functions for exploration, vizualisation, and analysis of data in the EHRData structure - check out its tutorials for insights that do not rely on the basic print statements we used here for simplicity.

How OMOP Maps to EHRData#

Here’s how the OMOP CDM tables map to the EHRData structure:

OMOP CDM

EHRData

Description

person table

.obs rows

Each patient becomes an observation

person columns

.obs columns

Demographics (age, gender, etc.)

measurement concepts

.var rows

Each unique concept_id becomes a variable

measurement values

.layers tensor

Time series data (patients × variables × time)

concept table

.var enrichment

Concept names and metadata

Time intervals

.tem

Discretized time points

Key transformations:

  • OMOP’s long format (many rows per patient) → EHRData’s tensor format (3D array)

  • OMOP’s timestamp-based data → Discretized time intervals

  • OMOP concept IDs → Human-readable variable names (via enrichment)

Next Tutorial#

Continue with OMOP Machine Learning to learn how ehrdata quickstarts ML workflows based on an OMOP dataset.

Further Resources#