ehrdata.io.omop.setup_interval_variables

ehrdata.io.omop.setup_interval_variables#

ehrdata.io.omop.setup_interval_variables(edata, *, backend_handle, layer=None, data_tables, data_field_to_keep, time_precision='date', interval_length_number, interval_length_unit, num_intervals, concept_ids='all', aggregation_strategy='last', enrich_var_with_feature_info=False, keep_date='start', instantiate_tensor=True)#

Extracts selected tables of a time-span character from the OMOP CDM.

The distinct concept_id s encountered in the selected tables form the variables in the EHRData object. The variables are sorted by the concept_id for each data_table in ascending order, and stacked together in the order that the data_tables are specified. The data_field_to_keep parameter specifies which Field in the selected table is to be used for the read-out of the value of a variable.

In contrast to setup_variables, tables without unit unformation can be present here. Hence, this function will not verify that a single unit per feature (=`concept_id`) is used. Also, it will not write a unit report. Should this be relevant for your work, please do open an issue on theislab/ehrdata.

Stores a table(s) named long_person_timestamp_feature_value_<data_table> in long format in the RDBMS. This table is instantiated into edata.r if instantiate_tensor is set to True; otherwise, the table is only stored in the RDBMS for later use.

Parameters:
edata

Data object to which the variables should be added.

backend_handle DuckDBPyConnection

The backend handle to the database.

layer str | None (default: None)

The layer to store the data in. If not specified, it uses X.

data_tables Sequence[Literal['drug_exposure', 'condition_occurrence', 'procedure_occurrence', 'device_exposure', 'drug_era', 'dose_era', 'condition_era', 'episode']] | Literal['drug_exposure', 'condition_occurrence', 'procedure_occurrence', 'device_exposure', 'drug_era', 'dose_era', 'condition_era', 'episode']

The tables to be used.

data_field_to_keep str | Sequence[str] | dict[str, str | Sequence[str]]

The CDM Field in the data tables to be kept. Can be e.g. ‘value_as_number’ or ‘value_as_concept_id’. Importantly, can be ‘is_present’ to have a one-hot encoding of the presence of the feature in a patient in an interval. Should be a dictionary to specify the data fields to keep per table if multiple data tables are used. For example, if data_tables=[‘measurement’, ‘observation’], data_field_to_keep={‘measurement’: ‘value_as_number’, ‘observation’: ‘value_as_number’}.

time_precision Literal['date', 'datetime'] (default: 'date')

The precision of the timestamp used in the table indicated in setup_obs(). If "date", uses the date field (e.g. visit_start_date for "person_visit_occurrence"). If "datetime", uses the datetime field (e.g. visit_start_datetime for "person_visit_occurrence").

interval_length_number int

Numeric value of the length of one interval.

interval_length_unit str

Unit of the interval length, needs to be a unit of pandas.Timedelta.

num_intervals int

Number of intervals.

concept_ids Literal['all'] | Sequence[int] (default: 'all')

Concept IDs to use from the data tables. If not specified, ‘all’ are used.

aggregation_strategy Literal['last', 'first', 'mean', 'median', 'mode', 'sum', 'count', 'min', 'max', 'std'] (default: 'last')

Strategy to use when aggregating multiple data points within one interval.

enrich_var_with_feature_info bool (default: False)

Whether to enrich the var table with feature information. If a concept_id is not found in the concept table, their respective alternate concept_id included in the concept_relationship table is retrieved to add the available feature information. Otherwise the feature information will be NaN.

keep_date Literal['start', 'end', 'interval'] (default: 'start')

Whether to keep the start or end date, or the interval span.

instantiate_tensor bool (default: True)

Whether to instantiate the tensor into the .r field of the EHRData object.

Returns:

An EHRData object with fields.

Examples

>>> import ehrdata as ed
>>> import duckdb
>>> con_gi = duckdb.connect(database=":memory:", read_only=False)
>>> ed.dt.gibleed_omop(
...     con_gi,
... )
>>> edata_gi = ed.io.omop.setup_obs(
>>>     con_gi,
>>>     observation_table="person_observation_period",
>>> )
>>> edata_gi = ed.io.omop.setup_interval_variables(
>>>     edata=edata_gi,
>>>     backend_handle=con_gi,
>>>     layer="tem_data",
>>>     data_tables=["drug_exposure", "condition_occurrence"],
>>>     data_field_to_keep={"drug_exposure": "is_present", "condition_occurrence": "is_present"},
>>>     interval_length_number=20,
>>>     interval_length_unit="day",
>>>     num_intervals=20,
>>>     concept_ids="all",
>>>     aggregation_strategy="last",
>>>     enrich_var_with_feature_info=True,
>>> )
>>> edata_gi