ehrdata.io.omop.setup_variables

ehrdata.io.omop.setup_variables#

ehrdata.io.omop.setup_variables(edata, *, backend_handle, layer=None, data_tables, data_field_to_keep, interval_length_number, interval_length_unit, time_precision='date', num_intervals, concept_ids='all', aggregation_strategy='last', enrich_var_with_feature_info=False, enrich_var_with_unit_info=False, instantiate_tensor=True)#

Extracts selected tables of a data-point character from the OMOP CDM.

The distinct concept_id`s encountered in the selected tables form the variables in the EHRData object. The variables are sorted by the `concept_id for each data_table in ascending order, and stacked together in the order that the data_tables are specified.

The data_field_to_keep parameter specifies which Field in the selected table is to be used for the read-out of the value of a variable.

It will fail if there is more than one unit_concept_id per variable. Writes a unit report of the features to edata.uns['unit_report_<data_tables>']. Writes the setup arguments into edata.uns['omop_io_variable_setup'].

Stores a table(s) named long_person_timestamp_feature_value_<data_table> in long format in the RDBMS. This table is instantiated into edata.r if instantiate_tensor is set to True; otherwise, the table is only stored in the RDBMS for later use.

Parameters:
edata

Data object to which the variables should be added.

backend_handle DuckDBPyConnection

The backend handle to the database.

layer str | None (default: None)

The layer to store the data in. If not specified, uses X.

data_tables Sequence[Literal['measurement', 'observation', 'specimen']] | Literal['measurement', 'observation', 'specimen']

The tables to be used.

data_field_to_keep str | Sequence[str] | dict[str, str | Sequence[str]]

The CDM Field in the data tables to be kept. Can be e.g. ‘value_as_number’ or ‘value_as_concept_id’. Importantly, can be ‘is_present’ to have a one-hot encoding of the presence of the feature in a patient in an interval. Should be a dictionary to specify the data fields to keep per table if multiple data tables are used. For example, if data_tables=[‘measurement’, ‘observation’], data_field_to_keep={‘measurement’: ‘value_as_number’, ‘observation’: ‘value_as_number’}.

time_precision Literal['date', 'datetime'] (default: 'date')

The precision of the timestamp used in the table indicated in setup_obs(). If "date", uses the date field (e.g. visit_start_date for "person_visit_occurrence"). If "datetime", uses the datetime field (e.g. visit_start_datetime for "person_visit_occurrence").

interval_length_number int

Numeric value of the length of one interval.

interval_length_unit str

Unit of the interval length, needs to be a unit of pandas.Timedelta.

num_intervals int

Number of intervals.

concept_ids Literal['all'] | Sequence[int] (default: 'all')

Concept IDs to use from the data tables. If not specified, ‘all’ are used.

aggregation_strategy Literal['last', 'first', 'mean', 'median', 'mode', 'sum', 'count', 'min', 'max', 'std'] (default: 'last')

Strategy to use when aggregating multiple data points within one interval.

enrich_var_with_feature_info bool (default: False)

Whether to enrich the var table with feature information. If a concept_id is not found in the concept table, their respective alternate concept_id included in the concept_relationship table is retrieved to add the available feature information. Otherwise the feature information will be NaN.

enrich_var_with_unit_info bool (default: False)

Whether to enrich the var table with unit information. Raises an Error if multiple units per feature are found for at least one feature. For entire missing data points, the units are ignored. For observed data points with missing unit information (NULL in either ‘unit_concept_id’ or ‘unit_source_value’), the value NULL/NaN is considered a single unit.

instantiate_tensor bool (default: True)

Whether to instantiate the tensor into the .r field of the EHRData object.

Returns:

An EHRData object with populated .r and .var field.

Examples

>>> import ehrdata as ed
>>> import duckdb
>>> con_gi = duckdb.connect(database=":memory:", read_only=False)
>>> ed.dt.gibleed_omop(
...     con_gi,
... )
>>> edata_gi = ed.io.omop.setup_obs(
>>>     con_gi,
>>>     observation_table="person_observation_period",
>>> )
>>> edata_gi = ed.io.omop.setup_variables(
>>>     edata=edata_gi,
>>>     backend_handle=con_gi,
>>>     layer="tem_data",
>>>     data_tables=["observation", "measurement"],
>>>     data_field_to_keep={"observation": "observation_source_value", "measurement": "is_present"},
>>>     interval_length_number=20,
>>>     interval_length_unit="day",
>>>     num_intervals=20,
>>>     concept_ids="all",
>>>     aggregation_strategy="last",
>>>     enrich_var_with_feature_info=True,
>>>     enrich_var_with_unit_info=True,
>>> )
>>> edata_gi