ehrdata.io.omop.setup_variables#
- ehrdata.io.omop.setup_variables(edata, *, backend_handle, layer=None, data_tables, data_field_to_keep, interval_length_number, interval_length_unit, time_precision='date', num_intervals, concept_ids='all', aggregation_strategy='last', enrich_var_with_feature_info=False, enrich_var_with_unit_info=False, instantiate_tensor=True)#
Extracts selected tables of a data-point character from the OMOP CDM.
The distinct
concept_id`s encountered in the selected tables form the variables in the EHRData object. The variables are sorted by the `concept_idfor eachdata_tablein ascending order, and stacked together in the order that thedata_tablesare specified.The
data_field_to_keepparameter specifies which Field in the selected table is to be used for the read-out of the value of a variable.It will fail if there is more than one
unit_concept_idper variable. Writes a unit report of the features toedata.uns['unit_report_<data_tables>']. Writes the setup arguments intoedata.uns['omop_io_variable_setup'].Stores a table(s) named
long_person_timestamp_feature_value_<data_table>in long format in the RDBMS. This table is instantiated intoedata.rifinstantiate_tensoris set toTrue; otherwise, the table is only stored in the RDBMS for later use.- Parameters:
- edata
Data object to which the variables should be added.
- backend_handle
DuckDBPyConnection The backend handle to the database.
- layer
str|None(default:None) The layer to store the data in. If not specified, uses
X.- data_tables
Sequence[Literal['measurement','observation','specimen']] |Literal['measurement','observation','specimen'] The tables to be used.
- data_field_to_keep
str|Sequence[str] |dict[str,str|Sequence[str]] The CDM Field in the data tables to be kept. Can be e.g. ‘value_as_number’ or ‘value_as_concept_id’. Importantly, can be ‘is_present’ to have a one-hot encoding of the presence of the feature in a patient in an interval. Should be a dictionary to specify the data fields to keep per table if multiple data tables are used. For example, if data_tables=[‘measurement’, ‘observation’], data_field_to_keep={‘measurement’: ‘value_as_number’, ‘observation’: ‘value_as_number’}.
- time_precision
Literal['date','datetime'] (default:'date') The precision of the timestamp used in the table indicated in
setup_obs(). If"date", uses thedatefield (e.g.visit_start_datefor"person_visit_occurrence"). If"datetime", uses thedatetimefield (e.g.visit_start_datetimefor"person_visit_occurrence").- interval_length_number
int Numeric value of the length of one interval.
- interval_length_unit
str Unit of the interval length, needs to be a unit of
pandas.Timedelta.- num_intervals
int Number of intervals.
- concept_ids
Literal['all'] |Sequence[int] (default:'all') Concept IDs to use from the data tables. If not specified, ‘all’ are used.
- aggregation_strategy
Literal['last','first','mean','median','mode','sum','count','min','max','std'] (default:'last') Strategy to use when aggregating multiple data points within one interval.
- enrich_var_with_feature_info
bool(default:False) Whether to enrich the var table with feature information. If a concept_id is not found in the concept table, their respective alternate
concept_idincluded in the concept_relationship table is retrieved to add the available feature information. Otherwise the feature information will be NaN.- enrich_var_with_unit_info
bool(default:False) Whether to enrich the var table with unit information. Raises an Error if multiple units per feature are found for at least one feature. For entire missing data points, the units are ignored. For observed data points with missing unit information (NULL in either ‘unit_concept_id’ or ‘unit_source_value’), the value NULL/NaN is considered a single unit.
- instantiate_tensor
bool(default:True) Whether to instantiate the tensor into the .r field of the EHRData object.
- Returns:
An
EHRDataobject with populated.rand.varfield.
Examples
>>> import ehrdata as ed >>> import duckdb >>> con_gi = duckdb.connect(database=":memory:", read_only=False) >>> ed.dt.gibleed_omop( ... con_gi, ... ) >>> edata_gi = ed.io.omop.setup_obs( >>> con_gi, >>> observation_table="person_observation_period", >>> ) >>> edata_gi = ed.io.omop.setup_variables( >>> edata=edata_gi, >>> backend_handle=con_gi, >>> layer="tem_data", >>> data_tables=["observation", "measurement"], >>> data_field_to_keep={"observation": "observation_source_value", "measurement": "is_present"}, >>> interval_length_number=20, >>> interval_length_unit="day", >>> num_intervals=20, >>> concept_ids="all", >>> aggregation_strategy="last", >>> enrich_var_with_feature_info=True, >>> enrich_var_with_unit_info=True, >>> ) >>> edata_gi