ehrdata.io.from_pandas

Contents

ehrdata.io.from_pandas#

ehrdata.io.from_pandas(df, *, layer=None, columns_obs_only=None, index_column=None, format='flat', wide_format_time_suffix=None, long_format_keys=None, fill_time_gaps=False)#

Transform a given DataFrame into an EHRData object.

Note that columns containing boolean values (either 0/1 or T(t)rue/F(f)alse) will be stored as boolean columns. The other non-numerical columns will be stored as categorical values.

Parameters:
df DataFrame

The dataframe to be transformed.

layer str | None (default: None)

The layer to store the data in. If not specified, it uses X.

columns_obs_only Iterable[str] | None (default: None)

Column names that should belong to obs only and not X.

index_column str | int | None (default: None)

The index column of obs. This can be either a column name (or its numerical index in the DataFrame) or the index of the dataframe.

format Literal['flat', 'wide', 'long'] (default: 'flat')

The format of the input dataframe. If the data is not longitudinal, choose format="flat". If the data is longitudinal in the long format, choose format="long". If the data is longitudinal in a wide format, choose format="wide".

wide_format_time_suffix str | None (default: None)

Use only if format="wide". Suffices in the variable columns that indicate the time of the observation. The collected suffices will be sorted lexicographically. The variables will be ordered accordingly along the 3rd axis of the EHRData object.

long_format_keys dict[Literal['observation_column', 'variable_column', 'time_column', 'value_column'], str] | None (default: None)

Use only if format="long". The keys of the dataframe in the long format. The dictionary should have the following structure: {"observation_column": "<the column name of the observation ids>", "variable_column": "<the column name of the variable ids>", "time_column": "<the column name of the time>", "value_column": "<the column name of the values>"}.

fill_time_gaps bool (default: False)

Use only if format="long". If True, fills gaps in the numeric time axis with NaN values so that the 3rd dimension is a continuous integer range from 0 to the maximum time value. For example, if the data contains time indices [0, 1, 2, 5], the resulting time axis will be [0, 1, 2, 3, 4, 5] with NaN values at indices 3 and 4 for all observations and variables.

Return type:

EHRData

Examples

>>> import ehrdata as ed
>>> import pandas as pd
>>> df = pd.DataFrame(
...     {
...         "patient_id": ["0", "1", "2", "3", "4"],
...         "age": [65, 72, 58, 78, 82],
...         "sex": ["M", "F", "F", "M", "F"],
...     }
... )
>>> edata = ed.io.from_pandas(df, layer="tem_data", index_column="patient_id")
>>> edata
>>> EHRData object with n_obs × n_vars × n_t = 5 × 2 × 1
>>>     layers: 'tem_data'
>>>     shape of .tem_data: (5, 2, 1)
>>> df_wide = pd.DataFrame(
...     {
...         "patient_id": ["0", "1"],
...         "sex": ["F", "M"],
...         "systolic_blood_pressure_t_0": [120, 130],
...         "systolic_blood_pressure_t_2": [125, np.nan],  # the suffix strings are sorted lexicographically
...         "systolic_blood_pressure_t_1": [np.nan, 135],
...     }
... )
>>> edata = ed.io.from_pandas(df_wide, layer="tem_data", format="wide", columns_obs_only=["patient_id", "sex"])
>>> edata
>>> EHRData object with n_obs × n_vars × n_t = 2 × 1 × 3
>>>     obs: 'patient_id', 'sex'
>>>     layers: 'tem_data'
>>>     shape of .tem_data: (2, 1, 3)
>>> df_long = pd.DataFrame(
...     {
...         "observation_id": ["0", "0", "0", "1", "1", "1"],
...         "variable": [
...             "sex",
...             "systolic_blood_pressure",
...             "systolic_blood_pressure",
...             "sex",
...             "systolic_blood_pressure",
...             "systolic_blood_pressure",
...         ],
...         "time": ["t_0", "t_0", "t_1", "t_0", "t_0", "t_2"],
...         "value": ["F", 120, 125, "M", 130, 135],
...     }
... )
>>> edata = ed.io.from_pandas(df_long, layer="tem_data", format="long", columns_obs_only=["sex"])
>>> edata
>>> EHRData object with n_obs × n_vars × n_t = 2 × 1 × 3
>>>     obs: "sex"
>>>     layers: 'tem_data'
>>>     shape of .tem_data: (2, 1, 3)