ehrdata.io.from_pandas#
- ehrdata.io.from_pandas(df, *, layer=None, columns_obs_only=None, index_column=None, format='flat', wide_format_time_suffix=None, long_format_keys=None, fill_time_gaps=False)#
Transform a given
DataFrameinto anEHRDataobject.Note that columns containing boolean values (either 0/1 or T(t)rue/F(f)alse) will be stored as boolean columns. The other non-numerical columns will be stored as categorical values.
- Parameters:
- df
DataFrame The dataframe to be transformed.
- layer
str|None(default:None) The layer to store the data in. If not specified, it uses
X.- columns_obs_only
Iterable[str] |None(default:None) Column names that should belong to
obsonly and notX.- index_column
str|int|None(default:None) The index column of
obs. This can be either a column name (or its numerical index in the DataFrame) or the index of the dataframe.- format
Literal['flat','wide','long'] (default:'flat') The format of the input dataframe. If the data is not longitudinal, choose
format="flat". If the data is longitudinal in the long format, chooseformat="long". If the data is longitudinal in a wide format, chooseformat="wide".- wide_format_time_suffix
str|None(default:None) Use only if
format="wide". Suffices in the variable columns that indicate the time of the observation. The collected suffices will be sorted lexicographically. The variables will be ordered accordingly along the 3rd axis of theEHRDataobject.- long_format_keys
dict[Literal['observation_column','variable_column','time_column','value_column'],str] |None(default:None) Use only if
format="long". The keys of the dataframe in the long format. The dictionary should have the following structure:{"observation_column": "<the column name of the observation ids>", "variable_column": "<the column name of the variable ids>", "time_column": "<the column name of the time>", "value_column": "<the column name of the values>"}.- fill_time_gaps
bool(default:False) Use only if
format="long". IfTrue, fills gaps in the numeric time axis with NaN values so that the 3rd dimension is a continuous integer range from 0 to the maximum time value. For example, if the data contains time indices[0, 1, 2, 5], the resulting time axis will be[0, 1, 2, 3, 4, 5]with NaN values at indices 3 and 4 for all observations and variables.
- df
- Return type:
Examples
>>> import ehrdata as ed >>> import pandas as pd >>> df = pd.DataFrame( ... { ... "patient_id": ["0", "1", "2", "3", "4"], ... "age": [65, 72, 58, 78, 82], ... "sex": ["M", "F", "F", "M", "F"], ... } ... ) >>> edata = ed.io.from_pandas(df, layer="tem_data", index_column="patient_id") >>> edata
>>> EHRData object with n_obs × n_vars × n_t = 5 × 2 × 1 >>> layers: 'tem_data' >>> shape of .tem_data: (5, 2, 1)
>>> df_wide = pd.DataFrame( ... { ... "patient_id": ["0", "1"], ... "sex": ["F", "M"], ... "systolic_blood_pressure_t_0": [120, 130], ... "systolic_blood_pressure_t_2": [125, np.nan], # the suffix strings are sorted lexicographically ... "systolic_blood_pressure_t_1": [np.nan, 135], ... } ... ) >>> edata = ed.io.from_pandas(df_wide, layer="tem_data", format="wide", columns_obs_only=["patient_id", "sex"]) >>> edata
>>> EHRData object with n_obs × n_vars × n_t = 2 × 1 × 3 >>> obs: 'patient_id', 'sex' >>> layers: 'tem_data' >>> shape of .tem_data: (2, 1, 3)
>>> df_long = pd.DataFrame( ... { ... "observation_id": ["0", "0", "0", "1", "1", "1"], ... "variable": [ ... "sex", ... "systolic_blood_pressure", ... "systolic_blood_pressure", ... "sex", ... "systolic_blood_pressure", ... "systolic_blood_pressure", ... ], ... "time": ["t_0", "t_0", "t_1", "t_0", "t_0", "t_2"], ... "value": ["F", 120, 125, "M", 130, 135], ... } ... ) >>> edata = ed.io.from_pandas(df_long, layer="tem_data", format="long", columns_obs_only=["sex"]) >>> edata
>>> EHRData object with n_obs × n_vars × n_t = 2 × 1 × 3 >>> obs: "sex" >>> layers: 'tem_data' >>> shape of .tem_data: (2, 1, 3)