ehrdata.dt.physionet2019#
- ehrdata.dt.physionet2019(data_path=None, *, interval_length_number=1, interval_length_unit='h', num_intervals=48, aggregation_strategy='last', drop_samples=None, n_samples=None, subsample_seed=0, layer=None)#
Loads the dataset of the PhysioNet challenge 2019 (v1.0.0).
This dataset was designed to encourage the development of algorithms for sepsis prediction using physiological data [RJJ+20] [GAG+00].
The data consists of 35 time dependent features and 5 static features (
Age,Gender,Unit1,Unit2,HospAdmTime). More information on the features can be found on the link above.The full dataset consists of 40’336 patients, with values for the 35 dynamic features recorded hourly, and indicated missing if the value is not available. This amounts to a final dataset shape of 40’336 x 35 x number of considered time steps.
The generated
EHRDataobject truincates samples if a sample has morenum_intervalssteps; and pads with missing values if a sample has less thannum_intervalssteps.The tensor stored in
.layers[layer_name]is fully compatible with e.g. the PyPOTS [Du23] package, as the.layersfield of EHRData objects generally is.- Parameters:
- data_path
Path|str|None(default:None) Path to the raw data. If the path exists, the data is loaded from there. Else, the data is downloaded. Hint: if you have downloaded the data already from the link above, set this path to the
trainingfolder.- interval_length_number
int(default:1) Numeric value of the length of one interval.
- interval_length_unit
str(default:'h') Unit belonging to the interval length.
- num_intervals
int(default:48) Number of intervals.
- aggregation_strategy
str(default:'last') Aggregation strategy for the time series data when multiple measurements for a person’s parameter within a time interval is available. Available are
'first'and'last', as used indrop_duplicates().- drop_samples
Iterable[str] |None(default:None) Samples to drop from the dataset (indicate their RecordID).
- n_samples
int|None(default:None) Number of samples to subsample from the dataset. If not specified, all samples are used.
- subsample_seed
int|None(default:0) Seed for the subsampling. If not specified, a random seed is used.
- layer
str|None(default:None) Name of the layer in the EHRData object that will store the time series data. If not specified, it uses
X.
- data_path
- Return type:
- Returns:
The processed physionet2019 dataset. The raw data is also downloaded, stored and available under the
data_path.
Examples
>>> import ehrdata as ed >>> edata = ed.dt.physionet_2019(layer="tem_data") >>> edata EHRData object with n_obs × n_vars × n_t = 40336 × 35 × 48 obs: 'Age', 'Gender', 'Unit1', 'Unit2', 'HospAdmTime', 'training_Set' var: 'Parameter' tem: '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31', '32', '33', '34', '35', '36', '37', '38', '39', '40', '41', '42', '43', '44', '45', '46', '47' layers: 'tem_data' shape of .tem_data: (40336, 35, 48)
Inspect static information
>>> edata.obs.head() Age Gender Unit1 Unit2 HospAdmTime training_Set RecordID p014977 77.27 1.0 0.0 1.0 -69.14 training_setA p000902 65.55 1.0 NaN NaN -0.02 training_setA p009098 52.16 0.0 NaN NaN -0.03 training_setA p008386 24.35 1.0 NaN NaN -0.03 training_setA p018195 82.51 1.0 1.0 0.0 -907.88 training_setA
Inspect the 48-hour trajectory of the variable
SepsisLabel:>>> edata[edata.obs.index == "p020378", edata.var_names == "SepsisLabel"].layers["tem_data"] [[[nan, 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan]]]