ehrdata.dt.physionet2019

Contents

ehrdata.dt.physionet2019#

ehrdata.dt.physionet2019(data_path=None, *, interval_length_number=1, interval_length_unit='h', num_intervals=48, aggregation_strategy='last', drop_samples=None, n_samples=None, subsample_seed=0, layer=None)#

Loads the dataset of the PhysioNet challenge 2019 (v1.0.0).

This dataset was designed to encourage the development of algorithms for sepsis prediction using physiological data [RJJ+20] [GAG+00].

The data consists of 35 time dependent features and 5 static features (Age, Gender, Unit1, Unit2, HospAdmTime). More information on the features can be found on the link above.

The full dataset consists of 40’336 patients, with values for the 35 dynamic features recorded hourly, and indicated missing if the value is not available. This amounts to a final dataset shape of 40’336 x 35 x number of considered time steps.

The generated EHRData object truincates samples if a sample has more num_intervals steps; and pads with missing values if a sample has less than num_intervals steps.

The tensor stored in .layers[layer_name] is fully compatible with e.g. the PyPOTS [Du23] package, as the .layers field of EHRData objects generally is.

Parameters:
data_path Path | str | None (default: None)

Path to the raw data. If the path exists, the data is loaded from there. Else, the data is downloaded. Hint: if you have downloaded the data already from the link above, set this path to the training folder.

interval_length_number int (default: 1)

Numeric value of the length of one interval.

interval_length_unit str (default: 'h')

Unit belonging to the interval length.

num_intervals int (default: 48)

Number of intervals.

aggregation_strategy str (default: 'last')

Aggregation strategy for the time series data when multiple measurements for a person’s parameter within a time interval is available. Available are 'first' and 'last', as used in drop_duplicates().

drop_samples Iterable[str] | None (default: None)

Samples to drop from the dataset (indicate their RecordID).

n_samples int | None (default: None)

Number of samples to subsample from the dataset. If not specified, all samples are used.

subsample_seed int | None (default: 0)

Seed for the subsampling. If not specified, a random seed is used.

layer str | None (default: None)

Name of the layer in the EHRData object that will store the time series data. If not specified, it uses X.

Return type:

EHRData

Returns:

The processed physionet2019 dataset. The raw data is also downloaded, stored and available under the data_path.

Examples

>>> import ehrdata as ed
>>> edata = ed.dt.physionet_2019(layer="tem_data")
>>> edata
EHRData object with n_obs × n_vars × n_t = 40336 × 35 × 48
    obs: 'Age', 'Gender', 'Unit1', 'Unit2', 'HospAdmTime', 'training_Set'
    var: 'Parameter'
    tem: '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31', '32', '33', '34', '35', '36', '37', '38', '39', '40', '41', '42', '43', '44', '45', '46', '47'
    layers: 'tem_data'
    shape of .tem_data: (40336, 35, 48)

Inspect static information

>>> edata.obs.head()
            Age  Gender  Unit1  Unit2  HospAdmTime   training_Set
RecordID
p014977     77.27     1.0    0.0    1.0       -69.14  training_setA
p000902     65.55     1.0    NaN    NaN        -0.02  training_setA
p009098     52.16     0.0    NaN    NaN        -0.03  training_setA
p008386     24.35     1.0    NaN    NaN        -0.03  training_setA
p018195     82.51     1.0    1.0    0.0      -907.88  training_setA

Inspect the 48-hour trajectory of the variable SepsisLabel:

>>> edata[edata.obs.index == "p020378", edata.var_names == "SepsisLabel"].layers["tem_data"]
[[[nan,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
      0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
      0.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1., nan,
     nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan]]]