ehrdata.dt.physionet2012

Contents

ehrdata.dt.physionet2012#

ehrdata.dt.physionet2012(data_path=None, *, interval_length_number=1, interval_length_unit='h', num_intervals=48, aggregation_strategy='last', drop_samples=['147514', '142731', '145611', '140501', '155655', '143656', '156254', '150309', '140936', '141264', '150649', '142998'], layer=None)#

Loads the dataset of the PhysioNet challenge 2012 (v1.0.0).

This dataset was designed to encourage the development of algorithms for mortality rate prediction using physiological data [SMS+12] [GAG+00].

If interval_length_number is 1, interval_length_unit is "h" (hour), and num_intervals is 48, this is the same as the SAITS preprocessing [DCoteL23]. Truncated if a sample has more num_intervals steps; Padded if a sample has less than num_intervals steps. Further, by default the following 12 samples are dropped since they have no time series information at all: 147514, 142731, 145611, 140501, 155655, 143656, 156254, 150309, 140936, 141264, 150649, 142998. Taken the defaults of interval_length_number, interval_length_unit, num_intervals, and drop_samples, the tensor stored in .layers[layer_name] of edata is the same as when doing the PyPOTS preprocessing [Du23]. A simple deviation is that the tensor in ehrdata is of shape n_obs x n_vars x n_intervals (with defaults, 3000x37x48) while the tensor in PyPOTS is of shape n_obs x n_intervals x n_vars (3000x48x37). The tensor stored in .layers[layer_name] is hence also fully compatible with the PyPOTS package, as the .layers field of EHRData objects generally is. Note: In the original dataset, some missing values are encoded with a -1 for some entries of the variables 'DiasABP', 'NIDiasABP', and 'Weight'. Here, these are replaced with NaN s.

Parameters:
data_path Path | str | None (default: None)

Path to the raw data. If the path exists, the data is loaded from there. Else, the data is downloaded.

interval_length_number int (default: 1)

Numeric value of the length of one interval.

interval_length_unit str (default: 'h')

Unit belonging to the interval length.

num_intervals int (default: 48)

Number of intervals.

aggregation_strategy str (default: 'last')

Aggregation strategy for the time series data when multiple measurements for a person’s parameter within a time interval is available. Available are 'first' and 'last', as used in drop_duplicates().

drop_samples Iterable[str] | None (default: ['147514', '142731', '145611', '140501', '155655', '143656', '156254', '150309', '140936', '141264', '150649', '142998'])

Samples to drop from the dataset (indicate their RecordID).

layer str | None (default: None)

Name of the layer in the EHRData object that will store the time series data. If not specified, it uses X.

Return type:

EHRData

Returns:

The processed physionet2012 dataset. The raw data is also downloaded, stored and available under the data_path.

Examples

>>> import ehrdata as ed
>>> edata = ed.dt.physionet_2012(layer="tem_data)
EHRData object with n_obs × n_vars × n_t = 11988 × 37 × 48
    obs: 'set', 'Age', 'Gender', 'Height', 'ICUType', 'SAPS-I', 'SOFA', 'Length_of_stay', 'Survival', 'In-hospital_death'
    var: 'Parameter'
    tem: '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31', '32', '33', '34', '35', '36', '37', '38', '39', '40', '41', '42', '43', '44', '45', '46', '47'
    layers: 'tem_data'
    shape of .tem_data: (11988, 37, 48)

Inspect static information

>>> edata.obs.head()
        set     Age     Gender  Height  ICUType SAPS-I  SOFA    Length_of_stay  Survival        In-hospital_death
RecordID
132539  set-a   54.0    0.0     -1.0    4.0     6       1       5       -1      0
132540  set-a   76.0    1.0     175.3   2.0     16      8       8       -1      0
132541  set-a   44.0    0.0     -1.0    3.0     21      11      19      -1      0
132543  set-a   68.0    1.0     180.3   3.0     7       1       9       575     0
132545  set-a   88.0    0.0     -1.0    3.0     17      2       4       918     0

Inspect the 48-hour trajectory of the variable RespRate:

>>> edata[edata.obs.index == "132539", edata.var_names == "RespRate"].layers["tem_data"]
[[[19., 18., 19., 20., 20., 17., nan, 15., 14., 17., 15., 15.,
     12., 15., 15., 12., 14., 13., 18., 13., 12., 20., 15., 24.,
     nan, 16., 19., 18., nan, 16., nan, 18., nan, 18., nan, 20.,
     nan, 24., 21., 16., 18., 14., 23., 17., 20., 20., 20., 23.]]]