Fed-Heart Disease

The Heart Disease dataset [1] was collected in 1988 in four centers: Cleveland, Hungary, Switzerland and Long Beach V. We do not own the copyright of the data: everyone using this dataset should abide by its licence and give proper attribution to the original authors. It is available for download here.

Dataset description

Please refer to the dataset website for an exhaustive data sheet. The table below provides a high-level description of the dataset.

	Dataset description
Description	Heart Disease dataset.
Dataset size	39,6 KB.
Centers	4 centers - Cleveland, Hungary, Switzerland and Long Beach V.
Records per center	Train/Test: 199/104, 172/89, 30/16, 85/45.
Inputs shape	16 features (tabular data).
Total nb of points
Task	Binary classification

License and data usage terms

This dataset is licensed under a Creative Commons Attribution 4.0 International (CC-BY 4.0) license by its authors. Anyone using this dataset should abide by its licence and give proper attribution to the original authors.

Ethics

As per the dataset website, sensitive entries of the dataset were removed by the original authors:

The names and social security numbers of the patients were recently removed from the database, replaced with dummy values.

Download and preprocessing instructions

To download the data, First cd into the dataset_creation_scripts folder:

cd flamby/datasets/fed_heart_disease/dataset_creation_scripts

then simply run the following command:

python download.py --output-folder ./heart_disease_dataset

This will download 38.6ko of data.

IMPORTANT : If you choose to relocate the dataset after downloading it, it is imperative that you run the following script otherwise all subsequent scripts will not find it:

python update_config.py --new-path /new/path/towards/dataset

Using the dataset

Now that the dataset is ready for use you can load it using the low or high-level API by doing:

from flamby.datasets.fed_heart_disease import FedHeartDisease, HeartDiseaseRaw

# To load the first center as a pytorch dataset
center0 = FedHeartDisease(center=0, train=True)
# To load the second center as a pytorch dataset
center1 = FedHeartDisease(center=1, train=True)
# To sample batches from each of the local datasets use the traditional pytorch API
from torch.utils.data import DataLoader as dl

X, y = iter(dl(center0, batch_size=16, shuffle=True, num_workers=0)).next()

More informations on how to train model and handle flamby datasets in general are available in the Quickstart

Benchmarking the baseline on a pooled setting

In order to benchmark the baseline on the pooled dataset one need to download and preprocess the dataset and launch the following script:

python benchmark.py

This will train a logistic regression classifier (which is the strongest baseline according to UCI ML Repository.

References

[1] Janosi, Andras, Steinbrunn, William, Pfisterer, Matthias, Detrano, Robert & M.D., M.D.. (1988). Heart Disease. UCI Machine Learning Repository.