Fed-TCGA-BRCA
The dataset used in this repo comes from The Cancer Genome Atlas (TCGA) from the GDC data portal.
We selected one single cancer type: Breast Invasive Carcinoma (BRCA) and only use clinical tabular data. We replicate the preprocessing used by Andreux et al. from data originally computed from TCGA by Liu et al.:
Liu J, Lichtenberg T, Hoadley KA, Poisson LM, Lazar AJ, Cherniack AD, Kovatich AJ, Benz CC, Levine DA, Lee AV, Omberg L, Wolf DM, Shriver CD, Thorsson V; Cancer Genome Atlas Research Network, Hu H. An Integrated TCGA Pan-Cancer Clinical Data Resource to Drive High-Quality Survival Outcome Analytics. Cell. 2018 Apr 5;173(2):400-416.e11. doi: 10.1016/j.cell.2018.02.052. PMID: 29625055; PMCID: PMC6066282.
Andreux, M., Manoel, A., Menuet, R., Saillard, C., and Simpson, C., “Federated Survival Analysis with Discrete-Time Cox Models”, arXiv e-prints, 2020.
i.e. a subset of the features in the raw TCGA-BRCA dataset (categorical variables are one-hot encoded).
Terms of use
The data terms can be found here. Note that we only use unrestricted data. We do not guarantee that the use of this data can be done freely by the user. As such it is mandatory that one should check the applicability of the licence associated with this data before using it.
In particular, as per the GDC data access policy, users should > not attempt to identify individual human research participants from whom the data were obtained.
Ethics
As per the TCGA policies, special care was devoted to ensure privacy protection of research subjects, including but not limited to HIPAA compliance. Note that we do not use the genetic part of TCGA whose access is restricted due to its sensitivity.
In particular, as per the GDC data access policy, the terms bind users as to “not attempt to identify individual human research participants from whom the data were obtained.”
Dataset description
Dataset description |
|
---|---|
Description |
Clinical data from the TCGA-BRCA study with 1,088 patients. |
Dataset size |
117,5 KB (stored in this repository). |
Centers |
6 regions - Northeast, South, West, Midwest, Europe, Canada. |
Records per center |
Train/Test: 248/63, 156/40, 164/42, 129/33, 129/33, 40/11. |
Inputs shape |
39 features (tabular data). |
Targets shape |
(E,T). E: relative risk, continuous variable. T: T: event observed (1) or censorship (0) |
Total nb of points |
|
Task |
Survival analysis. |
For a more thorough presentation of data, raw TCGA-BRCA data can be viewed, investigated, and downloaded here.
Data
Preprocessed data is stored in this repo in the file brca.csv
, so
the dataset does not need to be downloaded. The medical centers (with
their geographic regions) are stored in the file centers.csv
. From
this file and the patients’ TCGA barcodes, we can extract the region of
origin of each patient’s tissue sort site (TSS). The numbers of sites
being too large (64) we regroup them in 6 different regions (Northeast,
South, West, Midwest, Europe, Canada). The patients’ stratified split by
region is static and stored in the train_test_split.csv file.
Downloading and preprocessing
This dataset is the only one that doesn’t need to be downloaded as it is already included in FLamby.
Using the dataset
Now that the dataset is ready for use you can load it using the low or high-level API by doing:
from flamby.datasets.fed_tcga_brca import FedTcgaBrca
# To load the first center as a pytorch dataset
center0 = FedTcgaBrca(center=0, train=True)
# To load the second center as a pytorch dataset
center1 = FedTcgaBrca(center=1, train=True)
# To sample batches from each of the local datasets use the traditional pytorch API
from torch.utils.data import DataLoader as dl
X, y = iter(dl(center0, batch_size=16, shuffle=True, num_workers=0)).next()
More informations on how to train model and handle flamby datasets in general are available in the Quickstart
Baseline training and evaluation in a pooled setting
To train and evaluate a model for the pooled dataset, run:
python benchmark.py --GPU 0 --workers 4