FedPyDESeq2 demo on the TCGA-LUAD dataset.
Note
Click here to download the full example code
FedPyDESeq2 demo on the TCGA-LUAD dataset.
This example demonstrates how to run a FedPyDESeq2 experiment on the TCGA-LUAD dataset from a single machine, using Substra's simulation mode.
We will show how to perform a simple differential expression analysis, comparing samples
with "Advanced"
vs "Non-advanced"
tumoral stage
.
from pathlib import Path
import pandas as pd
from fedpydeseq2_datasets.process_and_split_data import setup_tcga_dataset
from IPython.display import display
from fedpydeseq2.fedpydeseq2_pipeline import run_fedpydeseq2_experiment
Dataset setup
In a real federated setup, the data is distributed across multiple medical centers
and must be registered with Substra beforehand. Hence, each center would have a folder
containing two csvs (one fore the counts and one for the metadata), as well as an
opener python file and a markdown readme file (see
Substra's documentation
on how to register a datasample).
Then, we would only need pass the dataset_datasample_keys path
.
In this tutorial, however, we use FedPyDESeq2's simulation mode, which allows us to emulate a federated setup from a single machine.
The simulation mode assumes the data to be organized in the following structure:
processed_data_path/
├── centers_data/
│ └── tcga/
│ └── {experiment_id}/
│ ├── center_0/
│ │ ├── counts.csv
│ │ └── metadata.csv
│ ├── center_1/
│ │ ├── counts.csv
│ │ └── metadata.csv
│ └── ...
└── pooled_data/
└── tcga/
└── {experiment_id}/
├── counts.csv
└── metadata.csv
In this tutorial, we have already downloaded the data in the data/raw
directory.
The setup_tcga_dataset
function from fedpydeseq2_datasets
will automatically
organize the data in the data/processed
directory.
It will split the TCGA-LUAD dataset into 7 centers according to the geographical origin of the samples, as described in the FedPyDESeq2 paper.
See also the fedpydeseq2_datasets
repository for more details.
dataset_name = "TCGA-LUAD"
raw_data_path = Path("data/raw").resolve()
processed_data_path = Path("data/processed").resolve()
design_factors = "stage"
setup_tcga_dataset(
raw_data_path,
processed_data_path,
dataset_name=dataset_name,
small_samples=False,
small_genes=False,
only_two_centers=False,
design_factors=design_factors,
force=True,
)
experiment_id = "TCGA-LUAD-stage"
Out:
2025-06-24 07:42:11.627 | INFO | fedpydeseq2_datasets.process_and_split_data:setup_tcga_dataset:144 - Setting up TCGA dataset: TCGA-LUAD-stage
2025-06-24 07:42:11.627 | INFO | fedpydeseq2_datasets.process_and_split_data:setup_tcga_dataset:150 - First center metadata does not exist or force=True. Setting up the dataset.
2025-06-24 07:42:11.628 | INFO | fedpydeseq2_datasets.process_and_split_data:_setup_tcga_dataset:282 - Processing the data for the TCGA dataset: TCGA-LUAD-stage
2025-06-24 07:42:14.678 | INFO | fedpydeseq2_datasets.process_and_split_data:_setup_tcga_dataset:302 - Saving the data for each center /home/runner/work/fedpydeseq2/fedpydeseq2/docs/examples/data/processed/centers_data/tcga/TCGA-LUAD-stage
2025-06-24 07:42:53.220 | INFO | fedpydeseq2_datasets.process_and_split_data:_setup_tcga_dataset:364 - Saving the pooled data at /home/runner/work/fedpydeseq2/fedpydeseq2/docs/examples/data/processed/pooled_data/tcga/TCGA-LUAD-stage
Running the experiment
We can now run the experiment.
Substra, the FL framework on which FedPyDESeq2 is built, supports a simulated mode which may be run locally from a single machine, which we will use here.
Let's run our FedPyDESeq2 experiment. This may be done using the
run_fedpydeseq2_experiment
wrapper function, which takes the following parameters:
-
n_centers=7
: Our data is distributed across 7 different medical centers -
backend="subprocess"
andsimulate=True
: We'll run the analysis locally on our machine to simulate a federated setup, rather than in a real distributed environment -
register_data=True
: We'll register our dataset with Substra before analysis. In the case of a real federated setup, this would be set toFalse
if data was already registered by Substra. -
asset_directory
: This directory should contain an opener.py file, containing an Opener class, and datasets.description.md file. Here, we copy them fromfedpydeseq2_datasets/assets/tcga
-
centers_root_directory
: Where the processed data for each center is stored -
compute_plan_name
: We'll call this analysis "Example-TCGA-LUAD-pipeline" in Substra -
dataset_name
: We're working with the TCGA-LUAD lung cancer dataset -
design_factors
: This should be a list of the design factors we wish to include in our analysis. Here, we're studying how "stage" (the cancer stage) affects gene expression -
ref_levels
: We're setting "Non-advanced" as our baseline cancer stage -
contrast
: This should be a list of three strings, of the form["factor", "alternative_level", "baseline_level"]
. To compare gene expression between "Advanced" vs "Non-advanced" stages, we setcontrast=["stage", "Advanced", "Non-advanced"]
. -
refit_cooks=True
: After finding outliers using Cook's distance, we'll refit the model without them for more robust results
fl_results = run_fedpydeseq2_experiment(
n_centers=7,
backend="subprocess",
simulate=True,
register_data=True,
asset_directory=Path("assets/tcga").resolve(),
centers_root_directory=processed_data_path
/ "centers_data"
/ "tcga"
/ experiment_id,
compute_plan_name="Example-TCGA-LUAD-pipeline",
dataset_name="TCGA-LUAD",
design_factors="stage",
ref_levels={"stage": "Non-advanced"},
contrast=["stage", "Advanced", "Non-advanced"],
refit_cooks=True,
)
Out:
2025-06-24 07:43:31.488 | INFO | fedpydeseq2.substra_utils.federated_experiment:run_federated_experiment:182 - Setting up organizations...
2025-06-24 07:43:31.490 | INFO | fedpydeseq2.substra_utils.federated_experiment:run_federated_experiment:233 - Registering the datasets...
2025-06-24 07:43:31.490 | INFO | fedpydeseq2.substra_utils.federated_experiment:run_federated_experiment:290 - Adding dataset to client MyOrg2MSP
2025-06-24 07:43:31.491 | INFO | fedpydeseq2.substra_utils.federated_experiment:run_federated_experiment:295 - Dataset added. Key: e0f3b562-296f-45e4-9396-f1238eb58a13
2025-06-24 07:43:31.492 | INFO | fedpydeseq2.substra_utils.federated_experiment:run_federated_experiment:290 - Adding dataset to client MyOrg3MSP
2025-06-24 07:43:31.492 | INFO | fedpydeseq2.substra_utils.federated_experiment:run_federated_experiment:295 - Dataset added. Key: 6da35e45-07e0-45b3-b99f-cce57c2c8424
2025-06-24 07:43:31.493 | INFO | fedpydeseq2.substra_utils.federated_experiment:run_federated_experiment:290 - Adding dataset to client MyOrg4MSP
2025-06-24 07:43:31.493 | INFO | fedpydeseq2.substra_utils.federated_experiment:run_federated_experiment:295 - Dataset added. Key: 5e86bd61-cde7-48f9-a666-9b333e9a3a48
2025-06-24 07:43:31.493 | INFO | fedpydeseq2.substra_utils.federated_experiment:run_federated_experiment:290 - Adding dataset to client MyOrg5MSP
2025-06-24 07:43:31.494 | INFO | fedpydeseq2.substra_utils.federated_experiment:run_federated_experiment:295 - Dataset added. Key: dceea926-cdba-482e-8a14-b9c6719efed8
2025-06-24 07:43:31.494 | INFO | fedpydeseq2.substra_utils.federated_experiment:run_federated_experiment:290 - Adding dataset to client MyOrg6MSP
2025-06-24 07:43:31.495 | INFO | fedpydeseq2.substra_utils.federated_experiment:run_federated_experiment:295 - Dataset added. Key: b6c0ad4d-c167-4be8-951d-32dd4fa8fa18
2025-06-24 07:43:31.495 | INFO | fedpydeseq2.substra_utils.federated_experiment:run_federated_experiment:290 - Adding dataset to client MyOrg7MSP
2025-06-24 07:43:31.495 | INFO | fedpydeseq2.substra_utils.federated_experiment:run_federated_experiment:295 - Dataset added. Key: a10d950a-1b25-427d-a5fc-a2be88f7e159
2025-06-24 07:43:31.496 | INFO | fedpydeseq2.substra_utils.federated_experiment:run_federated_experiment:290 - Adding dataset to client MyOrg8MSP
2025-06-24 07:43:31.496 | INFO | fedpydeseq2.substra_utils.federated_experiment:run_federated_experiment:295 - Dataset added. Key: 71268858-2a9b-4598-8137-41c743826a2d
2025-06-24 07:43:31.496 | INFO | fedpydeseq2.substra_utils.federated_experiment:run_federated_experiment:316 - Datasets registered.
2025-06-24 07:43:31.496 | INFO | fedpydeseq2.substra_utils.federated_experiment:run_federated_experiment:318 - Dataset keys: {'MyOrg2MSP': 'e0f3b562-296f-45e4-9396-f1238eb58a13', 'MyOrg3MSP': '6da35e45-07e0-45b3-b99f-cce57c2c8424', 'MyOrg4MSP': '5e86bd61-cde7-48f9-a666-9b333e9a3a48', 'MyOrg5MSP': 'dceea926-cdba-482e-8a14-b9c6719efed8', 'MyOrg6MSP': 'b6c0ad4d-c167-4be8-951d-32dd4fa8fa18', 'MyOrg7MSP': 'a10d950a-1b25-427d-a5fc-a2be88f7e159', 'MyOrg8MSP': '71268858-2a9b-4598-8137-41c743826a2d'}
2025-06-24 07:43:55,740 - INFO - Simulating the execution of the compute plan.
2025-06-24 07:43:55.741 | INFO | fedpydeseq2.core.deseq2_core.deseq2_full_pipe:run_deseq_pipe:66 - Building design matrices...
2025-06-24 07:43:56.183 | INFO | fedpydeseq2.core.deseq2_core.deseq2_full_pipe:run_deseq_pipe:76 - Finished building design matrices.
2025-06-24 07:43:56.183 | INFO | fedpydeseq2.core.deseq2_core.deseq2_full_pipe:run_deseq_pipe:82 - Computing size factors...
2025-06-24 07:43:56.631 | INFO | fedpydeseq2.core.deseq2_core.deseq2_full_pipe:run_deseq_pipe:97 - Finished computing size factors.
2025-06-24 07:43:56.632 | INFO | fedpydeseq2.core.deseq2_core.deseq2_full_pipe:run_deseq_pipe:101 - Running LFC and dispersions.
2025-06-24 07:43:56.632 | INFO | fedpydeseq2.core.deseq2_core.deseq2_lfc_dispersions.deseq2_lfc_dispersions:run_deseq2_lfc_dispersions:94 - Fit genewise dispersions...
2025-06-24 07:49:06.711 | INFO | fedpydeseq2.core.deseq2_core.deseq2_lfc_dispersions.deseq2_lfc_dispersions:run_deseq2_lfc_dispersions:108 - Finished fitting genewise dispersions.
2025-06-24 07:49:06.712 | INFO | fedpydeseq2.core.deseq2_core.deseq2_lfc_dispersions.deseq2_lfc_dispersions:run_deseq2_lfc_dispersions:112 - Compute dispersion prior...
2025-06-24 07:49:09.886 | INFO | fedpydeseq2.core.deseq2_core.deseq2_lfc_dispersions.deseq2_lfc_dispersions:run_deseq2_lfc_dispersions:125 - Finished computing dispersion prior.
2025-06-24 07:49:09.886 | INFO | fedpydeseq2.core.deseq2_core.deseq2_lfc_dispersions.deseq2_lfc_dispersions:run_deseq2_lfc_dispersions:145 - Fit MAP dispersions...
2025-06-24 07:52:52.629 | INFO | fedpydeseq2.core.deseq2_core.deseq2_lfc_dispersions.deseq2_lfc_dispersions:run_deseq2_lfc_dispersions:158 - Finished fitting MAP dispersions.
2025-06-24 07:52:52.629 | INFO | fedpydeseq2.core.deseq2_core.deseq2_lfc_dispersions.deseq2_lfc_dispersions:run_deseq2_lfc_dispersions:161 - Compute log fold changes...
2025-06-24 07:56:43.133 | INFO | fedpydeseq2.core.deseq2_core.deseq2_lfc_dispersions.deseq2_lfc_dispersions:run_deseq2_lfc_dispersions:171 - Finished computing log fold changes.
2025-06-24 07:56:43.134 | INFO | fedpydeseq2.core.deseq2_core.deseq2_full_pipe:run_deseq_pipe:112 - Finished running LFC and dispersions.
2025-06-24 07:56:43.134 | INFO | fedpydeseq2.core.deseq2_core.deseq2_full_pipe:run_deseq_pipe:114 - Computing Cook distances...
2025-06-24 08:01:07.624 | INFO | fedpydeseq2.core.deseq2_core.deseq2_full_pipe:run_deseq_pipe:128 - Finished computing Cook distances.
2025-06-24 08:01:07.624 | INFO | fedpydeseq2.core.deseq2_core.deseq2_full_pipe:run_deseq_pipe:132 - Refitting Cook outliers...
2025-06-24 08:03:33.763 | INFO | fedpydeseq2.core.deseq2_core.deseq2_lfc_dispersions.deseq2_lfc_dispersions:run_deseq2_lfc_dispersions:94 - Fit genewise dispersions...
2025-06-24 08:11:23.314 | INFO | fedpydeseq2.core.deseq2_core.deseq2_lfc_dispersions.deseq2_lfc_dispersions:run_deseq2_lfc_dispersions:108 - Finished fitting genewise dispersions.
2025-06-24 08:11:26.877 | INFO | fedpydeseq2.core.deseq2_core.deseq2_lfc_dispersions.deseq2_lfc_dispersions:run_deseq2_lfc_dispersions:145 - Fit MAP dispersions...
2025-06-24 08:12:08.064 | INFO | fedpydeseq2.core.deseq2_core.deseq2_lfc_dispersions.deseq2_lfc_dispersions:run_deseq2_lfc_dispersions:158 - Finished fitting MAP dispersions.
2025-06-24 08:12:08.064 | INFO | fedpydeseq2.core.deseq2_core.deseq2_lfc_dispersions.deseq2_lfc_dispersions:run_deseq2_lfc_dispersions:161 - Compute log fold changes...
2025-06-24 08:19:42.188 | INFO | fedpydeseq2.core.deseq2_core.deseq2_lfc_dispersions.deseq2_lfc_dispersions:run_deseq2_lfc_dispersions:171 - Finished computing log fold changes.
2025-06-24 08:19:46.047 | INFO | fedpydeseq2.core.deseq2_core.deseq2_full_pipe:run_deseq_pipe:164 - Finished refitting Cook outliers.
2025-06-24 08:19:46.048 | INFO | fedpydeseq2.core.deseq2_core.deseq2_full_pipe:run_deseq_pipe:168 - Running DESeq2 statistics.
2025-06-24 08:19:46.048 | INFO | fedpydeseq2.core.deseq2_core.deseq2_stats.deseq2_stats:run_deseq2_stats:64 - Running Wald tests.
2025-06-24 08:19:59.428 | INFO | fedpydeseq2.core.deseq2_core.deseq2_stats.deseq2_stats:run_deseq2_stats:74 - Finished running Wald tests.
2025-06-24 08:19:59.428 | INFO | fedpydeseq2.core.deseq2_core.deseq2_stats.deseq2_stats:run_deseq2_stats:77 - Running Cook's filtering...
2025-06-24 08:20:24.318 | INFO | fedpydeseq2.core.deseq2_core.deseq2_stats.deseq2_stats:run_deseq2_stats:86 - Finished running Cook's filtering.
2025-06-24 08:20:24.318 | INFO | fedpydeseq2.core.deseq2_core.deseq2_stats.deseq2_stats:run_deseq2_stats:87 - Computing adjusted p-values...
2025-06-24 08:20:30.188 | INFO | fedpydeseq2.core.deseq2_core.deseq2_stats.deseq2_stats:run_deseq2_stats:99 - Finished computing adjusted p-values.
2025-06-24 08:20:30.188 | INFO | fedpydeseq2.core.deseq2_core.deseq2_full_pipe:run_deseq_pipe:178 - Finished running DESeq2 statistics.
2025-06-24 08:20:30.188 | INFO | fedpydeseq2.core.deseq2_core.deseq2_full_pipe:run_deseq_pipe:182 - Saving pipeline results.
2025-06-24 08:20:34.190 | INFO | fedpydeseq2.core.deseq2_core.deseq2_full_pipe:run_deseq_pipe:191 - Finished saving pipeline results.
2025-06-24 08:20:34,190 - INFO - Experiment summary saved to /tmp/tmpcmxzzy1b/2025_06_24_07_43_55_simu-15d0c308-b532-4be1-8a2a-c3bdb1312b18.json
2025-06-24 08:20:34,191 - INFO - The compute plan has been simulated, its key is simu-15d0c308-b532-4be1-8a2a-c3bdb1312b18.
Results
The results are then stored in a fl_results
dictionary, which does not contain any
individual sample information.
Out:
dict_keys(['gene_names', 'MAP_dispersions', 'dispersions', 'genewise_dispersions', 'non_zero', 'fitted_dispersions', 'LFC', 'padj', 'p_values', 'wald_statistics', 'wald_se', 'replaced', 'refitted', 'prior_disp_var', '_squared_logres', 'contrast'])
We can then extract the results for our contrast of interest, and store them in a pandas DataFrame.
res_df = pd.DataFrame()
res_df["LFC"] = fl_results["LFC"]["stage_Advanced_vs_Non-advanced"]
res_df["pvalue"] = fl_results["p_values"]
res_df["padj"] = fl_results["padj"]
res_df = res_df.loc[fl_results["non_zero"], :]
Out:
LFC pvalue padj
ENSG00000223972 0.338364 0.052181 0.253126
ENSG00000278267 0.038183 0.821323 0.937148
ENSG00000227232 0.104173 0.207068 0.516585
ENSG00000284332 0.174178 0.742368 NaN
ENSG00000243485 0.291516 0.192281 0.497940
... ... ... ...
ENSG00000215506 -0.585729 0.145101 NaN
ENSG00000227629 0.074826 0.933744 NaN
ENSG00000231514 0.026989 0.853410 0.948483
ENSG00000237917 0.118084 0.668620 0.867462
ENSG00000235857 -1.069862 0.247121 NaN
[57832 rows x 3 columns]
Total running time of the script: ( 38 minutes 33.045 seconds)
Download Python source code: plot_demo.py