fedeca.utils¶

Utility functions of data generation.

generate_cox_data_and_substra_clients(n_clients=2, ndim=10, split_method_kwargs=None, backend_type='subprocess', data_path=None, urls=None, tokens=None, seed=42, n_per_client=200, add_treated=False, ncategorical=0)¶

Generate Cox data on disk for several clients.

Generate Cox data and register them with different fake clients.

Parameters:

n_clients (int, (optional)) – Number of clients. Defaults to 2.
ndim (int, (optional)) – Number of covariates. Defaults to 10.
Union[dict (split_method_kwargs =) – The argument to the split_method uniform.
None] – The argument to the split_method uniform.
backend_type (str, (optional)) – Type of backend. Defaults to “subprocess”.
data_path (str, (optional)) – Path to save the data. Defaults to None.
seed (int, (optional)) – Random seed. Defaults to 42.
n_per_client (int, (optional)) – Number of samples per client. Defaults to 200.
add_treated (bool, (optional)) – Whether or not to keep treated column.
ncategorical (int, (optional)) – Number of features to make categorical a posteriori (moving away from Cox assumptions).
split_method_kwargs (dict | None) –
urls (None | list) –
tokens (None | list) –

split_control_over_centers(df, n_clients, treatment_info='treatment_allocation', use_random=True, seed=42)¶

Split patients in the control group over the centers.

Parameters:

df (pandas.DataFrame,) – Dataframe containing features of the patients.
n_clients (int,) – Number of clients.
treatment_info (str, (optional)) – Column name for the treatment allocation covariate. Defaults to “treatment_allocation”.
use_random (bool) – Whether or not to shuffle the control group indices before splitting.
seed (int) – The seed of the shuffling.

split_dataframe_across_clients(df, n_clients, split_method='uniform', split_method_kwargs=None, backend_type='subprocess', data_path=None, urls=[], tokens=[])¶

Split patients over the centers.

Parameters:

df (pandas.DataFrame,) – Dataframe containing features of the patients.
n_clients (int,) – Number of clients.
split_method (Union[Callable, str]) – How to split the dataset across all clients, if callable should have the signature: df, n_clients, kwargs -> list[list[int]] if str should be an existing key, which will invoke the corresponding callable. Possible values are uniform which splits the patients uniformly across centers or split_control_over_centers where one center has all the treated patients and the control is split over the remaining ones.
split_method_kwargs (Union[dict, None]) – Optional kwargs for the split_method method.
backend_type (str, (optional)) – Backend type. Default is “subprocess”.
data_path (Union[str, None],) – Path on where to save the data on disk.
urls (List,) – List of urls.
tokens (List,) – List of tokens.

uniform_split(df, n_clients, use_random=True, seed=42)¶

Split patients uniformly over n_clients.

Parameters:

df (pandas.DataFrame,) – Dataframe containing features of the patients.
n_clients (int,) – Number of clients.
use_random (bool) – Whether or not to shuffle data before splitting. Defaults to True.
seed (int, (optional)) – Seeding for shuffling

A module containing utils to compute high-order moments using Newton’s formeanla.

aggregation_mean(local_means, n_local_samples)¶

Aggregate local means.

Aggregate the local means into a global mean by using the local number of samples.

Parameters:

local_means (List[Any]) – List of local means. Could be array, float, Series.
n_local_samples (List[int]) – List of number of samples used for each local mean.

Returns:

Aggregated mean. Same type of the local means

Return type:

Any

compute_centered_moment(uncentered_moments)¶

Compute the centered moment of order k.

Given a list of the k first unnormalized moments, compute the centered moment of order k. For high values of the moments the results can differ from scipy.special.moment. We are interested in computing .. math:

\hat{\mu}_k  = \frac{1}{\hat{\sigma}^k}
    \mathbb E_Z \left[ (Z - \hat{\mu})^k\right]
\hat{\mu}_k  = \frac{1}{\hat{\sigma}^k}
    \mathbb E_Z \left[ \sum_{l=0}^k\binom{k}{l} Z^{k-l} (-1)^l\hat\mu^l)\right]
\hat{\mu}_k  = \frac{1}{\hat{\sigma}^k}
  \sum_{l=0}^k(-1)^l\binom{k}{l} \mathbb E_Z \left[ Z^{k-l}
  \right]\mathbb E_Z \left[ Z \right]^l

thus we only need the list uncentered moments up to order k.

Parameters:: uncentered_moments (List[Any]) – List of the k first non-centered moment.
Returns:: The centered k-th moment.
Return type:: Any

compute_global_moments(shared_states)¶

Aggregate local moments.

Parameters:: shared_states (list) – list of outputs from compute_uncentered_moment.
Returns:: The results of the aggregation with both centered and uncentered moments.
Return type:: dict

compute_uncentered_moment(data, order, weights=None)¶

Compute the uncentered moment.

Parameters:

data (pd.DataFrame, np.array) – dataframe.
order (int) – order of the moment.
weights (np.array) – weight for the aggregation.

Returns:

Moment of order k.

Return type:

pd.DataFrame, np.array

Raises:

NotImplementedError – Raised if the data type is not Dataframe nor np.ndarray.

Utils functions for Substra.

class Experiment(strategies, num_rounds_list, ds_client=None, train_data_nodes=None, test_data_nodes=None, aggregation_node=None, evaluation_frequency=None, experiment_folder='./experiments', clean_models=False, fedeca_path=None, algo_dependencies=None, partner_client=None)¶

Bases: object

Experiment class.

Parameters:

strategies (list) –
num_rounds_list (list[int]) –
ds_client (Client | None) –
train_data_nodes (list[substrafl.nodes.train_data_node.TrainDataNode] | None) –
test_data_nodes (list[substrafl.nodes.test_data_node.TestDataNode] | None) –
aggregation_node (AggregationNode | None) –
evaluation_frequency (int | None) –
experiment_folder (str) –
clean_models (bool) –
fedeca_path (str | None) –
algo_dependencies (list | None) –
partner_client (Client | None) –

fit(data, nb_clients=None, split_method='uniform', split_method_kwargs=None, data_path=None, backend_type='subprocess', urls=None, tokens=None)¶

Fit strategies on global data split across clients.

For test if provided we use test_data_nodes from int or the train_data_nodes in the latter train=test.

Parameters:

data (pd.DataFrame) – The global data to be split has to be a dataframe as we only support one opener type.
nb_clients (Union[int, None], optional) – The number of clients used to split data across, by default None
split_method (Union[Callable, None], optional) – How to split data across the nb_clients, by default None.
split_method_kwargs (Union[Callable, None], optional) – Argument of the function used to split data, by default None.
data_path (Union[str, None]) – Where to store the data on disk when backend is not remote.
backend_type (str) – The backend to use for substra. Can be either: [“subprocess”, “docker”, “remote”]. Defaults to “subprocess”.
urls (Union[list[str], None]) – Urls corresponding to clients API if using remote backend_type. Defaults to None.
tokens (Union[list[str], None]) – Tokens necessary to authenticate each client API if backend_type is remote. Defauts to None.

get_outmodel(task_name, strategy_idx=0, idx_task=0)¶

Get the output model.

Parameters:

task_name (str) – Name of the task.
strategy_idx (int, optional) – Index of the strategy, by default 0.
idx_task (int, optional) – Index of the task, by default 0.

reset_experiment()¶

Reset the state of the object.

So it can be fit with a new dataset.

run(num_strategies_to_run=None)¶

Run the experiment.

Parameters:: num_strategies_to_run (int, optional) – Number of strategies to run, by default None.

class SubstraflTorchDataset(data_from_opener, is_inference, target_columns=None, columns_to_drop=None, fit_cols=None, dtype='float64', return_torch_tensors=False)¶

Bases: Dataset

Substra torch dataset class.

Parameters:

is_inference (bool) –
target_columns (list | None) –
columns_to_drop (list | None) –
fit_cols (list | None) –

download_train_task_models_by_round(client, dest_folder, compute_plan_key, round_idx)¶: Download models associated with a specific round of a train task.

get_outmodel_function(task_name, client, compute_plan_key=None, idx_task=0, tasks_dict={})¶: Retrieve an output model from a task or tasks_dict.

get_simu_state_from_round(simu_memory, client_id, round_idx=None)¶

Get the simulation state from a specific round and client.

Parameters:

simu_memory (SimuStatesMemory) – Simulation memory.
client_id (str) – Client ID.
round_idx (Optional[int], optional) – Round index, by default None.

make_accuracy_function(treatment_col)¶

Build accuracy function.

Parameters:: treatment_col (str,) – Column name for the treatment allocation.

make_c_index_function(duration_col, event_col)¶

Build C-index function.

Parameters:

duration_col (str,) – Column name for the duration.
event_col (str,) – Column name for the event.

make_substrafl_torch_dataset_class(target_cols, event_col, duration_col, fit_cols=None, dtype='float64', return_torch_tensors=False, client_identifier=None)¶

Create a custom SubstraflTorchDataset class for survival analysis.

Parameters:

target_cols (list) – List of target columns.
event_col (str) – Name of the event column.
duration_col (str) – Name of the duration column.
fit_cols (Union[list, None], optional) – List of columns to fit on, by default None will use all columns that can be cast to numeric except target_columns.
dtype (str, optional) – Data type, by default “float64”.
return_torch_tensors (bool, optional) – Returns torch.Tensor. Defaults to False.
client_identifier (Union[None, str], optional) – Name of the column that identifies the client and that is to be dropped. By default assumes there is no client identifier.

Returns:

Custom SubstraflTorchDataset class.

Return type:

type

Functions for tensors comparison.

compare_tensors_lists(tensor_list_a, tensor_list_b, rtol=1e-05, atol=1e-08)¶

Compare list of tensors up to a certain precision.

The criteria that is checked is the following: |x - y| <= |y| * rtol + atol. So there are two terms to consider. The first one is relative (rtol) and the second is absolute (atol). The default for atol is a bit low for float32 tensors. We keep the defaults everywhereto be safe except in the tests computed gradients wrt theory where we raise atol to 1e-6. It makes sens in this case because it matches the expected precision for slightly different float32 ops that should theoretically give the exact same result.

Parameters:

tensor_list_a (list) – a list of tensors
tensor_list_b (list) – a list of tensors
atol (float, optional) – absolute difference tolerance for tensor-to-tensor comparison. Default to 1e-5.
rtol (float, optional) – relative difference tolerance for tensor-to-tensor comparison. Default to 1e-8.

Module defining type alias.