fedeca.utils

Utility functions of data generation.

generate_cox_data_and_substra_clients(n_clients=2, ndim=10, split_method_kwargs=None, backend_type='subprocess', data_path=None, urls=None, tokens=None, seed=42, n_per_client=200, add_treated=False, ncategorical=0)

Generate Cox data on disk for several clients.

Generate Cox data and register them with different fake clients.

Parameters:
  • n_clients (int, (optional)) – Number of clients. Defaults to 2.

  • ndim (int, (optional)) – Number of covariates. Defaults to 10.

  • Union[dict (split_method_kwargs =) – The argument to the split_method uniform.

  • None] – The argument to the split_method uniform.

  • backend_type (str, (optional)) – Type of backend. Defaults to “subprocess”.

  • data_path (str, (optional)) – Path to save the data. Defaults to None.

  • seed (int, (optional)) – Random seed. Defaults to 42.

  • n_per_client (int, (optional)) – Number of samples per client. Defaults to 200.

  • add_treated (bool, (optional)) – Whether or not to keep treated column.

  • ncategorical (int, (optional)) – Number of features to make categorical a posteriori (moving away from Cox assumptions).

  • split_method_kwargs (dict | None) –

  • urls (None | list) –

  • tokens (None | list) –

split_control_over_centers(df, n_clients, treatment_info='treatment_allocation', use_random=True, seed=42)

Split patients in the control group over the centers.

Parameters:
  • df (pandas.DataFrame,) – Dataframe containing features of the patients.

  • n_clients (int,) – Number of clients.

  • treatment_info (str, (optional)) – Column name for the treatment allocation covariate. Defaults to “treatment_allocation”.

  • use_random (bool) – Whether or not to shuffle the control group indices before splitting.

  • seed (int) – The seed of the shuffling.

split_dataframe_across_clients(df, n_clients, split_method='uniform', split_method_kwargs=None, backend_type='subprocess', data_path=None, urls=[], tokens=[])

Split patients over the centers.

Parameters:
  • df (pandas.DataFrame,) – Dataframe containing features of the patients.

  • n_clients (int,) – Number of clients.

  • split_method (Union[Callable, str]) – How to split the dataset across all clients, if callable should have the signature: df, n_clients, kwargs -> list[list[int]] if str should be an existing key, which will invoke the corresponding callable. Possible values are uniform which splits the patients uniformly across centers or split_control_over_centers where one center has all the treated patients and the control is split over the remaining ones.

  • split_method_kwargs (Union[dict, None]) – Optional kwargs for the split_method method.

  • backend_type (str, (optional)) – Backend type. Default is “subprocess”.

  • data_path (Union[str, None],) – Path on where to save the data on disk.

  • urls (List,) – List of urls.

  • tokens (List,) – List of tokens.

uniform_split(df, n_clients, use_random=True, seed=42)

Split patients uniformly over n_clients.

Parameters:
  • df (pandas.DataFrame,) – Dataframe containing features of the patients.

  • n_clients (int,) – Number of clients.

  • use_random (bool) – Whether or not to shuffle data before splitting. Defaults to True.

  • seed (int, (optional)) – Seeding for shuffling

A module containing utils to compute high-order moments using Newton’s formeanla.

aggregation_mean(local_means, n_local_samples)

Aggregate local means.

Aggregate the local means into a global mean by using the local number of samples.

Parameters:
  • local_means (List[Any]) – List of local means. Could be array, float, Series.

  • n_local_samples (List[int]) – List of number of samples used for each local mean.

Returns:

Aggregated mean. Same type of the local means

Return type:

Any

compute_centered_moment(uncentered_moments)

Compute the centered moment of order k.

Given a list of the k first unnormalized moments, compute the centered moment of order k. For high values of the moments the results can differ from scipy.special.moment. We are interested in computing .. math:

\hat{\mu}_k  = \frac{1}{\hat{\sigma}^k}
    \mathbb E_Z \left[ (Z - \hat{\mu})^k\right]
\hat{\mu}_k  = \frac{1}{\hat{\sigma}^k}
    \mathbb E_Z \left[ \sum_{l=0}^k\binom{k}{l} Z^{k-l} (-1)^l\hat\mu^l)\right]
\hat{\mu}_k  = \frac{1}{\hat{\sigma}^k}
  \sum_{l=0}^k(-1)^l\binom{k}{l} \mathbb E_Z \left[ Z^{k-l}
  \right]\mathbb E_Z \left[ Z \right]^l

thus we only need the list uncentered moments up to order k.

Parameters:

uncentered_moments (List[Any]) – List of the k first non-centered moment.

Returns:

The centered k-th moment.

Return type:

Any

compute_global_moments(shared_states)

Aggregate local moments.

Parameters:

shared_states (list) – list of outputs from compute_uncentered_moment.

Returns:

The results of the aggregation with both centered and uncentered moments.

Return type:

dict

compute_uncentered_moment(data, order, weights=None)

Compute the uncentered moment.

Parameters:
  • data (pd.DataFrame, np.array) – dataframe.

  • order (int) – order of the moment.

  • weights (np.array) – weight for the aggregation.

Returns:

Moment of order k.

Return type:

pd.DataFrame, np.array

Raises:

NotImplementedError – Raised if the data type is not Dataframe nor np.ndarray.

Utils functions for Substra.

class Experiment(strategies, num_rounds_list, ds_client=None, train_data_nodes=None, test_data_nodes=None, aggregation_node=None, evaluation_frequency=None, experiment_folder='./experiments', clean_models=False, fedeca_path=None, algo_dependencies=None, partner_client=None)

Bases: object

Experiment class.

Parameters:
  • strategies (list) –

  • num_rounds_list (list[int]) –

  • ds_client (Client | None) –

  • train_data_nodes (list[substrafl.nodes.train_data_node.TrainDataNode] | None) –

  • test_data_nodes (list[substrafl.nodes.test_data_node.TestDataNode] | None) –

  • aggregation_node (AggregationNode | None) –

  • evaluation_frequency (int | None) –

  • experiment_folder (str) –

  • clean_models (bool) –

  • fedeca_path (str | None) –

  • algo_dependencies (list | None) –

  • partner_client (Client | None) –

fit(data, nb_clients=None, split_method='uniform', split_method_kwargs=None, data_path=None, backend_type='subprocess', urls=None, tokens=None)

Fit strategies on global data split across clients.

For test if provided we use test_data_nodes from int or the train_data_nodes in the latter train=test.

Parameters:
  • data (pd.DataFrame) – The global data to be split has to be a dataframe as we only support one opener type.

  • nb_clients (Union[int, None], optional) – The number of clients used to split data across, by default None

  • split_method (Union[Callable, None], optional) – How to split data across the nb_clients, by default None.

  • split_method_kwargs (Union[Callable, None], optional) – Argument of the function used to split data, by default None.

  • data_path (Union[str, None]) – Where to store the data on disk when backend is not remote.

  • backend_type (str) – The backend to use for substra. Can be either: [“subprocess”, “docker”, “remote”]. Defaults to “subprocess”.

  • urls (Union[list[str], None]) – Urls corresponding to clients API if using remote backend_type. Defaults to None.

  • tokens (Union[list[str], None]) – Tokens necessary to authenticate each client API if backend_type is remote. Defauts to None.

get_outmodel(task_name, strategy_idx=0, idx_task=0)

Get the output model.

Parameters:
  • task_name (str) – Name of the task.

  • strategy_idx (int, optional) – Index of the strategy, by default 0.

  • idx_task (int, optional) – Index of the task, by default 0.

reset_experiment()

Reset the state of the object.

So it can be fit with a new dataset.

run(num_strategies_to_run=None)

Run the experiment.

Parameters:

num_strategies_to_run (int, optional) – Number of strategies to run, by default None.

class SubstraflTorchDataset(data_from_opener, is_inference, target_columns=None, columns_to_drop=None, fit_cols=None, dtype='float64', return_torch_tensors=False)

Bases: Dataset

Substra torch dataset class.

Parameters:
  • is_inference (bool) –

  • target_columns (list | None) –

  • columns_to_drop (list | None) –

  • fit_cols (list | None) –

download_train_task_models_by_round(client, dest_folder, compute_plan_key, round_idx)

Download models associated with a specific round of a train task.

get_outmodel_function(task_name, client, compute_plan_key=None, idx_task=0, tasks_dict={})

Retrieve an output model from a task or tasks_dict.

get_simu_state_from_round(simu_memory, client_id, round_idx=None)

Get the simulation state from a specific round and client.

Parameters:
  • simu_memory (SimuStatesMemory) – Simulation memory.

  • client_id (str) – Client ID.

  • round_idx (Optional[int], optional) – Round index, by default None.

make_accuracy_function(treatment_col)

Build accuracy function.

Parameters:

treatment_col (str,) – Column name for the treatment allocation.

make_c_index_function(duration_col, event_col)

Build C-index function.

Parameters:
  • duration_col (str,) – Column name for the duration.

  • event_col (str,) – Column name for the event.

make_substrafl_torch_dataset_class(target_cols, event_col, duration_col, fit_cols=None, dtype='float64', return_torch_tensors=False, client_identifier=None)

Create a custom SubstraflTorchDataset class for survival analysis.

Parameters:
  • target_cols (list) – List of target columns.

  • event_col (str) – Name of the event column.

  • duration_col (str) – Name of the duration column.

  • fit_cols (Union[list, None], optional) – List of columns to fit on, by default None will use all columns that can be cast to numeric except target_columns.

  • dtype (str, optional) – Data type, by default “float64”.

  • return_torch_tensors (bool, optional) – Returns torch.Tensor. Defaults to False.

  • client_identifier (Union[None, str], optional) – Name of the column that identifies the client and that is to be dropped. By default assumes there is no client identifier.

Returns:

Custom SubstraflTorchDataset class.

Return type:

type

Functions for tensors comparison.

compare_tensors_lists(tensor_list_a, tensor_list_b, rtol=1e-05, atol=1e-08)

Compare list of tensors up to a certain precision.

The criteria that is checked is the following: |x - y| <= |y| * rtol + atol. So there are two terms to consider. The first one is relative (rtol) and the second is absolute (atol). The default for atol is a bit low for float32 tensors. We keep the defaults everywhereto be safe except in the tests computed gradients wrt theory where we raise atol to 1e-6. It makes sens in this case because it matches the expected precision for slightly different float32 ops that should theoretically give the exact same result.

Parameters:
  • tensor_list_a (list) – a list of tensors

  • tensor_list_b (list) – a list of tensors

  • atol (float, optional) – absolute difference tolerance for tensor-to-tensor comparison. Default to 1e-5.

  • rtol (float, optional) – relative difference tolerance for tensor-to-tensor comparison. Default to 1e-8.

Module defining type alias.