fedeca.utils¶
Utility functions of data generation.
- generate_cox_data_and_substra_clients(n_clients=2, ndim=10, split_method_kwargs=None, backend_type='subprocess', data_path=None, urls=None, tokens=None, seed=42, n_per_client=200, add_treated=False, ncategorical=0)¶
Generate Cox data on disk for several clients.
Generate Cox data and register them with different fake clients.
- Parameters:
n_clients (
int
,(optional)
) – Number of clients. Defaults to 2.ndim (
int
,(optional)
) – Number of covariates. Defaults to 10.Union[dict (split_method_kwargs =) – The argument to the split_method uniform.
None] – The argument to the split_method uniform.
backend_type (
str
,(optional)
) – Type of backend. Defaults to “subprocess”.data_path (
str
,(optional)
) – Path to save the data. Defaults to None.seed (
int
,(optional)
) – Random seed. Defaults to 42.n_per_client (
int
,(optional)
) – Number of samples per client. Defaults to 200.add_treated (
bool
,(optional)
) – Whether or not to keep treated column.ncategorical (
int
,(optional)
) – Number of features to make categorical a posteriori (moving away from Cox assumptions).split_method_kwargs (dict | None) –
urls (None | list) –
tokens (None | list) –
- split_control_over_centers(df, n_clients, treatment_info='treatment_allocation', use_random=True, seed=42)¶
Split patients in the control group over the centers.
- Parameters:
df (
pandas.DataFrame,
) – Dataframe containing features of the patients.n_clients (
int,
) – Number of clients.treatment_info (
str
,(optional)
) – Column name for the treatment allocation covariate. Defaults to “treatment_allocation”.use_random (
bool
) – Whether or not to shuffle the control group indices before splitting.seed (
int
) – The seed of the shuffling.
- split_dataframe_across_clients(df, n_clients, split_method='uniform', split_method_kwargs=None, backend_type='subprocess', data_path=None, urls=[], tokens=[])¶
Split patients over the centers.
- Parameters:
df (
pandas.DataFrame,
) – Dataframe containing features of the patients.n_clients (
int,
) – Number of clients.split_method (
Union[Callable
,str]
) – How to split the dataset across all clients, if callable should have the signature: df, n_clients, kwargs -> list[list[int]] if str should be an existing key, which will invoke the corresponding callable. Possible values are uniform which splits the patients uniformly across centers or split_control_over_centers where one center has all the treated patients and the control is split over the remaining ones.split_method_kwargs (
Union[dict
,None]
) – Optional kwargs for the split_method method.backend_type (
str
,(optional)
) – Backend type. Default is “subprocess”.data_path (
Union[str
,None],
) – Path on where to save the data on disk.urls (
List,
) – List of urls.tokens (
List,
) – List of tokens.
- uniform_split(df, n_clients, use_random=True, seed=42)¶
Split patients uniformly over n_clients.
A module containing utils to compute high-order moments using Newton’s formeanla.
- aggregation_mean(local_means, n_local_samples)¶
Aggregate local means.
Aggregate the local means into a global mean by using the local number of samples.
- Parameters:
local_means (
List[Any]
) – List of local means. Could be array, float, Series.n_local_samples (
List[int]
) – List of number of samples used for each local mean.
- Returns:
Aggregated mean. Same type of the local means
- Return type:
Any
- compute_centered_moment(uncentered_moments)¶
Compute the centered moment of order k.
Given a list of the k first unnormalized moments, compute the centered moment of order k. For high values of the moments the results can differ from scipy.special.moment. We are interested in computing .. math:
\hat{\mu}_k = \frac{1}{\hat{\sigma}^k} \mathbb E_Z \left[ (Z - \hat{\mu})^k\right] \hat{\mu}_k = \frac{1}{\hat{\sigma}^k} \mathbb E_Z \left[ \sum_{l=0}^k\binom{k}{l} Z^{k-l} (-1)^l\hat\mu^l)\right] \hat{\mu}_k = \frac{1}{\hat{\sigma}^k} \sum_{l=0}^k(-1)^l\binom{k}{l} \mathbb E_Z \left[ Z^{k-l} \right]\mathbb E_Z \left[ Z \right]^l
thus we only need the list uncentered moments up to order k.
- Parameters:
uncentered_moments (
List[Any]
) – List of the k first non-centered moment.- Returns:
The centered k-th moment.
- Return type:
Any
- compute_global_moments(shared_states)¶
Aggregate local moments.
- compute_uncentered_moment(data, order, weights=None)¶
Compute the uncentered moment.
- Parameters:
data (
pd.DataFrame
,np.array
) – dataframe.order (
int
) – order of the moment.weights (
np.array
) – weight for the aggregation.
- Returns:
Moment of order k.
- Return type:
pd.DataFrame
,np.array
- Raises:
NotImplementedError – Raised if the data type is not Dataframe nor np.ndarray.
Utils functions for Substra.
- class Experiment(strategies, num_rounds_list, ds_client=None, train_data_nodes=None, test_data_nodes=None, aggregation_node=None, evaluation_frequency=None, experiment_folder='./experiments', clean_models=False, fedeca_path=None, algo_dependencies=None, partner_client=None)¶
Bases:
object
Experiment class.
- Parameters:
strategies (list) –
ds_client (Client | None) –
train_data_nodes (list[substrafl.nodes.train_data_node.TrainDataNode] | None) –
test_data_nodes (list[substrafl.nodes.test_data_node.TestDataNode] | None) –
aggregation_node (AggregationNode | None) –
evaluation_frequency (int | None) –
experiment_folder (str) –
clean_models (bool) –
fedeca_path (str | None) –
algo_dependencies (list | None) –
partner_client (Client | None) –
- fit(data, nb_clients=None, split_method='uniform', split_method_kwargs=None, data_path=None, backend_type='subprocess', urls=None, tokens=None)¶
Fit strategies on global data split across clients.
For test if provided we use test_data_nodes from int or the train_data_nodes in the latter train=test.
- Parameters:
data (
pd.DataFrame
) – The global data to be split has to be a dataframe as we only support one opener type.nb_clients (
Union[int
,None]
, optional) – The number of clients used to split data across, by default Nonesplit_method (
Union[Callable
,None]
, optional) – How to split data across the nb_clients, by default None.split_method_kwargs (
Union[Callable
,None]
, optional) – Argument of the function used to split data, by default None.data_path (
Union[str
,None]
) – Where to store the data on disk when backend is not remote.backend_type (
str
) – The backend to use for substra. Can be either: [“subprocess”, “docker”, “remote”]. Defaults to “subprocess”.urls (
Union[list[str]
,None]
) – Urls corresponding to clients API if using remote backend_type. Defaults to None.tokens (
Union[list[str]
,None]
) – Tokens necessary to authenticate each client API if backend_type is remote. Defauts to None.
- get_outmodel(task_name, strategy_idx=0, idx_task=0)¶
Get the output model.
- reset_experiment()¶
Reset the state of the object.
So it can be fit with a new dataset.
- class SubstraflTorchDataset(data_from_opener, is_inference, target_columns=None, columns_to_drop=None, fit_cols=None, dtype='float64', return_torch_tensors=False)¶
Bases:
Dataset
Substra torch dataset class.
- download_train_task_models_by_round(client, dest_folder, compute_plan_key, round_idx)¶
Download models associated with a specific round of a train task.
- get_outmodel_function(task_name, client, compute_plan_key=None, idx_task=0, tasks_dict={})¶
Retrieve an output model from a task or tasks_dict.
- get_simu_state_from_round(simu_memory, client_id, round_idx=None)¶
Get the simulation state from a specific round and client.
- Parameters:
simu_memory (
SimuStatesMemory
) – Simulation memory.client_id (
str
) – Client ID.round_idx (
Optional[int]
, optional) – Round index, by default None.
- make_accuracy_function(treatment_col)¶
Build accuracy function.
- Parameters:
treatment_col (
str,
) – Column name for the treatment allocation.
- make_c_index_function(duration_col, event_col)¶
Build C-index function.
- Parameters:
duration_col (
str,
) – Column name for the duration.event_col (
str,
) – Column name for the event.
- make_substrafl_torch_dataset_class(target_cols, event_col, duration_col, fit_cols=None, dtype='float64', return_torch_tensors=False, client_identifier=None)¶
Create a custom SubstraflTorchDataset class for survival analysis.
- Parameters:
target_cols (
list
) – List of target columns.event_col (
str
) – Name of the event column.duration_col (
str
) – Name of the duration column.fit_cols (
Union[list
,None]
, optional) – List of columns to fit on, by default None will use all columns that can be cast to numeric except target_columns.dtype (
str
, optional) – Data type, by default “float64”.return_torch_tensors (
bool
, optional) – Returns torch.Tensor. Defaults to False.client_identifier (
Union[None
,str]
, optional) – Name of the column that identifies the client and that is to be dropped. By default assumes there is no client identifier.
- Returns:
Custom SubstraflTorchDataset class.
- Return type:
Functions for tensors comparison.
- compare_tensors_lists(tensor_list_a, tensor_list_b, rtol=1e-05, atol=1e-08)¶
Compare list of tensors up to a certain precision.
The criteria that is checked is the following: |x - y| <= |y| * rtol + atol. So there are two terms to consider. The first one is relative (rtol) and the second is absolute (atol). The default for atol is a bit low for float32 tensors. We keep the defaults everywhereto be safe except in the tests computed gradients wrt theory where we raise atol to 1e-6. It makes sens in this case because it matches the expected precision for slightly different float32 ops that should theoretically give the exact same result.
- Parameters:
Module defining type alias.