fedeca.strategies¶

class WebDisco(algo, metric_functions=None, standardize_data=True, tol=1e-16)¶

Bases: Strategy

WebDisco strategy class.

It can only be used with traditional Cox models on pandas.DataFrames. This strategy is one of its kind because it can only be used with Linear CoxPH models defined in fedeca.utils.survival_utils. Therefore all models are initialized with zeroed weights (as in lifelines), tested and we cover all possible use cases with the dtype and ndim arguments. This strategy splits the computations of gradient and Hessian between workers to compute a centralized batch Newton- Raphson update on Breslow’s partial log-likelihod (to handle tied events it uses Breslow’s approximation unlike lifelines which uses Efron’s by default but Efron is not separable). This strategy uses lifeline’s adaptive step-size to converge faster starting from initial_ste_size and use lifelines safe way of inverting the hessian. As lifelines standardizes the data by default we allow the user to do it optionally.

Reference¶

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5009917/

param statistics_computed:: If the statistics that we can find in each gradient, hessian are already computed and given as attribute to the server or not.
type statistics_computed:: bool,
param initial_step_size:: The initial step size of the Newton-Raphson algorithm at the server side. The following steps will use lifelines heuristics to adapt the step-size. Defaults to 0.9.
type initial_step_size:: float, otional
type tol:: float
param tol:: Capping every division to avoid dividing by 0. Defaults to 1e-16.
type tol:: float, optional
type standardize_data:: bool
param standardize_data:: Whether or not to standardize the data before comuting updates. Defaults to False.
type standardize_data:: bool,
param penalizer:: Add a regularizer in case of ill-conditioned hessians, which happen quite often with large covariates. Defaults to 0.
type penalizer:: float, optional
param l1_ratio:: When using a penalizer the ratio between L1 and L2 regularization as in sklearn. Defaults to 0.
type l1_ratio:: float, optional

aggregate_moments(shared_states)¶

Compute the global centered moments given the local results.

Parameters:: shared_states (List) – List of results (local_m1, local_m2, n_samples) from training nodes.
Returns:: Global results to be shared with train nodes via shared_state.
Return type:: dict

build_compute_plan(train_data_nodes, aggregation_node, evaluation_strategy, num_rounds, clean_models)¶

Build the computation graph of the strategy.

It removes initialization round, which is useless in this case as all models start at 0.

Parameters:

train_data_nodes (typing.List[TrainDataNode],) – list of the train organizations
aggregation_node (typing.Optional[AggregationNode],) – aggregation node, necessary for centralized strategy, unused otherwise
evaluation_strategy (Optional[EvaluationStrategy],) – When and how to compute performance.
num_rounds (int,) – The number of rounds to perform.
clean_models (bool (default=True),) – Clean the intermediary models on the Substra platform. Set it to False if you want to download or re-use intermediary models. This causes the disk space to fill quickly so should be set to True unless needed. Defaults to True.

perform_evaluation(test_data_nodes, train_data_nodes, round_idx)¶

Evaluate model on the given test_data_nodes.

Parameters:

test_data_nodes (List[TestDataNode]),) – test data nodes to intersect with train data nodes to evaluate the model on.
train_data_nodes (List[TrainDataNode],) – train data nodes the model has been trained on.
round_idx (int,) – round index.

Raises:

NotImplementedError –

perform_round(train_data_nodes, aggregation_node, round_idx, clean_models, additional_orgs_permissions=None)¶

Perform one round of webdisco.

One round of the WebDisco strategy consists in:

optionally compute global means and stds for all features if standardize_data is True
compute global survival statistics that will be reused at each round
build building blocks of the gradient and hessian based on global risk sets
perform a Newton-Raphson update on each train data nodes

Parameters:

train_data_nodes (typing.List[TrainDataNode],) – List of the nodes on which to train
aggregation_node (AggregationNode) – node without data, used to perform operations on the shared states of the models
round_idx (int,) – Round number, it starts at 0.
clean_models (bool,) – Clean the intermediary models of this round on the Substra platform. Set it to False if you want to download or re-use intermediary models. This causes the disk space to fill quickly so should be set to True unless needed.
additional_orgs_permissions (typing.Optional[set],) – Additional permissions to give to the model outputs after training, in order to test the model on an other organization.

property name: StrategyName¶

The name of the strategy.

Returns:: StrategyName
Return type:: Name of the strategy

Parameters:

algo (Algo) –
metric_functions (Dict[str, Callable] | List[Callable] | Callable | None) –
standardize_data (bool) –
tol (float) –

Bootstrap substra strategy in an efficient fashion.

make_bootstrap_metric_function(metric_functions)¶

Take the metric_functions dict, and bootstrap each metric.

Parameters:: metric_functions (dict) – The metric functions to hook.
Return type:: dict

make_bootstrap_strategy(strategy, n_bootstrap=None, bootstrap_seeds=None, bootstrap_function=None, client_specific_kwargs=None)¶

Bootstrap a substrafl strategy wo impacting the number of compute tasks.

In order to reduce the bottleneck of substra when bootstraping a strategy we need to go over the strategy compute plan and modifies each local atomic task to execute n_bootstrap times on bootstraped data. Each modified task returns a list of n_bootstraps original outputs obtained on each bootstrap. Each aggregation task is then modified to aggregate the n_bootstrap outputs independently. This code heavily uses some code patterns invented by Arthur Pignet.

Parameters:

strategy (Strategy) – The strategy to bootstrap.
n_bootstrap (Union[int, None]) – Number of bootstrap to be performed. If None will use len(bootstrap_seeds) instead. If bootstrap_seeds is given seeds those seeds will be used for the generation otherwise seeds are generated randomly.
bootstrap_seeds (Union[list[int], None]) – The list of seeds used for bootstrapping random states. If None will generate n_bootstrap randomly, in the presence of both allways use bootstrap_seeds.
bootstrap_function (Union[Callable, None]) – A function with signature f(data, seed) that returns a bootstrapped version of the data. If None, use the BootstrapMixin function. Note that this can be used to provide splits/cross-validation capabilities as well where seed would be the fold number in a flattened list of folds.
client_specific_kwargs (Union[None, list[dict]]) – A list of dictionaries containing the kwargs to be passed to the algos if they are different for each bootstrap. It is useful to chain bootstrapped compute plans for instance.

Returns:

The resulting efficiently bootstrapped strategy

Return type:

Strategy

Webdisco utils.

compute_summary_function(final_params, variance_matrix, alpha=0.05)¶

Compute summary function.

Parameters:

final_params (np.ndarray) – The estimated vallues of Cox model coefficients.
variance_matrix (np.ndarray) – Computed variance matrix whether using robust estimation or not.
alpha (float, optional) – The quantile level to test, by default 0.05

Returns:

Summary of IPTW analysis as in lifelines.

Return type:

pd.DataFrame

get_final_cox_model_function(client, compute_plan_key, num_rounds, standardize_data, duration_col, event_col, simu_mode=False, robust=False)¶

Retrieve first converged Cox model and corresponding hessian.

In case of bootstrapping retrieves the first converged Cox models for each seed.

Parameters:

Client (Client) – The susbtrafl Client that registered the CP.
compute_plan_key (Union[str, Algo]) – The key of the CP.
num_rounds (int) – The number of rounds of the CP.
standardize_data (float, optional) – Whether or not the data was standadized, by default 0.05
duration_col (str) – The name of the duration column.
event_col (str) – The name of the event column.
simu_mode (bool) – Whether or not we are using simu mode. Note this could be inferred from the Client.
robust (bool, optional) – Retreive global statistics for robust variance estimation.
client (Client) –

Returns:

Returns hessian, log-likelihood, Cox model’s weights, global moments

Return type:

tuple

get_last_algo_from_round_count(num_rounds, standardize_data=True, simu_mode=False)¶

Get true number of rounds.

Parameters:

num_rounds (list[int]) – _description_
standardize_data (bool, optional) – _description_, by default True
simu_mode (bool) – Whether or not we are in simu mode.

Returns:

_description_

Return type:

_type_