Utilities¶
Utility module containing utilities to speed up pentesting.
-
pepr.utilities.
assign_record_ids_to_target_models
(target_knowledge_size, number_target_models, target_training_set_size, offset=0)¶ Create training datasets (index sets) for the target models.
Each target training dataset contains exactly 50 percent of the given data. Each record is in the half of the data sets.
Parameters: - target_knowledge_size (int) – Number of data samples to be distributed to the target data sets.
- number_target_models (int) – Number of target models for which datasets should be created.
- target_training_set_size (int) – Number of records each target training set should be contain.
- offset (int) – If offset is zero, the lowest index in the resulting datasets is zero. If the offset is o, all indices i are shifted by o: i + o.
Returns: Index array that describes which record is used to train which model.
Return type: numpy.ndarray
-
pepr.utilities.
filter_outlier
(data, labels, filter_pars, save_path=None, load_pars=None)¶ Filter out potentially vulnerable samples.
The general idea is similar to (
pepr.privacy.gmia.DirectGmia()
) to find potential vulnerable records.Steps:
- Create mapping of records to reference models.
- Train the reference models.
- Generate intermediate models.
- Extract high-level features.
- Compute pairwise distances between high-level features.
- Determine outlier records.
- Remove outlier records from the dataset.
Parameters: - data (numpy.ndarray) – Dataset with all training samples used in the given pentesting setting.
- labels (numpy.ndarray) –
Array of all labels used in the given pentesting setting. filter_pars : dict Dictionary containing needed filter parameters:
- number_classes (int): Number of different classes the dataset.
- number_reference_models (int): Number of reference models to be trained.
- reference_training_set_size (int): Size of the trainings set for each reference model.
- create_compile_model (function): Function that returns a compiled TensorFlow model (in gmia this is typically identical to the target model) used in the training of the reference models.
- reference_epochs (int): Number of training epochs of the reference models.
- reference_batch_size (int): Batch size used in the training of the reference models.
- hlf_metric (str): Metric (typically ‘cosine’) used for the distance calculations in the high-level feature space. For valid metrics see documentation of sklearn.metrics.pairwise_distances.
- hlf_layer_number (int): If value is n, the n-th layer of the model returned by create_compile_model is used to extract the high-level feature vectors.
- distance_neighbor_threshold (float): If distance is smaller than the neighbor threshold the record is selected as target record.
- number_neighbor_threshold (float): If number of neighbors of a record is smaller than this, it is considered a vulnerable example.
- number_outlier (int): If set, the selection algorithm performs max_search_rounds, to find a distance_neighbor_threshold, that leads to a finding of n_targets target records. These target records are most vulnerable with respect to our selection criterion.
- max_search_rounds (int): If number_target_records is given, maximal max_search_rounds are performed to find number_target_records of potential vulnerable target records.
- save_path (str) –
If path is given, the following (partly computational expensive) intermediate results are saved to disk:
- The mapping of training-records to reference models
- The trained reference models
- The high-level features
- The matrix containing all pairwise distances between the high-level features.
- load_pars (dict) –
If this dictionary is given, the following computational intermediate results can be loaded from disk.
- records_per_reference_model (str) : Path to the mapping.
- reference_models (list) : List of paths to the reference models.
- pairwise_distance_hlf_<hlf_metric> (str) : Path to the pairwise distance matrix between the high-level features using a hlf_metric (e.g. cosine).
Returns: - numpy.ndarray – Dataset indices without outliers.
- float – Calculated neighbor distance threshold of the result.
- float – Calculated neighbor count threshold of the result.
References
Partial implementation of the direct gmia from Long, Yunhui and Bindschaedler, Vincent and Wang, Lei and Bu, Diyue and Wang, Xiaofeng and Tang, Haixu and Gunter, Carl A and Chen, Kai (2018). Understanding membership inferences on well-generalized learning models. arXiv preprint arXiv:1802.04889.