raiss.imputation_R2

Module for imputation performance evaluation

Functions to generate test dataset with missing SNPs, impute missing SNPs and compare precision with original dataset.

In other word, this module provides function to empirically measure imputation R2 on independant dataset.

Functions

generated_test_data(zscore[, N_to_mask, ...])

Mask N_to_mask Snps in the dataframe zscore and return the dataframe with missing SNPs.

grid_search(zscore_folder, masked_folder, ...)

Compute the imputation performance for several eigen ratioself. The procedure is the following: * Masked N_to_mask Snps in the input dataset (chosen at random) * for eigen ratio in eigen_ratio_grid: * Impute SNPs * Compute performance on masked SNPs (R-square and mean absolute error) :param zscore_folder: path toward the input data folder :type zscore_folder: str :param masked_folder: path toward the folder to save the dataframe with masked SNPs :type masked_folder: str :param output_folder: path toward the folder to save the imputed dataframe :type output_folder: str :param ref_folder: path toward the folder to save the imputed dataframe :type ref_folder: str :param ld_folder: path toward the Linkage desiquilibrium matrices to save the imputed dataframe :type ld_folder: str :param gwas: gwas identifier in the following format : 'CONSORTIA_TRAIT' :type gwas: str :param chrom: chromosome in the format "chr.." :type chrom: str :param eigen_ratio_grid: list of eigen_ratio to test (must be between 0 and 1) :type eigen_ratio_grid: list :param ld_threshold_grid: list of minimum-ld to test (must be > 0 ) :type ld_threshold_grid: list :param window_size: imputation parameter (see raiss command line documentation) :param buffer_size: imputation parameter (see raiss command line documentation) :param l2_regularization: imputation parameter (see raiss command line documentation) :param R2_threshold: imputation parameter (see raiss command line documentation) :param N_to_mask: Number of SNPs masked in the initial dataset to compute the correlation between true value and imputed value :type N_to_mask: int :param ref_panel_suffix: suffix :type ref_panel_suffix: str :param ld_type: The type of file where the LD is stored should be 'plink' or 'scipy' :type ld_type: str :param stratifying_vector: a continuous vector containin one value per SNPs used to stratify the sampling of SNPs to mask :type stratifying_vector: pd.Series :param stratifying_bins: a vector specifying the boundary values to form the bins :type stratifying_bins: list.

imputation_performance(zscore_initial, ...)

Compute imputation performance on real data the performance is computed as the correlation between imputed and real values and as the mean absolute deviation between imputed and real values :param zscore_initial: :type zscore_initial: pandas DataFrame :param zscore_imputed: :type zscore_imputed: pandas DataFrame :param masked: SNPs ids which have been masked by imputation

z_amplitude_effect(zscore_folder, ...[, ...])

Compute the imputation performance on SNPs with different amplitude The procedure is the following: * Masked Snps in the input dataset in function of the amplitude