Welcome to the Robust and Accurate Imputation from Summary Statistics (RAISS) documentation!
Table of Contents
- Welcome to the Robust and Accurate Imputation from Summary Statistics (RAISS) documentation!
- What is RAISS ?
- Dependencies
- Installation
- Precomputation of LD-correlation
- Input format:
- Launching imputation on one chromosome
- Output
- Optimizing RAISS parameters for your data
- Generating report and sanity ckecks
- Command Line Usage
- Indices and tables
What is RAISS ?
RAISS is a python package to impute missing SNP summary statistics from neighboring SNPs in linkage desiquilibrium.
The statistical model used to make the imputation is described in [], [] and []. The implementation and performances of RAISS are described in [].
The imputation execution time is optimized by precomputing Linkage desiquilibrium between SNPs.
Dependencies
If you need to compute LD matrices from your reference panel, RAISS requires plink version 1.9 for the precomputation of LD: https://www.cog-genomics.org/plink2
Installation
pip3 install git+https://gitlab.pasteur.fr/statistical-genetics/raiss.git
Alternatively, RAISS is available as a docker container:
Precomputation of LD-correlation
The imputation is based the Linkage desiquilibrium between SNPs.
To save computation time, the LD is computed before imputation and saved as tabular format. To limit the number of SNP pairs, the LD is computed between pairs of SNPs in a approximately LD-independent regions. Regions defined by [] or computed using bigsnpr ([]) are provided in the package data folder for all several genome build and ancestries.
To compute the LD you need to specify a reference panel splitted by chromosomes (bed, fam and bim formats of plink, see PLINK formats )
# path to the Region file
region_berisa = "/mnt/atlas/PCMA/WKD_Hanna/cleaned_jass_input/Region_LD.csv"
# Path to the reference panel
ref_folder="/mnt/atlas/PCMA/1._DATA/ImpG_refpanel"
# path to the folder to store the results
ld_folder_out = "/mnt/atlas/PCMA/WKD_Hanna/impute_for_jass/berisa_ld_block"
raiss.LD.generate_genome_matrices(, ...)
Since 2021, LD can also be specified as scipy sparse matrix (.npz), the index must be provided in a separate file (one id by line)
Input format:
GWAS results files must be provided in the tabular format by chromosome (tab separated) all in the same folder with the following columns with the same header:
rsID |
pos |
A0 |
A1 |
Z |
---|---|---|---|---|
rs6548219 |
30762 |
A |
G |
-1.133 |
The files names must follow this pattern “z_{GWAS_TAG}_chr{1..22}.txt” This format can be obtained with the JASS PreProcessing package.
Launching imputation on one chromosome
RAISS has an interface with the command line.
Here is an example for imputing the file z_FINNS_IL-6_chr22.txt
raiss --chrom chr22 --gwas FINNS_IL-6 --ld-folder ${path-to-ld-matrices} --ref-folder ${path-to-the-reference-panel} --zscore-folder ${path-to-input data} --output-folder ${folder to store imputed files} --eigen-threshold 0.000001
see details and all parameters in Command Line Usage bellow and note that the eigen_threshold parameter should be adapted to your reference panel (see the Optimizing RAISS parameters for your data section)
If you have access to a cluster, an efficient way to use RAISS is to launch the imputation of each chromosome on a separate cluster node. The script launch_imputation_all_gwas.sh contain an example of raiss usage with a SLURM scheduler.
Output
The raiss package outputs imputed GWAS files in the tabular format:
rsID |
pos |
A0 |
A1 |
Z |
Var |
ld_score |
imputation_R2 |
---|---|---|---|---|---|---|---|
rs3802985 |
198510 |
T |
C |
0.334 |
-1.0 |
inf |
2.0 |
rs111876722 |
201922 |
C |
T |
0.297 |
0.09 |
5.412 |
0.91578 |
Variance is set to -1 for variants present in the input dataset
Optimizing RAISS parameters for your data
Raiss package contains a subcommand raiss performance-grid-search to assess its performance on your data and fine tune RAISS parameter.
Test procedure : 1. Mask N SNPs on a chromosome 2. Impute masked files for different values of the –eigen-threshold and the –minimum-ld parameters 3. Compute correlation (and other statistics) between genotype Z-values to imputed Z-values
To perform this test follow this procedure :
Create a folder to store masked z-score files
Create a folder to store z-score files imputed with different parameter
Adapt the following command snippet to apply the command to your data:
Here the command example run for the harmonized GWAS files z_FINNS-IL-6.txt for eigen_threshold values in [0.000001, 0.001,0.1] and ld-threshold in [0,10,20]. Note that if you allocate more than 1 cpu to the job, different parameter combination will be tested in parallel.
raiss --ld-folder ${path-to-ld-matrices} --ref-folder ${path-to-the-reference-panel} --gwas FINNS_IL-6 --chrom chr22 --ld-type ${'scipy' or 'plink'} performance-grid-search --harmonized-folder ${path-to-harmonized data} --masked-folder ${folder to store partially masked input data} --imputed-folder $ --output-path ${path-to-save the performance report} --eigen-ratio-grid '[0.000001, 0.001,0.1]' --ld-threshold-grid '[0,10,20]' --n-cpu ${number of cpu to be allocated to the job}
The file Perf_GWAS_TAG (Perf_FINNS_IL-6 here) ressembles the following output:
eigen_ratio |
min_ld |
N_SNP |
fraction_imputed |
cor |
mean_absolute_error |
median_absolute_error |
min_absolute_error |
max_absolute_error |
SNP_max_error |
---|---|---|---|---|---|---|---|---|---|
0.1 |
0 |
2970 |
0.594 |
0.978 |
0.277 |
0.171 |
1.47e-05 |
6.92 |
rs5756504 |
0.1 |
2 |
2970 |
0.594 |
0.978 |
0.277 |
0.171 |
1.47e-05 |
6.92 |
rs5756504 |
0.1 |
5 |
2840 |
0.568 |
0.978 |
0.277 |
0.169 |
1.47e-05 |
6.92 |
rs5756504 |
0.1 |
7 |
2550 |
0.51 |
0.978 |
0.275 |
0.164 |
0.000285 |
6.92 |
rs5756504 |
0.15 |
0 |
2470 |
0.494 |
0.976 |
0.282 |
0.172 |
2.43e-05 |
4.22 |
rs59411032 |
0.15 |
2 |
2470 |
0.494 |
0.976 |
0.282 |
0.172 |
2.43e-05 |
4.22 |
rs59411032 |
0.15 |
5 |
2450 |
0.49 |
0.976 |
0.281 |
0.172 |
2.43e-05 |
4.22 |
rs59411032 |
0.15 |
7 |
2320 |
0.465 |
0.976 |
0.282 |
0.172 |
0.00044 |
4.22 |
rs59411032 |
0.2 |
0 |
2040 |
0.409 |
0.973 |
0.291 |
0.168 |
6.61e-05 |
4.37 |
rs5752798 |
0.2 |
2 |
2040 |
0.409 |
0.973 |
0.291 |
0.168 |
6.61e-05 |
4.37 |
rs5752798 |
0.2 |
5 |
2040 |
0.408 |
0.973 |
0.291 |
0.168 |
6.61e-05 |
4.37 |
rs5752798 |
0.2 |
7 |
2020 |
0.403 |
0.973 |
0.291 |
0.168 |
6.61e-05 |
4.37 |
rs5752798 |
eigen_ratio : eigen ratio parameter that was tested.
min_ld : eigen ratio parameter that was tested.
N_SNP : number of the masked SNPs that were successfully imputed (i.e. not filtered out by the R2 criteria and/or min_ld criteria)
fraction_imputed : fraction of the masked SNPs that were successfully imputed (N_SNP / total_number_of_masked_SNP)
cor : the correlation between imputed and genotyped Z-scores.
mean_absolute_error : \(\mathbb{E}|Z_{imputed} - Z_{masked}|\)
median_absolute_error : \(median|Z_{imputed} - Z_{masked}|\)
min_absolute_error : \(min|Z_{imputed} - Z_{masked}|\)
max_absolute_error : \(max|Z_{imputed} - Z_{masked}|\)
SNP_max_error : \(argmax|Z_{imputed} - Z_{masked}|\)
To pick the best parameters, we recommend to search for a compromise between low imputation error and an high fraction of masked SNPs imputed (a trade-off between fraction_imputed and mean_absolute_error).
The optimal eigen_ratio and min_ld can vary depending on the density of your reference panel and input data. Hence, we recommend to run a grid search to pick the best parameter for your data. However, so far, we never observed a difference of performance from one chromosome to another. We suggest testing on the chr22 for computational efficiency. Note that this performance report evaluate RAISS on limited number of SNPs. Hence, we recommend to complement this report with our sanity check report_snps that check if the number of new genome wide significant hits is coherent with the number of imputes SNPs.
Generating report and sanity ckecks
Once you have imputed your dataset, you might want to know how many SNPs have been imputed and with which imputation quality. More importantly, you might want to make sure that the distribution of imputed signal is coherent with what is expected.
Indeed, if the population of your reference panel is too different from the one of your study, the Linkage desequilibrium matrices derived from your panel might not be adapted and might generate false positive
To run a sanity check on your data use the raiss sanity-check command
raiss sanity-check --trait z_{trait} --harmonized-folder {folder_input_data_files} --imputed-folder {folder_imputed_files}
Here is a sanity check log example :
Number of SNPs
in harmonized file: 7245111
in imputed file: 10665361
Proportion imputed : 0.32068769167775946
Number of SNPs by level of significance
harmonized harmonized_by_1Mbregion imputed imputed_by_1MBregion imputed_proportion imputed_hit_OR
5.00e-02-1.00e+00 6506492 16 9628715 14 0.324262 0.932818
5.00e-06-5.00e-02 738573 2682 1036583 2684 0.287493 0.999821
5.00e-08-5.00e-06 43 17 60 20 0.283333 1.080485
0.00e+00-5.00e-08 3 1 3 1 0.000000 0.999448
Number of imputed SNPs by level of imputation quality
(0.8, 0.9] 2086501
(0.6, 0.8] 1333749
(0.0, 0.6] 0
(0.9, 0.95] 0
(0.95, 1.0] 0
The first block of the log gives you information about total number of SNPs contained in the harmonized file and imputed file.
The second block reports the number of SNPs by significance strata in the harmonized file and imputed file. the _by_1MBregion columns reports the number of 1Mb region reaching significance (minimal p-value of snps contained in the region). This provide you with a rough estimate of the number of LD independent hits before and after imputation. the imputed_hit_OR compares the imputed_by_1MBregion and harmonized_by_1Mbregion and reports if the number of new hit in imputed data is reasonable knowing the increased coverage.
The third block provides the number of imputed SNPs by imputation quality strata.
Command Line Usage
raiss command launch imputation for one trait on one chromosome
usage: raiss [-h] [--chrom CHROM] [--gwas GWAS] [--ref-folder REF_FOLDER]
[--ld-folder LD_FOLDER] [--zscore-folder ZSCORE_FOLDER]
[--output-folder OUTPUT_FOLDER] [--window-size WINDOW_SIZE]
[--buffer-size BUFFER_SIZE]
[--l2-regularization L2_REGULARIZATION]
[--eigen-threshold EIGEN_THRESHOLD] [--R2-threshold R2_THRESHOLD]
[--ld-type LD_TYPE] [--ref-panel-prefix REF_PANEL_PREFIX]
[--ref-panel-suffix REF_PANEL_SUFFIX] [--minimum-ld MINIMUM_LD]
{sanity-check,performance-grid-search} ...
Named Arguments
- --chrom
chromosome to impute to the chrd+ format
- --gwas
GWAS to impute to the consortia_trait format
- --ref-folder
reference panel location (used to determine which snp to impute)
- --ld-folder
Location LD correlation matrices
- --zscore-folder
Location of the zscore files of the gwases to impute
- --output-folder
Location of the impute zscore files
- --window-size
Size of the non overlapping window
Default: 500000
- --buffer-size
Size of the buffer around the imputation window
Default: 125000
- --l2-regularization
Size of the small value added to the diagonal of the covariance matrix before inversion
Default: 0.1
- --eigen-threshold
threshold under which eigen vectors are removed for the computation of the pseudo inverse
Default: 0.1
- --R2-threshold
R square (imputation quality) threshold bellow which SNPs are filtered from the output
Default: 0.6
- --ld-type
‘Two options are available : ‘plink’ or ‘scipy’. Ld can be supplied as plink command –ld-snp-list output files (see raiss.ld_matrix.launch_plink_ld to compute these data using plink) or as a couple of a scipy sparse matrix (.npz )and an .csv containing SNPs index
Default: “plink”
- --ref-panel-prefix
prefix for the reference panel files
Default: “”
- --ref-panel-suffix
suffix for the reference panel files
Default: “.bim”
- --minimum-ld
this parameter ensure that their is enough typed SNPs around the imputed loci to perform a high accuracy imputation
Default: 4
Sub-commands
sanity-check
count imputed SNPs and significant hit before and after imputation
raiss sanity-check [-h] --trait TRAIT --harmonized-folder HARMONIZED_FOLDER
--imputed-folder IMPUTED_FOLDER [--output-path OUTPUT_PATH]
[--chr-list CHR_LIST]
Named Arguments
- --trait
trait on which to run the sanity check
- --harmonized-folder
folder containing data before imputation
- --imputed-folder
folder containing imputed files
- --output-path
path+prefix of the report file, the file will be suffixed by the trait name and .txt
Default: “./raiss_report”
- --chr-list
Default: “range(1,23)”
performance-grid-search
assess RAISS performances for a range of parameter by masking SNPs, re-imputing them and comparing imputed value to initial ones
raiss performance-grid-search [-h] --harmonized-folder HARMONIZED_FOLDER
--masked-folder MASKED_FOLDER --imputed-folder
IMPUTED_FOLDER --output-path OUTPUT_PATH
--eigen-ratio-grid EIGEN_RATIO_GRID
[--ld-threshold-grid LD_THRESHOLD_GRID]
[--N-to-mask N_TO_MASK] [--n-cpu N_CPU]
Named Arguments
- --harmonized-folder
folder containing data before imputation
- --masked-folder
folder to store zscore with masked SNPs (raiss performance-grid-search will generate them)
- --imputed-folder
folder to store imputed files
- --output-path
path+prefix of the report file, the file will be suffixed by the trait name and .txt
Default: “./raiss_report”
- --eigen-ratio-grid
main parameter to tune. Adjust the regularization when computing the pseudo inverse of LD matrices
Default: “[0.5, 0.1, 0.01]”
- --ld-threshold-grid
optional parameter to tune. raising the threshold will augment the number neighboring of SNPs required for the imputation to be consider valid (i.e. missing SNPs with only few neighbor won’t be imputed). Empirically this parameter seems to impact more the maximum error than the mean/median error
Default: “[0,4, 10,20]”
- --N-to-mask
Number of SNPs to mask. Should only be a relatively small fraction (e.g. ~1/10)of the number of your initial data
Default: 5000
- --n-cpu
Number of cpu to use to run imputation in Parallel for each parameter value
Default: “unspecified”
Indices and tables
Imputation launcher |
|
Function set to compute LD correlation from a reference panel in predefined Region |
|
This module contain the statistical library for imputation. |
|
implement the imputation window is sliding along the genome: |
|
Module to filter SNPs on imputation quality and format output for JASS |
|
Module for imputation performance evaluation |