.. RAISS documentation master file, created by
   sphinx-quickstart on Mon Aug 20 16:17:59 2018.
   You can adapt this file completely to your liking, but it should at least
   contain the root `toctree` directive.

Welcome to the Robust and Accurate Imputation from Summary Statistics (RAISS) documentation!
============================================================================================

.. toctree::
   :caption: Table of Contents
   :maxdepth: 2

   index

What is RAISS ?
===============

RAISS is a python package to impute missing SNP summary statistics from
neighboring SNPs in linkage desiquilibrium.

The statistical model used to make the imputation is described in :cite:`Pasaniuc2014`, :cite:`lee2013dist` and :cite:`Julienne2019`.
The implementation and performances of RAISS are described in :cite:`Julienne2019`.

The imputation execution time is optimized by precomputing Linkage desiquilibrium between SNPs.


Dependencies
============
If you need to compute LD matrices from your reference panel,
RAISS requires plink version 1.9 for the precomputation of LD: `<https://www.cog-genomics.org/plink2>`_

Installation
============

.. code-block:: shell

  pip3 install git+https://gitlab.pasteur.fr/statistical-genetics/raiss.git
  pip3 install -r https://gitlab.pasteur.fr/statistical-genetics/raiss/-/raw/master/requirements.txt


Alternatively, RAISS is available as a docker container:

https://quay.io/repository/biocontainers/raiss?tab=tags


Precomputation of LD-correlation
=================================

The imputation is based the Linkage desiquilibrium
between SNPs.

To save computation time, the LD is computed before imputation and saved as tabular format.
To limit the number of SNP pairs, the LD is computed between pairs of
SNPs in a approximately LD-independent regions. Regions defined by :cite:`Berisa2015` or  computed using
`bigsnpr <https://privefl.github.io/bigsnpr/reference/snp_ldsplit.html>`_ (:cite:`privee2022`) are provided in the package
data folder for all several genome build and ancestries.

To compute the LD you need to specify a reference panel splitted by chromosomes
(bed, fam and bim formats of plink, see `PLINK formats <https://www.cog-genomics.org/plink2/formats>`_ )


.. code-block:: python

  # path to the Region file
  region_berisa = "/mnt/atlas/PCMA/WKD_Hanna/cleaned_jass_input/Region_LD.csv"
  # Path to the reference panel
  ref_folder="/mnt/atlas/PCMA/1._DATA/ImpG_refpanel"
  # path to the folder to store the results
  ld_folder_out = "/mnt/atlas/PCMA/WKD_Hanna/impute_for_jass/berisa_ld_block"
  raiss.LD.generate_genome_matrices(, ...)

Since 2021, LD can also be specified as scipy sparse matrix (.npz), the index must be provided in a separate file (one id by line)

Input format:
=============

GWAS results files must be provided in the tabular format by chromosome (tab separated)
all in the same folder with the following columns with the same header:

+----------+-------+------+-----+--------+
| rsID     | pos   | A0   | A1  |  Z     |
+==========+=======+======+=====+========+
| rs6548219| 30762 | A	  | G   | -1.133 |
+----------+-------+------+-----+--------+

The files names must follow this pattern "z_{GWAS_TAG}_chr{1..22}.txt"
This format can be obtained with the `JASS PreProcessing package <https://gitlab.pasteur.fr/statistical-genetics/JASS_Pre-processing>`_.


Launching imputation on one chromosome
======================================

RAISS has an interface with the command line.

Here is an example for imputing the file z_FINNS_IL-6_chr22.txt

.. code-block:: shell

  raiss --chrom chr22 --gwas FINNS_IL-6 --ld-folder ${path-to-ld-matrices} --ref-folder ${path-to-the-reference-panel}  --zscore-folder ${path-to-input data} --output-folder ${folder to store imputed files} --eigen-threshold 0.000001

see details and all parameters in Command Line Usage bellow and note that the eigen_threshold parameter should be adapted to your reference panel (see the **Optimizing RAISS parameters for your data** section)

If you have access to a cluster, an efficient way to use RAISS is to launch
the imputation of each chromosome on a separate cluster node. The script
`launch_imputation_all_gwas.sh <https://gitlab.pasteur.fr/statistical-genetics/raiss/blob/master/scripts/launch_imputation_all_gwas.sh>`_
contain an example of raiss usage with a SLURM scheduler.

Output
======

The raiss package outputs imputed GWAS files in the tabular format:

+-------------+----------+------------+------------+---------+-------+----------+---------------+
|     rsID    |    pos   |     A0     |      A1    |    Z    |  Var  | ld_score | imputation_R2 |
+=============+==========+============+============+=========+=======+==========+===============+
|  rs3802985  |  198510  |      T     |      C     |  0.334  | -1.0  |   inf    |      2.0      |
+-------------+----------+------------+------------+---------+-------+----------+---------------+
| rs111876722 |  201922  |      C     |      T     |  0.297  |	0.09 |  5.412   |    0.91578    |
+-------------+----------+------------+------------+---------+-------+----------+---------------+

*Variance is set to -1 for variants present in the input dataset*

Optimizing RAISS parameters for your data
========================================

Raiss package contains a subcommand **raiss performance-grid-search**
to assess its performance on your data and fine tune RAISS parameter.

Test procedure :
1. Mask N SNPs on a chromosome
2. Impute masked files for different values of the --eigen-threshold
and the --minimum-ld parameters
3. Compute correlation (and other statistics) between genotype Z-values to imputed Z-values

To perform this test follow this procedure :

1. Create a folder to store masked z-score files
2. Create a folder to store z-score files imputed with different parameter
3. Adapt the following command snippet to apply the command to your data:

Here the command example run for the harmonized GWAS files z_FINNS-IL-6.txt for eigen_threshold values in [0.000001, 0.001,0.1] and ld-threshold in [0,10,20].
Note that if you allocate more than 1 cpu to the job, different parameter combination will be tested in parallel.

.. code-block:: shell

  raiss --ld-folder ${path-to-ld-matrices} --ref-folder ${path-to-the-reference-panel} --gwas FINNS_IL-6 --chrom chr22 --ld-type ${'scipy' or 'plink'} performance-grid-search --harmonized-folder ${path-to-harmonized data} --masked-folder ${folder to store partially masked input data} --imputed-folder $ --output-path ${path-to-save the performance report} --eigen-ratio-grid '[0.000001, 0.001,0.1]' --ld-threshold-grid '[0,10,20]' --n-cpu ${number of cpu to be allocated to the job}


The file Perf_GWAS_TAG (Perf_FINNS_IL-6 here) ressembles the following output:

.. csv-table:: Performance Report
  :widths: 10, 10, 10,10, 10, 10,10,10,10,10
  :header-rows: 1

  "eigen_ratio","min_ld","N_SNP","fraction_imputed","cor","mean_absolute_error","median_absolute_error","min_absolute_error","max_absolute_error","SNP_max_error"
  0.1,0,2970,0.594,0.978,0.277,0.171,1.47e-05,6.92,"rs5756504"
  0.1,2,2970,0.594,0.978,0.277,0.171,1.47e-05,6.92,"rs5756504"
  0.1,5,2840,0.568,0.978,0.277,0.169,1.47e-05,6.92,"rs5756504"
  0.1,7,2550,0.51,0.978,0.275,0.164,0.000285,6.92,"rs5756504"
  0.15,0,2470,0.494,0.976,0.282,0.172,2.43e-05,4.22,"rs59411032"
  0.15,2,2470,0.494,0.976,0.282,0.172,2.43e-05,4.22,"rs59411032"
  0.15,5,2450,0.49,0.976,0.281,0.172,2.43e-05,4.22,"rs59411032"
  0.15,7,2320,0.465,0.976,0.282,0.172,0.00044,4.22,"rs59411032"
  0.2,0,2040,0.409,0.973,0.291,0.168,6.61e-05,4.37,"rs5752798"
  0.2,2,2040,0.409,0.973,0.291,0.168,6.61e-05,4.37,"rs5752798"
  0.2,5,2040,0.408,0.973,0.291,0.168,6.61e-05,4.37,"rs5752798"
  0.2,7,2020,0.403,0.973,0.291,0.168,6.61e-05,4.37,"rs5752798"


* **eigen_ratio** : eigen ratio parameter that was tested.
* **min_ld** : eigen ratio parameter that was tested.
* **N_SNP** : number of the masked SNPs that were successfully imputed (i.e. not filtered out by the R2 criteria and/or min_ld criteria)
* **fraction_imputed** : fraction of the masked SNPs that were successfully imputed (N_SNP / total_number_of_masked_SNP)
* **cor** : the correlation between imputed and genotyped Z-scores.
* **mean_absolute_error** : :math:`\mathbb{E}|Z_{imputed} - Z_{masked}|`
* **median_absolute_error** : :math:`median|Z_{imputed} - Z_{masked}|`
* **min_absolute_error** : :math:`min|Z_{imputed} - Z_{masked}|`
* **max_absolute_error** : :math:`max|Z_{imputed} - Z_{masked}|`
* **SNP_max_error** : :math:`argmax|Z_{imputed} - Z_{masked}|`

To pick the best parameters, we recommend to search for a compromise between low imputation error and an high fraction of masked SNPs imputed
(a trade-off between **fraction_imputed** and **mean_absolute_error**).

The optimal eigen_ratio and min_ld can vary depending on the density of your reference panel and input data.
Hence, we recommend to run a grid search to pick the best parameter for your data.
However, so far, we never observed a difference of performance from one chromosome to another.
We suggest testing on the chr22 for computational efficiency. Note that this performance report
evaluate RAISS on limited number of SNPs. Hence, we recommend to complement this report with our sanity check report_snps
that check if the number of new genome wide significant hits is coherent with the number of imputes SNPs.

Generating report and sanity ckecks
=======================================================

Once you have imputed your dataset, you might want to know how many SNPs have been imputed and
with which imputation quality. More importantly, you might want to make sure that the distribution
of imputed signal is coherent with what is expected.


Indeed, if the population of your reference panel is too  different from the one of your study, the Linkage
desequilibrium matrices derived from your panel might not be adapted and might generate false positive

To run a sanity check on your data use the **raiss sanity-check** command

.. code-block:: shell

  raiss sanity-check --trait z_{trait} --harmonized-folder {folder_input_data_files} --imputed-folder {folder_imputed_files}

Here is a sanity check log example :

.. code-block:: text

  Number of SNPs
  in harmonized file: 7245111
  in imputed file: 10665361
  Proportion imputed : 0.32068769167775946


  Number of SNPs by level of significance
                   harmonized  harmonized_by_1Mbregion  imputed  imputed_by_1MBregion  imputed_proportion  imputed_hit_OR
  5.00e-02-1.00e+00     6506492                       16  9628715                    14            0.324262        0.932818
  5.00e-06-5.00e-02      738573                     2682  1036583                  2684            0.287493        0.999821
  5.00e-08-5.00e-06          43                       17       60                    20            0.283333        1.080485
  0.00e+00-5.00e-08           3                        1        3                     1            0.000000        0.999448

  Number of imputed SNPs by level of imputation quality
  (0.8, 0.9]     2086501
  (0.6, 0.8]     1333749
  (0.0, 0.6]           0
  (0.9, 0.95]          0
  (0.95, 1.0]          0

The first block of the log gives you information about total number of SNPs
contained in the harmonized file and imputed file.

The second block reports the number of SNPs by significance strata in the harmonized file and imputed file.
the _by_1MBregion columns reports the number of 1Mb region reaching significance (minimal p-value of snps contained in the region).
This provide you with a rough estimate of the number of LD independent hits before and after imputation.
the imputed_hit_OR compares the imputed_by_1MBregion and harmonized_by_1Mbregion and reports if the number of new hit in imputed data
is reasonable knowing the increased coverage.

The third block provides the number of imputed SNPs by imputation quality strata.


Command Line Usage
==================

.. argparse::
  :ref: raiss.__main__.add_chromosome_imputation_argument
  :prog: raiss

Indices and tables
==================

* :ref:`genindex`
* :ref:`modindex`
.. automodule:: raiss
   :members:
* :ref:`search`

.. autosummary::
   :toctree: _autosummary

.. bibliography:: reference.bib