.. jass_preprocessing documentation master file, created by
   sphinx-quickstart on Wed Nov  7 11:03:55 2018.
   You can adapt this file completely to your liking, but it should at least
   contain the root `toctree` directive.

Welcome to jass_preprocessing's documentation!
==============================================

.. toctree::
   :maxdepth: 2
   :caption: Contents:

   index

What is jass preprocessing ?
============================
Jass preprocessing is a command line tool that takes in input
heterogeneous GWAS summary statistics and performs standardization and quality checks to output standardized summary statistic files that can be used as input in the JASS python package and the RAISS imputation package.


Overview
========
The QC and preprocessing step goes as follow:

#. Map column from of a heterogeneous GWAS entry file to standardize names
#. Select GWAS SNPs that are in the input reference panel
#. Align coded allele of the GWASs data with the reference panel
#. Infer the sample size by SNPs if not present in input data (from MAF, standard deviation and genetic effect)
#. Filter SNPs with an heterogeneous sample size (as JASS and RAISS packages assume sample size to be constant across SNPs)
#. Normalize the effect size to Z-scores
#. Save the output by chromosome as the following example:

+----------+-------+------+-----+--------+
| rsID     | pos   | A0   | A1  |  Z     |
+==========+=======+======+=====+========+
| rs6548219| 30762 | A	  | G   | -1.133 |
+----------+-------+------+-----+--------+

* (Optional step) Save the output to one file with a chromosome column
(input format needed to perform LD-score). The additional output correspond to: P, the p-value; N_effective (Effective sample size estimated/retrieved from the data).
Effective sample size refers to the total sample size for continuous trait and to 1 / ( 1/Ncases + 1/Ncontrol ) for binary traits.

+-------+-----------+--------+----+----+-----+-----+------------------+
| chrom	|    rsID   |  pos   | A0 | A1 |  Z  |  P  |    N_effective   |
+-------+-----------+--------+----+----+-----+-----+------------------+
|   1   | rs4075116 |1003629 | C  | T  | 0.3 | 0.76|     10220.98     |
+-------+-----------+--------+----+----+-----+-----+------------------+


Installation
============

In a terminal, execute the following lines:

.. code-block:: shell

  pip3 install git+https://gitlab.pasteur.fr/statistical-genetics/JASS_Pre-processing

Input
======

* **A reference panel** to the format below. The user is expected to provide a reference panel
 in tsv format with the following columns in the following order (chromosome, rsID, Minor Allele
  Frequency, Position, reference, Alternative allele), **without header**. 

+-----+------------+---------+-------+-----+-----+
|  1  | rs62635286 |0.0970447| 13116 |  T  |  G  |
+=====+============+=========+=======+=====+=====+
|  1  | rs63125786 |0.0970447| 15116 |  T  |  A  |
+-----+------------+---------+-------+-----+-----+
|  1  | rs5686     |0.1970447| 17116 |  A  |  G  |
+-----+------------+---------+-------+-----+-----+
|  1  | rs892586   |0.7670447| 23116 |  C  |  G  |
+-----+------------+---------+-------+-----+-----+


* The **GWAS Folder** containing all raw gwas data (correspond to the --gwas-info command line parameter): all chromosomes in one file, compressed or uncompressed
* A descriptor csv files (see example below and `here <https://gitlab.pasteur.fr/statistical-genetics/jass_suite_pipeline/-/blob/master/input_files/Data_test_EAS.csv>`_)that will described each GWAS summary statistic files (correspond to the --input-folder command line parameter):
  * a header
  * 1 line per study
  * the fields categories are:

+-------------------------------------------+---------------------------------------------------------------+
|                     category              |                            field name                         |
+===========================================+===============================================================+
|             path to the data              |                            filename                           |
+-------------------------------------------+---------------------------------------------------------------+
|            study info fields              | Consortium,Outcome,fullName,Nsample,Ncase,Ncontrol,Nsnp       |
+-------------------------------------------+---------------------------------------------------------------+
|    names of the header in the GWAS file   |      snpid,a1,a2,freq,pval,n,z,OR,se,code,imp,ncas,ncont      |
+-------------------------------------------+---------------------------------------------------------------+

**Study field definition**:

* filename: gwas summary statistic name as it appear in the **GWAS folder** 
* Consortium : the Consortium of the study (can also be the category of the trait) in **upper case and without _ characters**
* Outcome: a short tag for the Outcome of the study in **upper case and without _ characters**
* FullName: full description of the trait (for your own information not used in the cleaning process)
* Nsample: Number of sample in the study
* Ncase: Number of cases in the study (left empty if trait is continuous)
* Ncontrol: Number of control in the study (left empty if trait is continuous)

**Field corresping to column names in the summary statistic**

* snpid: name of the column storing rsid in the gwas file
* POS: name of the column storing the position in the gwas file
* CHR: name of the column storing the chromosome in the gwas file
* a1: effect allele
* a2: Other allele
* freq: name of the column storing the minor allele frequence in the gwas file
* pval: name of the column storing the pvalue in the gwas file
* n: name of the column storing the sample size by variants (optional, will be inferred from the MAF, genetic effect and standard deviation if absent)
* ncas: For binary traits, name of the column storing the number of cases by variants (optional)
* ncont: For binary traits, name of the column storing the number of controls by variants (optional)
* z: name of the column storing the genetic effect (beta) in the gwas file
* OR : For binary traits, Odd ratio when available. Not to be confounded with the genetic effect size or 'beta'.
* index-type: precise the type of index 
* imputation_quality: (Optional) column containing individual-based imputation quality. Will be used to filter low quality imputation data from GWASs if the option --imputation-quality-treshold is used

.. warning::
  Note that the concatenation of Consortium and Outcome must be **unique** because as it will be used as an index in the cleaning process.
  Both Outcome and Consortium must be in **uppercase and with no _ characters**

Here is an example of descriptor field (downloadable example `here <https://gitlab.pasteur.fr/statistical-genetics/jass_suite_pipeline/-/blob/master/input_files/Data_test_EAS.csv>`_), the field irrelevant (for example odd ratio for continuous trait) for the study must be filled with na or left empty.
Some fields are optional like the imputation_quality. If not used they can be filled with na. 

.. csv-table:: GWAS information table
  :header-rows: 1

  "filename","Consortium","Outcome","FullName","Type","Nsample","Ncase","Ncontrol","Nsnp","snpid", "POS", "a1","a2","freq","pval","n","z","OR","se","code","imp","ncas","ncont","imputation_quality","index_type"
  "GIANT_HEIGHT_Wood_et_al.txt","GIANT","HEIGHT","Height","Anthropometry",253288,	na,	na, 2550858,	"MarkerName",	"position","Allele1", "Allele2", "Freq.Allele1.HapMapCEU","p","N","b",na,"SE",na,na,na,na, "imputationInfo","rs-number"


Command line usage example:
============================

It is possible to launch the complete preprocessing (all steps described in section Overview section) from the command line:

.. argparse::
  :ref: jass_preprocessing.__main__.add_preprocessing_argument
  :prog: jass_preprocessing

Indices and tables
==================


* :ref:`genindex`
* :ref:`modindex`
.. automodule:: jass_preprocessing
   :members:
* :ref:`search`