Welcome to jass_preprocessing’s documentation!

What is jass preprocessing ?

Jass preprocessing is a command line tool that takes in input heterogeneous GWAS summary statistics and performs standardization and quality checks to output standardized summary statistic files that can be used as input in the JASS python package and the RAISS imputation package.

Overview

The QC and preprocessing step goes as follow:

  1. Map column from of a heterogeneous GWAS entry file to standardize names

  2. Select GWAS SNPs that are in the input reference panel

  3. Align coded allele of the GWASs data with the reference panel

  4. Infer the sample size by SNPs if not present in input data (from MAF, standard deviation and genetic effect)

  5. Filter SNPs with an heterogeneous sample size (as JASS and RAISS packages assume sample size to be constant across SNPs)

  6. Normalize the effect size to Z-scores

  7. Save the output by chromosome as the following example:

rsID

pos

A0

A1

Z

rs6548219

30762

A

G

-1.133

  • (Optional step) Save the output to one file with a chromosome column

(input format needed to perform LD-score). The additional output correspond to: P, the p-value; N_effective (Effective sample size estimated/retrieved from the data). Effective sample size refers to the total sample size for continuous trait and to 1 / ( 1/Ncases + 1/Ncontrol ) for binary traits.

chrom

rsID

pos

A0

A1

Z

P

N_effective

1

rs4075116

1003629

C

T

0.3

0.76

10220.98

Installation

In a terminal, execute the following lines:

pip3 install git+https://gitlab.pasteur.fr/statistical-genetics/JASS_Pre-processing

Input

  • A reference panel to the format below. The user is expected to provide a reference panel

in tsv format with the following columns in the following order (chromosome, rsID, Minor Allele

Frequency, Position, reference, Alternative allele), without header.

1

rs62635286

0.0970447

13116

T

G

1

rs63125786

0.0970447

15116

T

A

1

rs5686

0.1970447

17116

A

G

1

rs892586

0.7670447

23116

C

G

  • The GWAS Folder containing all raw gwas data (correspond to the –gwas-info command line parameter): all chromosomes in one file, compressed or uncompressed

  • A descriptor csv files (see example below and here)that will described each GWAS summary statistic files (correspond to the –input-folder command line parameter): * a header * 1 line per study * the fields categories are:

category

field name

path to the data

filename

study info fields

Consortium,Outcome,fullName,Nsample,Ncase,Ncontrol,Nsnp

names of the header in the GWAS file

snpid,a1,a2,freq,pval,n,z,OR,se,code,imp,ncas,ncont

Study field definition:

  • filename: gwas summary statistic name as it appear in the GWAS folder

  • Consortium : the Consortium of the study (can also be the category of the trait) in upper case and without _ characters

  • Outcome: a short tag for the Outcome of the study in upper case and without _ characters

  • FullName: full description of the trait (for your own information not used in the cleaning process)

  • Nsample: Number of sample in the study

  • Ncase: Number of cases in the study (left empty if trait is continuous)

  • Ncontrol: Number of control in the study (left empty if trait is continuous)

Field corresping to column names in the summary statistic

  • snpid: name of the column storing rsid in the gwas file

  • POS: name of the column storing the position in the gwas file

  • CHR: name of the column storing the chromosome in the gwas file

  • a1: effect allele

  • a2: Other allele

  • freq: name of the column storing the minor allele frequence in the gwas file

  • pval: name of the column storing the pvalue in the gwas file

  • n: name of the column storing the sample size by variants (optional, will be inferred from the MAF, genetic effect and standard deviation if absent)

  • ncas: For binary traits, name of the column storing the number of cases by variants (optional)

  • ncont: For binary traits, name of the column storing the number of controls by variants (optional)

  • z: name of the column storing the genetic effect (beta) in the gwas file

  • OR : For binary traits, Odd ratio when available. Not to be confounded with the genetic effect size or ‘beta’.

  • index-type: precise the type of index

  • imputation_quality: (Optional) column containing individual-based imputation quality. Will be used to filter low quality imputation data from GWASs if the option –imputation-quality-treshold is used

Warning

Note that the concatenation of Consortium and Outcome must be unique because as it will be used as an index in the cleaning process. Both Outcome and Consortium must be in uppercase and with no _ characters

Here is an example of descriptor field (downloadable example here), the field irrelevant (for example odd ratio for continuous trait) for the study must be filled with na or left empty. Some fields are optional like the imputation_quality. If not used they can be filled with na.

GWAS information table

filename

Consortium

Outcome

FullName

Type

Nsample

Ncase

Ncontrol

Nsnp

snpid

POS

a1

a2

freq

pval

n

z

OR

se

code

imp

ncas

ncont

imputation_quality

index_type

GIANT_HEIGHT_Wood_et_al.txt

GIANT

HEIGHT

Height

Anthropometry

253288

na

na

2550858

MarkerName

position

Allele1

Allele2

Freq.Allele1.HapMapCEU

p

N

b

na

SE

na

na

na

na

imputationInfo

rs-number

Command line usage example:

It is possible to launch the complete preprocessing (all steps described in section Overview section) from the command line:

usage: jass_preprocessing [-h] --gwas-info GWAS_INFO --ref-path REF_PATH
                          --input-folder INPUT_FOLDER --diagnostic-folder
                          DIAGNOSTIC_FOLDER --output-folder OUTPUT_FOLDER
                          [--output-folder-1-file OUTPUT_FOLDER_1_FILE]
                          [--percent-sample-size PERCENT_SAMPLE_SIZE]
                          [--minimum-MAF MINIMUM_MAF] [--mask-MHC MASK_MHC]
                          [--additional-masked-region ADDITIONAL_MASKED_REGION]
                          [--imputation-quality-treshold IMPUTATION_QUALITY_TRESHOLD]

Named Arguments

--gwas-info

Path to the file describing the format of the individual GWASs files with correct header

--ref-path

reference panel location (used to determine which snp to impute)

--input-folder

Path to the folder containing the Raw GWASs summary statistic files, must end by ‘/’

--diagnostic-folder

Path to the reporting information on the PreProcessing such as the SNPs sample size distribution

--output-folder

Location of main ouput folder for preprocessed GWAS files (splitted by chromosome)

--output-folder-1-file

optional location to store the preprocessing in one tabular file with one chromosome columns (useful to compute LDSC correlation for instance)

--percent-sample-size

the proportion (between 0 and 1) of the 90th percentile of the sample size used to filter the SNPs

Default: 0.7

--minimum-MAF

Filter the reference panel by minimum allele frequency

Default: “0.01”

--mask-MHC

Whether the MHC region should be masked or not. default is False

Default: “False”

--additional-masked-region

List of dictionary containing coordinate of region to mask. For example :[{‘chr’:6, ‘start’:50000000, ‘end’: 70000000}, {‘chr’:6, ‘start’:100000000, ‘end’: 120000000}]

Default: “None”

--imputation-quality-treshold

minimum imputation quality in summary statistics

Default: “None”

Indices and tables

map_gwas

Map GWAS

dna_utils

Few fonction to to compute DNA complement

map_reference

Module of function

compute_score

save_output