"Aligning a GWAS to a genetic reference with gwas2cojo.py"

28 dec 2019

Dealing with a gazillion different undocumented notations and conventions for writing down variant information is a common problem in bioinformatics. Different effect/other alleles, variant names, reference genome build, chromosome name conventions or just different column names all prevent easy comparison between genetic variants. gwas2cojo.py is there to automatically convert a GWAS to a certain genetic reference.

For example, in my bachelor end project I had to convert the UKBioBank CAD GWAS dataset to a format readable by SMR. Our genetic dataset and UKBioBank had different ideas about the allele notation for a given variant. The effect allele and other allele were be swapped, or the variant might be recorded as lying on the other strand (translated). Together with frequency information, this is mostly reversible. However, in the case of ambiguous ambivalent alleles (A/T or G/C variant with frequencies close to 50%), this is impossible (there is just not enough information) and the variant needs to be discarded. Dealing with this for multiple versions of the GWAS, and different other similar problems led me to write a tool to automate all of this.


Thus gwas2cojo.py was born, which takes as input a GWAS and genetic reference and outputs the GWAS in the notation of the genetic reference in the COJO format. After testing against other GWAS datasets online, it ended up being quite long (700 lines including CLI help and comments). gwas2cojo.py is included by default in the SCRIPTS directory of QTLToolkit/QTLToolkit2 or available here.

The COJO format is the most simple format possible that contains all information in a GWAS. The format is easily moldable with standard *nix tools. Here is an example:

SNP A1 A2 freq b se p n
rs2073813 A G 0.16054 0.00184 0.01385 0.8945 253503
rs3131969 G A 0.81933 -0.00361 0.01369 0.792062 251851
rs3131968 G A 0.8191 -0.00373 0.01369 0.78532 251852
rs3131967 C T 0.81446 -0.005 0.01384 0.717932 251852
rs3115858 T A 0.82673 -0.00441 0.01393 0.75168 255643
rs3131962 G A 0.81798 -0.00162 0.01379 0.906485 252504

The genetic reference is read from a genetic summary statistics file as outputed by PLINK or reference GWAS file. GWAS and genetic input files are assumed to be text files, with one row per line and fields separated by any number of whitespace characters. The first line is assumed to be the header. This was found to be the case for most GWAS datasets encountered online.

The feature list ended up a bit long for a small script:

  • Autodetection of columns from the first 100 GWAS rows. Nearly every GWAS file found online has different column names.
  • Running gwas2cojo.py with insufficient column information will print copy-paste ready command-line arguments to fill in this information.
  • Report file generation with an explanation of all failed and discarded variants, and a script in QTLToolkit to verify the report file.
  • Dry-run by default.
  • Live progress and straight line R^2 statistic of the GWAS vs. genetic effect allele frequency. This should be close to 1 (only shown if numpy is installed).
  • Stopping the reading of the GWAS or genetic file halfway with Ctrl-C, without stopping the complete problem so that small test files can be created or statistics for a small sample can be gathered.
  • Python version 2 and 3 support
  • Support for a single chr_bp column vs. separate chr and bp columns, or a single nsamples vs. separate ncases and ncontrol.
  • Autodetection of the most likely genome build from header
  • When reference genome build conversion is needed, this is done through pyliftover if installed, or an error message with installation instructions is show if it is not.
  • Of courese, allele conversion to genetic reference. When two alleles are swapped, the beta value is negated. When the frequency difference between the genetic reference and GWAS is too large, the variant is discarded.
  • Correct handling of the various indel notations.
  • Variant equality is done by comparision of chromosome and bp position. Variant name is taken from the genetic reference (GWAS data can often have different variant names).
  • Chromosome name comparison is implemented for many different chromosome name conventions, including with or without leading chr, leading zeroes, using numbers for the X, Y and MT chromosomes and different notations for the MT chromosome.

As I found more and more GWAS files to test the program on, more and more edge-cases were solved in gwas2cojo.py. I also did some optimization based on klines/sec. The code is not pretty :) However, the verification script that checks whether the output is correct is however short and simple, and hasn't found an error until now.

I hope I can help you prevent you from redoing all this work yourselves, as discovering all the differences in notation was a long and painfull project. Let me know if you found edge-cases where it dit not work, or if you have other improvements!

Update 2020: as this tool recently gained some interest, I've deciced to upload one of the genetics references I'm using in my current research as a starter for others (as is, without waranty) here. It's a summary statistics of the 1000Genomes project after filtering on 1% MAF and removing non-bivalent alleles. If, for example, you're doing Mendelian Randomization's between eQTLs and GWAS's, this wouldn't work and you'd need to use the genetic dataset used for the eQTLs as reference (use something like qctool to generate a summary statistics file).

First ten rows of the genetics reference (needed: chrom/pos/id/ref/alt/frequency):

CHROM  POS    ID           REF                     ALT  AF         EUR_AF  CHROM:POS:REF:ALT
1      10177  rs367896724  A                       AC   0.425319   0.4056  1:10177:A:AC
1      10352  rs555500075  T                       TA   0.4375     0.4264  1:10352:T:TA
1      10616  rs376342519  CCGCCGTTGCAAAGGCGCGCCG  C    0.993011   0.994   1:10616:CCGCCGTTGCAAAGGCGCGCCG:C
1      11008  rs575272151  C                       G    0.0880591  0.0885  1:11008:C:G
1      11012  rs544419019  C                       G    0.0880591  0.0885  1:11012:C:G
1      13110  rs540538026  G                       A    0.0267572  0.0567  1:13110:G:A
1      13116  rs62635286   T                       G    0.0970447  0.1869  1:13116:T:G
1      13118  rs200579949  A                       G    0.0970447  0.1869  1:13118:A:G
1      13273  rs531730856  G                       C    0.0950479  0.1471  1:13273:G:C

Here is an example of how I use this to convert ~40 GWASs. While column autodetection works for a great deal of GWASs, you still need to do manual work in ambiguous cases.

python3 ./QTLToolKit2/SCRIPTS/gwas2cojo.py \
        --gen:build     hg19 \
        --gen           1kGp3.ref.1maf.nonbia.sumstats.gz \
        --gwas          "${gwasfilename}" \
        --report        "out/${name}.report" \
        --gen:ident     ID \
        --gen:chr       CHROM \
        --gen:other     REF \
        --gen:effect    ALT \
        --gen:eaf       AF \
        --out           "out/${name}.cojo"