KGGSEE: A biological Knowledge-based mining platform for Genomic and Genetic association Summary statistics using gEne Expression
User manual 1.0
Miaoxin Li, Lin Jiang, Xiangyi Li
1 Introduction
KGGSEE is a standalone Java tool for knowledge-based analyses and can perform association analysis and causative inference analysis based on the genomic and genetic association summary statistics of complex phenotypes, gene expression profiles and related data. It can perform four major integrative analyses:
Gene-based association analysis;
Estimate phenotype-associated tissues or cell-type based on gene expression of single-cell or bulk cells in different tissues;
Conditional gene-based association analysis based on an improved effective chi-squared statistic (ECS) and multiple variant-gene mapping strategies;
Causal gene inference for complex diseases and traits based on multiple eQTLs.
More integrative analysis functions will be added to KGGSEE in the future.
2 Installation
2.1 kggsee.jar
The main library of KGGSEE, kggsee.jar, does not need an installation procedure as long as its Java Runtime Environment(JRE) v1.8 (or up) is pre-installed in your machine. The kggsee.jar can be directly executed given a file path.
2.2 Resource data
In the KGGSEE working folder, a folder named resources contains the running resource data, e.g., gene boundary and gene expression. KGGSEE will automatically download the required resource files to this folder. Users can also manually download all the files and put them into the corresponding folders.
3 Tutorials
3.1 Gene-based association analysis
- Purpose: Detect the phenotype-associated genes of a phenotype using the GWAS summary statistics.
Input data:
GWAS summary statistics compressed in a text file (a fabled data set for education purposes): examples/gwas.sum.stat.gz
Genotypes in VCF format to approximate the correction between summary statistics: resources/hg19/gty/1kg.phase3.v5.shapeit2.eur.hg19.chr1.vcf.gz
java -Xmx10g -jar kggsee.jar \ --nt 10 \ --gene-assoc \ --sum-file examples/gwas.sum.stat.gz \ --vcf-ref resources/hg19/gty/1kg.phase3.v5.shapeit2.eur.hg19.chr*.vcf.gz \ --keep-ref \ --out examples/out/geneAssoc
Important Note: The parameter of "--vcf-ref" is the genotypes in VCF format and can be downloaded from https://pmglab.top/genotypes/#/. The users should download the proper vcf files according to their research subjects' ancestry. The combination of "--vcf-ref" and "--keep-ref" can keep the parsed vcf data (KGGSEE object format) in a local folder named like "VCFRefhg19" under the output folder (here is the "examples/out/geneAssoc"). When the users want to run another analysis, the users can use the option "--saved-ref" instead of the "--vcf-ref" and "keep-ref" to save time. The parameter of "--saved-ref" is the full path of folder "VCFRefhg19" (see example in the following part).
3.2 Estimate the phenotype-associated cell-types of a phenotype (DESE)
- Purpose: Estimate the relevant cell-types of a phenotype and finely map associated genes according to selective expression.
Input data:
GWAS summary statistics compressed in a text file(a fabled data set for education purposes): examples/gwas.sum.stat.gz
Genotypes in KGGSEE objects(generated in Gene-based association analysis) to approximate the correction between summary statistics: examples/out/geneAssoc
Gene expression data compressed in a text file: resources/hs_scRNA_cluster_mean.tsv.gz
java -Xmx10g -jar kggsee.jar \ --nt 10 \ --gene-finemapping \ --expression-file resources/hs_scRNA_cluster_mean.tsv.gz \ --only-hgnc-gene \ --sum-file examples/gwas.sum.stat.gz \ --saved-ref examples/out/geneAssoc \ --out examples/out/spa \ --excel
3.3 Conditional gene-based association analysis with eDESE
Purpose: Perform conditional gene-based association analysis using different strategies to map variants to genes, i.e., physically nearby variants, gene-level and isoform-level eQTLs (also are variants). Three strategies correspond to three models, i.e., eDESE:dist, eDESE:gene and eDESE:isoform, respectively.
eDESE:dist
Input data:
GWAS summary statistics compressed in a text file (a fabled data set for education purposes): examples/gwas.sum.stat.gz;
Genotypes in KGGSEE objects (generated in Gene-based association analysis) to approximate the correction between summary statistics: examples/out/geneAssoc;
Gene expression data compressed in a text file: resources/gtex.v8.gene.mean.tsv.gz. We have provided the dataset based on the gene-expression profiles of ~50 tissues in GTEX(v8) and has been packaged this file in the download of "KGGSEE+Resources" at http://pmglab.top/kggsee/#/download. Users can also use their own gene expression profiles. The row index is gene name, and the column name is tissue name and tissue name +".SE". Each tissue has two columns, one representing the average expression value of all samples of the tissue and the other representing the standard error of the mean (SE).
java -Xmx20g \ -jar kggsee.jar \ --nt 10 \ --sum-file examples/gwas.sum.stat.gz \ --chrom-col CHR \ --pos-col BP \ --p-col P \ --gene-finemapping \ --saved-ref examples/out/geneAssoc \ --expression-file resources/gtex.v8.gene.mean.tsv.gz \ --only-hgnc-gene \ --p-value-cutoff 0.05 \ --multiple-testing bonf \ --regions-out chr6:27477797-34448354 \ --out examples/out/geneAssoceQTL
eDESE:gene
Input data:
The GWAS summary statistics compressed in a text file(a fabled data set for education purposes): examples/gwas.sum.stat.gz;
Genotypes in KGGSEE objects to approximate the correction between summary statistics (generated in Gene-based association analysis): examples/out/geneAssoc;
Gene-level expression data compressed in a .gz or text file: resources/gtex.v8.gene.mean.tsv.gz. We have provided the dataset based on the gene-expression profiles of ~50 tissues in GTEX(v8) and has been packaged this file in the download of "KGGSEE+Resources" at http://pmglab.top/kggsee/#/download. Users can also use their own gene expression profiles.
eQTLs data compressed in a gz or text file: resources/hg19/eqtl/Brain-FrontalCortex_BA9_.gene.maf05.p01.gz.eqtl.txt.gz. Our pre-calculated gene-level eQTLs based on GTEx(v8) can be downloaded from here.
java -Xmx20g \ -jar kggsee.jar \ --nt 10 \ --chrom-col CHR \ --pos-col BP \ --p-col P \ --gene-finemapping \ --sum-file examples/gwas.sum.stat.gz \ --saved-ref examples/out/geneAssoc \ --expression-file resources/gtex.v8.gene.mean.tsv.gz \ --eqtl-file resources/hg19/eqtl/Brain-FrontalCortex_BA9.gene.maf05.p01.gz.eqtl.txt.gz \ --filter-eqtl-p 0.01 \ --only-hgnc-gene \ --p-value-cutoff 0.05 \ --multiple-testing bonf \ --regions-out chr6:27477797-34448354 \ --out examples/out/geneAssoceQTL
eDESE:isoform
Input data:
The GWAS summary statistics compressed in a text file(a fabled data set for education purposes): examples/gwas.sum.stat.gz;
Genotypes in KGGSEE objects to approximate the correction between summary statistics(generated in Gene-based association analysis): examples/out/geneAssoc;
Isoform-level expression data compressed in a .gz or text file: resources/gtex.v8.transcript.mean.tsv.gz. We have provided the dataset based on the isoform/transcript-expression profiles of ~ 50 tissues in GTEX(v8) and has been packaged this file in the download of "KGGSEE+Resources" at http://pmglab.top/kggsee/#/download. Users can also use their own isoform-level expression profiles.
IsoQTL data compressed in a gz or text file: resources/hg19/eqtl/Brain-FrontalCortex_BA9_.transcript.maf05.p01.gz.eqtl.txt.gz.Our pre-calculated isoform-level eQTLs based on GTEx(v8) can be downloaded from here.
java -Xmx20g \ -jar kggsee.jar \ --nt 10 \ --sum-file examples/gwas.sum.stat.gz \ --chrom-col CHR \ --pos-col BP \ --p-col P \ --gene-finemapping \ --saved-ref examples/out/geneAssoc \ --expression-file resources/gtex.v8.transcript.mean.tsv.gz \ --eqtl-file resources/hg19/eqtl/Brain-FrontalCortex_BA9_.transcript.maf05.p01.gz.eqtl.txt.gz \ --filter-eqtl-p 0.01 \ --only-hgnc-gene \ --p-value-cutoff 0.05 \ --multiple-testing bonf \ --regions-out chr6:27477797-34448354 \ --out examples/out/geneAssoceQTL
3.4 Gene-based causality analysis
- Purpose: Detect the causal genes of a phenotype using the GWAS summary statistics and eQTL.
Input data:
GWAS summary statistics compressed in a text file(a fabled data set for education purpose): examples/gwas.sum.stat.gz
Genotypes in KGGSEE objects(generated in Gene-based association analysis) to approximate the correction between summary statistics: examples/out/geneAssoc
eQTL summary statistics compressed in a text file: resources/hg19/eqtl/Brain-FrontalCortex_BA9_.transcript.maf05.p05.gz.eqtl.txt.gz
java -Xmx10g -jar kggsee.jar \ --nt 10 \ --emic \ --eqtl-file resources/hg19/eqtl/Brain-FrontalCortex_BA9_.transcript.maf05.p05.gz.eqtl.txt.gz \ --sum-file examples/gwas.sum.stat.gz \ --beta-type 2 \ --saved-ref examples/out/geneAssoc \ --out examples/out/emic \ --excel
3.5 Drug repositioning based on the drug selective perturbation analysis(SelDP)
Purpose: Estimate the drug selective perturbation effect on the phenotype-associated genes' expression to aid the drug repositioning for complex diseases.
Input data:
GWAS summary statistics compressed in a text file (a fabled data set for education purpose): examples/gwas.sum.stat.gz;
Genotypes in KGGSEE objects (generated in Gene-based association analysis) to approximate the correction between summary statistics: examples/out/geneAssoc;
Drug/compound-induced gene expression fold changes (log2(Treated/Control)) profiles. We provided an example of drug-induced fold changes based on the Level 3 data of LINCS (2017 release) at raw perturbation profile or reverse perturbation profile;
eQTL data of phenotype-associated tissues. The eQTL-guided SelDP needs the eQTL data as one of the input datasets. We have procalculated ~50 tissues' gene-level and transcript-level eQTL data based on the GTEx (v8), and the eQTL data can be accessed at gene-level eQTLs and transcript-level eQTLs.
The command for original SelDP:
java -Xmx20g \ -jar kggsee.jar \ --nt 10 \ --sum-file examples/gwas.sum.stat.gz \ --chrom-col CHR \ --pos-col BP \ --p-col P \ --gene-finemapping \ --saved-ref examples/out/geneAssoc \ --expression-file drug-induced.gene.expression.fold.change.profile \ --only-hgnc-gene \ --p-value-cutoff 0.05 \ --multiple-testing bonf \ --regions-out chr6:27477797-34448354 \ --out examples/out/Selective_Perturbed_DrugsThe command for eQTL-guided SelDP:
java -Xmx20g \ -jar kggsee.jar \ --nt 10 \ --chrom-col CHR \ --pos-col BP \ --p-col P \ --gene-finemapping \ --sum-file examples/gwas.sum.stat.gz \ --saved-ref examples/out/geneAssoc \ --expression-file drug-induced.gene.expression.fold.change.profile \ --eqtl-file resources/hg19/eqtl/Brain-FrontalCortex_BA9_.gene.maf05.p01.gz.eqtl.txt.gz \ --filter-eqtl-p 0.01 \ --only-hgnc-gene \ --p-value-cutoff 0.05 \ --multiple-testing bonf \ --regions-out chr6:27477797-34448354 \ --out examples/out/Selective_Perturbed_Drugs
4 Functions
4.1 Gene-based association analysis by an effective chi-square statistics(ECS)
One can perform the gene-based association analysis by an effective chi-square statistics (ECS) with the GWAS p-values of variants. The p-values are converted to chi-square statistics(degree of freedom = 1). The ECS merges all chi-square statistics of a gene after correcting the redundancy of the statistics due to LD. The LD is calculated from genotypes of an ancestrally matched sample in VCF format, e.g., a panel of 1000 Genomes Project. The method of ECS is described in our paper(Paper Link).
Required options
--gene-assoc
--sum-file [/path/to/summary/file]
--vcf-ref [/path/to/vcf/file]
--keep-ref
[--saved-ref "previous/output/path/prefix"]
--out [output/path/prefix]
See an analysis example at: Gene-based association analysis
Explanations and Optional options
--gene-assoc
: The main function option.--sum-file
: The file containing GWAS summary statistics.Three columns of the GWAS summary statistic file, i.e., chromosome, physical position and p-value are a minimal requirement. The default column names are CHR, BP and P, respectively. Otherwise, users should specify the name by using
--chrom-col
,--pos-col
and--p-col
, respectively.CHR
BP
P
1
751756
0.979957
1
752566
0.863844
1
752894
0.55814
1
753405
0.968401
1
755890
0.918246
--vcf-ref
: The file containing the genotypes to calculate the genotypic correlations. For the data separated in multiple files by chromosomes, one can use the asterisk wildcard (e.g., hg19.chr*.vcf.gz) to denote the chromosome names.--keep-ref
: If used, the option will enable to save the encoded genotypes in VCF for future usage, which will speed up the next analysis.--saved-ref
: Instead of using--vcf-ref
, one can directly specify the path of the encoded genotypes generated last time by specifying the last output path.--filter-maf-le
: Filter out the variants with minor allele frequency less or equal than the specified value.--out
: Specify the path and prefix name of the output files. The main output file of the gene-based analysis is *.gene.pvalue.txt or *.gene.pvalue.xls. The following is an example:Gene
#Var
ECSP
Chrom
Pos
VarP
Ingored
RefGene
GENCODE
RERE-AS1
21
7.83E-10
1
8490320
1.17E-10
0
…
…
RNU6-2
11
0.442684
1
10355834
0.146355
0
…
…
PTPRF
188
4.59E-11
1
44083015
4.09E-11
0
…
…
Columns in the output file are gene symbol, the number of variants in the gene, p-values of gene-based association test, and the detailed information of the top variant within the gene(i.e., the variant with the smallest p-value). These columns include chromosome, physical position, p-value, whether the top variant was ignored in the gene-based association analysis, and gene feature annotations according to RefGene and GENCODE.
4.2 Finely map genes and estimate relevant cell types of a phenotype by the single-cell (or bulk-cell) type and phenotype cross annotation framework(DESE)
One can simultaneously prioritize phenotype-associated genes and cell types with GWAS p-values and gene/transcript expression profile. The GWAS p-values types and expression were analyzed by an iterative prioritization procedure. In the procedure, phenotype-associated genes were prioritized by a conditional gene-based association(using the ECS again) according to the genes’ selective expression in disease related cell-types while the phenotype related cell-types were prioritized by an enrichment analysis of Wilcoxon rank-sum test for phenotype-associated genes’ selective expression. The phenotype-associated gene list and phenotype related cell-type list were updated by turns until the two list were unchanged. The detailed method is described in our paper(Paper Link).
Required options
--gene-finemapping
--expression-file [path/to/expression/file]
--only-hgnc-gene
--sum-file [/path/to/summary/file]
--saved-ref [previous/output/path/prefix]
--out [output/path/prefix]
See an analysis example at: Estimate relevant cell-types of a phenotype
Explanations and Optional options
--gene-finemapping
: The main function option.--multiple-testing
: The multiple testing method to select significant genes for the conditional analysis. There are three settings. bonf: Standard Bonferroni correction given a family-wise error rate specified by--p-value-cutoff
. benfdr: Benjamini-Hochberg method to control the false discovery rate. fixed: Filtering by a fixed p-value cutoff.--p-value-cutoff
: The cutoff for the multiple testing.--only-hgnc-gene
: Only consider genes with hgnc gene symbols.--expression-file
: The path of gene expression file.The expression file contains gene symbols(the first column), expression mean and standard errors of the gene or transcript in a cell types or clusters. One can include the Ensembl transcript ID of a gene in the first column. When a gene has multiple transcripts, each row can only contain the data of transcript. The standard error is not pre-requisite.
Gene
Cell_cluster1
Cell_cluster1.SE
Cell_cluster2
Cell_cluster1.SE
…
PRPF19:ENST00000541371
2.5
0.234
12.5
0.564
…
PRPF19:ENST00000541391
3.2343
0.4638
10.23
0.58
…
C19orf47:ENST00000580977
9.2343
0.8638
1.23
0.548
…
…
…
…
…
…
…
--sum-file
: See above description.--filter-maf-le
: See above description.--out
: Specify the path and prefix name of the output files. One of main output files is the conditional gene-based analysis results, named *.finemapping.gene.ecs.txt or *. finemapping.gene.ecs.xls. The followingGene
Chrom
StartPos
EndPos
#Var
Group
ECSP
CondiECSP
GeneScore
C1orf162
1
112011621
112025081
21
0
2.73E-04
2.73E-04
3443
LINC01703
1
226266440
226282822
33
0
2.09E-04
2.09E-04
233.1
…
…
…
…
…
…
…
…
…
columns in the output file are gene symbol, chromosome, transcription start position, transcription end position, number of variants in the gene, the LD group ID of genes, p-values of gene-based association test, p-values of conditional gene-based association test, and the selective expression score in enriched tissue or cell-types.
Another main output files is the selective expression enrichment analysis results at different tissues or cell types, named *.celltype.txt or *. celltype.xls. The following
TissueName
RobustRegressionZ
AveragedLog(p)
Kidney-Cortex
1.70E-03
2.765004628
Brain-Cortex
2.30E-02
1.639680865
Bladder
2.50E-02
1.597056329
Brain-Hippocampus
2.70E-02
1.561281239
Colon-Transverse
3.00E-02
1.527765274
…
…
…
columns in the output file are tissue or cell-type names, the p-value of enrichment according to the selective expression derived from the robust regression z-score, the logarithm of p-value.
4.3 Multi-strategy Conditional Gene-based Association framework mainly guided by eQTLs (eDESE)
eDESE can be used to perform the conditional gene-based association analysis using different variant sets, i.e., physically nearby variants, gene-level and isoform-level eQTLs. The statistical method is the improved effective chi-square statistics (ECS). The pre-calculated gene-level and isoform-level eQTLs of 50 tissues or cell types from GTEx (v8) have been integrated into the KGGSEE resource (hg19 and hg38).
Required options
--gene-finemapping
--eqtl-file [path/to/eQTL/file of genes or transcripts]
--filter-eqtl-p
--expression-file [path/to/expression/file]
--sum-file [/path/to/summary/file]
--filter-maf-le
--saved-ref [previous/output/path]
--out [output/path/prefix]
--nt
--chrom-col
--pos-col
--p-col
--only-hgnc-gene
--p-value-cutoff
--multiple-testing
--regions-out
See analysis examples at: Conditional gene-based association analysis based on the improved ECS and multiple variant-gene mapping strategies
Explanations and Optional options
--nt
: Specify the number of CPU cores used for the analysis.--gene-finemapping
: No parameters required. The main function option.--multiple-testing
: Specify the multiple testing methods to select significant genes for the conditional analysis. There are three settings, i.e., bonf: Standard Bonferroni correction, benfdr: Benjamini-Hochberg method to control the false discovery rate, fixed: Filtering by a fixed p-value cutoff.--p-value-cutoff
: Specify the family-wise cutoff for the multiple testing.--only-hgnc-gene
: No parameters required. If used, KGGSEE only considers the genes with HGNC gene symbols.--expression-file
: Specify the path of the preproceeded gene expression file. The index column of the preprocessed expression file was gene/isoform symbol name, and each of 50 tissues or cell types had two columns: one representing averaged expression value (i.e., mean) of all samples and the other representing the standard error of the mean (SE).--filter-eqtl-p
: Specify the filter to select the significant gene/isoform-level eQTLs to enter the following gene-based association analysis.--sum-file
: Specify the full path of the GWAS summary statistics. Three columns of the GWAS summary statistic file, i.e., chromosome, physical position and p-value are minimally required. The default column names are CHR, BP and P, respectively. Users can also specify these names by using--chrom-col
,--pos-col
and--p-col
, respectively.--filter-maf-le
: Specify the filter used to select the variants with MAF > the cutoff to enter the following gene-based association analysis.--regions-out
: Location section, such as chr6:27477797-34448354. Specify the variants in the specified regions to be excluded in the following gene-based association analysis.--eqtl-file
: Specify the full path of gene-level and isoform-level eQTL file. The format of the eQTL file is similar to the fasta file. The first row starting with "#" is the column name. Then the eQTL data of a gene or transcript starts with the symbol “>”, and the following are the gene symbol, Ensembl gene/transcript ID and chromosome name, which are delimited by tab characters. The subsequent rows contain the summary statistics of the eQTL-gene/isoform association. The tab-delimited columns are physical position, reference allele, alternative allele, frequency of alternative allele, estimated effect size, standard error of the estimation, p-value, effective sample sizes and determination coefficient in linear regression, respectively. In the regression, the number of alternative alleles is used as an independent variable. Based on KGGSEE, we have pre-calculated the eQTL data using the GTEx data(v8). Variants within 1MB upstream and downstream of a gene or a transcript boundary are included. The commands to compute eQTLs can be seen in Compute the gene-level and isoform-level eQTLs of each tissue.An example of gene-level eQTLs file is as follows:
#symbol id chr pos ref alt altfreq beta se p neff r2 >WASH7P ENSG00000227232 1 52238 T G 0.942 -1.771 0.285 5.16E-10 65 0.38 74681 G T 0.95 -1.457 0.333 1.19E-5 63 0.239 92638 A T 0.241 0.547 0.206 7.93E-3 53 0.121 >MIR1302-10 ENSG00000284557 1 52238 T G 0.942 -1.771 0.285 5.16E-10 65 0.38 74681 G T 0.95 -1.457 0.333 1.19E-5 63 0.239 … … … … … … … … …
An example of isoform-level eQTLs file is as follows:
#symbol id chr pos ref alt altfreq beta se p neff r2 >DDX11L1 ENST000456328 1 13418 G A 0.161 -0.03 0.013 0.027 62 0.076 19391 G A 0.11 0.065 0.027 0.017 63 0.085 107970 G A 0.285 -0.024 0.01 0.018 86 0.063 >MIR6859 ENST0000612080 1 13418 G A 0.161 -0.03 0.013 0.027 62 0.076 19391 G A 0.11 0.065 0.027 0.017 63 0.085 62578 G A 0.081 0.062 0.024 7.98E-03 67 0.098 99334 A G 0.088 0.071 0.035 0.043 56 0.07 … … … … … … … … …
--out
: Specify the path and prefix name of the output files.For eDESE, the six output files are as follows:
The first output file is the conditional gene-based analysis results, named *.finemapping.gene.ecs.txt or *.finemapping.gene.ecs.xls (We get the susceptible genes based on this file).
Gene
Chrom
StartPos
EndPos
#Var
Group
ECSP
CondiECSP
GeneScore
PLCH2
1
2394074
2441392
90
0
3.32E-07
3.32E-07
1749.814941
LINC01672
1
6779717
6794311
28
0
1.09E-07
1.09E-07
1459.64502
RERE-AS1
1
8481016
8498680
22
0
1.60E-07
1
276.8022156
GMEB1
1
28990270
29050710
61
1
1.21E-07
1
190.6436615
OPRD1
1
29136686
29202162
165
1
4.88E-08
4.88E-08
1776.521118
LINC01648
1
30481836
30513725
84
1
4.24E-07
8.76E-07
1310.996948
C1orf216
1
36174770
36186112
12
2
9.41E-07
9.41E-07
1644.873169
AGO3
1
36398368
36540415
67
2
1.02E-06
0.019723344
411.5656128
MAP7D1
1
36616941
36650961
23
2
4.86E-07
4.86E-07
1217.979736
MPL
1
43800001
43822665
17
3
5.19E-07
1
1179.628052
Gene: gene name;
Chrom: chromosome position of the gene;
StartPos: gene start position (refGene hg19);
EndPos: gene end postion (refGene hg19);
#Var: the number of variants assigned to the gene according to different strategies (physically nearby variants for eDESE:dist, gene-level eQTLs (also are variants) for eDESE:gene, isoform-level eQTLs (also are variants) for eDESE:isoform);
Group: the identifier of LD block to which the gene belongs;
ECSP: the p-value of effective chi-square test (without conditioning on gene expression profiles);
CondiECSP: the p-value of the gene by performing the conditional effective chi-square test;
GeneScore: the tissue-selective score of the gene by the end of the iterative procedure;
The second output file is the gene-based association result file ("gene-top variant" ECS result), named *.gene.pvalue.txt or *.gene.pvalue.xls.
Gene
#Var
ECSP
Chrom
Pos
GWAS_Var_P
RNU6-1199P
1
0.9428
1
693731
0.9428
FAM87B
16
0.90109217
1
752566
0.7723
LINC01128
48
0.842458397
1
776546
0.3348
LINC00115
11
0.926980956
1
760912
0.7847
FAM41C
19
0.91792081
1
810286
0.3687
TUBB8P11
8
0.816456433
1
810286
0.3687
LINC02593
38
0.110157913
1
852133
0.01991
SAMD11
30
0.217041605
1
879676
0.0197
NOC2L
26
0.397219233
1
891021
0.0129
KLHL17
21
0.17778506
1
891021
0.0129
Gene: gene name;
#Var: the number of variants assigned to the gene according to different strategies (physically nearby SNPs for eDESE:dist, gene-level eQTLs (also are variants) for eDESE:gene, isoform-level eQTLs (also are variants) for eDESE:isoform);
ECSP: the p-value of effective chi-square test (without conditioning on gene expression profiles);
Chrom: chromosome position of the gene;
Pos: the position of top-variant belonging to the gene;
GWAS_Var_P: the p-value of the top-variant in GWAS summary statistics.
Especially, additional three columns are appended to the *.gene.pvalue.txt or *.gene.pvalue.xls generated by eDESE:gene and eDESE:isoform (see example below).
Gene
#Var
ECSP
Chrom
Pos
GWAS_Var_P
eQTL_P
eQTL_Beta
eQTL_SE
CDK11A
46
0.291482878
1
2254910
0.009276
0.00335
0.419
0.143
TMEM240
22
0.025449014
1
1885055
0.003213
0.004
0.467
0.162
FNDC10
22
1.56E-04
1
2373089
4.85E-07
0.00152
0.412
0.13
UBE2J2
52
0.061003298
1
2114271
0.001515
8.32E-04
-0.878
0.263
C1orf159
164
0.013052155
1
1882949
4.82E-05
0.00461
0.486
0.172
CCNL2
7
0.1832933
1
1373335
0.05406
0.00548
-0.376
0.135
MIB2
95
0.001354836
1
1882949
4.82E-05
0.00419
0.481
0.168
WASH7P
1
0.3348
1
776546
0.3348
0.00603
0.39
0.142
LINC01786
20
0.222666606
1
1373335
0.05406
3.54E-04
-0.509
0.143
PUSL1
30
0.241421866
1
2122723
0.003085
0.005
-0.767
0.273
eQTL_P: the variant-gene expression association p-value of the top-variant with the gene;
eQTL_Beta: the variant-gene expression association beta value of the top-variant with the gene;
eQTL_SE: the standard error of the mean of beta value.
The third output file is the p-value of all variants belonging to a gene (raw ECS results), named *.gene.var.pvalue.txt.gz. Their file formats are the same as above. The meaning of the column names in *.gene.var.pvalue.txt.gz is the same as that in *.gene.pvalue.txt.
Gene
#Var
ECSP
Chrom
Pos
GWAS_Var_P
RNU6-1199P
1
0.9428
1
693731
0.9428
FAM87B
16
0.90109217
1
752566
0.7723
FAM87B
16
0.90109217
1
754964
0.7767
FAM87B
16
0.90109217
1
752721
0.7878
FAM87B
16
0.90109217
1
754503
0.807
FAM87B
16
0.90109217
1
758626
0.8899
FAM87B
16
0.90109217
1
758144
0.9102
FAM87B
16
0.90109217
1
757936
0.9151
FAM87B
16
0.90109217
1
757734
0.9156
FAM87B
16
0.90109217
1
756604
0.916
The fourth output file is the significance of phenotype-associated tissues, named *.celltype.txt.
TissueName
RobustRegressionZ
AveragedLog(p)
Brain-FrontalCortexBA9
1.30E-17
16.90010754
Brain-AnteriorcingulatecortexBA24
2.40E-17
16.61350308
Brain-Cortex
1.60E-16
15.79927424
Brain-Hippocampus
1.90E-13
12.72137284
Brain-Caudatebasalganglia
5.80E-13
12.23746327
Brain-Putamenbasalganglia
6.50E-13
12.18921402
Brain-Amygdala
1.40E-12
11.85870189
Brain-Nucleusaccumbensbasalganglia
5.60E-12
11.24812294
Brain-Substantianigra
2.40E-10
9.614009191
Brain-Hypothalamus
4.30E-10
9.370307784
TissueName: tissue names;
RobustRegressionZ: The p-value generated by the Wilcoxon rank-sum test based on the robust-regression z-score of the potential susceptibility and non-susceptibility genes.
AveragedLog(p): the negative log10 of the p-values generated by averaging the p-values of four selective-expression measures (robust-regression z-score, conventional z-score, MAD robust z-score, and ratio of vector-scalar projection).
The fifth output file is a Q-Q plot, named *.qq.png, representing the p-value of the ECS test based on gene, variants insides the gene and variants outside the gene, respectively.
The sixth output file is the log file in which the detailed parameter settings and computation procedures can be found.
4.4 Infer the causal genes based on GWAS summary statistics and eQTLs by Mendelian randomization analysis framework for causal gene estimation(EMIC)
One can perform multiple IVs-based MR analyses to infer the causal genes or transcripts by using the integrative framework named EMIC. EMIC adopted two multiple IVs-based MR methods for causality test and casual effect estimation of a gene’s expression to a phenotype, median-based MR and ML-based MR. EMIC needs two major inputs, GWAS and eQTL summary statistics, respectively. The GWAS summary statistics refer to the logarithm of odds ratio or regression coefficients and the corresponding standard errors(SEs) from a large-scale GWAS study, indicating the association between IVs and a phenotype. The eQTL summary statistics are similar to that of the GWAS, indicating the association between IVs and the expression of genes or transcripts in a tissue or cell type. EMIC has integrated the pre-calculated cis-eQTLs in 50 tissues or cell-types with gene-level and isoform-level expression from GTEx(version 8).
Required options
--emic
--eqtl-file [path/to/eQTL/file of genes or transcripts]
--sum-file [/path/to/summary/file]
--beta-type [0/1/2]
--saved-ref [previous/output/path]
--out [output/path/prefix]
See an analysis example at: Gene-based causality analysis
Explanations and Optional options
--emic
: No parameters required. The main function option.--eqtl-file
: See above description.--sum-file
: See above description.--beta-type
: The altenative values are 0, 1 or 2. Specially, 0 represents the coefficients of linear regression for a quantitative phenotype; 1 represents the coefficients of logistic regression or the logarithms of odds ratio for a qualitative phenotype; 2 represents the odds ratio for a qualitative phenotype.--saved-ref
: See above description.--out
: Specify the path and prefix name of the output files. The main output file is the Mendelian randomization analysis results for causal gene estimation, named *.mr.gene.txt or *. gene.mr.gene.xls. The followingGene
#Var
minP_MedianMR
DetailsMedianMR
minP_MLMR
DetailsMLMR
Chrom
Pos
GWAS_Var_P
GWAS_Var_Beta
GWAS_Var_SE
eQTL_P
eQTL_Beta
eQTL_SE
AMPD1
3
.
ENST00000639274[2,NaN,NaN,NaN]
9.38E-06
ENST00000639274[2,-0.07947323864224412,0.017935971853054997,9.382262412244995E-6]
1
115309590
8.83E-08
1.057384
0.0104
4.65E-05
-0.613
0.151
RN7SL432P
3
.
ENST00000484057[2,NaN,NaN,NaN]
9.38E-06
ENST00000484057[2,-0.07947323864224412,0.017935971853054997,9.382262412244995E-6]
1
115309590
8.83E-08
1.057384
0.0104
4.65E-05
-0.613
0.151
…
…
…
…
…
…
…
…
…
…
…
…
…
…
Columns in the output file are gene symbol, the number of variants in the gene, p-values of causality tests by Median-based MR, detailed causality estimation by Median-based MR, p-values of causality tests by maximal likelihood-based MR, detailed causality estimation by maximal likelihood-based MR, chromosome, top GWAS variant position, p-value, beta and SE of the top GWAS variant, p-value, beta and SE of the top GWAS variant as an eQTL. When a gene has multiple transcripts, the detailed MR results will show MR analysis of all transcripts. Each MR analysis result has four components, the number IVs for the estimation, the estimated causal effect, the standard error of the estimation, and the p-values.
4.5 Compute the gene/isoform-level eQTLs of each tissue
Purpose: compute the gene/isoform-level eQTLs based on the gene-level expression and isoform-level expression profiles of the target tissue.
Input data:
Genotypes in KGGSEE objects (generated in Gene-based association analysis). Here genotypes in GTEx v8 were used as example input. When computing the gene/isoform-level eQTLs of certain tissue, only subjects simultaneously containing the genotype data and expression data were used;
Gene expression data of certain tissues corresponding to genotype data from the same subjects;
Subject information, such as subject ID and subject-related covariates.
java -Xmx10g \ -jar kggsee.jar \ --nt 10 \ --calc-eqtl \ --expression-gty-vcf path/to/vcf/file/of/subjects/with/expression \ --gene-expression resources/Adipose-Subcutaneous.expression.subjectid.gene.fmt.gz \ --expression-subjects path/to/subjectID/covariates.txt \ --filter-eqtl-p 0.01 \ --hwe-all 0.001 \ --filter-maf-le 0.05 \ --neargene 1000000 \ --out /path/Adipose-Subcutaneous.gene.maf05.p01 \
Details of the options can be seen in Options Index.
5 Options Index
5.1 Inputs/outputs
Option
Description
Examples with default value
--a1-col
column name of alternative or minor allele
--a1-col A1
--a2-col
column name of reference or major allele
--a2-col A2
--beta-col
column name of coefficients
--beta-col OR
--beta-type
whether the coefficients are odds ratio. 0 means coefficients of linear regression for a quantitative phenotype beta; 1 means coefficients of logistic regression or the logarithms of odds ratio for a qualitative phenotype; 2 means the odds ratio for a qualitative phenotype
--beta-type 0
--chrom-col
column name of chromosome ID
--chrom-col CHR
--eqtl-file
path of eQTL file
--eqtl-file /path/to/eqtl_file
--excel
Output results in Excel format
--expression-gty-vcf
path to vcf file of subjects with expression
--expression-gty-vcf /path/to/vcf_file
--gene-expression
path to file with gene expression
--gene-expression /path/to/gene_expression_file
--freq-a1-col
column name of frequencies of A1 allele
--freq-a1-col FRQ_U
--keep-ref
save the encoded genotypes in VCF for future usage
--nmiss-col
column name of sample size with non-missing genotypes
--nmiss-col Neff
--out
output path and prefix name
--out ./test/test1
--pos-col
column name of positions
--pos-col BP
--p-col
column name of p-values
--p-col P
--saved-ref
the path of encoded genotypes by last time
--saved-ref 、test/test1
--se-col
column name of standard errors of the coefficients
--se-col SE
--sum-file
path of the file storing GWAS summary statistics
--sum-ref /path/to/gwas_summary_file
--vcf-ref
the path of vcf file and name
--vcf-ref /path/to/vcf_file/hg19.chr*.vcf.gz
5.2 Quality control
Option
Description
Examples with default value
--filter-eqtl-p
filter eQTLs with p-value > the cutoff
--filter-eqtl-p 1E-2
--filter-maf-le
filer variants with MAF ≤ the cutoff
--filter-maf-le 0.05
--hwe-all
filter variants with HWD test p-value < the cutoff
--filter-eqtl-p 1E-5
--only-hgnc-gene
only consider genes with HGNC gene symbols
--neargene
set the extension window size of a gene to cover more potential regulatory variants.
--neargene 5000
--regions-out
exclude variants in the specified regions
5.3 Functions
Option
Description
Examples with default value
--gene-assoc
Perform gene-based association analysis
--emic
Perform Mendelian randomization analysis for causal gene estimation
--calcu-selectivity
An inner function for DESE and eDESE to calculate selective expression of gene or transcripts in a tissue or a cell type. Default parameter is "true".
--calcu-selectivity true
--ld-pruning-mr
Prune a variant with genotypic correlation to another variant> the cutoff for causal gene estimation by MR
--ld-pruning-mr 0.5
--multiple-testing [bonf/benfdr/ fixed]
The multiple testing method used to select significant genes for the conditional gene-based analysis. There are three settings. bonf: Standard Bonferroni correction given a family-wise error rate specified by --p-value-cutoff. benfdr: Benjamini-Hochberg method to control the false discovery rate. fixed: Filtering by a fixed p-value cutoff.
--multiple-testing bonf
--p-value-cutoff
The family-wise cutoff for the multiple testing.
--p-value-cutoff 0.05
--top-gene
Only select the N top significant genes for subsequent gene-based conditional analysis by ECS
--top-gene N
5.4 Utilities
Option
Description
Examples with default value
--nt
CPU cores for the analysis
--nt 12
--buildver
reference genome version of the coordinates. The supported versions are hg19 and hg38.
--buildver hg19
--resource
the path and folder of kggsee.