coR-ge by Chris1221

Quick-start Guide.

Welcome to the landing page for coR-ge (corection of genomes in R). This software is in active development, and pull requests are welcome on any of the branches. Please find below instructions for use cases. For more complex use, please contact the maintainer by raising an issue on the project repository.

This project was presented at the Compute Canada's High Powered Computer Symposium in June 2015. The abstract and poster PDF are also found in the repository.

Program Structure

Heuristically, coR-ge is structured as follows.

Logic Flow

In reality, these three steps are divided into different files in the repository. The quick start guide will nonetheless follow this logic flow in order to simplify matters.

Requirements

coR-ge requires the following:

This program requires a cluster system set up on either SGE or PBS architecture, running qsub. It can be easily modified to use bsub as well with minimal effort.
Any base unix system to run on. The program has been tested on CentOS and OSX. Dependency issues should be reported as an issue on the main repository.
R version > 3.0.0 able to run dplyr and data.frame
Rscript in $PATH or located at /usr/bin/Rscript
Python v. 2.7.*
A temporary storage drive with n*560G free space, where n is the number of runs you would like to perform simultaneously.
156G RAM recommended. Be aware that improper allocation of jobs can easily cause swapping of nodes on the cluster. We have included a delayed submission script to ameliorate this problem.
Latest build of 1000 Genomes in IMPUTE2 format (*.legend, *.haps, *.sample) in any directory that has sufficient storage room. Name this directory; it will hereafter be named $REF

Installation

Clone the repository to your working directory (hereafter known as $WD):

$ cd $WD
$ git clone https://github.com/Chris1221/sFDR
$ git checkout master
$ cd sFDR/

Ensure that you are on the master branch for now.

Several files will have to be edited, and these will be addressed in turn.

Disease Model

The file rand.R contains the information for the disease model. We will edit a few things here to specify individual parameters for the run. At the top of the script, you must edit the following parameters. Defaults are displayed, and must be edited:

heritability <- 0.45
noise_var <- 0.55
sim_file_path <- getwd()
wd <- getwd()
snp_summary <- "snp_summary.out"

To filter based on minor allele frequency, you must generate a file named snp_summary.out (or whatever you set the file name as in rand.R) using SNPTEST2. See here for an example of this.

In this script, you can manually or randomly generate causal loci (~ line 160). If you would like to manually assign variants, choose your SNPs and assign them a column named "causal" equal to one, and pass only causal SNPs to the snps data frame. snps must contain four columns, an example is given below:

rsid chromosomes bp all_maf
rs_1 1           1  0.30
rs_2 1           2  0.35
rs_3 1           3  0.05

If the selection of the SNPs is based on some parameter, simply subset the summary object to select only causal loci. In the following simple example, we select 50 random SNPs with minor allele frequency between 0.05 and 0.5 using the dplyr library.

library(dplyr)
summary %>%
    filter(all_maf > 0.05) %>% 
        filter(all_maf < 0.5) %>% 
            sample_n(50) %>% 
                select(rsid, chromosomes, bp, all_maf) -> snps

In a more complex situation, one may wish to select a group of genes, selecting one causal variant from each. In the following example, we import a table of 10 genes with start and end positions, and randomly select one causal variant from each of the genes. Additional information is retained so that the start and end positions can be used to decide a priority strata for sFDR.

#Read in list of genes.  First column is gene name, second is physical start position in bp, third is end position in bp.
gene_list <- read.table("gene_list.txt", h = T, sep = "\t")

#select one random SNP from each gene
for(i in 1:nrow(gene_list)){

 name <- paste0("gene_", i)

 #grab only SNPs within gene boundaries 
 summary %>% filter(position > gene_list[[i,2]]) %>% filter(position < gene_list[[i,3]]) %>% select(rsid, chromosome,     position, all_maf) -> gene
 gene$select <- 1
 gene$causal <- 0

 #grab random snp
 gene[ceiling(runif(1,1,nrow(gene))),"causal"] <- 1
 assign(name, gene)
}

genes <- rbind(gene_1, gene_2, gene_3, gene_4, gene_5, gene_6, gene_7, gene_8, gene_9, gene_10)

#or
#genes <- rbind(ls()[grepl("gene_[0-9]", ls())])

genes %>% filter(causal == 1) -> snps

#clear extra columns
snps$causal <- NULL
snps$select <- NULL

The previous examples illustrate the concept behind SNP selection, but any selection can be made as long as the correct columns are passed to snps prior to the renaming step.

Resampling and Model Propagation

In i_array.sh, j_array.sh, and k_array.sh, set the $DIR variable with the directory where you are storing the reference genotypes. Set the $OUT variable for where you would like to house the output. Edit the directives to fit your cluster set-up.

Start the generation of resampled genotypes with hapgen2. For 30 trials (approximately 1.7 TB of free space required):

for i in {1..3}
do
  for j in {1..10}
  do
    bash i_array $i $j
  done
done

or bash sub_script.sh $i_limit $j_limit for ease of use.

This will begin the submission of many small jobs. If any jobs fail on your system, please post the full log file in an issue.

Analysis and Comparison of MTC Methodologies.

In correct_and_report.R, rename the path variable to the location of your simulated genotypes. Again specify the location of the snp_summary.out file in the summary object. This script inherits properties from correct.R, and will run as long as the snps object was passed correctly.

Reported by default are:

Aggregated FDR True Positive
Aggregated FDR False Positive
Aggregated FDR True Negative
Aggregated FDR False Negative
Bonferroni True Positive
Bonferroni False Positive
Bonferroni True Negative
Bonferroni False Negative
Stratified FDR True Positive
Stratified FDR False Positive
Stratified FDR True Negative
Stratified FDR False Negative

In that order, written to a file named results.txt. The strata for sFDR are by default randomly selected in a stepwise manner, but this behaviour can be easily modified.

After all of the k_array.sh jobs have finished, run the clean_comb.sh script in the same manner as i_array.sh with input arguments $i and $j.

Example use cases

coR-ge has been used successfully to demonstrate an increase in true positives and a decrease in false negatives when tests are stratified with prior information.

We present the following as a case study. Figures a and b display the differential true positive and false positive rates respectively between stratified FDR and FDR correction grouped by number of “nonsense” variants in priority strata. Displayed are the differences between the two methodologies with kernel density estimate. 50 causal loci simulated on 1000 Genomes CEU reference panel with 0.45 disease heritability assumed on chromosome 1 for 100 permutations.

Improvement of true discovery rate between sFDR and FDR correction methodologies.

Decrease in false positive detection between sFDR and FDR correction methodologies.

Continuing the case study, differential effects of causal variant minor allele frequencies are examined in stratified FDR and reported grouped by number of “nonsense” variants in the priority strata. Five causal loci with effect sizes following a uniform distribution were simulated in each of the MAF groups in order to further elucidate the heuristic qualities of stratified FDR correction in genomic environments. Five causal variants simulated in each category with equal effect sizes (Type II ANOVA P = 0.2026). Simulated on 1000 Genomes CEU population chromosome 1 using a disease model with 0.45 heritability for 100 permutations.

MAF

coR-ge can be used for any methodology, for any grouping of SNPs, and any grouping of reporting. If you have used coR-ge in any publication, kindly cite this page. Code for the previously mentioned examples can be found in the master and MAF branches respectively.

Development Goals

More user-friendly interface
Better documentation with comprehensive guide
Compiled binary with .config file

Help / Guidance

Please post an issue on the repository, or email the maintainer with the email mentioned on his profile. We welcome assistance from individuals wishing to contribute to the development of coR-ge. Any questions will be answered quickly and as completely as possible.

Acknowledgements

The maintainer would like to thank Joanne Knight of CAMH, who has provided guidance throughout the development of this project. He would also like to thank the statistical genetics group at CAMH for guidance and assistance during development.

We would like to thank the R core team for the creation of an environment in which such projects are possible, as well as specifically Hadley Wickham for his tidyr-dplyr-ggplot/ggvis pipeline, which has proved invaluable.

We acknowledge funding from The Institute of Medical Science at the University of Toronto, NSERC, the Centre for Addiction and Mental Health (CAMH) and the University of Ottawa.

References

Sun, L., Craiu, R. V, Paterson, A. D., & Bull, S. B. (2006). Stratified false discovery control for large-scale hypothesis testing with application to genome-wide association studies. Genetic Epidemiology, 30(6), 519–30.
Su, Z., Marchini, J., & Donnelly, P. (2011). HAPGEN2: simulation of multiple disease SNPs. Bioinformatics, 27(16), 2304–2305.
Tune H. Pers, Juha M. Karjalainen, Yingleong Chan, Harm-Jan Westra, Andrew R. Wood, Jian Yang, Julian C. Lui, Sailaja Svedantam, Stefan Gustafsson, Tonu Esko, Tim Frayling, Elizabeth K. Speliotes, Genetic Investigation of ANthropometric Traits (GIANT) Con, L. F. (2015). Biological interpretation of genome-wide association studies using predicted gene functions.