Software for the Examination of Multiple Correction Methodologies in Accurate Genomic Environments
Welcome to the landing page for coR-ge (corection of genomes in R). This software is in active development, and pull requests are welcome on any of the branches. Please find below instructions for use cases. For more complex use, please contact the maintainer by raising an issue on the project repository.
This project was presented at the Compute Canada's High Powered Computer Symposium in June 2015. The abstract and poster PDF are also found in the repository.
Heuristically, coR-ge is structured as follows.
In reality, these three steps are divided into different files in the repository. The quick start guide will nonetheless follow this logic flow in order to simplify matters.
coR-ge requires the following:
SGE
or PBS
architecture, running qsub
. It can be easily modified to use bsub
as well with minimal effort.CentOS
and OSX
. Dependency issues should be reported as an issue on the main repository. R
version > 3.0.0 able to run dplyr
and data.frame
Rscript
in $PATH
or located at /usr/bin/Rscript
Python
v. 2.7.**.legend
, *.haps
, *.sample
) in any directory that has sufficient storage room. Name this directory; it will hereafter be named $REF
Clone the repository to your working directory (hereafter known as $WD
):
$ cd $WD
$ git clone https://github.com/Chris1221/sFDR
$ git checkout master
$ cd sFDR/
Ensure that you are on the master branch for now.
Several files will have to be edited, and these will be addressed in turn.
The file rand.R
contains the information for the disease model. We will edit a few things here to specify individual parameters for the run. At the top of the script, you must edit the following parameters. Defaults are displayed, and must be edited:
heritability <- 0.45
noise_var <- 0.55
sim_file_path <- getwd()
wd <- getwd()
snp_summary <- "snp_summary.out"
To filter based on minor allele frequency, you must generate a file named snp_summary.out
(or whatever you set the file name as in rand.R
) using SNPTEST2. See here for an example of this.
In this script, you can manually or randomly generate causal loci (~ line 160). If you would like to manually assign variants, choose your SNPs and assign them a column named "causal" equal to one, and pass only causal SNPs to the snps
data frame. snps
must contain four columns, an example is given below:
rsid chromosomes bp all_maf
rs_1 1 1 0.30
rs_2 1 2 0.35
rs_3 1 3 0.05
If the selection of the SNPs is based on some parameter, simply subset the summary
object to select only causal loci. In the following simple example, we select 50 random SNPs with minor allele frequency between 0.05 and 0.5 using the dplyr
library.
library(dplyr)
summary %>%
filter(all_maf > 0.05) %>%
filter(all_maf < 0.5) %>%
sample_n(50) %>%
select(rsid, chromosomes, bp, all_maf) -> snps
In a more complex situation, one may wish to select a group of genes, selecting one causal variant from each. In the following example, we import a table of 10 genes with start and end positions, and randomly select one causal variant from each of the genes. Additional information is retained so that the start and end positions can be used to decide a priority strata for sFDR.
#Read in list of genes. First column is gene name, second is physical start position in bp, third is end position in bp.
gene_list <- read.table("gene_list.txt", h = T, sep = "\t")
#select one random SNP from each gene
for(i in 1:nrow(gene_list)){
name <- paste0("gene_", i)
#grab only SNPs within gene boundaries
summary %>% filter(position > gene_list[[i,2]]) %>% filter(position < gene_list[[i,3]]) %>% select(rsid, chromosome, position, all_maf) -> gene
gene$select <- 1
gene$causal <- 0
#grab random snp
gene[ceiling(runif(1,1,nrow(gene))),"causal"] <- 1
assign(name, gene)
}
genes <- rbind(gene_1, gene_2, gene_3, gene_4, gene_5, gene_6, gene_7, gene_8, gene_9, gene_10)
#or
#genes <- rbind(ls()[grepl("gene_[0-9]", ls())])
genes %>% filter(causal == 1) -> snps
#clear extra columns
snps$causal <- NULL
snps$select <- NULL
The previous examples illustrate the concept behind SNP selection, but any selection can be made as long as the correct columns are passed to snps
prior to the renaming step.
In i_array.sh
, j_array.sh
, and k_array.sh
, set the $DIR
variable with the directory where you are storing the reference genotypes. Set the $OUT
variable for where you would like to house the output. Edit the directives to fit your cluster set-up.
Start the generation of resampled genotypes with hapgen2
. For 30 trials (approximately 1.7 TB of free space required):
for i in {1..3}
do
for j in {1..10}
do
bash i_array $i $j
done
done
or bash sub_script.sh $i_limit $j_limit for ease of use.
This will begin the submission of many small jobs. If any jobs fail on your system, please post the full log file in an issue.
In correct_and_report.R
, rename the path
variable to the location of your simulated genotypes. Again specify the location of the snp_summary.out
file in the summary
object. This script inherits properties from correct.R
, and will run as long as the snps
object was passed correctly.
Reported by default are:
In that order, written to a file named results.txt
. The strata for sFDR are by default randomly selected in a stepwise manner, but this behaviour can be easily modified.
After all of the k_array.sh
jobs have finished, run the clean_comb.sh
script in the same manner as i_array.sh
with input arguments $i and $j.
coR-ge has been used successfully to demonstrate an increase in true positives and a decrease in false negatives when tests are stratified with prior information.
We present the following as a case study. Figures a and b display the differential true positive and false positive rates respectively between stratified FDR and FDR correction grouped by number of “nonsense” variants in priority strata. Displayed are the differences between the two methodologies with kernel density estimate. 50 causal loci simulated on 1000 Genomes CEU reference panel with 0.45 disease heritability assumed on chromosome 1 for 100 permutations.
Improvement of true discovery rate between sFDR and FDR correction methodologies.
Decrease in false positive detection between sFDR and FDR correction methodologies.
Continuing the case study, differential effects of causal variant minor allele frequencies are examined in stratified FDR and reported grouped by number of “nonsense” variants in the priority strata. Five causal loci with effect sizes following a uniform distribution were simulated in each of the MAF groups in order to further elucidate the heuristic qualities of stratified FDR correction in genomic environments. Five causal variants simulated in each category with equal effect sizes (Type II ANOVA P = 0.2026). Simulated on 1000 Genomes CEU population chromosome 1 using a disease model with 0.45 heritability for 100 permutations.
coR-ge can be used for any methodology, for any grouping of SNPs, and any grouping of reporting. If you have used coR-ge in any publication, kindly cite this page. Code for the previously mentioned examples can be found in the master
and MAF
branches respectively.
.config
filePlease post an issue on the repository, or email the maintainer with the email mentioned on his profile. We welcome assistance from individuals wishing to contribute to the development of coR-ge. Any questions will be answered quickly and as completely as possible.
The maintainer would like to thank Joanne Knight of CAMH, who has provided guidance throughout the development of this project. He would also like to thank the statistical genetics group at CAMH for guidance and assistance during development.
We would like to thank the R core team for the creation of an environment in which such projects are possible, as well as specifically Hadley Wickham for his tidyr
-dplyr
-ggplot
/ggvis
pipeline, which has proved invaluable.
We acknowledge funding from The Institute of Medical Science at the University of Toronto, NSERC, the Centre for Addiction and Mental Health (CAMH) and the University of Ottawa.
Sun, L., Craiu, R. V, Paterson, A. D., & Bull, S. B. (2006). Stratified false discovery control for large-scale hypothesis testing with application to genome-wide association studies. Genetic Epidemiology, 30(6), 519–30.
Su, Z., Marchini, J., & Donnelly, P. (2011). HAPGEN2: simulation of multiple disease SNPs. Bioinformatics, 27(16), 2304–2305.
Tune H. Pers, Juha M. Karjalainen, Yingleong Chan, Harm-Jan Westra, Andrew R. Wood, Jian Yang, Julian C. Lui, Sailaja Svedantam, Stefan Gustafsson, Tonu Esko, Tim Frayling, Elizabeth K. Speliotes, Genetic Investigation of ANthropometric Traits (GIANT) Con, L. F. (2015). Biological interpretation of genome-wide association studies using predicted gene functions.