LanceOTron Command Line Interface
A bare-bones interface to the trained LanceOTron (LoT) model from the command line.
LoT is an all-in-one peak caller that identifies peak regions from a coverage track and uses a convolutional neural network to classify them based on their shape. This algorithm has a web client available at https://lanceotron.molbiol.ox.ac.uk/ where users can upload coverage tracks, call peaks, and visualize them using multi locus view, a powerful visualization engine. This web client will do most of the heavy lifting for most of the people that want to use the tool, but for those who need to call peaks in batch mode we provide a command line interface in this package. This document details the installation of LoT as well as typical use cases. See the left side of this page for tutorials on the three main modules.
Installation
- Clone the repository.
- Install dependencies with pip.
- Install the package.
- Run tests to ensure that everything is working.
git clone git@github.com:Chris1221/lanceotron.git; cd lanceotron # Step 1
pip install -r requirements.txt # Step 2
pip install -e . # Step 3
python -m unittest
Usage
To see available commands, use the --help
flag.
lanceotron --help
Call Peaks
To call peaks from a bigWig track, use the callPeaks
command.
Call peaks from a coverage track and score them using the LanceOTron model.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
file |
str |
Path to bigwig track. |
required |
threshold |
int |
Initial threshold used for selecting candidate peaks. Defaults to 4. |
4 |
window |
int |
Window size for rolling mean to use for selecting candidate peaks. Defaults to 400. |
400 |
folder |
str |
Output folder. Defaults to "./". |
'./' |
skipheader |
bool |
Skip writing header. Defaults to False. |
False |
Source code in lanceotron/modules.py
def find_and_score_peaks(
file: str,
threshold: int = 4,
window: int = 400,
folder: str = "./",
skipheader: bool = False) -> None:
"""Call peaks from a coverage track and score them using the LanceOTron model.
Args:
file (str): Path to bigwig track.
threshold (int, optional): Initial threshold used for selecting candidate peaks. Defaults to 4.
window (int, optional): Window size for rolling mean to use for selecting candidate peaks. Defaults to 400.
folder (str, optional): Output folder. Defaults to "./".
skipheader (bool, optional): Skip writing header. Defaults to False.
"""
bigwig_file = file
out_folder = make_directory_name(folder)
initial_threshold = threshold
min_peak_width = 50
max_peak_width = 2000
read_coverage_factor = 10**9
out_file_name = bigwig_file.split('/')[-1].split('.')[0]+'_L-tron.bed'
pyBigWig_object = pyBigWig.open(bigwig_file)
read_coverage_total = pyBigWig_object.header()['sumData']
read_coverage_rphm = read_coverage_total/read_coverage_factor
pyBigWig_object.close()
bigwig_data = Ltron.Bigwig_data(bigwig_file)
genome_stats_dict = bigwig_data.get_genome_info()
bed_file_out = []
for chrom in genome_stats_dict:
print(chrom)
coverage_array_smooth = bigwig_data.make_chrom_coverage_map(genome_stats_dict[chrom], smoothing=window)
enriched_region_coord_list = Ltron.label_enriched_regions_dynamic_threshold_width(coverage_array_smooth, genome_stats_dict[chrom]['chrom_mean']*initial_threshold, genome_stats_dict[chrom]['chrom_mean'], max_peak_width, min_region_size=min_peak_width)
chrom_file_out = []
if enriched_region_coord_list:
wide_path = pkg_resources.resource_filename('lanceotron.static', 'standard_scaler_wide_v5_03.p')
deep_path = pkg_resources.resource_filename('lanceotron.static', 'standard_scaler_deep_v5_03.p')
coverage_array = bigwig_data.make_chrom_coverage_map(genome_stats_dict[chrom])/read_coverage_rphm
X_wide_array, X_deep_array = Ltron.extract_signal_wide_and_deep_chrom(coverage_array, enriched_region_coord_list, read_coverage_rphm)
standard_scaler_wide = pickle.load(open(wide_path, 'rb'))
X_wide_array_norm = standard_scaler_wide.transform(X_wide_array)
X_wide_array_norm = np.expand_dims(X_wide_array_norm, axis=2)
standard_scaler = StandardScaler()
X_deep_array_norm_T = standard_scaler.fit_transform(X_deep_array.T)
standard_scaler_deep = pickle.load(open(deep_path, 'rb'))
X_deep_array_norm = standard_scaler_deep.transform(X_deep_array_norm_T.T)
X_deep_array_norm = np.expand_dims(X_deep_array_norm, axis=2)
model = build_model()
model_classifications = model.predict([X_deep_array_norm, X_wide_array_norm], verbose=1)
K.clear_session()
for i, coord_pair in enumerate(enriched_region_coord_list):
out_list = [chrom, coord_pair[0], coord_pair[1], model_classifications[0][i][0], model_classifications[1][i][0], model_classifications[2][i][0]]
X_wide_list = X_wide_array[i][:-1].tolist()
X_wide_list = [100. if x>10 else x for x in X_wide_list]
out_list+=X_wide_list
chrom_file_out.append(out_list)
bed_file_out+=chrom_file_out
with open(out_folder+out_file_name, 'w', newline='') as f:
if not skipheader:
f.write('chrom\tstart\tend\toverall_peak_score\tshape_score\tenrichment_score\tpvalue_chrom\tpvalue_10kb\tpvalue_20kb\tpvalue_30kb\tpvalue_40kb\tpvalue_50kb\tpvalue_60kb\tpvalue_70kb\tpvalue_80kb\tpvalue_90kb\tpvalue_100kb\n')
bed_writer = csv.writer(f, delimiter='\t')
bed_writer.writerows(bed_file_out)
Call Peaks with Input
To call peaks from a bigWig track with an input file, use the callPeaks_Input
command.
Call peaks from a coverage track with an input and score them using the LanceOTron model.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
file |
str |
Path to bigwig track. |
required |
input |
str |
Control input track used to calculate Poisson-based significance of peaks. |
required |
threshold |
int |
Initial threshold used for selecting candidate peaks. Defaults to 4. |
4 |
window |
int |
Window size for rolling mean to use for selecting candidate peaks. Defaults to 400. |
400 |
folder |
str |
Output folder. Defaults to "./". |
'./' |
skipheader |
bool |
Skip writing header. Defaults to False. |
False |
Source code in lanceotron/modules.py
def call_peaks_with_input(
file: str,
input: str,
threshold: int = 4,
window: int = 400,
folder: str = "./",
skipheader: bool = False) -> None:
"""Call peaks from a coverage track with an input and score them using the LanceOTron model.
Args:
file (str): Path to bigwig track.
input (str): Control input track used to calculate Poisson-based significance of peaks.
threshold (int, optional): Initial threshold used for selecting candidate peaks. Defaults to 4.
window (int, optional): Window size for rolling mean to use for selecting candidate peaks. Defaults to 400.
folder (str, optional): Output folder. Defaults to "./".
skipheader (bool, optional): Skip writing header. Defaults to False.
"""
bigwig_file = file
control_file = input
out_folder = make_directory_name(folder)
initial_threshold = threshold
min_peak_width = 50
max_peak_width = 2000
read_coverage_factor = 10**9
out_file_name = bigwig_file.split('/')[-1].split('.')[0]+'_L-tron.bed'
pyBigWig_object = pyBigWig.open(bigwig_file)
read_coverage_total = pyBigWig_object.header()['sumData']
read_coverage_rphm = read_coverage_total/read_coverage_factor
pyBigWig_object.close()
bigwig_data = Ltron.Bigwig_data(bigwig_file)
genome_stats_dict = bigwig_data.get_genome_info()
bed_file_out = []
for chrom in genome_stats_dict:
print(chrom)
coverage_array_smooth = bigwig_data.make_chrom_coverage_map(genome_stats_dict[chrom], smoothing=window)
enriched_region_coord_list = Ltron.label_enriched_regions_dynamic_threshold_width(coverage_array_smooth, genome_stats_dict[chrom]['chrom_mean']*initial_threshold, genome_stats_dict[chrom]['chrom_mean'], max_peak_width, min_region_size=min_peak_width)
chrom_file_out = []
if enriched_region_coord_list:
wide_path = pkg_resources.resource_filename('lanceotron.static', 'standard_scaler_wide_v5_03.p')
deep_path = pkg_resources.resource_filename('lanceotron.static', 'standard_scaler_deep_v5_03.p')
coverage_array = bigwig_data.make_chrom_coverage_map(genome_stats_dict[chrom])/read_coverage_rphm
X_wide_array, X_deep_array = Ltron.extract_signal_wide_and_deep_chrom(coverage_array, enriched_region_coord_list, read_coverage_rphm)
standard_scaler_wide = pickle.load(open(wide_path, 'rb'))
X_wide_array_norm = standard_scaler_wide.transform(X_wide_array)
X_wide_array_norm = np.expand_dims(X_wide_array_norm, axis=2)
standard_scaler = StandardScaler()
X_deep_array_norm_T = standard_scaler.fit_transform(X_deep_array.T)
standard_scaler_deep = pickle.load(open(deep_path, 'rb'))
X_deep_array_norm = standard_scaler_deep.transform(X_deep_array_norm_T.T)
X_deep_array_norm = np.expand_dims(X_deep_array_norm, axis=2)
model = build_model()
model_classifications = model.predict([X_deep_array_norm, X_wide_array_norm], verbose=1)
pyBigWig_input = pyBigWig.open(control_file)
read_coverage_total_input = pyBigWig_input.header()['sumData']
read_coverage_rphm_input = read_coverage_total_input/read_coverage_factor
K.clear_session()
for i, coord_pair in enumerate(enriched_region_coord_list):
average_cov = coverage_array[coord_pair[0]:coord_pair[1]].mean()*read_coverage_rphm
pvalue_input = calculate_pvalue_from_input(chrom, coord_pair[0], coord_pair[1], read_coverage_total, read_coverage_total_input, pyBigWig_input, average_cov)
out_list = [chrom, coord_pair[0], coord_pair[1], model_classifications[0][i][0], model_classifications[1][i][0], model_classifications[2][i][0], pvalue_input]
X_wide_list = X_wide_array[i][:-1].tolist()
X_wide_list = [100. if x>10 else x for x in X_wide_list]
out_list+=X_wide_list
chrom_file_out.append(out_list)
pyBigWig_input.close()
bed_file_out+=chrom_file_out
with open(out_folder+out_file_name, 'w', newline='') as f:
if not skipheader:
f.write('chrom\tstart\tend\toverall_peak_score\tshape_score\tenrichment_score\tpvalue_input\tpvalue_chrom\tpvalue_10kb\tpvalue_20kb\tpvalue_30kb\tpvalue_40kb\tpvalue_50kb\tpvalue_60kb\tpvalue_70kb\tpvalue_80kb\tpvalue_90kb\tpvalue_100kb\n')
bed_writer = csv.writer(f, delimiter='\t')
bed_writer.writerows(bed_file_out)
Score a Bed file
To score the peaks in an existing Bed file, use the scoreBed
command.
Score an existing bed file using LanceOTron's model and a coverage track.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
file |
str |
Input bigWig track. |
required |
bed |
str |
Bed file of regions to score. |
required |
folder |
str |
Output folder. Defaults to "./". |
'./' |
skipheader |
bool |
Skip writing header. Defaults to False. |
False |
Source code in lanceotron/modules.py
def score_bed(
file: str,
bed: str,
folder: str = "./",
skipheader: bool = False) -> None:
"""Score an existing bed file using LanceOTron's model and a coverage track.
Args:
file (str): Input bigWig track.
bed (str): Bed file of regions to score.
folder (str): Output folder. Defaults to "./".
skipheader (bool): Skip writing header. Defaults to False.
"""
bigwig_file = file
out_folder = make_directory_name(folder)
bed_file = bed
read_coverage_factor = 10**9
out_file_name = bigwig_file.split('/')[-1].split('.')[0]+'_L-tron.bed'
pyBigWig_object = pyBigWig.open(bigwig_file)
read_coverage_total = pyBigWig_object.header()['sumData']
read_coverage_rphm = read_coverage_total/read_coverage_factor
pyBigWig_object.close()
bed_list = Ltron.bed_file_to_list(bed_file)
chroms_in_bed = []
for bed_entry in bed_list:
if bed_entry[0] not in chroms_in_bed:
chroms_in_bed.append(bed_entry[0])
bigwig_data = Ltron.Bigwig_data(bigwig_file)
genome_stats_dict = bigwig_data.get_genome_info(include_special_chromosomes=True)
bed_file_out = []
for chrom in chroms_in_bed:
print(chrom)
enriched_region_coord_list = []
for bed_entry in bed_list:
if bed_entry[0]==chrom:
enriched_region_coord_list.append([bed_entry[1], bed_entry[2]])
chrom_file_out = []
if enriched_region_coord_list:
wide_path = pkg_resources.resource_filename('lanceotron.static', 'standard_scaler_wide_v5_03.p')
deep_path = pkg_resources.resource_filename('lanceotron.static', 'standard_scaler_deep_v5_03.p')
coverage_array = bigwig_data.make_chrom_coverage_map(genome_stats_dict[chrom])/read_coverage_rphm
X_wide_array, X_deep_array = Ltron.extract_signal_wide_and_deep_chrom(coverage_array, enriched_region_coord_list, read_coverage_rphm)
standard_scaler_wide = pickle.load(open(wide_path, 'rb'))
X_wide_array_norm = standard_scaler_wide.transform(X_wide_array)
X_wide_array_norm = np.expand_dims(X_wide_array_norm, axis=2)
standard_scaler = StandardScaler()
X_deep_array_norm_T = standard_scaler.fit_transform(X_deep_array.T)
standard_scaler_deep = pickle.load(open(deep_path, 'rb'))
X_deep_array_norm = standard_scaler_deep.transform(X_deep_array_norm_T.T)
X_deep_array_norm = np.expand_dims(X_deep_array_norm, axis=2)
model = build_model()
model_classifications = model.predict([X_deep_array_norm, X_wide_array_norm], verbose=1)
K.clear_session()
for i, coord_pair in enumerate(enriched_region_coord_list):
out_list = [chrom, coord_pair[0], coord_pair[1], model_classifications[0][i][0], model_classifications[1][i][0], model_classifications[2][i][0]]
X_wide_list = X_wide_array[i][:-1].tolist()
X_wide_list = [100. if x>10 else x for x in X_wide_list]
out_list+=X_wide_list
chrom_file_out.append(out_list)
bed_file_out+=chrom_file_out
with open(out_folder+out_file_name, 'w', newline='') as f:
if not skipheader:
f.write('chrom\tstart\tend\toverall_peak_score\tshape_score\tenrichment_score\tpvalue_chrom\tpvalue_10kb\tpvalue_20kb\tpvalue_30kb\tpvalue_40kb\tpvalue_50kb\tpvalue_60kb\tpvalue_70kb\tpvalue_80kb\tpvalue_90kb\tpvalue_100kb\n')
bed_writer = csv.writer(f, delimiter='\t')
bed_writer.writerows(bed_file_out)
Examples
There is a basic bigWig file included in the test
subdirectory. To try out the caller, execute it on this file.
lanceotron callPeaks test/chr22.bw -f output_folder
Citation
@article {Hentges2021.01.25.428108,
author = {Hentges, Lance D. and Sergeant, Martin J. and Downes, Damien J. and Hughes, Jim R. and Taylor, Stephen},
title = {LanceOtron: a deep learning peak caller for ATAC-seq, ChIP-seq, and DNase-seq},
year = {2021},
doi = {10.1101/2021.01.25.428108},
publisher = {Cold Spring Harbor Laboratory},
URL = {https://www.biorxiv.org/content/early/2021/01/27/2021.01.25.428108},
journal = {bioRxiv}
}
Bug Reports and Improvement Suggestions
Please raise an issue if there is anything you wish to ask or contribute.