Guide on SHOREmap v3.x

SHOREmap v3.x can analyze resequencing data from classical mapping population (generated by crossing a natural strain to a diverged accession - or outcrossing) or isogenic mapping population (generated by crossing a mutagenized mutant to the non-mutagenized progenitor - or backcrossing). Depending on the mapping population, SHOREmap outcross or backcross perform sliding-window based analysis of allele frequencies.

Both SHOREmap outcross and backcross require consensus base call information of the pooled mutant recombinants. This is the common output of any resequencing pipeline and is typically stored in large files. Before analyzing the allele frequencies, SHOREmap extract can be used to extract consensus information that corresponds to the markers. This extraction will save a lot of time if SHOREmap needs to be rerun multiple times. Resequencing information of the parental lines (or background) can help to define high quality SNP markers, which in turn will allow for finding an accurate mapping interval. We recommended providing this information even though SHOREmap (outcross) can be performed on the pooled recombinants alone.

Brief overview on workflow of SHOREmap outcross (assuming resequencing of the mapping population (as well as the parental lines) has been performed):

Basic steps of outcross analysis Description (a more detailed example has been added in FAQ)
SHOREmap extract Extracts resequencing consensus calls corresponding to SNP mutations (candidate markers)
SHOREmap create Creates a SNP marker list according to the resequencing quality of the parental/background lines
SHOREmap outcross Performs allele frequency analysis and defines a mapping interval (if any)
SHOREmap annotate Annotates the effect of mutations on genes in the mapping interval

Brief overview of workflow of SHOREmap backcross:

Basic steps of backcross analysis Description
SHOREmap extract Extracts resequencing consensus calls corresponding to SNP mutations
SHOREmap backcross Performs allele frequency analysis
SHOREmap annotate Annotates the effect of mutations on genes

SHOREmap outcross

SHOREmap outcross predicts a mapping interval by analyzing the local mean and coefficient of variation of AFs within a pool of mutant recombinants. In the following, we describe the parameters that can be adjusted to tune this prediction.

The most basic usage of SHOREmap outcross is,

SHOREmap outcross --chrsizes chrsizes.txt --marker quality_variant.txt --consen consensus_summary.txt --folder path/to/output_folder/

The option '--chrsizes' provides a file containing names of chromosome IDs (or scaffolds IDs) and their length in bp (example with A.thaliana). '--marker' provides a file (format: quality_variant.txt from SHORE consensus) listing the SNP markers for AF analysis. '--consen' provides a file (format: consensus_summary.txt from SHORE consensus) describing the allele counts at each marker. '--folder' tells SHOREmap where to write its results and log files.

By setting only mandatory options, SHOREmap outcross will consider all SNPs given in the quality_variant.txt (determined by resequencing of the mapping population), including false positive SNPs. SHOREmap outcross provides various options for filtering SNP markers. For example, options '--min-coverage' and '--max-coverage' can be used to exclude SNP markers with extreme coverage values, while option '--marker-score' can be used to exclude markers based on the base call quality. These options can decrease the number of false positive SNPs to be used as markers. However, the accuracy of a mapping interval may still not be satisfactory.

To achieve a better accuracy on the mapping interval, SHOREmap create can be used to create markers for SHOREmap outcross by using the resequencing information of the pool and the background (parental) line(s). The idea is that a SNP identified in one of parental line (against a reference genome) can used as a marker if there is strong support for the reference allele in the other parental line. Several options of SHOREmap create can be tuned to control quality of markers,

Option Description
Parental line 1 --pmarker-min-cov Minimum coverage of a base call
--pmarker-max-cov Maximum coverage of a base call
--pmarker-min-freq Minimum AF of a base call
--pmarker-score Minimum quality of a base call
Parental line 2 --bg-ref-cov Minimum coverage of a base call
--bg-ref-cov-max Maximum coverage of a base call
--bg-ref-freq Minimum AF of a base call
--bg-ref-score Minimum quality of a base call

SHOREmap outcross works with sliding windows, which can be tuned by using '--window-size' and '--window-step'. It calculates local means and coefficient of variation of AFs within each sliding window. Coefficient of variation is calculated as cv=δ/θ, where θ is the mean value of AFs at all markers within a sliding, and δ is the corresponding standard deviation. In mapping population, where the causal mutation is fixed, cv should be a small value (~0.1), while θ approaches 1. Typically AF fluctuate around the theoretically expected values. For this, one can set cutoffs by using '--interval-max-mean' and '--interval-max-cvar' to control interval calculation if needed.

There are two more options, '--background2', which can visualize the frequency of the other allele, and '-verbose' printing processing information on screen during the process.

General outputs of SHOREmap outcross are summarized in the table below. SHOREmap outcross visualizes analysis of AFs in a pdf output file, which can be found in folder path/to/output_folder. Each page contains a graph of AFs corresponding to each chromosome/scaffold (click to check an example). In each graph, AFs are plotted as dots, the windowed-averaged AFs and boost-values as lines. If a mapping interval exists, it is indicated by a light-blue block with the coordinates given below the figure. The information of markers used for analyzing AFs and peaks of boosted values can be output in files SHOREmap_stat_single_marker.txt, SHROEmap_stat_window_markers.txt and SHOREmap_boosted_peaks.txt.

Output of SHOREmap outcross Description
OC_AF_visualization_1_boost.pdf Visualization of AFs
SHOREmap_stat_single_marker.txt AF of single markers used in analysis
SHROEmap_stat_window_markers.txt AF of sliding-window based markers used in analysis
SHOREmap_boosted_peaks.txt Peaks of boost-values which can be used to rank mutations in SHOREmap annotate
SHOREmap_marker_filtered_.WINx.ICx.ACx.xx_bgREF.txt file of filtered markers with info of base quality and coverage (corresponding to SHOREmap_stat_single_marker.txt)
Outcross.log log file of processing

In addition, SHOREmap outcross can zoom in to particular regions and allele frequency ranges by setting '--chromosome', '--begin' and '--end' (genomic positions) and '--minfreq' and '--maxfreq' as lower and upper boundary of AF.

SHOREmap backcross

SHOREmap backcross is used to analyze resequencing data of pooled recombinants of a mapping population generated by backcrossing. In contrast to the conventional mapping population, exclusively mutagen-induced changes segregate and only those can be used to mutation mapping.

SHOREmap backcross tries to filter all differences between the reference sequence and the sequenced pool that are common with background/parental line in order to focus on mutant-specific mutations only. Then it checks if the retained SNPs can be used as markers by using (quality/coverage/AFs of base calls of) foreground and/or background information. After a proper filtering (for thaliana, there should be several hundreds of markers left), SHOREmap backcross is capable of identifying a rough peak region by analyzing AFs of markers. By further annotating variations, we can select candidate gene(s) for an interesting trait.

To check all the options of SHOREmap backcross, type in command

SHOREmap backcross

Here is a simple example of calling SHOREmap backcross,

SHOREmap backcross --chrsizes chrsizes.txt --marker quality_variant.txt --consen consensus_summary.txt --folder path/to/output_folder

The options of SHOREmap backcross given above have the same meanings as SHOREmap outcross.

Again, giving only the resequencing data of the mapping population may not be powerful enough to identify candidate causal mutations. The SNP markers need to be examined by using foreground/background information for further analyzing AFs. First, resequencing information of the parental line(s)/another mutant of the same strain can be used to correct SNPs of the mapping population. This is reffered to as background correction. Background correction can be done by turning on option '--bg' and/or '--bg-ref-filter'. '--bg' tells SHOREmap backcross to check SNPs (or candidates as markers) of the targeted mutant by using SNPs of the background, where '--bg-score' (base quality), '--bg-freq' (concordance/AF), and '--bg-cov' (coverage) can be turned on with threshold values to select background SNPs that can be used to examine the SNPs in foreground. The option '--bg-ref-filter' tells SHOREmap backcross to validate SNPs determined in the pool by using the reference alleles called in the background genome, where '--bg-ref-score', '--bg-ref-freq', and '--bg-ref-cov' can be turned on with specific values to control the reference alleles in background that can be used to filter the SNPs in foreground.

General outputs of SHOREmap backcross are summarized in the table below.

Output of SHOREmap backcross Description
BC_AF_visualization_*.pdf Visualization of AFs
SHOREmap_marker.bg_corrected File of markers corrected based a given background
SHOREmap_marker.bg_corrected_
mh1.0000_ic10_ac80_q40_f0.0_EMS
File of further filtered markers according to average number hits, minimum coverage, maximum coverage, base quality, allele frequency and type of mutation on the marker loci
BackCross.log log file of processing

Output files are written into the path/to/output_folder after each correction step (including quality, concordance/AF and EMS-mutations (controlled by '-non-EMS')). The names of files (like SHOREmap_marker.bg_corrected_mh1.0000_ic10_ac80_q40_f0.0_EMS) indicate the criteria used for filtering. If '-plot-bc' is on with proper window-size/step, θ, and δ controlled by '--window-size', '--window-step', '--interval-min-mean' and '--interval-min-cvar', a visualization of analysis of AFs is reported in a flexible graphical pdf file, which is in the same format as the one resulted from outcross function.

SHOREmap annotate

SHOREmap outcross/backcross results in a mapping interval containing a list of mutations. We would like to check if there is a mutation resulting in a striking change causing a phenotype. This can be done by using SHOREmap annotate that can output a list of mutations (prioritized by their distance to the highest peak) in an user-defined interval. The interval can be provided with '--chrom', '--start', and '--end'. By default, SHOREmap annotate uses position 1 as the peak (for ranking a variation); alternatively, the peak can be provided with '--peaks' indicating a file. For annotating SNPs, gene annotations (in GFF) is required so that the effect of the SNPs can be added to the output.

Here is an example about annotating SNPs in a region [1600000, 1700000] of chromosome 4:

SHOREmap annotate --chrsizes chrsizes.txt --snp markers.txt --chrom 4 --start 1600000 --end 1700000 --genome reference_seq.fa --gff gene_annotations.gff --peaks SHOREmap_boosted_peaks.txt --folder path/to/output_folder

'--snp' gives a list of SNPs/indels.

'--peaks' is the output file of SHOREmap outcross/backcross that contains all peaks predicted for all the chromosomes. It is a double-columned file, with the first column indicating the chromosome, and the second column representing for a peak position. The columns are tab-separated.

'--chrom' indicates the identity of the chromosome to annotate. It must be consistent with one of the identity of chromosomes in marker.txt, gene_annotations.gff, and consensus_summary.txt.

'--start' and '--end' indicate the interval location.

'--genome' is the reference sequence in fasta format.

'--gff' is a GFF-formatted description of the annotation about gene features. For Arabidopsis thaliana, we can download the GFF file from arabidopsis.org.

Results can be found in path/to/output_folder. There is a file named prioritized_snp_4_1600000_-1700000_peak1680000.txt (supposing there is a peak position 1680000 for chromosome 4 in peak-file). If there is another peak, say 16900000, then there will be one more annotated file as prioritized_snp_4_-1600000_1700000_peak1690000.txt. Additionally, there is a file ref_and_eco_coding_seq.txt. In this file, the first line is the name of the gene, the second line is the extracted coding region from the reference genome, the third line is the extracted coding region from the mutant genome (note that in this line, any position with a base the same as the one of the reference line is represented by a '-' while a different one is the specific DNA base), the fourth line is the translated protein sequence from the reference region, the fifth line is the translated protein sequence from the mutant region (any position with a protein base the same as that of the reference is '-' while a different one is a specific protein base or '*').

SHOREmap convert

SHOREmap convert is used to convert VCF files generated by SAMtools into files with format acceptable by SHOREmap. An example is given below:

SHOREmap convert --marker samtools.vcf --folder path/to/folder -runid 3

This will leads to converted files 3_converted_consen.txt, 3_converted_variant.txt and 3_converted_reference.txt.

Format of input files

Chromosome file

'--chrsizes' requires a file containing a table with two columns. (This file can be constructed by using informaiton of chromosomes in a GFF file.)

Column Description
1 Identity of reference chromosome or scaffold
2 Size of the reference chromosome or scaffold (in bp)

File of candidate markers

If SHORE is used, variations to be analyzed can be found under Consensus/ConsensusAnalysis/. There are a number of files containing SNPs predicted by SHORE analysis, such as heterozygous_position.txt, homozygous_snp.txt, and quality_variant.txt, etc.. (Click here to find annotations of SHORE outputs.)

SHOREmap analyzes SNPs in quality_variant.txt. Each row of this file representing a SNP is used for analysis, where the columns mean

Column Description
1 Project name
2 Identity of chromosome
3 Position of the SNP-marker
4 Reference base
5 Alternative base (or mutant base)
6 Quality of the alternative base (ranging from 0 to 40)
7 Number of reads supporting the predicted base
8 Ratio of reads supporting the predicted base to total coverage
9 Average number of hits of alignments covering the position of the SNP

(The high-quality reference base call information in quality_reference.txt has the same format as quality_variant.txt. This information is powerful to increase the quality of markers for next-step allele frequency analysis. However, it can be a large file with size in tens of Gygabytes, of which parsing for information will take a lot of time. To save time in future analysis with SHOREmap outcross/backcross, the information of interest can be extracted once for all by using

SHOREmap extract --chrsizes chrsize.txt --marker marker.txt --consen quality_reference.txt --folder /path/to/folder/ --extract-bg-ref

Extracted consensus information will be recorded in a new file named extracted_quality_ref_base.txt with the same format as the original file. This file is required when '-bg-ref-filter' is turned on in SHOREmap backcross).

File of consensus base call

If SHORE is used, consensus_summary.txt can be found in folder Consensus/ConsensusAnalysis/supplementary_data. In its original form, consensus_summary.txt might be compressed as consensus_summary.txt.gz. In this case, we should unzip it with the command

gunzip consensus_summary.txt.gz

Generally, consensus_summary.txt is a huge table of 65 columns; each row corresponds to information of consensus base call of a genomic position. For details on each cell information of this table, please go to annotations of SHORE consensus call. Here we only introduce the first 12 columns used in SHOREmap:

Column Description
1 Identity of reference sequence chromosome/scaffold
2 position of a reference base
3 base call, including A/C/G/T or N, or -
4 Total coverage at the position, including Ns and deletions
5 Number of observed A
6 Number of observed C
7 Number of observed G
8 Number of observed T
9 Number of observed -
10 Number of observed N
11 Average number of hits of alignments covering the position
12 Average number of mismatches of alignments covering the position

Columns from 13 to 65 in consensus_summary.txt will be ignored during the process of SHOREmap.

File consensus_summary.txt is also large in size which can range from tens to hundreds of Gygabytes while we only need the information corresponding to candidate markers. Therefore, it is recommended to extract the corresponding information before further processing by using SHOREmap outcross/backcross. This extraction can be done with the command

SHOREmap extract --chrsizes chrsize.txt --marker quality_variant.txt --consen consensus_summary.txt --folder path/to/folder

A resulted file named extracted*.txt will be write into '--folder', which has the same format as the original file.

File of reference genome

Fasta file of the reference sequence (chromosomes/scaffolds), i.e., the first line is the id of a chromosome while the second line is the sequence, for example
>1
ACGTACGT
>2
TGCATGCATGCA
...

File of SNPs for annotation

The SNPs annotated by SHOREmap annotate can be those in quality_variant.txt, heterozygous_position.txt, or homosygous_snps.txt (if SHORE is used). The latter two have the same format as quality_variant.txt. For more information, please see annotations of SHORE consensus call for detailed descriptions.

File of variations in VCF

SAMtools v0.1.18 (r982:295) can deliver variant calls in VCF4.1, usually named samtools.vcf. For more information, please refer to Manual Reference Pages of SAMtools.

Conversion leads to a raw-marker file and a consensus file, which are in the formats that can be accepted by SHOREmap analysis.

File of annotation of gene feature

See description of General Feature Format (GFF) files. For thaliana, see arabidopsis.org.

Outputs

SHOREmap outcross

The main output is a file visualizing analysis of AF, including AFs at single markers, window-based average AFs and boosted analysis (optional). Each chromomsome/scaffold corresponds to a plot of page, where we can check key parameters provided from the command line. Importanly, we can have a mapping interval (if any) with start and end positions.

If -plot-boost is switched on, there will be a file SHOREmap_boosted_peaks.txt generated for recording boosted-peaks. The peak position can be used in annotation for ranking variations. It has the same format as the file of names and sizes of chromosomes except that the second column indicates a value of genomic position.

Moreover, the final markers used for deriving the mapping interval will be recorded in a file like SHOREmap_marker_filtered_.off_bgREF.txt ('off_bgREF' in the file name indicates no filtering is carried out with information of background reference base calls). This file has the same format as the above-mentioned file of candidate markers.

SHOREmap backcross

This function can also output a vasulization pdf file containing the same information as outcross. Intermediate information of markers for analyzing AFs are also recorded in files like SHOREmap_marker.bg_corrected_q40_f20_EMS. 'bg_corrected_q40_f20_EMS' means that the markers are corrected with information of background SNPs, and only EMS-type SNPs, which have minimum quality of 40, AF of 0.2, are used as markers.

An intermediate file SHOREmap_marker_filtered_on_bgREF_ccd0.990_q40.txt contains markers that are validated by using information of background reference base call. A background reference base can be used to validate a foreground marker if its concordance is at least 0.99 and quality is 40.

The files recording intermediate markers have the same format as the above-mentioned file of candidate markers.

SHOREmap annotate

The file of annotated mutations includes the following columns of information,
Column Description
1 Chromosome identifier
2 Position of a base in the reference sequence
3 Base in the reference sequence
4 Alternative/mutant base
5 Number of reads supporting the base change
6 AF of the short reads overlapping the position
7 Base quality supporting the base change
8 Either NEWSNP or REFERR
9 region of DNA which is affected by the change (CDS, splice_site, 3'/5'-UTR etc)
10 Gene identifier
11 Physical distance from the peak (if position of the peak is not provided, distance is from position 1)
12 the numerical order of the codon having the mutation
13 site of the mutation in codon
14 Either Synonymous or Nonsynonymous (only when CDS is hit)
15 Amino acid of the reference (only when CDS is hit)
16 Amino acid after base change (only when CDS is hit)
----RETURN TO BEGINNING----