SHOREmap v3.0: fast and accurate identification of causal mutations from forward genetic screens

Hequan Sun and Korbinian Schneeberger*

Department of Plant Developmental Biology, Max Planck Institute for Plant Breeding Research, Carl-von-Linné-Weg 10, 50829 Cologne, Germany


*Correspondence

Korbinian Schneeberger

Department of Plant Developmental Biology

Max Planck Institute for Plant Breeding Research

schneeberger@mpipz.mpg.de

Running head

Mapping-by-sequencing with SHOREmap v3.0



Summary

Whole genome resequencing of pools of recombinant mutant genomes allows directly linking phenotypic traits to causal mutations. Such an analysis, called mapping-by-sequencing, combines classical genetic mapping and next generation sequencing by relying on selection-induced patterns within genome-wide allele frequency (AF) in pooled genomes. Mapping-by-sequencing can be performed with computational tools such as SHOREmap. Previous versions of SHOREmap, however, did not implement standardized analyses, but were specifically designed for particular experimental settings. Here, we introduce the usage of a novel and advanced implementation of SHOREmap (version 3.0) including several new features like file readers for commonly used file formats, SNP marker selection and a stable calculation of mapping intervals. SHOREmap can be downloaded at shoremap.org.


Keywords: forward genetics, bulk segregant analysis, next generation sequencing, mapping-by-sequencing, SNP marker, allele frequency analysis




1. Introduction

Next generation sequencing facilitates genome-wide identification of SNPs, indels, and structural variations [1], which enable whole genome comparisons, even if the genomes under investigation are only distinguished by a few differences. Though mutagen-induced mutations can be identified by directly comparing the mutant and wild-type genomes [2], identification of causal mutations, or a small set of candidate mutations, within the genomes of mutants from forward genetic screens is hampered by the sheer amount of mutations in the genomes. To reduce the list of potentially causal mutations, sequencing-based analysis of pools of recombinant mutant genomes can make use of genetic linkage to distinguish between causal and non-causal mutations [3] in one process, which is commonly referred to as mapping-by-sequencing [4].

For this, the mutant is crossed to a non-mutant plant followed by one generation of inbreeding or inter-crossing of the heterozygous F1 samples. The offspring of this second cross gives rise to a recombinant population, which will segregate for the mutant phenotype. These individuals can be isolated, pooled and sequenced (either genome-wide, or with complexity reducing methods including RNA-seq, RAD-seq or whole-exome sequencing). Mapping-by-sequencing then estimates allele frequencies (AFs) of the different parental alleles throughout the genome (measured with read counts at marker loci) to identify local skews, which were introduced by the selection for the mutant phenotype and thereby reveal the region harbouring the causal mutation [5]. A comprehensive review on mapping-by-sequencing is given in [6].

Mapping-by-sequencing as implemented in SHOREmap (SHOREmapping) relies on a marker list, including the two parental alleles. Other methods, which can work independently of prior knowledge on the markers, have been presented (e.g. [3,7]), but are currently not implemented in SHOREmap.

SHOREmap can handle different types of segregating mapping populations, either derived through outcrossing of the mutant to a diverged accession [3,7-10], or through backcrossing of the mutant to its progenitor [11-15]. Outcrossing the mutant to a wild-type strain with a different genetic background introduces a large number of natural polymorphisms into the recombinant genomes, which provides a powerful basis for allele frequency estimations [3]. In a mapping population generated through backcrossing the mutant to the non-mutagenized progenitor, only mutagen-induced mutations segregate [16]. Here, only mutagen-induced mutations can be use as markers.

SHOREmapping is performed on the outcome of a standard resequencing analysis of the recombinant pool. Resequencing relies on alignments of short reads against a reference sequence, performed with tools like BWA [17], GenomeMapper [18], or Bowtie2 [19]. Most of the short read alignment tools store their results in Sequence/Binary Alignment/Map (SAM/BAM) format [20], which can be further processed by tools such as SHORE [1], SAMtools [20] or GATK [21]. These tools use overlapping alignments to reveal genomic variations, e.g., SNPs, which are finally recorded in VCF formatted files [22].

Here we introduce a C/C++-based implementation of SHOREmap (version 3.0, http://shoremap.org under GPL license), which is independent of any particular resequencing tool. The SHOREmap function convert pre-processes the resequencing results stored in VCF files for further analysis. Users can provide customized SNP markers or use the SHOREmap function create for marker selection. Once markers are selected, the SHOREmap functions outcross and backcross estimate AFs throughout the bulk of mutant genomes. The function outcross predicts a mapping interval, while backcross identifies candidate causal mutations directly. Both functions visualize mutant AFs at the marker loci in a PDF file. Finally, the SHOREmap function annotate predicts the functional impact of the candidate mutations on gene integrity.

2. Materials

2.1 SHOREmapping with outcross populations

The first step of SHOREmapping based on outcross population data is marker selection if they are not provided from additional sources. For this, SHOREmap distinguishes between two different cases. First, if only the resequencing data of the mapping population is available, the function outcross can be used for de novo marker identification. Single nucleotide mismatches between the short reads and the reference sequence can be considered as markers if the non-reference alleles fulfill several quality criteria, such as the number of short reads aligned to them. Users can adjust these thresholds if needed. Second, marker selection can be performed on the outcome of resequencing analyses of the parental lines. For each high-quality SNP identified in one of the parental lines, create checks if the resequencing data of the other parental line supports a different allele with the above-mentioned quality criteria. Only those markers, which pass all quality controls in the resequencing analyses of both parental samples, will be kept for mapping interval calculation. If one of the parental lines is the reference line, resequencing information of the other parental line is sufficient for generating a list of markers. Alternatively, SHOREmap can accept marker lists from public databases or any other sources.

Using the marker list and the resequencing data of the pooled recombinants, the function outcross will then identify genomic regions (or mapping intervals), which segregates at a given target AF within the bulked population. Within mapping populations established for the identification of recessive mutations, the target AF of the causal mutations is 1.0. However, mapping populations of dominant mutations or more complex crossing schemes might not fix the causal mutations. Moreover, due to the effects of random sampling, mis-aligned reads and sequencing errors, observed AFs can fluctuate around their real AF [16].

SHOREmap outcross estimates the average AF (denoted by θ) within sliding windows of adjustable size along the chromosomes. Users can set minimum and maximum thresholds θmin and θmax to define an acceptable range around the target AF. This estimation can be fine-tuned by the coefficient of variation of the AFs (denoted by Cv). Assuming σ is the standard deviation corresponding to θ, Cv=σ/θ. Users can set a maximum threshold Cvmax for Cv. A continuous mapping interval is defined by outcross by connecting all windows with θmaxθθmin and CvCvmax.

Default parameters for fixed and segregating mutations are provided, but adjustments could be required if the initial run remains unsuccessful. Adjusting these results might also involve recreation of the initial marker list.

To determine different thresholds of the parameters, which can be used to refine the marker list, SHOREmap outcross implements a k-means-based clustering to classify markers using the above-mentioned quality criteria as their attributes. Excluding sets (clusters) of markers with low quality may lead to a more accurate mapping interval.

The function outcross additionally calculates a simple metric called boost-value (denoted by Bv). If θobs is the observed mean of AFs within a window and θtar is the target AF then Bv=1/|1-max(θtar, 1tar)/max(θobs,1-θobs)|. Similar to the r-value [3], which was introduced with the original version of SHOREmap, the peak in the distribution of boost-values along the chromosome (or boost-peak) is likely to co-localize with the causal mutation.

2.2 SHOREmapping with backcross populations

Markers segregating in backcross populations are mutagen-induced changes that distinguish the mutant genome from the non-mutagenized progenitor. However, as resequencing tends to bring up false positive variations and, more importantly, as the mutant line might be in a different background as compared to the line used to establish the reference sequence, it is important to distinguish the mutagen-induced mutations from natural variations and resequencing artifacts. This requires comparison to the non-mutagenized progenitor (or background correction). For this, the function backcross identifies novel, mutagen-induced mutations by checking for the presence of reference alleles in the resequencing data of the non-mutagenized parental line. Once identified, the mutant AFs at these mutations-markers can highlight regions under selection and finally reveal candidate mutations (e.g. recessive mutations, which are fixed in the pool). Instead of comparing to the non-mutagenized progenitor, other mutants or pools of mutant genomes of the same screen can be used to discriminate between natural variation and novel mutations.


2.3 SHOREmap annotate and additional functions

The function annotate predicts putative effects of mutations on gene integrity given the reference sequence and gene annotations, including premature stops or mutations that affect the coding of amino acids. In addition to prioritizing them after their putative impact on genes, the mutations can be ranked according to their physical distance to the boost-peak.

Moreover, SHOREmap implements the two supporting functions called convert and extract. SHOREmap generally relies on proprietary file formats, but as resequencing analyses performed by SAMtools or GATK record their result in standard VCF, SHOREmap convert can translate such data into the correct format. SHOREmap extract can parse the files that store the resequencing results, which can be as huge as tens of gigabytes in size, and reduce the files with respect to (candidate) markers. This reduces execution time if multiple runs of SHOREmapping are performed on the same data.

3. Methods

Here we used the data of two recent mapping-by-sequencing experiments in Arabidopsis thaliana to illustrate the usage of SHOREmap (see Table 1). This data can be downloaded from shoremap.org (with commands provided below). Within the first study, a recombinant mapping population was generated by outcrossing a recessive mutant in the Col (reference) background to the diverged accession Ler, followed by one round of inbreeding of the F1. A pool of 119 F2 mutant individuals was sequenced with Illumina paired-end reads to a sequencing depth of 60x (in the following we refer to this data set as OCF2) [5]. The two parental lines Col and Ler were also sequenced at 42x depth [23] and 48x depth [14], respectively. The sequencing data of the parental lines was not generated from the actual parents of the map cross, but was taken from two different sequencing projects on the same (homozygous) lines and therefore did not include the actual mutations. Although not including the mutations, these data can provide natural variations, which can be used at markers.

The second study included sequencing of the mutants of a recombinant backcross population. These were generated by backcrossing a mutant to the non-mutagenized progenitor followed by one round of inbreeding. A mutant pool of 110 individuals was sequenced with Illumina paired-end reads at 50x sequencing depth (called BCF2 in the following). The non-mutagenized parental line, called mir159a, was sequenced at 48x depth [14].

The reference sequence of Arabidopsis thaliana (TAIR10_chr_all.fas) and gene annotation (TAIR10_GFF3_genes.gff) from The Arabidopsis Information Resource (www.arabidopsis.org) were used for short read alignment (and SNP calling) and mutation annotation. SHOREmap requires a file chrSizes.txt, of which the first column lists the identities of the chromosomes/scaffolds and the second column lists the sizes (with columns tab-separated). Resequencing by SHORE also requires a scoring matrix to call SNPs, which can also be found under its installation folder share/shore/. Note the chromosome identifiers in files TAIR10_chr_all.fas, TAIR10_GFF3_genes.gff and chrSizes.txt must be the same.

Here we will demonstrate usage of SHOREmap based on different resequencing tools. In particular we will perform resequencing with SHORE [1] (release 0.7.1) and SAMtools (version 0.1.19) [20]. For these analyses we will use GenomeMapper [18] (release 0.4.4) and Bowtie2 (version 2.2.2.2) [19] as short read alignment tools, respectively.

The SHOREmap analysis as well as resequencing will be performed on a command line tool on a linux operating system.

3.1 SHOREmapping of a recessive mutation within an outcrossing population (resequencing performed with SHORE)

  1. Create a folder example/ and download the data (including backcrossing data) listed in a file data_list.txt.

mkdir example

cd example

wget --no-check-certificate http://bioinfo.mpipz.mpg.de/shoremap/data_list.txt

while read file; do wget --no-check-certificate ${file}; done < data_list.txt

        #Additional: md5 check -- if all files are downloaded completely
          wget --no-check-certificate http://bioinfo.mpipz.mpg.de/shoremap/md5sum.txt
          md5sum -c md5sum.txt

  1. Pre-process the reference sequence of A. thaliana.

shore preprocess -f TAIR10_chr_all.fas -i indexs

  1. Import short reads of the mutant pool (or the parental line) in fastq format into SHORE analysis folders.

tpop=OC

samp=fg

shore import -v fastq -e shore -a genomic -Q sanger -x ${tpop}.${samp}.reads1.fq.gz -y ${tpop}.${samp}.reads2.fq.gz -o ${tpop}/${samp}/flowcell --rplot

  1. Align reads to the reference sequence (parallelized using 40 cores – adjust according to your computational resources).

shore mapflowcell -f ${tpop}/${samp}/flowcell -i indexs/TAIR10_chr_all.fas.shore -n 10% -g 7% -c 40 -p --rplot

  1. Correct alignments with paired-end information.

shore correct4pe -l ${tpop}/${samp}/flowcell/4 -x 250 -e 1

  1. Merge alignments.

shore merge -m ${tpop}/${samp}/flowcell -o ${tpop}/${samp}/alignment -p

  1. Call differences between sample and reference sequence (including natural variation and novel mutations).

shore consensus -n ${tpop}.${samp} -f indexs/TAIR10_chr_all.fas.shore -o ${tpop}/${samp}/consensus -i ${tpop}/${samp}/alignment/map.list.gz -a scoring_matrix_het.txt -g 5 -v -r

  1. Perform resequencing for the Ler parent by repeating steps 3 to 7 with 'tpop=OC, samp=bg' in step 3 and changing 'flowcell/4' to 'flowcell/1' in step 5.

  2. Perform resequencing for the parent mir159a by repeating steps 3 to 7 with 'tpop=BC, samp=bg' in step 3 and changing 'flowcell/4' to 'flowcell/1' in step 5 (see Note 1).

  3. Create an additional folder for collecting markers.

mkdir OC/marker_creation

  1. Combine all the candidate markers according to the parental lines.

cat OC/bg/consensus/ConsensusAnalysis/quality_variant.txt BC/bg/consensus/ConsensusAnalysis/quality_variant.txt > OC/marker_creation/ler_col_combined_quality_variant.txt

  1. Decompress the consensus information of the pooled mapping population.

gunzip OC/fg/consensus/ConsensusAnalysis/supplementary_data/consensus_summary.txt.gz

  1. Extract the consensus base calls for all the candidate markers. Result will be recorded in file extracted_consensus_0.txt.

SHOREmap extract --chrsizes chrSizes.txt --folder OC/marker_creation --marker OC/marker_creation/ler_col_combined_quality_variant.txt --consen OC/fg/consensus/ConsensusAnalysis/supplementary_data/consensus_summary.txt -verbose

  1. Compress the consensus call information of the pooled mapping population.

gzip OC/fg/consensus/ConsensusAnalysis/supplementary_data/consensus_summary.txt

  1. Decompress the quality-reference calls of the parental lines.

gunzip BC/bg/consensus/ConsensusAnalysis/quality_reference.txt.gz

gunzip OC/bg/consensus/ConsensusAnalysis/quality_reference.txt.gz

  1. Extract quality-reference bases of one parent respective to quality-variants that has been called in the other background.

SHOREmap extract --chrsizes chrSizes.txt --folder OC/marker_creation --marker OC/bg/consensus/ConsensusAnalysis/quality_variant.txt --extract-bg-ref --consen BC/bg/consensus/ConsensusAnalysis/quality_reference.txt --row-first 15 -verbose

SHOREmap extract --chrsizes chrSizes.txt --folder OC/marker_creation --marker BC/bg/consensus/ConsensusAnalysis/quality_variant.txt --extract-bg-ref --consen OC/bg/consensus/ConsensusAnalysis/quality_reference.txt --row-first 51 -verbose

  1. Compress the quality-reference calls of the parental lines.

gzip BC/bg/consensus/ConsensusAnalysis/quality_reference.txt

gzip OC/bg/consensus/ConsensusAnalysis/quality_reference.txt

  1. Create markers with resequencing information of the parental lines. Markers will be collected in file SHOREmap_created_F2Pab_specific.txt (see Note 2).

SHOREmap create --chrsizes chrSizes.txt --folder OC/marker_creation --marker OC/fg/consensus/ConsensusAnalysis/quality_variant.txt --marker-pa OC/bg/consensus/ConsensusAnalysis/quality_variant.txt --marker-pb BC/bg/consensus/ConsensusAnalysis/quality_variant.txt --bg-ref-base-pb OC/marker_creation/extracted_quality_ref_base_15.txt --bg-ref-base-pa OC/marker_creation/extracted_quality_ref_base_51.txt --pmarker-score 40 --pmarker-min-cov 30 --pmarker-max-cov 75 --pmarker-min-freq 0.9 --bg-ref-score 40 --bg-ref-cov 33 --bg-ref-cov-max 70 --bg-ref-freq 0.9 -verbose

  1. Analyze AFs within the OCF2 pool (see Note 3 & 4). Figure 1 summarizes the visualization of AFs at markers along each chromosome. The analysis predicts a 215 kb interval located at 18,595,000 to 18,809,999 on chromosome 2.

SHOREmap outcross --chrsizes chrSizes.txt --folder OC/SHOREmap_analysis --marker OC/marker_creation/SHOREmap_created_F2Pab_specific.txt --consen OC/marker_creation/extracted_consensus_0.txt --min-marker 5 -plot-boost -plot-scale --window-step 5000 --window-size 200000 --interval-min-mean 0.997 --interval-max-cvar 0.01 --min-coverage 20 --max-coverage 80 --marker-score 25 --fg-INDEL-cov 0 --marker-hit 1 --fg-N-cov 0 -plot-win --cluster 1 -rab -background2 -verbose

  1. Annotate mutations within the mapping interval (see Note 5). This will reveal two mutations with effects on genes. The first one is a C→T mutation on position 18,774,111, which results in a premature stop codon in AT2G45550. The other mutation is also a C→T mutation on position 18,808,927 affecting a splice site of AT2G45660 (SOC1), which was verified as the causal gene [5].

SHOREmap annotate --chrsizes chrSizes.txt --folder OC/SHOREmap_analysis/annotation --snp OC/fg/consensus/ConsensusAnalysis/quality_variant.txt --chrom 2 --start 18595000 --end 18809999 --genome indexs/TAIR10_chr_all.fas.shore --gff TAIR10_GFF3_genes.gff

3.2 SHOREmapping of a recessive mutation within a backcrossing population (resequencing performed with SHORE)

Assume the current working directory is example/.

  1. Perform resequencing for BCF2 pool by repeating steps 3 to 7 of Section 3.1 with 'tpop=BC, samp=fg' in step 3 and changing 'flowcell/4' to 'flowcell/1' in step 5.

  2. Decompress the consensus call information of the mapping population.

gunzip BC/fg/consensus/ConsensusAnalysis/supplementary_data/consensus_summary.txt.gz

  1. Extract the consensus information for candidate markers.

SHOREmap extract --chrsizes chrSizes.txt --folder BC/SHOREmap_analysis --marker BC/fg/consensus/ConsensusAnalysis/quality_variant.txt --consen BC/fg/consensus/ConsensusAnalysis/supplementary_data/consensus_summary.txt -verbose

  1. Compress the consensus call information of the mapping population.

gzip BC/fg/consensus/ConsensusAnalysis/supplementary_data/consensus_summary.txt

  1. Analyze AFs in the BCF2 population. Figure 2 summarizes the visualization of AFs at the markers. A mapping interval is estimated from position 1 to 4,000,000 on chromosome 3 (see Note 3).

SHOREmap backcross --chrsizes chrSizes.txt --marker BC/fg/consensus/ConsensusAnalysis/quality_variant.txt --consen BC/SHOREmap_analysis/extracted_consensus_0.txt --folder BC/SHOREmap_analysis -plot-bc --marker-score 40 --marker-freq 0.0 --min-coverage 10 --max-coverage 80 --bg BC/bg/consensus/ConsensusAnalysis/quality_variant.txt --bg-cov 1 --bg-freq 0.0 --bg-score 1 -non-EMS --cluster 1 --marker-hit 1 –verbose

  1. Annotate mutations (see Note 6). This reveals three mutations with effects on genes. The first one is a C→T mutation on position 82,825 of AT3G01270, which changes the 3’UTR. The second C→T mutation on position 1,405,085 of AT3G05040 results in a premature stop codon. The third mutation is also a C→T mutation on position 3,057,628 of AT3G09940, which results in an amino acid change from E→K. AT3G05040 was validated as causal mutation [14].

SHOREmap annotate --chrsizes chrSizes.txt --folder BC/SHOREmap_analysis/ann --snp BC/SHOREmap_analysis/SHOREmap_marker.bg_corrected --chrom 3 --start 1 --end 4000000 --genome indexs/TAIR10_chr_all.fas.shore --gff TAIR10_GFF3_genes.gff

3.3 SHOREmap analysis with resequencing by Bowtie2 and SAMtools

The following example focuses on the outcrossing data only. The backcrossing data can be processed similarly. Suppose the current working directory is example/.

  1. Create a folder, in which the reference sequence will be indexed by Bowtie2.

mkdir indexb

  1. Index reference sequence.

bowtie2-build -f TAIR10_chr_all.fas indexb/TAIR10_chr_all.fas.bowtie2

  1. Set parameters about the sample and create a folder for recording results.

tpop=OC

samp=fg

mkdir ${tpop}/${samp}/bowtie2SAMtools

  1. Map the reads to the reference genome with Bowtie2.

bowtie2 -x indexb/TAIR10_chr_all.fas.bowtie2 -1 ${tpop}.${samp}.reads1.fq.gz -2 ${tpop}.${samp}.reads2.fq.gz -S ${tpop}/${samp}/bowtie2SAMtools/bt2_${tpop}${samp}_PE.sam

  1. Change the working directory and convert the respective SAM file to BAM file.

cd ${tpop}/${samp}/bowtie2SAMtools

samtools view -bS -o bt2_${tpop}${samp}_PE.bam bt2_${tpop}${samp}_PE.sam

  1. Sort the bam file.

samtools sort bt2_${tpop}${samp}_PE.bam bt2_${tpop}${samp}_PE.sorted

  1. Call the consensuses and record them in a VCF4.1 file.

samtools mpileup -uD -f ../../../TAIR10_chr_all.fas bt2_${tpop}${samp}_PE.sorted.bam | bcftools view -cg - > bt2_${tpop}${samp}_PE.raw.all.vcf

  1. Convert VCF4.1 file into the file format required for SHOREmap analysis and change working directory to example/. This function converts a VCF4.1 file into three files, namely 11_converted_consen.txt, 11_converted_variant.txt, and 11_converted_reference.txt, which respectively contain information of consensus bases, SNP variations, and high-quality reference bases.

SHOREmap convert --marker bt2_${tpop}${samp}_PE.raw.all.vcf --folder convert -runid 11

cd ../../../

  1. Set 'tpop=OC, samp=bg' in step 3, and repeat steps 3 to 8.

  2. Set 'tpop=BC, samp=bg' in step 3, and repeat steps 3 to 8.

  3. Create an additional folder for collecting markers.

mkdir OC/marker_creation_bt2

  1. Combine all the candidate markers.

cat OC/bg/bowtie2SAMtools/convert/11_converted_variant.txt BC/bg/bowtie2SAMtools/convert/11_converted_variant.txt > OC/marker_creation_bt2/ler_col_combined_quality_variant.txt

  1. Extract the information of the consensus base calls of the mapping population for all the candidate markers.

SHOREmap extract --chrsizes chrSizes.txt --folder OC/marker_creation_bt2 --marker OC/marker_creation_bt2/ler_col_combined_quality_variant.txt --consen OC/fg/bowtie2SAMtools/convert/11_converted_consen.txt -verbose

  1. Extract quality-reference bases of one parent respective to quality-variants that has been called in the other parental genome.

SHOREmap extract --chrsizes chrSizes.txt --folder OC/marker_creation_bt2 --marker OC/bg/bowtie2SAMtools/convert/11_converted_variant.txt --extract-bg-ref --consen BC/bg/bowtie2SAMtools/convert/11_converted_reference.txt --row-first 15 -verbose

SHOREmap extract --chrsizes chrSizes.txt --folder OC/marker_creation_bt2 --marker BC/bg/bowtie2SAMtools/convert/11_converted_variant.txt --extract-bg-ref --consen OC/bg/bowtie2SAMtools/convert/11_converted_reference.txt --row-first 51 -verbose

  1. Create markers with resequencing information of the parental lines.

SHOREmap create --chrsizes chrSizes.txt --folder OC/marker_creation_bt2 --marker OC/fg/bowtie2SAMtools/convert/11_converted_variant.txt --marker-pa OC/bg/bowtie2SAMtools/convert/11_converted_variant.txt --marker-pb BC/bg/bowtie2SAMtools/convert/11_converted_variant.txt --bg-ref-base-pb OC/marker_creation_bt2/extracted_quality_ref_base_15.txt --bg-ref-base-pa OC/marker_creation_bt2/extracted_quality_ref_base_51.txt --pmarker-score 130 --pmarker-min-cov 10 --pmarker-max-cov 80 --pmarker-min-freq 0.88 --bg-ref-score 130 --bg-ref-cov 10 --bg-ref-cov-max 80 --bg-ref-freq 0.88 -verbose

  1. Analyze AFs of OCF2 population. Visualization of AFs is similar to the one shown in Figure 1. The analysis should predict a 305kb mapping interval located from 18,505,000 to 18,809,999 at chromosome 2.

SHOREmap outcross --chrsizes chrSizes.txt --folder OC/SHOREmap_analysis_bt2 --marker OC/marker_creation_bt2/SHOREmap_created_F2Pab_specific.txt --consen OC/marker_creation_bt2/extracted_consensus_0.txt --min-marker 5 -plot-boost -plot-scale --window-step 5000 --window-size 200000 --interval-min-mean 0.976 --interval-max-cvar 0.04 --min-coverage 20 --max-coverage 80 --marker-score 25 --fg-N-cov 4 -plot-win --cluster 1 -rab -background2 -verbose

  1. Annotate mutations of the OCF2 population within the mapping interval. The annotation should give the same striking mutations as step 17 of Section 3.1.

SHOREmap annotate --chrsizes chrSizes.txt --folder OC/SHOREmap_analysis_bt2/annotation --snp OC/fg/bowtie2SAMtools/convert/11_converted_variant.txt --chrom 2 --start 18505000 --end 18809999 --genome indexs/TAIR10_chr_all.fas.shore --gff TAIR10_GFF3_genes.gff

4. Notes

  1. Even if one of the parental genotypes of the map cross is the reference line (either the mutant is induced in the reference strain or a non-reference strain mutant is crossed to the reference line) and ideally all its alleles are represented within the reference sequence, we find it advantageous to sequence this reference line again. This provides information on the quality of the reference bases, which need to be as accessible (i.e. unique in the genome) as diverged alleles and other short read analysis artifacts.

  2. The number of markers affects the accuracy of the mapping interval. A more accurate mapping interval can be identified if more markers are provided. However, including more markers typically increases the fraction of wrong markers (i.e. SNPs were the two different alleles that cannot be aligned with the same quality), which can decrease the accuracy of the mapping interval. Tuning marker selection can thus affect the mapping interval (by adjusting the options --pmarker-score, --pmarker-min-cov, --pmarker-max-cov, --pmarker-min-freq, --bg-ref-score, --bg-ref-cov, --bg-ref-cov-max, and --bg-ref-freq). A very effective way for removing wrong markers is excluding those with extreme coverage values.

  3. There are typically only a few mutations even in large mapping intervals, which can even be further prioritized according to their effects on genes. Therefore, we tend to work with a mapping interval, which is even larger than the one suggested by the allele frequency pattern or mapping interval calculation, in order to minimize the risk to exclude the causal mutation.

  4. In case there is a skew in the allele frequency pattern, but no mapping interval is predicted, adjusting the parameters can resolve this. Decreasing the value of θmin while increasing the value of Cvmax loosens the constraints on defining a mapping interval. Parameter tuning can be performed in addition to creating a new marker list as discussed in Note 2.

  5. It is possible that there is no mutation (or no good candidate mutation) within a given mapping interval. In this case, gradually extending the mapping interval suggested by the pattern of AFs for annotation, loosening the criteria for marker creation (for SHOREmapping of outcrossing data) or background correction (for SHOREmapping of backcrossing data) can include more mutations.

  6. For SHOREmapping of backcrossing data, the final list of mutations provided for annotation is background corrected, which means that only mutant-specific mutations will be considered. However, as background corrections remove mutations from the list of putative candidate mutations, it might risk excluding the causal mutation. To be safe, mutations with low read support, base quality, and AFs (in file SHOREmap_marker.bg_corrected) should also be annotated, in particular if the original list of background-corrected mutation did not reveal good candidate mutations.


References

1. Ossowski, S. et al. (2008) Sequencing of natural strains of Arabidopsis thaliana with short reads. Genome Res 18, 2024-33.

2. Nordström, K. J. et al. (2013) Mutation identification by direct comparison of whole-genome sequencing data from mutant and wild-type individuals using k-mers. Nat Biotechnol 31, 325-30.

3. Schneeberger, K. et al. (2009) SHOREmap: simultaneous mapping and mutation identification by deep sequencing. Nat Methods 6, 550-1.

4. Schneeberger, K. and Weigel, D. (2011) Fast-forward genetics enabled by new sequencing technologies. Trends Plant Sci 16, 282-8.

5. Galvão, V. C. et al. (2012) Synteny-based Mapping-by-Sequencing enabled by Targeted Enrichment. Plant J 71, 517-26.

6. Schneeberger, K. (2014) Using next-generation sequencing to isolate mutant genes from forward genetic screens. Nature Rev. Genet. In Press.

7. Austin, R. S. et al. (2011) Next-Generation Mapping of Arabidopsis Genes. Plant J 67, 715-25.

8. Cuperus, J. T. et al. (2010) Identification of MIR390a precursor processing-defective mutants in Arabidopsis by direct genome sequencing. Proc Natl Acad Sci U S A 107, 466-71.

9. Lindner, H. et al. (2012) SNP-Ratio Mapping (SRM): identifying lethal alleles and mutations in complex genetic backgrounds by next-generation sequencing. Genetics 191, 1381-6.

10. Minevich, G., Park, D. S., Blankenberg, D., Poole, R. J., and Hobert, O. (2012) CloudMap: A Cloud-based Pipeline for Analysis of Mutant Genome Sequences. Genetics 192, 1249-69.

11. Leshchiner, I. et al. (2012) Mutation mapping and identification by whole genome sequencing. Genome Res 22, 1541-8.

12. Abe, A. et al. (2012) Genome sequencing reveals agronomically important loci in rice using MutMap. Nat Biotechnol 30, 174-8.

13. Hartwig, B., James, G. V., Konrad, K., Schneeberger, K., and Turck, F. (2012) Fast Isogenic Mapping-by-Sequencing of Ethyl Methanesulfonate-Induced Mutant Bulks. Plant Physiol 160, 591-600.

14. Allen, R. S., Nakasugi, K., Doran, R. L., Millar, A. A., and Waterhouse, P. M. (2013) Facile mutant identification via a single parental backcross method and application of whole genome sequencing based mapping pipelines. Frontiers in Plant Science 4,

15. Fekih, R. et al. (2013) MutMap+: Genetic Mapping and Mutant Identification without Crossing in Rice. PLoS One 8, e68529.

16. Velikkakam James, G. et al. (2013) User guide for mapping-by-sequencing in Arabidopsis. Genome Biol 14, R61.

17. Li, H. and Durbin, R. (2009) Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25, 1754-60.

18. Schneeberger, K. et al. (2009) Simultaneous alignment of short reads against multiple genomes. Genome Biol 10, R98.

19. Langmead, B. and Salzberg, S. L. (2012) Fast gapped-read alignment with Bowtie 2. Nat Methods 9, 357-9.

20. Li, H. et al. (2009) The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078-9.

21. DePristo, M. A. et al. (2011) A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet 43, 491-8.

22. Danecek, P. et al. (2011) The variant call format and VCFtools. Bioinformatics 27, 2156-8.

23. Schneeberger, K. et al. (2011) Reference-guided assembly of four diverse Arabidopsis thaliana genomes. Proc Natl Acad Sci U S A 108, 10249-54.

Figure Captions


Figure 1 | Visualization of allele frequency estimations in a classical mapping population. Gray points indicate AFs as estimated on individual markers. The red line shows the average AFs within 200kb windows (with a 50kb step size). The black line follows the boost-values as calculated for the same sliding windows. The orange rectangle is the region predicted as mapping interval.

Figure 2 | Visualization of allele frequency estimations in a backcross mapping population. The points in red indicate AFs as estimated at individual mutation-markers. The region between position 1 and 4,000,000 of chromosome 3 is expected to harbour the causal mutation.

Tables


Table 1. Data used to illustrate SHOREmap analyses

Description

Data

Outcross analysis

OCF2

OC.fg.reads1.fq.gz
OC.fg.reads2.fq.gz

Ler

OC.bg.reads1.fq.gz

OC.bg.reads2.fq.gz

Backcross analysis

BCF2

BC.fg.reads1.fq.gz

BC.fg.reads2.fq.gz

mir159a (Col)

BC.bg.reads1.fq.gz

BC.bg.reads2.fq.gz

Others

Col reference sequence

TAIR10_chr_all.fas

Gene annotation

TAIR10_GFF3_genes.gff

Chromosomes sizes

chrSizes.txt

Scoring matrix for base calling with SHORE

scoring_matrix_het.txt