Examples

Here we provide examples (performed with synthetic data as well as real data) which help one to get familiar with the procedure of a SHOREmap analysis in a few minutes. For a test on real data (listed in Table 1), please go to SHOREmap_v3.x with real data (* you may need to check md5 of the data to be sure that data are downloaded completely).

Table 1. Data used to illustrate SHOREmap analyses (for SHOREmap_v3.x with real data only; click file name to download manually)

Description

Data

Outcross analysis

OCF2

OC.fg.reads1.fq.gz
OC.fg.reads2.fq.gz

Ler

OC.bg.reads1.fq.gz

OC.bg.reads2.fq.gz

Backcross analysis

BCF2

BC.fg.reads1.fq.gz

BC.fg.reads2.fq.gz

mir159a (Col)

BC.bg.reads1.fq.gz

BC.bg.reads2.fq.gz

Others

Col reference sequence

TAIR10_chr_all.fas

Gene annotation

TAIR10_GFF3_genes.gff

Chromosomes sizes

chrSizes.txt

Scoring matrix for base calling with SHORE

scoring_matrix_het.txt

MD5 for *fastq.gz files

md5sum.txt

The examples below are based on an imaginary plant (S.imulated), from which we isolated a single mutant plant. This mutant plant was backcrossed to a wildtype. In the F2 population we selected 100 mutant plants and sent their pooled DNA for whole-genome sequencing.

After downloading SHROEmap_v3.x (need to replace x with sub-version number; latest v3.2), we can find the sequencing data in folder examples/. Get into the folder with the command

cd examples/
ls -lahF

Sub-folder examples/data contains necessary information, including sizes of chromosomes: chrsizes.txt (supposing we are interested in chrosomes 1 and 2 for this mutant); GFF annotation of genes: S.imulated.gff ; and the reference chromosomes in fasta format: S.imulated.ref.fa . Folder examples/genome_center gives the file containing unpaired sequencing reads: read_data.fq.gz . Based on all the above data, the next-step analysis consists of two main steps, namely, resequencing by using SHORE or bowtie2/SAMtools and mapping-by-sequencing by using SHOREmap v3.x.

Working with SHORE

SHORE is a pipeline that can achieve short read alignment as well as SNPS and indels calling. In the following exmaple, SHORE v0.8unstable (rev. 3254-2108) is used.

Getting started with the sequencing data

Open a new terminal, get into the folder SHROEmap_v3.x/examples with the command

cd path/to/SHROEmap_v3.x/examples/

Now we create a new folder named withSHORE to store the corresponding results

mkdir withSHORE

Change to the folder genome_center with the command

cd genome_center/

Uncompress the read file with the command

gunzip read_data.fq.gz

Now we change into the folder ../data with the command

cd ../data/
ls -lahF

In order to allow short read alignment tools to be quick, we have to index the reference sequence S.imulated.ref.fa. We can do index with the command (please type shore preprocess to check the options beforehand)

shore preprocess -f S.imulated.ref.fa -i ../withSHORE/index

Then there is a folder index resulted in folder withSHORE, where S.imulated.ref.fa.shore is used in the following analysis

Resequencing

Now we change into the folder genome_center with the command

cd ../genome_center/

SHORE requires a special format on the short reads for processing, for which we can prepare with the command

shore import -v fastq -e Shore -a genomic -x read_data.fq -o ../withSHORE/flowcell

We can check the statistics of import with the command

less ../withSHORE/flowcell/import_statistic.txt

Based on all the above, we can map the reads in ../withSHORE/flowcell/1/single/reads_0.fl.gz to the indexed reference sequence with the command (remember to check options of shore mapflowcell)

shore mapflowcell -f ../withSHORE/flowcell -i ../withSHORE/index/S.imulated.ref.fa.shore -n 10% -g 7%

Option '-n' tells the program that the maximum number of edit distances is L*10%; '-g' tells the program that the maximum number of mismatches is L*7%, where L is the length of reads.

The result file map.list.gz is included in folder ../withSHORE/flowcell/1/single. We can check the mapping log to see what we have used and obtained with the command

less ../withSHORE/flowcell/mapping.log

Finally, we can run the consensus-calling program to predict the mutations (SNPs, indels, and SV) with the command

shore consensus -n example1 -f ../withSHORE/index/S.imulated.ref.fa.shore -o ../withSHORE/consensus_het -i ../withSHORE/flowcell/1/single/map.list.gz -g 5 -a /projects/dep_coupland/grp_nordstrom/bin/shore/shore_current/shore/Analysis/scoring_matrices/scoring_matrix_het.txt -v

Option '-g' tells the program to trust only core of reads. In the example, the first 5 positions and the last 5 positions of each read are not trusted when calling variations. More importantly, ensure that '-v' is turned on, which tells shore consensus to create additional files containing all intermediate data that are required for subsequent SHOREmap analysis. We can also check the log file to see what we have used and obtained with the command

less ../withSHORE/consensus_het/config.log

Now we have finished resequencing, and all the necessary information for SHOREmap analysis is avaibale in folder ../withSHORE/consensus_het/ConsensusAnalysis.

Mapping-by-sequencing

SHOREmap accepts files resulted from SHORE wihtout any further modification. However, files like consensus_summary.txt.gz can be as large as tens to hundreds of gygabytes, which might be repeatedly parsed during SHOREmap analysis. To save file parsing time in future analysis, we extract the consensus information only related to the candidate SNP markers. For convenience of manipulation, we change to folder withSHORE with the command

cd ../withSHORE

Firstly, we have to unzip consensus_summary.txt.gz for further processing, with the command

gunzip consensus_het/ConsensusAnalysis/supplementary_data/consensus_summary.txt.gz

Check the content in consensus_summary.txt with the command

less consensus_het/ConsensusAnalysis/supplementary_data/consensus_summary.txt

We extract the necessary consensus information according to consensus_het/ConsensusAnalysis/quality_variant.txt (= candidate markers) with the command,

SHOREmap extract --chrsizes ../data/chrsizes.txt --marker consensus_het/ConsensusAnalysis/quality_variant.txt --consen consensus_het/ConsensusAnalysis/supplementary_data/consensus_summary.txt --folder consensus_het/ConsensusAnalysis

The result is consensus_het/ConsensusAnalysis/extracted_consensus_0.txt.

Now we can use SHOREmap backcross to do background correction according to AF and base call quality, with the command

SHOREmap backcross --chrsizes ../data/chrsizes.txt --marker consensus_het/ConsensusAnalysis/quality_variant.txt --consen consensus_het/ConsensusAnalysis/extracted_consensus\_0.txt --folder MBS --marker-score 2 --marker-freq 0.2 -plot-bc

Now we can list and check files the folder MBS and list results with the commands,

ls MBS -lahF

The visualization of AFs is in BC_AF_visualization_1_boost.pdf, open it with the command,

xpdf MBS/BC_AF_visualization_1_boost.pdf

Now we can see that the region in orange which can contain the causal mutation of the phenotype. Meanwhile, there are predcited peaks (positions of candidate mutations). We can check the peaks with the command

less MBS/SHOREmap_boosted_peaks.txt

For one chromosome, there can be more than one peak. However, these peaks should not be far away from each other. We can annotate the mutations in the candidate region mapped according to their effects on genes. To avoid missing any SNP information, we annotate the initially converted marker file quality_variant.txt. We can also tune the size of the predicted region to do annotations. Annotation can be done like this:

SHOREmap annotate --chrsizes ../data/chrsizes.txt --snp consensus_het/ConsensusAnalysis/quality_variant.txt --chrom 1 --start 2500 --end 6500 --folder MBS_annotation --genome ../data/S.imulated.ref.fa --gff ../data/S.imulated.gff -verbose

The result file is in folder MBS_annotation which is named prioritized_snp_1_2500_6500_peak1.txt (as no peak information is provided, position 1 plays as the 'peak'). Check the file with the command

less MBS_annotation/prioritized_snp_1_2500_6500_peak1.txt

It describes mutations and their effect annotated according to gene features. For example non-synonymous changes will be more likely to cause the phenotype than synonymous ones. Meanwhile, there is an additional file ref_and_eco_coding_seq.txt, which records 5 lines for each targeted mutation of a gene. Open it with the command

less MBS_annotation/ref_and_eco_coding_seq.txt

and type in -S (capitalized) to chop long lines for a comfortable view.

Working with SAMtools

Note that SAMtools itself does not do read alignment. It works with other alignment tools giving alignment in the SAM format. SAMtools then parses and manipulates alignments in the SAM format. In this example, we will use bowtie2 version 2.1.0 for linux x_86_64 for read alignment. After read alignment, we will use SAMtools to call variations recorded in VCF4.1, which then will be analyzed by SHOREmap.

Getting started with the sequencing data

Open a terminal. Get into folder SHROEmap_v3.x/examples/, we create a new folder withSAMtools to store results

mkdir withSAMtools

Get into the result folder

cd withSAMtools

Uncompress the read file with the command (if it is compressed)

gunzip ..genome_center/read_data.fq.gz

Smilarly, in order to allow short read alignment tools to be quick. The reference sequence needs to be indexed by using bowtie2-build. We can check command options of bowtie2 with the command

path/to/bowtie2-2.1.0/bowtie2-build

Please REPLACE 'path/to/' with your own specification of installation folder. The same operations should be carried out for the following commands, which will not be reminded. We need to create an index folder in ../withSAMtools/ to store indexed files with the command

mkdir index/

Now we can do the index with the command

path/to/bowtie2-2.1.0/bowtie2-build -f ../data/S.imulated.ref.fa index/S.imulated.index

If the index is successfully finished, there are 6 *.bt2 files generated, which you can list by the command

ls -lahF index/

Resequencing

Now, we need to align the short reads, which can be done with the bowtie2,

path/to/bowtie2-2.1.0/bowtie2 -x index/S.imulated.index -U ../genome_center/read_data.fq -S S.imulated.sam

If successfully finished, the following information will be print on screen,

24000 reads; of these:
24000 (100.00\%) were unpaired; of these:
440 (1.83\%) aligned 0 times
17041 (71.00\%) aligned exactly 1 time
6519 (27.16\%) aligned >1 times
98.17\% overall alignment rate

The result file is S.imulated.sam. Look at it with cmd,

less S.imulated.sam

In order to be able to continue working with it we need to compress it (important when working with real data) with the command

/path/to/samtools-0.1.19/samtools view -bS -o S.imulated.bam S.imulated.sam

The result file is S.imulated.bam. Now, we need to sort this file, in order to make it easy to find overlapping alignments,

/path/to/samtools-0.1.19/samtools sort S.imulated.bam S.imulated.sorted

The result file is S.imulated.sorted.bam.

Finally, we can run the consensus-calling program that predicts the mutations ('-v': only variants will be output),

path/to/samtools-0.1.19/samtools mpileup -uD -f ../data/S.imulated.ref.fa S.imulated.sorted.bam | bcftools view -vcg - > S.imulated.vcf

The result file is S.imulated.vcf. Check it with the command

less S.imulated.vcf

Mapping-by-sequencing

Let us stay in folder withSAMtools. First, we need to modify the file format of the .vcf into one that SHOREmap can understand. Specifically, SAMtools generates variant call files in VCF4.1. If your VCF format is different from what is generated by SAMtools, please be careful to check if the following convert is done as expeted. We would try to consider more general format, so thanks for your feedback (if any).

Now let us convert VCF4.1 to files that can be recognized by SHOREmap,

SHOREmap convert --marker S.imulated.vcf --folder MBS -runid 1

The result files are under the folder MBS, including 1_converted_consen_info.txt and 1_converted_marker_info.txt.

Enter folder MBS with the command

cd MBS

Then the following steps are similar to what we have done with SHORE. Do background correction,

SHOREmap backcross --chrsizes ../../data/chrsizes.txt --marker 1_converted_marker_info.txt --consen 1_converted_consen_info.txt --folder Mapping --marker-score 2 --marker-freq 0.2 --window-size 500 --window-step 100 -plot-bc -plot-boost

Check the result in pdf files with the command

xpdf Mapping/BC_AF_visualization_1_boost.pdf

Check the peaks with the command

less Mapping/SHOREmap_boosted_peaks.txt

Annotate the mutations,

SHOREmap annotate --chrsizes ../../data/chrsizes.txt --snp 1_converted_marker_info.txt --chrom 1 --start 2500 --end 6500 --folder \Mapping_annotation --genome ../../data/S.imulated.ref.fa --gff ../../data/S.imulated.gff -verbose

Check the prioritized SNPs with the command

less Mapping_annotation/prioritized_snp_1_2500_6500_peak1.txt

Check the coding DNA sequences and translated proteins with the command

less Mapping_annotation/ref_and_eco_coding_seq.txt

----RETURN TO BEGINNING----