Examples
Here we provide examples (performed with synthetic data as well as real data) which help one to get familiar with the procedure of a SHOREmap analysis in a few minutes. For a test on real data (listed in Table 1), please go to SHOREmap_v3.x with real data (* you may need to check md5 of the data to be sure that data are downloaded completely).
Table 1. Data used to illustrate SHOREmap analyses (for SHOREmap_v3.x with real data only; click file name to download manually)
Description |
Data |
|
Outcross analysis |
OCF2 |
|
Ler |
||
Backcross analysis |
BCF2 |
|
mir159a (Col) |
||
Others |
Col reference sequence |
|
Gene annotation |
||
Chromosomes sizes |
||
Scoring matrix for base calling with SHORE |
||
MD5 for *fastq.gz files |
The examples below are based on an imaginary plant (S.imulated), from which we isolated a single mutant plant. This mutant plant was backcrossed to a wildtype. In the F2 population we selected 100 mutant plants and sent their pooled DNA for whole-genome sequencing.
After downloading SHROEmap_v3.x (need to replace x with sub-version number; latest v3.2), we can find the sequencing data in folder examples/. Get into the folder with the command
cd examples/
ls -lahF
Sub-folder examples/data contains necessary information, including sizes of chromosomes: chrsizes.txt (supposing we are interested in chrosomes 1 and 2 for this mutant); GFF annotation of genes: S.imulated.gff ; and the reference chromosomes in fasta format: S.imulated.ref.fa . Folder examples/genome_center gives the file containing unpaired sequencing reads: read_data.fq.gz . Based on all the above data, the next-step analysis consists of two main steps, namely, resequencing by using SHORE or bowtie2/SAMtools and mapping-by-sequencing by using SHOREmap v3.x.
Working with SHORE
SHORE is a pipeline that can achieve short read alignment as well as SNPS and indels calling. In the following exmaple, SHORE v0.8unstable (rev. 3254-2108) is used.
Getting started with the sequencing data
Open a new terminal, get into the folder SHROEmap_v3.x/examples with the command
cd path/to/SHROEmap_v3.x/examples/
Now we create a new folder named withSHORE to store the corresponding results
mkdir withSHORE
Change to the folder genome_center with the command
cd genome_center/
Uncompress the read file with the command
gunzip read_data.fq.gz
Now we change into the folder ../data with the command
cd ../data/
ls -lahF
In order to allow short read alignment tools to be quick, we have to index the reference sequence S.imulated.ref.fa. We can do index with the command (please type shore preprocess to check the options beforehand)
shore preprocess -f S.imulated.ref.fa -i ../withSHORE/index
Then there is a folder index resulted in folder withSHORE, where S.imulated.ref.fa.shore is used in the following analysis
Resequencing
Now we change into the folder genome_center with the command
cd ../genome_center/
SHORE requires a special format on the short reads for processing, for which we can prepare with the command
shore import -v fastq -e Shore -a genomic -x read_data.fq -o ../withSHORE/flowcell
We can check the statistics of import with the command
less ../withSHORE/flowcell/import_statistic.txt
Based on all the above, we can map the reads in ../withSHORE/flowcell/1/single/reads_0.fl.gz to the indexed reference sequence with the command (remember to check options of shore mapflowcell)
shore mapflowcell -f ../withSHORE/flowcell -i ../withSHORE/index/S.imulated.ref.fa.shore -n 10% -g 7%
Option '-n' tells the program that the maximum number of edit distances is L*10%; '-g' tells the program that the maximum number of mismatches is L*7%, where L is the length of reads.
The result file map.list.gz is included in folder ../withSHORE/flowcell/1/single. We can check the mapping log to see what we have used and obtained with the command
less ../withSHORE/flowcell/mapping.log
Finally, we can run the consensus-calling program to predict the mutations (SNPs, indels, and SV) with the command
shore consensus -n example1 -f ../withSHORE/index/S.imulated.ref.fa.shore -o ../withSHORE/consensus_het -i ../withSHORE/flowcell/1/single/map.list.gz -g 5 -a /projects/dep_coupland/grp_nordstrom/bin/shore/shore_current/shore/Analysis/scoring_matrices/scoring_matrix_het.txt -v
Option '-g' tells the program to trust only core of reads. In the example, the first 5 positions and the last 5 positions of each read are not trusted when calling variations. More importantly, ensure that '-v' is turned on, which tells shore consensus to create additional files containing all intermediate data that are required for subsequent SHOREmap analysis. We can also check the log file to see what we have used and obtained with the command
less ../withSHORE/consensus_het/config.log
Now we have finished resequencing, and all the necessary information for SHOREmap analysis is avaibale in folder ../withSHORE/consensus_het/ConsensusAnalysis.
Mapping-by-sequencing
SHOREmap accepts files resulted from SHORE wihtout any further modification. However, files like consensus_summary.txt.gz can be as large as tens to hundreds of gygabytes, which might be repeatedly parsed during SHOREmap analysis. To save file parsing time in future analysis, we extract the consensus information only related to the candidate SNP markers. For convenience of manipulation, we change to folder withSHORE with the command
cd ../withSHORE
Firstly, we have to unzip consensus_summary.txt.gz for further processing, with the command
gunzip consensus_het/ConsensusAnalysis/supplementary_data/consensus_summary.txt.gz
Check the content in consensus_summary.txt with the command
less consensus_het/ConsensusAnalysis/supplementary_data/consensus_summary.txt
We extract the necessary consensus information according to consensus_het/ConsensusAnalysis/quality_variant.txt (= candidate markers) with the command,
SHOREmap extract --chrsizes ../data/chrsizes.txt --marker consensus_het/ConsensusAnalysis/quality_variant.txt --consen consensus_het/ConsensusAnalysis/supplementary_data/consensus_summary.txt --folder consensus_het/ConsensusAnalysis
The result is consensus_het/ConsensusAnalysis/extracted_consensus_0.txt.
Now we can use SHOREmap backcross to do background correction according to AF and base call quality, with the command
SHOREmap backcross --chrsizes ../data/chrsizes.txt --marker consensus_het/ConsensusAnalysis/quality_variant.txt --consen consensus_het/ConsensusAnalysis/extracted_consensus\_0.txt --folder MBS --marker-score 2 --marker-freq 0.2 -plot-bc
Now we can list and check files the folder MBS and list results with the commands,
ls MBS -lahF
The visualization of AFs is in BC_AF_visualization_1_boost.pdf, open it with the command,
xpdf MBS/BC_AF_visualization_1_boost.pdf
Now we can see that the region in orange which can contain the causal mutation of the phenotype. Meanwhile, there are predcited peaks (positions of candidate mutations). We can check the peaks with the command
less MBS/SHOREmap_boosted_peaks.txt
For one chromosome, there can be more than one peak. However, these peaks should not be far away from each other. We can annotate the mutations in the candidate region mapped according to their effects on genes. To avoid missing any SNP information, we annotate the initially converted marker file quality_variant.txt. We can also tune the size of the predicted region to do annotations. Annotation can be done like this:
SHOREmap annotate --chrsizes ../data/chrsizes.txt --snp consensus_het/ConsensusAnalysis/quality_variant.txt --chrom 1 --start 2500 --end 6500 --folder MBS_annotation --genome ../data/S.imulated.ref.fa --gff ../data/S.imulated.gff -verbose
The result file is in folder MBS_annotation which is named prioritized_snp_1_2500_6500_peak1.txt (as no peak information is provided, position 1 plays as the 'peak'). Check the file with the command
less MBS_annotation/prioritized_snp_1_2500_6500_peak1.txt
It describes mutations and their effect annotated according to gene features. For example non-synonymous changes will be more likely to cause the phenotype than synonymous ones. Meanwhile, there is an additional file ref_and_eco_coding_seq.txt, which records 5 lines for each targeted mutation of a gene. Open it with the command
less MBS_annotation/ref_and_eco_coding_seq.txt
and type in -S (capitalized) to chop long lines for a comfortable view.
Working with SAMtools
Note that SAMtools itself does not do read alignment. It works with other alignment tools giving alignment in the SAM format. SAMtools then parses and manipulates alignments in the SAM format. In this example, we will use bowtie2 version 2.1.0 for linux x_86_64 for read alignment. After read alignment, we will use SAMtools to call variations recorded in VCF4.1, which then will be analyzed by SHOREmap.
Getting started with the sequencing data
Open a terminal. Get into folder SHROEmap_v3.x/examples/, we create a new folder withSAMtools to store results
mkdir withSAMtools
Get into the result folder
cd withSAMtools
Uncompress the read file with the command (if it is compressed)
gunzip ..genome_center/read_data.fq.gz
Smilarly, in order to allow short read alignment tools to be quick. The reference sequence needs to be indexed by using bowtie2-build. We can check command options of bowtie2 with the command
path/to/bowtie2-2.1.0/bowtie2-build
Please REPLACE 'path/to/' with your own specification of installation folder. The same operations should be carried out for the following commands, which will not be reminded. We need to create an index folder in ../withSAMtools/ to store indexed files with the command
mkdir index/
Now we can do the index with the command
path/to/bowtie2-2.1.0/bowtie2-build -f ../data/S.imulated.ref.fa index/S.imulated.index
If the index is successfully finished, there are 6 *.bt2 files generated, which you can list by the command
ls -lahF index/
Resequencing
Now, we need to align the short reads, which can be done with the bowtie2,
path/to/bowtie2-2.1.0/bowtie2 -x index/S.imulated.index -U ../genome_center/read_data.fq -S S.imulated.sam
If successfully finished, the following information will be print on screen,
24000 reads; of these:24000 (100.00\%) were unpaired; of these:
440 (1.83\%) aligned 0 times
17041 (71.00\%) aligned exactly 1 time
6519 (27.16\%) aligned >1 times
98.17\% overall alignment rate
The result file is S.imulated.sam. Look at it with cmd,
less S.imulated.sam
In order to be able to continue working with it we need to compress it (important when working with real data) with the command
/path/to/samtools-0.1.19/samtools view -bS -o S.imulated.bam S.imulated.sam
The result file is S.imulated.bam. Now, we need to sort this file, in order to make it easy to find overlapping alignments,
/path/to/samtools-0.1.19/samtools sort S.imulated.bam S.imulated.sorted
The result file is S.imulated.sorted.bam.
Finally, we can run the consensus-calling program that predicts the mutations ('-v': only variants will be output),
path/to/samtools-0.1.19/samtools mpileup -uD -f ../data/S.imulated.ref.fa S.imulated.sorted.bam | bcftools view -vcg - > S.imulated.vcfThe result file is S.imulated.vcf. Check it with the command
less S.imulated.vcf
Mapping-by-sequencing
Let us stay in folder withSAMtools. First, we need to modify the file format of the .vcf into one that SHOREmap can understand. Specifically, SAMtools generates variant call files in VCF4.1. If your VCF format is different from what is generated by SAMtools, please be careful to check if the following convert is done as expeted. We would try to consider more general format, so thanks for your feedback (if any).
Now let us convert VCF4.1 to files that can be recognized by SHOREmap,
SHOREmap convert --marker S.imulated.vcf --folder MBS -runid 1
The result files are under the folder MBS, including 1_converted_consen_info.txt and 1_converted_marker_info.txt.
Enter folder MBS with the command
cd MBS
Then the following steps are similar to what we have done with SHORE. Do background correction,
SHOREmap backcross --chrsizes ../../data/chrsizes.txt --marker 1_converted_marker_info.txt --consen 1_converted_consen_info.txt --folder Mapping --marker-score 2 --marker-freq 0.2 --window-size 500 --window-step 100 -plot-bc -plot-boost
Check the result in pdf files with the command
xpdf Mapping/BC_AF_visualization_1_boost.pdf
Check the peaks with the command
less Mapping/SHOREmap_boosted_peaks.txt
Annotate the mutations,
SHOREmap annotate --chrsizes ../../data/chrsizes.txt --snp 1_converted_marker_info.txt --chrom 1 --start 2500 --end 6500 --folder \Mapping_annotation --genome ../../data/S.imulated.ref.fa --gff ../../data/S.imulated.gff -verbose
Check the prioritized SNPs with the command
less Mapping_annotation/prioritized_snp_1_2500_6500_peak1.txt
Check the coding DNA sequences and translated proteins with the command
less Mapping_annotation/ref_and_eco_coding_seq.txt
----RETURN TO BEGINNING----