ssaha_pileup is a pipeline to detect SNPs and short indels under various sequencing platforms

ftp://ftp.sanger.ac.uk/pub/zn1/ssaha_pileup/
http://www.sanger.ac.uk/Software/analysis/SSAHA2

Please contact:

Zemin Ning ( zn1@sanger.ac.uk ) for related questions.

Wrap file pileup_v0.1.tgz released - this includes all the files you need to run ssaha2 and ssaha_pileup: 
ftp://ftp.sanger.ac.uk/pub/zn1/ssaha_pileup/pileup_v0.1.tgz

New features for ssaha2 v2.0.0:
(1) You need to add "-save" before the hash table after ssaha2 version v1.0.9.
(2) You can use "-rtype {454/solexa}" for ssaha2 version v2.0.0, but "-454" and "-solexa" still work.
(3) Pair-end reads 

./ssaha_reads reads-pairend.fastq reads-raw.fastq

This splits the long ~70 base into single end reads of ~35 bps, but read names are changed so that pair information can be traced in later process.
(4) If you have reasonable RAM, the fastest way to run ssaha2 is 

./ssaha2 "options" reference.fa reads.fastq

See SSAHA2 web site for more information.
(5) Data visualization tool included, see below.

================
Read file format
================
The ssaha_pileup pipeline is based upon the Sanger fastq file format,
where quality values are correlated with Phred calculations. Quality 
values in Solexa and Sanger formats are different, but the file format
is very similar. The reads from Solexa machines are Solexa format. This 
might tell you if your data is in Solexa format:

"If the average quality value of the first 10 bases is more than, say
60 or 70, this file must be in the Solexa/Illumina format."

To convert Solexa fastq to Sanger fastq, please go to MAQ page and download 
the maq

http://maq.sourceforge.net/

maq sol2sanger in.sol.fastq out.sanger.fastq

We will develop a file convert code for ssaha_pileup soon.

======================
Sanger capillary reads
======================
(1) Buildup hash table:
./ssaha2Build -save genome_hash -skip 12 -kmer 12 refe.fa 
This step produces five hash table files
genome_hash.base 
genome_hash.body
genome_hash.head
genome_hash.name
genome_hash.size

(2) Place reads to the genome
Input files: read file - "reads-raw.fastq"; reference file - "refe.fa"

./ssaha2 -seeds 15 -score 250 -tags 1 -diff 15 -output cigar -memory 300 -cut 5000 -disk 1 genome_hash reads-raw.fastq > ssaha.out

v1.0.9:
./ssaha2 -seeds 15 -score 250 -tags 1 -diff 15 -output cigar -memory 300 -cut 5000 -disk 1 -save genome_hash reads-raw.fastq > ssaha.out

If the dataset of "reads-raw.fastq" is very big and you may cut the read file into small files each, say, with 5000 reads:

./ssaha_split -split 5000 split reads-raw.fastq

You will have files like:
split_0000
split_0001
...

./ssaha2 -seeds 15 -score 250 -tags 1 -diff 15 -output cigar -memory 300 -cut 5000 -disk 1 genome_hash split_0000 > ssaha_0000.out
v1.0.9:
./ssaha2 -seeds 15 -score 250 -tags 1 -diff 15 -output cigar -memory 300 -cut 5000 -disk 1 -save genome_hash split_0000 > ssaha_0000.out


(3) Get the cigar line file

cat ssaha.out | egrep cigar > genome-cigar-raw.dat
or
cat ssaha_*.out | egrep cigar > genome-cigar-raw.dat

(4)Process cigar line file to sort pair reads:
(a) No pair-end data:
./ssaha_cigar genome-cigar-raw.dat genome-cigar.dat 

(b) Pair-ending reads:
(i) If the pair name format is like *.p1k, *.q1k 
./ssaha_pairs -insert 4000 genome-cigar-raw.dat genome-cigar.dat 

here 4000 is the mean insert size.

(ii) Other formats:
./ssaha_mates -insert 4000 genome-cigar-raw.dat mates genome-cigar.dat

here mates is a mate-information file like
------------------------------------------------
B9_676_FC4742_R1_1_1_195_859.p1k M000000000
B9_676_FC4742_R1_1_1_246_325.p1k M000000001
B9_676_FC4742_R1_1_1_678_287.p1k M000000002
B9_676_FC4742_R1_1_1_466_228.p1k
B9_676_FC4742_R1_1_1_370_428.p1k M000000004
B9_676_FC4742_R1_1_1_294_641.p1k M000000005
B9_676_FC4742_R1_1_1_747_279.p1k M000000006
...
B9_676_FC4742_R1_1_1_195_859.q1k M000000000
B9_676_FC4742_R1_1_1_246_325.q1k M000000001
B9_676_FC4742_R1_1_1_678_287.q1k M000000002
B9_676_FC4742_R1_1_1_370_428.q1k M000000004
B9_676_FC4742_R1_1_1_294_641.q1k M000000005
.....
------------------------------------------------
First column - read name;
Second column - mate name. 
Also B9_676_FC4742_R1_1_1_195_859.p1k and B9_676_FC4742_R1_1_1_195_859.q1k are paired reads.
B9_676_FC4742_R1_1_1_466_228.p1k is a single end read.

It should be noted that all the read names should be unique. If a read does not have a paired read, leave the mate name empty.

If your read set has different insert size, you may make a file like 
------------------------------------------------
B9_676_FC4742_R1_1_1_195_859.p1k M000000000 3000 600
B9_676_FC4742_R1_1_1_246_325.p1k M000000001 3000 600
B9_676_FC4742_R1_1_1_678_287.p1k M000000002 3000 600
B9_676_FC4742_R1_1_1_466_228.p1k
B9_676_FC4742_R1_1_1_370_428.p1k M000000004 500 100
B9_676_FC4742_R1_1_1_294_641.p1k M000000005 500 100
B9_676_FC4742_R1_1_1_747_279.p1k M000000006 500 100
...
B9_676_FC4742_R1_1_1_195_859.q1k M000000000 3000 600 
B9_676_FC4742_R1_1_1_246_325.q1k M000000001 3000 600 
B9_676_FC4742_R1_1_1_678_287.q1k M000000002 3000 600 
B9_676_FC4742_R1_1_1_370_428.q1k M000000004 500 100 
B9_676_FC4742_R1_1_1_294_641.q1k M000000005 500 100
.....
------------------------------------------------

Here "3000" and "500" are two insert sizes and "600" and "100" are the standard derivation, which
is normally 20% of the insert size.

(5) Get the name list of uniquely placed reads: 
cat genome-cigar.dat | awk '{print $2}' > name.dat

(6) Get the uniquely placed reads, in the order of "name.dat" or "genome-cigar.dat"  
./get_seqreads name.dat reads-raw.fastq genome_reads.fastq

(7) For SNP output file:
./ssaha_pileup genome-cigar.dat refe.fa genome_reads.fastq > genome_SNP.out 

(8) For pileup output file:
./ssaha_pileup -cons 1 genome-cigar.dat refe.fa genome_reads.fastq > genome_pileup.out 

  
=========
454 reads
=========
Input files: read file - "reads-raw.fastq"; reference file - "refe.fa"

If your machine has >4 GB RAM or the genome size < 1 GB

./ssaha2 -454 -seeds 5 -score 30 -kmer 13 -skip 4 -diff 0 -output cigar refe.fa reads-raw.fastq > ssaha.out

egrep cigar > genome-cigar-raw.dat

Go to (4).

Otherwise following these steps:

(1) Buildup hash table:
./ssaha2Build -save genome_hash -skip 4 -kmer 13 refe.fa 
This step produces five hash table files
genome_hash.base 
genome_hash.body
genome_hash.head
genome_hash.name
genome_hash.size

(2) Place reads to the genome

Cut the read file into small files each, say, with 50000 reads:

./ssaha_split -split 50000 split reads-raw.fastq

You will have files like:
split_0000
split_0001
...
v1.0.9:
./ssaha2 -solexa -seeds 5 -score 30 -tags 1 -skip 4 -kmer 13 -diff 0 -output cigar -memory 200 -cut 5000 -disk 1 -save genome_hash split_0000 > ssaha_0000.out
v2.0.0:
./ssaha2 -454 -kmer 13 -skip 4 -tags 1 -diff 0 -output cigar -memory 200 -disk 1 genome_hash split_0000 > ssaha_0000.out

(3) Get the cigar line file

cat ssaha_*.out | egrep cigar > genome-cigar-raw.dat

(4)Process cigar line file to sort pair reads:
(a) No pair-end data:
./ssaha_cigar genome-cigar-raw.dat genome-cigar.dat 

(b) Pair-ending reads:
(i) If the pair name format is like *.p1k, *.q1k 
./ssaha_pairs -insert 1000 genome-cigar-raw.dat genome-cigar.dat 

here 1000 is the mean insert size.

(ii) Other formats:
./ssaha_mates -insert 1000 genome-cigar-raw.dat mates genome-cigar.dat

here mates is a mate-information file like
------------------------------------------------
228467_3207_2705 M000000000
117817_1884_0909 M000000001
088082_2563_0053
240978_3485_3099 M000000004
189044_2675_2464 M000000005
...
144670_2837_1818 M000000000
160362_1296_0270 M000000001
...
------------------------------------------------
First column - read name;
Second column - mate name. 
As indicated, 228467_3207_2705 and 144670_2837_1818  are paired reads. 088082_2563_0053 is a single end read.

It should be noted that all the read names should be unique. If a read does not have a paired read, leave the
mate name empty.
  
(5) Get the name list of uniquely placed reads: 
cat genome-cigar.dat | awk '{print $2}' > name.dat

(6) Get the uniquely placed reads, in the order of "name.dat" or "genome-cigar.dat"  
./get_seqreads name.dat reads-raw.fastq genome_reads.fastq

(7) For SNP output file:
./ssaha_pileup genome-cigar.dat refe.fa genome_reads.fastq > genome_SNP.out 

(8) For pileup output file:
./ssaha_pileup -cons 1 genome-cigar.dat refe.fa genome_reads.fastq > genome_pileup.out 

============
Solexa reads
============
Input files: read file - "reads-raw.fastq"; reference file - "refe.fa"

If your data file contains pair-end reads in the file "reads-pairend.fastq", i.e. 70 bases with each pair of 35 bases:

./ssaha_reads reads-pairend.fastq reads-raw.fastq

Note: this step splits the long 70 base into single end reads of 35 bps, but the 
file names are changed so that the codes in the late stage can recognize the read
pairs.

If your machine has >4 GB RAM or the genome size < 1 GB

./ssaha2 -solexa -score 20 -skip 2 -diff 0 -output cigar refe.fa reads-raw.fastq > ssaha.out

egrep cigar > genome-cigar-raw.dat

Note: The file "reads-raw.fastq" can be one lane of Solexa data, but no more larger than that. 

Go to (4).

Otherwise following these steps:

(1) Buildup hash table:
./ssaha2Build -save genome_hash -skip 2 -kmer 12 refe.fa 
This step produces five hash table files
genome_hash.base 
genome_hash.body
genome_hash.head
genome_hash.name
genome_hash.size

(2) Place reads to the genome
If the genome is very big and your data is multiple runs, you need to cut the read file into small files each, say, with 50000 to 500000 reads:

./ssaha_split -split 50000 split reads-raw.fastq

You will have files like:
split_0000
split_0001
...

./ssaha2 -solexa -score 20 -skip 2 -output cigar -memory 200 -diff 0 -disk 1 -save genome_hash split_0000 > ssaha_0000.out

Note: if you don't have many CPUs, you can increase the size of split_*, even for one lane of Solexa data.

(3) Get the cigar line file

cat ssaha_????.out | egrep cigar > genome-cigar-raw.dat

(4)Process cigar line file to sort pair reads:
(a) No pair-end data:
./ssaha_cigar genome-cigar-raw.dat genome-cigar.dat 

(b) Pair-ending reads:
(i) If the pair name format is like *.p1k, *.q1k 
./ssaha_pairs -insert 200 genome-cigar-raw.dat genome-cigar-unclean.dat 
./ssaha_clean -insert 200 genome-cigar-unclean.dat genome-cigar.dat
here 200 is the mean insert size and ssaha_clean removes or replaces some
wrongly mapped paired reads.

(ii) Other formats:
./ssaha_mates -insert 200 genome-cigar-raw.dat mates genome-cigar.dat

here mates is a mate-information file like
------------------------------------------------
SLXA-B9_676_FC4742_R1_1_1_195_859.p1k M000000000
SLXA-B9_676_FC4742_R1_1_1_246_325.p1k M000000001
SLXA-B9_676_FC4742_R1_1_1_678_287.p1k M000000002
SLXA-B9_676_FC4742_R1_1_1_466_228.p1k
SLXA-B9_676_FC4742_R1_1_1_370_428.p1k M000000004
SLXA-B9_676_FC4742_R1_1_1_294_641.p1k M000000005
SLXA-B9_676_FC4742_R1_1_1_747_279.p1k M000000006
...
SLXA-B9_676_FC4742_R1_1_1_195_859.q1k M000000000
SLXA-B9_676_FC4742_R1_1_1_246_325.q1k M000000001
SLXA-B9_676_FC4742_R1_1_1_678_287.q1k M000000002
SLXA-B9_676_FC4742_R1_1_1_370_428.q1k M000000004
SLXA-B9_676_FC4742_R1_1_1_294_641.q1k M000000005
.....
------------------------------------------------
First column - read name;
Second column - mate name. 
Also SLXA-B9_676_FC4742_R1_1_1_195_859.p1k and SLXA-B9_676_FC4742_R1_1_1_195_859.q1k are paired reads. SLXA-B9_676_FC4742_R1_1_1_466_228.p1k is a single end read.

It should be noted that all the read names should be unique. If a read does not have a paired read, leave the mate name empty.
  
Like previous read formats, if you have different insert sizes, you may make a file like 
------------------------------------------------
B9_676_FC4742_R1_1_1_195_859.p1k M000000000 3000 600
B9_676_FC4742_R1_1_1_246_325.p1k M000000001 3000 600
B9_676_FC4742_R1_1_1_678_287.p1k M000000002 3000 600
B9_676_FC4742_R1_1_1_466_228.p1k
B9_676_FC4742_R1_1_1_370_428.p1k M000000004 500 100
B9_676_FC4742_R1_1_1_294_641.p1k M000000005 500 100
B9_676_FC4742_R1_1_1_747_279.p1k M000000006 500 100
...
B9_676_FC4742_R1_1_1_195_859.q1k M000000000 3000 600 
B9_676_FC4742_R1_1_1_246_325.q1k M000000001 3000 600 
B9_676_FC4742_R1_1_1_678_287.q1k M000000002 3000 600 
B9_676_FC4742_R1_1_1_370_428.q1k M000000004 500 100 
B9_676_FC4742_R1_1_1_294_641.q1k M000000005 500 100
.....
------------------------------------------------

Here "3000" and "500" are two insert sizes and "600" and "100" are the standard derivation, which
is normally 20% of the insert size.

(5) Get the name list of uniquely placed reads: 
cat genome-cigar.dat | awk '{print $2}' > name.dat

(6) Get the uniquely placed reads, in the order of "name.dat" or "genome-cigar.dat"  
./get_seqreads name.dat reads-raw.fastq genome_reads.fastq

(7) For SNP output file:
./ssaha_pileup -solexa 1 genome-cigar.dat refe.fa genome_reads.fastq > genome_SNP.out 

(8) For pileup output file:
./ssaha_pileup -solexa 1 -cons 1 genome-cigar.dat refe.fa genome_reads.fastq > genome_pileup.out 

===================
Solexa reads indels 
===================
Ssaha2 is a gapped alignment program and it carries smith-waterman alignment for all the reads. Given the read length of ~35 bps, it can detect indels upto 3 bps.

Deletions
./ssaha_indel -deletion 1 -allele 0.15 genome-cigar.dat refe.fa genome_pileup.out > genome_deletion.out

Insertions
./ssaha_indel -insertion 1 -allele 0.15 genome-cigar.dat refe.fa genome_pileup.out > genome_insertion.out

------------------
Output file format
------------------
Deletion: 38 reference_seq 531992 1 31 50 IL5_178:3:208:410:940 23 0 174 35
Deletion: 38 reference_seq 531992 1 31 50 IL5_178:3:196:127:666 22 0 174 35
Deletion: 38 reference_seq 531992 1 31 50 IL5_178:3:28:97:846 14 1 174 35
Deletion: 38 reference_seq 531992 1 31 50 IL5_178:3:306:255:978 14 1 174 35
Deletion: 38 reference_seq 531992 1 31 50 IL5_178:3:34:938:186 20 0 174 35
Deletion: 38 reference_seq 531992 1 31 50 IL5_178:3:193:177:517 20 0 174 35
Deletion: 38 reference_seq 531992 1 31 50 IL5_178:3:27:196:216 28 0 174 35
Deletion: 38 reference_seq 531992 1 31 50 IL5_178:3:201:644:630 23 0 174 35
Deletion: 38 reference_seq 531992 1 31 50 IL5_178:3:297:700:650 26 0 174 35
Deletion: 38 reference_seq 531992 1 31 50 IL5_178:3:160:363:52 25 0 174 35
Deletion: 38 reference_seq 531992 1 31 50 IL5_178:3:210:65:261 21 1 174 35

Deletion: 39 reference_seq 612399 2 11 25 IL5_178:3:87:963:820 27 0 116 0
Deletion: 39 reference_seq 612400 1 11 50 IL5_178:3:310:890:374 20 1 112 20
Deletion: 39 reference_seq 612400 1 11 50 IL5_178:3:162:285:672 15 0 112 20
Deletion: 39 reference_seq 612400 1 11 50 IL5_178:3:230:864:435 19 1 112 20
Deletion: 39 reference_seq 612400 1 11 50 IL5_178:3:267:906:111 17 0 112 20
Deletion: 39 reference_seq 612400 1 11 34 IL5_178:3:124:875:75 23 0 112 20
Deletion: 39 reference_seq 612400 1 11 50 IL5_178:3:81:405:586 15 0 112 20
Deletion: 39 reference_seq 612400 1 11 25 IL5_178:3:292:641:664 10 1 112 20
Deletion: 39 reference_seq 612400 1 11 50 IL5_178:3:260:688:827 22 1 112 20
Deletion: 39 reference_seq 612400 1 11 42 IL5_178:3:247:910:680 22 0 112 20
Deletion: 39 reference_seq 612400 1 11 25 IL5_178:3:154:433:352 24 0 112 20


[2] Deletion index;
[3] Chromosome name;
[4] Reference offset;
[5] Deletion length;
[6] Number of reads covering the deletion position;
[7] Read mapping score; 
[8] Read name;
[9] Read offset;
[10] Alignment direction: 0 - forward; 1 - reverse compliment.
[11] Read coverage on the reference base;
[12] Number of "-"s from the reads on the reference base.

Same file format for insertions.

If the file genome-cigar.dat is very big, you might try this

egrep " D " genome-cigar.dat > genome-deletion.dat
egrep " I " genome-cigar.dat > genome-insertion.dat

./ssaha_indel -deletion 1 genome-deletion.dat refe.fa > genome_deletion.out
./ssaha_indel -insertion 1 genome-insertion.dat refe.fa > genome_insertion.out

As indicated, you can change the allele frequencies for your own dataset. For example, use "-allele 0.0" to output all the potential indels.

==========================
Solexa transcriptome reads
==========================
As Solexa transcriptome reads can straddle two exons, it is necessary to
reduce the minimum match length. Also we need to remove the global alignment
criterion: if mismatch bases at read ends are all high quality (>=30), the
read will not be used.

(1) Set a lower alignment score so that it picks up the short alignment pieces:
./ssaha2 -solexa -score 12 -skip 2 -diff 0 -output cigar refe.fa reads-raw.fastq > ssaha.out

(2) For SNP output file:
./ssaha_pileup -solexa 1 -trans 1 genome-cigar.dat refe.fa genome_reads.fastq > genome_SNP.out

(3) For pileup output file:
./ssaha_pileup -solexa 1 -cons 1 -trans 1 genome-cigar.dat refe.fa genome_reads.fastq > genome_pileup.out

================================================
Solexa reads - randomly mapping repetitive reads
================================================
It is not recommended to assign repetitive reads to a location by random. But if you do wish to do so in single end reads, ssaha_pileup offers this facility as well. One condition is set: those reads must be 100% match with the reference, i.e. no mismatches or indels in the matching region.

To map repetitive reads:

./ssaha_cigar -uniq 0 genome-cigar-raw.dat genome-cigar.dat

Now exactly repetitive reads have been assigned randomly to ONE location. You then continue the rest steps.

===========================
Solexa reads - pileup depth 
===========================
The latest version of ssaha_pileup reports all the output results regardless of read depth.

==========================
Running ssaha2 on parallel 
==========================
It is possible to speed up the alignment jobs if you have multiprocessor 
machines. Ideally with big number of farm nodes as this version of ssaha2 
was mainly designed for low memory machines. Using "-disk 1", you can align 
reads against the whole unmasked human genome under 1.0 GB of memory. We 
suggest that you cut the read file into small files each, say, from 5000-500000 
reads (depending upon the CPU nodes you have) and then use multiprocessor 
machines to run the small jobs one by one. At the end, you need to cat all the 
files together. If your machine has good RAM memory, say >4 GB, you can use 
"-disk 0" to speed alignment up as this loads the hash table files directly to 
RAM. If you have reads in hundreds of millions, the file genome-cigar-raw.dat 
can be very big as we collect up to 200  pieces of alignment for each Solexa 
read. This consequently lead to big memory requirement in of processing cigar 
line files. We suggest that you process cigar line file and read file for every 
Solexa run: 

./ssaha_cigar runxx-cigar-raw.dat runxx-cigar.dat
cat runxx-cigar.dat | awk '{print $2}' > runxx-name.dat
/get_seqreads runxx-name.dat runxx-reads-raw.fastq runxx-reads.fastq

cat run??-cigar.dat > genome-cigar.dat
cat run??-reads.fastq > genome-reads.fastq

And finally run ssaha_pileup.

=======================
Source or binary codes:
=======================
ssaha2

http://www.sanger.ac.uk/Software/analysis/SSAHAA2

ssaha_pileup ssaha_cigar ssaha_pairs ssaha_mates

ftp://ftp.sanger.ac.uk/pub/zn1/ssaha_pileup/ssaha_pileup/

ssaha_split get_seqreads:

ftp://ftp.sanger.ac.uk/pub/zn1/ssaha_pileup/other_codes/

===============
SNP file format
===============
=================================================================================================
refe_name SNP_score offset N_reads ref_bp SNP_bp N_'A' N_'C' N_'G' N_'T' N_'-' N_'N' N_'a' N_'c' N_'g' N_'t'
=================================================================================================
SNP_hom: reference 24 465711    11      C G   0      0      10     1      0      0      0   0   8   1  
SNP_hom: reference 20 466338    6       G A   6      0      0      0      0      0      4   0   0   0  
SNP_hom: reference 27 466341    6       G A   6      0      0      0      0      0      1   0   0   0  
SNP_hez: reference 27 466344    7       C T/C 0      1      0      6      0      0      0   0   0   1  
SNP_hez: reference 15 466352    5       C T/C 0      1      0      4      0      0      0   0   0   2  
SNP_hez: reference 17 469255    8       C T/C 0      3      0      5      0      0      0   2   0   2  
SNP_hez: reference 17 469263    8       T A/T 5      0      0      3      0      0      2   0   0   3  
SNP_hez: reference 55 469413    17      G A/G 16     0      1      0      0      0      10  0   1   0  
SNP_hom: reference 60 469414    17      C T   0      0      0      17     0      0      0   0   0   10 
SNP_hez: reference 60 469415    17      A G/A 1      0      16     0      0      0      1   0   8   0  
SNP_hom: reference 50 469422    16      A C   0      16     0      0      0      0      0   12  0   0  
SNP_hez: reference 57 469424    16      T G/T 0      0      15     1      0      0      0   0   7   1  
SNP_hez: reference 99 472834    395     G A/G 394    0      1      0      0      0      239 0   1   0  
SNP_hez: reference 99 480623    331     T C/T 1      138    0      192    0      0      1   105 0   109
SNP_hez: reference 99 485669    209     T G/T 1      0      206    2      0      0      1   0   154 2  
SNP_hez: reference 99 506718    772     A C/A 433    333    5      0      0      1      324 251 5   0  
SNP_hez: reference 99 544004    207     G T/G 1      4      148    54     0      0      1   4   140 53 

(2) Name of chromosome/reference;
(3) Overall SNP confidence score: 0-99;
(4) SNP offset;
(5) Read coverage;
(6) Reference base;
(7) SNP base;
(8-11) Number of A,C,G,T (base quality Q>=0);
(12) Number of '-'s;
(13) Number of Ns;
(14-17) Number of a,c,g,t (base quality Q<25);

=======================
Pileup/cons file format
=======================
=================================================================================================
  refe_name offset N_reads refe_base N_'A' N_'C' N_'G' N_'T' N_'N' N_'-' N_'a' N_'c' N_'g' N_'t'
=================================================================================================
cons: reference 1 0 T Zero coverage
cons: reference 2         273     T     0      0      0      273    0      0      0   0   0   141
cons: reference 3         301     A     301    0      0      0      0      0      182 0   0   0  
cons: reference 4         324     G     0      0      324    0      0      0      0   0   207 0  
cons: reference 5         348     C     0      347    1      0      0      0      0   197 1   0  
cons: reference 6         386     A     385    0      1      0      0      0      245 0   1   0  
cons: reference 7         412     G     0      0      411    1      0      0      0   0   257 1  
cons: reference 8         440     C     5      433    2      0      0      0      5   267 2   0  
cons: reference 9         468     T     0      0      1      467    0      0      0   0   1   252
cons: reference 10        480     A     479    1      0      0      0      0      313 1   0   0  
cons: reference 11        507     G     1      0      505    0      0      1      1   0   333 0  
cons: reference 12        533     T     1      3      3      525    0      1      1   3   3   394
cons: reference 13        558     G     1      2      553    1      0      1      0   2   359 1  
cons: reference 14        577     G     1      1      574    0      0      1      1   1   369 0  
cons: reference 15        595     G     1      1      591    0      1      1      0   1   391 0  
cons: reference 16        617     A     605    2      6      3      0      1      453 2   5   3  

(2)     Name of chromosome/reference;
(3)     Base location;
(4)     Read coverage;
(5)     Reference base;
(6-9)   Number of A,C,G,T (base quality Q>=0);
(10)    Number of '-'s;
(11)    Number of Ns;
(12-15) Number of a,c,g,t (base quality Q<25);

===================
Read Mapping Score
==================
Read mapping is used to assess the repetitive feature of the read in the genome. In the cigar file
cigar::50

Smap = 50 is the mapping score:

Smap = (30/R)(Smax-Smax2)*10    if(Smap <50)
     = 50                       if(Smap >=50)
R = read length;
Smax - maximum alignment score (smith-waterman) of the hits on genome;
Smax2 - second best alignment score of the hits on genome;

Say you have one read of 30 bases which has two hits on the genome:
Best hit: exact match with Smax 30; Second best hit: one base mismatch with Smax2 29.
The mapping score for this read is Mscore = 10;

====================
SNP Confidence Score
====================
SNP score is calculated as the sum of weighted read mapping score,
combined with base quality.

For Solexa reads:

SNP_score = SUM(Smap*Qf)/10  (0 - 99)

Smap - read mapping score, from 0 (repeat) to 50 (unique);
Qf   - base quality factor:
       Qf = 1 if Q>=30
       Qf = 0.5 if Q<30

===============
32Bit Platforms
===============
It is recommended to use ssaha_pileup under 64Bit systems under which the pipeline.
was developed. We have tested some applications under 32Bit x86 systems. However,
problem might occur. If you only have x86 machines, edit the file "fasta.h" and
change 

//#define B64_long long int
#define B64_long long long int

In the above, a 64 bit integer is defined as "long long int" for x86 systems. 
Compile the code and run your applications.


=======================
Alignment visualization
=======================
The Sanger Institute is developing tools for data visualization and James Bonfield is working on gap5, which should be able to handle next-gen sequencing data. Before gap5's release, our group has some visualization tools and we are happy to share this with our users. The tool view_pileup is mainly for the QCs of SNP/indel detections of ssaha_pileup and may NOT have the highest standard in it. The alignment is pairwise and is not a fully optimized mutiple alignment. The idea is based upon client/server connections. You build up a server first and fire a query through a client. The client has to be a new window or even a new computer. If you have problems in running it, simply ignore it.

(1) Build up server
./view_pileup -port 777777 genome-cigar.dat refe.fa genome_reads.fastq

When you see "Start SSAHA2 server", this means that the server is set up and ready to take queries.
Note: "-port 777777" is the default port number. You can use any number if you want to.

(2) Open a new window
Connect to server and get alignment information for a give location: chromosome N at location M:
./viewClient.pl -server my_machine -port 777777 -contig N -pos M 

Here "my_machine" is the computer name you are running on view_pileup. This means that you can get access to server via network from any computers. Contig/chromosome are related to the reference sequence and it starts from "1", not "0".

Note: If you are using 454 or Sanger reads, the character "|" points to the exact base for the given location in the following line: 
 
View position:          0    954250    282923                              |

However, if you are using Solexa reads, the position is not accurate and it may have been shifted a few bases. But the read alignments and SNP information in that window should be correct.

(3) The depth of the reads is set to 5000. The code will have problems if your depth is higher than that.   

(4) Exit view_pileup
Ctrl-C or Ctrl-Z
If you don't need to view the data on a working day, please exit view_pileup. If leave there for too long, might cause problems for the system.

(5) For the same reference, if you have data from mutiple platforms:
Cigar files:   my_solexa-cigar.dat, my_454-cigar.dat, my_sanger-cigar.dat
Reference:     my_reference.fa
Read files:    reads-solexa.fastq, reads-454.fastq, reads-sanger.fastq

Do these steps:
cat my_solexa-cigar.dat my_454-cigar.dat my_sanger-cigar.dat > my-cigar.dat
cat reads-solexa.fastq reads-454.fastq reads-sanger.fastq > my-reads.fastq
./view_pileup -port 777777 my-cigar.dat my_reference.fa my-reads.fastq
Open a new window:
./viewClient.pl -server my_machine -port 777777 -contig N -pos M

(6) For the same reference, if you have data from mutiple cell lines:
Cigar files:   cell-1-cigar.dat, cell-2-cigar.dat
Reference:     my_reference.fa
Read files:    reads-cell-1.fastq, reads-cell-2.fastq

Change the FIRST column of the cigar file cell-1-cigar.dat from
================================================================
cigar::50 IL5_178:1:1:121:419 35 1 - csp12 10247 10281 + 35 M 35
cigar::50 IL5_178:1:1:120:336 35 1 - csp12 4759 4793 + 35 M 35
.....
cigar::50 IL5_178:1:1:114:152 35 1 - csp12 3213 3247 + 35 M 35
cigar::50 IL5_178:1:1:99:386 1 35 + csp12 2677 2711 + 35 M 35

to 

cell1::50 IL5_178:1:1:121:419 35 1 - csp12 10247 10281 + 35 M 35
cell1::50 IL5_178:1:1:120:336 35 1 - csp12 4759 4793 + 35 M 35
.....
cell1::50 IL5_178:1:1:114:152 35 1 - csp12 3213 3247 + 35 M 35
cell1::50 IL5_178:1:1:99:386 1 35 + csp12 2677 2711 + 35 M 35
===============================================================

Do the same for cell-2-cigar.dat:
=================================================================
cigar::50 IL5_178:1:330:171:867 1 35 + csp12 1727 1761 + 35 M 35
cigar::50 IL5_178:1:330:139:138 1 35 + csp12 9760 9795 + 28 M 24 D 1 M 11
...
cigar::50 IL5_178:1:330:100:981 1 35 + csp12 929 963 + 35 M 35
cigar::50 IL5_178:1:330:597:424 35 1 - csp12 12320 12354 + 29 M 35

to

cell2::50 IL5_178:1:330:171:867 1 35 + csp12 1727 1761 + 35 M 35
cell2::50 IL5_178:1:330:139:138 1 35 + csp12 9760 9795 + 28 M 24 D 1 M 11
...
cell2::50 IL5_178:1:330:100:981 1 35 + csp12 929 963 + 35 M 35
cell2::50 IL5_178:1:330:597:424 35 1 - csp12 12320 12354 + 29 M 35
=================================================================

Do these steps:
cat cell-1-cigar.dat cell-2-cigar.dat > my-cigar.dat
cat reads-cell-1.fastq reads-cell-2.fastq > my-reads.fastq
./view_pileup -port 777777 my-cigar.dat my_reference.fa my-reads.fastq
Open a new window:
./viewClient.pl -server my_machine -port 777777 -contig N -pos M

======================
Flag options of ssaha2
======================
      ssaha2Build [OPTIONS] -save hash_name subject_file

      ssaha2 [OPTIONS] -save hash_name query_file
      ssaha2 [OPTIONS] subject_file query_file

      ssahaSNP [OPTIONS] -save hash_name query_file
      ssahaSNP [OPTIONS] subject_file query_file

      ssaha2Server [OPTIONS] hash_name

   where subject_file and query_file are fasta or fastq files with the
   reference (subject) and query sequence files. hash_name is the root
   name of the hash table files for the reference sequence created
   using ssaha2Build.

OPTIONS:

 -h, -help     Print this page.
 -v, -version  Print version information.
 -c, -cookbook Print some example parameter sets, suitable
               for common tasks.

All other options, except the -solexa flag and the -output ssaha2 cigar option, are
key-value pairs, values are described for the keys below (defaults in brackets):

 -save <FILENAME>
            ssaha2Build: Root name of the files to which the
                         hash table is saved. The set of files
                         have the extensions FILENAME.head FILENAME.body
                         FILENAME.name FILENAME.base FILENAME.size
            ssaha2, ssahaSNP: Read hash table from files created by ssaha2Build.
	                   If -save is not specified then the first file name
                         specifies the reference sequence for which a hash
                         table is constructed on the fly
 -kmer      Word size for ssaha hashing (12).
 -skip      Step size for ssaha hashing (12).
 -ckmer     Word size for cross_match matching (10).
 -cmatch    Minimum match length for cross_match matching (14).
 -cut       Number of repeats allowed before this kmer is ignored (10000).
 -seeds     Number of kmer matches required to flag a hit (5).
 -depth     Number of hits to consider for alignment (50).
 -memory:   Memory assigned in MBs for the alignment matrix (200).
 -score     Minimum score for match to be reported (30).
 -identity  Minimum identity for match to be reported (50.000000).
 -port      Port number for server (60000).
 -align     If set to > 0, output graphical alignment (0).
            If set to 2 and -solexa flag is set: output also quality score.
 -edge      Augment hit by this many bases before alignment (200).
 -array:    Memory assigned in bytes for frequency arrays (4000000).
 -start:    first sequence to process in query (0).
 -end:      last sequence to process in query, 0 means process all (0).
 -sense     Allow really patchy hits to go for alignment (0).
 -best      If set to 1, only report the best alignment for each
            match, if multiple best scores report all (0).
 -454:      (see -rtype) Tune for 454 reads (0).
 -NQS:      Use NQS to filter SNPs if set to 1, otherwise output all candidates (1).
 -quality:  Quality value to use for variation base in NQS (23).
 -tags:     If set to 1, prefix added to output summary lines to
            aid parsing, the prefix depends upon the chosen
            output format, e.g. if output is ssaha2 then the
            prefix is ALIGNMENT (1).
 -output:   ssaha2       - original ssaha2 line only (default)
            sugar        - Simple UnGapped Alignment Report
            cigar        - Compact Idiosyncratic Gapped Alignment Report
            vulgar       - Verbose Useful Labelled Gapped Alignment Report
            psl          - Tab separated format similar to BLT
                         - http://genome.ucsc.edu/goldenPath/help/customTrack.html
            pslx         - Tab separated format with sequence
            gff          - http://www.sanger.ac.uk/Software/formats/GFF/
            ssaha2 cigar - alternate between ssaha2 and cigar format lines
            for a full description of output formats see:
                 http://www.sanger.ac.uk/Software/analysis/SSAHA2/formats.shtml
 -name      Flag that modifies option '-output cigar' such that read name
            and length are also reported when there was no hit found.
 -diff:     Output all hits within diff of the best (-1).
 -udiff:    Ignore best hit if second best score within udiff (0).
 -fix:      If set to 1, fix -edge, -seeds, -score so that they
            are not updated according to read length in ssahaSNP (0).
 -disk:     If set to 1, read hashtable from disk rather than
            loading to memory (0)
 -weight:   If >0, apply this much weighting to rare kmers (0).
 -solexa:   (see -rtype) implies (ssaha2 and ssahaSNP only):
            -seeds 2 -score 12 -sense 1 -cmatch 10 -ckmer 6 -skip 1.
            Top scoring hits with lower quality at the mismatch positions have
            their Smith-Waterman score incremented by 1. Mapping scores are
            changed accordingly. SsahaSNP reports in such cases only the top
            scoring hit (no Repeat lines).
 -rtype:    solexa          - like -solexa flag
            454             - like -454 flag
            abi             - tunes for ABI reads (default)
