Trace formatting: Difference between revisions

From Cbcb
Jump to navigation Jump to search
Line 66: Line 66:
Anomalies:  
Anomalies:  
   * the linker can appear (tandem,completely/partially) more than once
   * the linker can appear (tandem,completely/partially) more than once
  * 80-90% of reads contain an identical copy of the linker


Links:
Links:

Revision as of 16:03, 23 September 2008

Articles

Technologies

Sanger

454 (single reads)

Anomalies:

 * homopolymer lengths can be shorter than real
 * substitutions less likely than in traditional methodssingle base insertions
 * carry forward events usually near but not adjacent to homopolymers

GS20

 Plate information
 * 1.6M total wells
 * 450K detactable wells
 * 200K usable wells
 Accuracy:
 * published per-base accuracy of a Roche GS20 is only 96%.
 * Mitch Sogin paper
   * 99.5% accuracy rate in unassembled sequences
   * identified several factors that can be used to remove a small percentage of low-quality reads, improving the accuracy to 99.75% or better => better quality than Sanger sequencing
   * The error rate, defined as the number of errors (miscalled bases plus inserted and deleted bases) divided by the total number of expected bases, was 0.49%
  * 36% insertions, 27% delitions, 21% N's, 16% substitutions
  * A to G and T to C, were more frequent than other mismatches
  * reverse transitions, G to A and C to T, were not that frequent 
  * Nearly 70% of the homopolymer extensions were A/T
  * errors were evenly distributed along the length of the reference sequences, they were not evenly distributed

among reads: 82% had no errors, 93% had no more than a single error, and 96% had no more than 2 errors.

  * A small number of reads, fewer than 2%, contained a disproportionate number of errors that account for nearly 50% of the miscalls for the entire dataset  
  * Avg quality is 25; in homopolymers can drop as low as 5
  * Reads much longer than avg length had more errors
  * strong correlation between the presence of ambiguous base calls and other errors in a read
  * The presence of even a single ambiguous base in a read correlates strongly with the presence of other errors 
  * Primer errors also correlated with errors

Get info from .sff files:

 $ sffinfo -h
 Usage:  sffinfo [options...] [- | sfffile] [accno...]
 Options:
      -a or -accno      Output just the accessions
      -s or -seq      Output just the sequences
      -q or -qual     Output just the quality scores
      -f or -flow     Output just the flowgrams
      -t or -tab      Output the seq/qual/flow as tab-delimited lines
      -n or -notrim   Output the untrimmed sequence or quality scores
      -m or -mft      Output the manifest text

454 (paired ends)

Features:

 * approximately 84-nucleotide DNA fragments 
 * have a ~ 44-mer linker sequence in the middle 
 * flanked by a ~ 20-mer sequence on each side. 
 * The two flanking 20-mers are segments of DNA that were originally located approximately 2.5 kb apart in the genome of interest.  
 * The ordering and orienting of contigs generates scaffolds which provide a high-quality draft sequence of the genome.
 Linker(palindrome) : GTTGGAACCGAAAGGGTTTGAATTCAAACCCTTTCGGTTCCAAC
 Check for linker   : sffinfo -s *.sff | ~/bin/fasta2tab.pl | grep GTTGGAACCGAAAGGGTTTGAATTCAAACCCTTTCGGTTCCAAC

Anomalies:

 * the linker can appear (tandem,completely/partially) more than once
 * 80-90% of reads contain an identical copy of the linker

Links:

 1_paired_end.pdf

Solexa/Illumina

Data sets:

 Strep suis Solexa data set for download at Sanger
 Staphylococcus aureus strain MW2 (edena paper)
 NCBI Solexa example data set
 Pseudomonas aeruginosa
 Pseudomonas syringae
 human HapMap individual NA12878  SRR000921..SRR001306

Articles:

 ismb2007Poster.pdf
 Smith_Rennes_2007.pdf

Software:

 Staden & Io_lib
 * IO_LIB package /fs/sz-user-supported/common/packages/io_lib-1.11-x86_64/bin/
 * STADEN package /fs/sz-user-supported/common/packages/staden-src-1-7-0/distrib/unix-rel-1-7-0/linux-bin
 
 MAQ Sanger assembler
 FASTQ sequence format

Illumina 1G :

 * ~40 Million DNA sequencing reactions
 * about 36 hours for a run
 * each sequence is up to 36 bases long

SRA: set of 4 files

 *_seq.txt  : lane,run, well(x,y) sequence
 *_prb.txt  : max quality from each group of 4 values is taken as quality
 *_sig2.txt : lane,run, well(x,y); max signal from each group of 4 values corresponds to max quality
 *_qhg.txt  : lane,run, well(x,y); some encoded info?

Qualities:

 Range : -5..40
 Avg   : ~25, depending on the data set

Fastq format

Example:

 1 lane of Solexa reads: 10,959 READS; all are 36 bp
 $ /fs/sz-user-supported/common/packages/io_lib-x86_64/bin/solexa2srf s_8_0100_seq.txt  ; mv traces.srf  s_8_0100.srf
 $ /fs/sz-user-supported/common/packages/io_lib-x86_64/bin/srf2fastq s_8_0100.srf > s_8_0100.fastq

   @s_8_100_293_551
   CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCACC
   +
   IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
   @s_8_100_35_698
   TATATGATTGACAATATAAAAATATGAGTATAAAAT
   +
   IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII4/:I
   @s_8_100_880_947
   TTATTATCTTTATTGACGTACCTCTAGAAGACCCAA
   +
   IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII;>1
   ...
 Edge effect: 
 N's have quality -14
 $ cat s_8_0100_seq.txt | sort -nk3 -nk4 
 8       100     0       37      ......AT.AT...TAATCAATA..GA.GAAG....
 ...
 8       100     1003    959     AGTC.......T.C.........GT.........AA
 $ more traces.qual
 ...
 >s_8_100_0_37
 -14 -14 -14 -14 -14 -14 25 13 -14 25 25 -14 -14 -14 25 25 25 25 22 25 25 25 25 -14 -14 25 25 -14 25 -11 25 14 -14 -14 -14 -14 
 ...
 >s_8_100_1003_959
 25 25 25 25 -14 -14 -14 -14 -14 -14 -14 25 -14 25 -14 -14 -14 -14 -14 -14 -14 -14 -14 25 -10 -14 -14 -14 -14 -14 -14 -14 -14 -14 8 25
 ...
 # bioperl script to convrt seq formats
 $ seqconvert.PLS --from fastq --to fasta < s_8_0100.fastq
 
 # get fastq qualities
 $ more *fastq | grep -A 1 "^+" | grep -v ^+ | grep -v -- ^-- | perl -ane '@F=split //,$F[0]; foreach (@F) { $n=ord($_)-33; print $n," ";} print "\n";'

SOLiD

color space (0123) => base space (ACGT)
.csfsta file : in color space; start with a known base (usually T)

Example:

 >1_88_1830_R3
 G32113123201300232320
>1 _89_1562_R3
 G23133131233333101320
 ..

Alignment matrix

   A C G T
 A 0 1 2 3
 C 1 0 3 2
 G 2 3 0 1
 T 3 2 1 0

Examples:

  AA is encoded as 0
  CG is encoded as 3
  AACG is encoded as 0 1 3

Features of Color space:

 * Color space data are self-complementary
   Example:
       Base    A G C T C G T C G T G C A G
       Color space 2 3 2 2 3 1 2 3 1 1 3 1 2
   
       Complemented
       Base    T C G A G C A G C A C G T C
       Color space 2 3 2 2 3 1 2 3 1 1 3 1 2
 * Two-Base Encoding and Error Recognition
   1 change: measuring error 
   multiple changes starting at a certain point: SNP
   Example:
      Reference 2 3 2 2 3 1 2 3 1 1 3 1 2
      Observed  2 3 2 2 0 1 2 3 1 1 3 1 2

Download

From online database

Example:

 >gi|45439865|ref|NC_005810.1| Yersinia pestis biovar Microtus str. 91001, complete genome
 TCGCGCGATCTTTGAGCTAATTAGAGTAAATTAATCCAATCTTTGACCCAAATCTCTGCTGGATCCTCTG
 GTATTTCATGTTGGATGACGTCAATTTCTAATATTTCACCCAACCGTTGAGCACCTTGTGCGATCAATTG
 ...

Bioperl scripts:

 /fs/sz-user-supported/common/bin/
 bp_fetch.pl net::genbank:NC_005810.1 > NC_005810.1
 bp_fetch.pl net::genbank:NC_005810 > NC_005810
 bp_fetch.pl net::genbank:45439865 > 45439865

Format

Example:

 ~/bin/tarchive2amos -o Ba Ba.seq                                              # TA FTP
 ~/bin/tarchive2amos -o Ba -tracedir traces/                                   # TA querytrace_db 
 ~/bin/tarchive2amos -o Ba -assembly assembly/ASSEMBLY.xml -tracedir traces/   # AA

Alignments

Whole genomes alignments