Trace formatting: Difference between revisions

From Cbcb
Jump to navigation Jump to search
Line 5: Line 5:
Anomalies:  
Anomalies:  
   * homopolymer lengths can be shorter than real
   * homopolymer lengths can be shorter than real
   * substitutions less likely than in traditional methods
   * substitutions less likely than in traditional methodssingle base insertions
  * carry forward events usually near but not adjacent to homopolymers


GS20
GS20
Line 18: Line 19:
     * 99.5% accuracy rate in unassembled sequences
     * 99.5% accuracy rate in unassembled sequences
     * identified several factors that can be used to remove a small percentage of low-quality reads, improving the accuracy to 99.75% or better => better quality than Sanger sequencing
     * identified several factors that can be used to remove a small percentage of low-quality reads, improving the accuracy to 99.75% or better => better quality than Sanger sequencing
    * The error rate, defined as the number of errors (miscalled bases plus inserted and deleted bases) divided by the total number of expected bases, was 0.49%
  * 36% insertions, 27% delitions, 21% N's, 16% substitutions
  * A to G and T to C, were more frequent than other mismatches
  * reverse transitions, G to A and C to T, were not that frequent
  * Nearly 70% of the homopolymer extensions were A/T
  * errors were evenly distributed along the length of the reference sequences, they were not evenly distributed
among reads: 82% had no errors, 93% had no more than a single error, and 96% had no more than 2 errors.
  * A small number of reads, fewer than 2%, contained a disproportionate number of errors that account for nearly 50% of the miscalls for the entire dataset 
  * Avg quality is 25; in homopolymers can drop as low as 5
  * Reads much longer than avg length had more errors
  * strong correlation between the presence of ambiguous base calls and other errors in a read
  * The presence of even a single ambiguous base in a read correlates strongly with the presence of other errors
  * Primer errors also correlated with errors


== 454 (paired ends) ==
== 454 (paired ends) ==

Revision as of 16:18, 21 January 2008

Sanger

454 (single reads)

Anomalies:

 * homopolymer lengths can be shorter than real
 * substitutions less likely than in traditional methodssingle base insertions
 * carry forward events usually near but not adjacent to homopolymers

GS20

 Plate information
 * 1.6M total wells
 * 450K detactable wells
 * 200K usable wells
 Accuracy:
 * published per-base accuracy of a Roche GS20 is only 96%.
 * Mitch Sogin paper
   * 99.5% accuracy rate in unassembled sequences
   * identified several factors that can be used to remove a small percentage of low-quality reads, improving the accuracy to 99.75% or better => better quality than Sanger sequencing
   * The error rate, defined as the number of errors (miscalled bases plus inserted and deleted bases) divided by the total number of expected bases, was 0.49%
  * 36% insertions, 27% delitions, 21% N's, 16% substitutions
  * A to G and T to C, were more frequent than other mismatches
  * reverse transitions, G to A and C to T, were not that frequent 
  * Nearly 70% of the homopolymer extensions were A/T
  * errors were evenly distributed along the length of the reference sequences, they were not evenly distributed

among reads: 82% had no errors, 93% had no more than a single error, and 96% had no more than 2 errors.

  * A small number of reads, fewer than 2%, contained a disproportionate number of errors that account for nearly 50% of the miscalls for the entire dataset  
  * Avg quality is 25; in homopolymers can drop as low as 5
  * Reads much longer than avg length had more errors
  * strong correlation between the presence of ambiguous base calls and other errors in a read
  * The presence of even a single ambiguous base in a read correlates strongly with the presence of other errors 
  * Primer errors also correlated with errors

454 (paired ends)

Features:

 * approximately 84-nucleotide DNA fragments 
 * have a ~ 44-mer linker sequence in the middle 
 * flanked by a ~ 20-mer sequence on each side. 
 * The two flanking 20-mers are segments of DNA that were originally located approximately 2.5 kb apart in the genome of interest.  
 * The ordering and orienting of contigs generates scaffolds which provide a high-quality draft sequence of the genome.

Anomalies:

 * the linker can appear (tandem,completely/partially) more than once

Links:

 1_paired_end.pdf

Solexa/Illumina

Links:

 Strep suis Solexa data set for download at Sanger
 NCBI Solexa example data set
 ismb2007Poster.pdf
 Smith_Rennes_2007.pdf

Software:

 Staden & Io_lib
 * IO_LIB package /fs/sz-user-supported/common/packages/io_lib-1.11-x86_64/bin/
 * STADEN package /fs/sz-user-supported/common/packages/staden-src-1-7-0/distrib/unix-rel-1-7-0/linux-bin

Example:

 $ solexa2srf s_8_0100_seq.txt  -o s_8_0100_seq.srf
 $ srf2fastq s_8_0100_seq.srf

   @s_8_100_293_551
   CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCACC
   +
   IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
   @s_8_100_35_698
   TATATGATTGACAATATAAAAATATGAGTATAAAAT
   +
   IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII4/:I
   @s_8_100_880_947
   TTATTATCTTTATTGACGTACCTCTAGAAGACCCAA
   +
   IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII;>1
   ...