Trace formatting: Difference between revisions
Jump to navigation
Jump to search
Line 61: | Line 61: | ||
* IO_LIB package /fs/sz-user-supported/common/packages/io_lib-1.11-x86_64/bin/ | * IO_LIB package /fs/sz-user-supported/common/packages/io_lib-1.11-x86_64/bin/ | ||
* STADEN package /fs/sz-user-supported/common/packages/staden-src-1-7-0/distrib/unix-rel-1-7-0/linux-bin | * STADEN package /fs/sz-user-supported/common/packages/staden-src-1-7-0/distrib/unix-rel-1-7-0/linux-bin | ||
[http://maq.sourceforge.net/maq-man.shtml#download MAQ Sanger assembler] | |||
[http://www.bioperl.org/wiki/FASTQ_sequence_format FASTQ sequence format] | |||
Illumina 1G : | Illumina 1G : | ||
Line 70: | Line 73: | ||
1 lane of Solexa reads: 10,959 READS; all are 36 bp | 1 lane of Solexa reads: 10,959 READS; all are 36 bp | ||
$ solexa2srf s_8_0100_seq.txt | $ solexa2srf s_8_0100_seq.txt ; mv traces.srf s_8_0100.srf | ||
$ srf2fastq | $ srf2fastq s_8_0100.srf > s_8_0100.fastq | ||
@s_8_100_293_551 | @s_8_100_293_551 | ||
Line 90: | Line 93: | ||
N's have quality -14 | N's have quality -14 | ||
$ cat s_8_0100_seq.txt | sort -nk3 -nk4 | $ cat s_8_0100_seq.txt | sort -nk3 -nk4 | ||
8 100 0 37 ......AT.AT...TAATCAATA..GA.GAAG.... | 8 100 0 37 ......AT.AT...TAATCAATA..GA.GAAG.... | ||
... | |||
8 100 1003 959 AGTC.......T.C.........GT.........AA | 8 100 1003 959 AGTC.......T.C.........GT.........AA | ||
$ | $ more traces.qual | ||
... | |||
>s_8_100_0_37 | >s_8_100_0_37 | ||
-14 -14 -14 -14 -14 -14 25 13 -14 25 25 -14 -14 -14 25 25 25 25 22 25 25 25 25 -14 -14 25 25 -14 25 -11 25 14 -14 -14 -14 -14 | -14 -14 -14 -14 -14 -14 25 13 -14 25 25 -14 -14 -14 25 25 25 25 22 25 25 25 25 -14 -14 25 25 -14 25 -11 25 14 -14 -14 -14 -14 | ||
... | |||
>s_8_100_1003_959 | >s_8_100_1003_959 | ||
25 25 25 25 -14 -14 -14 -14 -14 -14 -14 25 -14 25 -14 -14 -14 -14 -14 -14 -14 -14 -14 25 -10 -14 -14 -14 -14 -14 -14 -14 -14 -14 8 25 | 25 25 25 25 -14 -14 -14 -14 -14 -14 -14 25 -14 25 -14 -14 -14 -14 -14 -14 -14 -14 -14 25 -10 -14 -14 -14 -14 -14 -14 -14 -14 -14 8 25 | ||
... | |||
# bioperl script to convrt seq formats | |||
$ seqconvert.PLS --from fastq --to fasta < s_8_0100.fastq | |||
# get fastq qualities | |||
$ more *fastq | grep -A 1 "^+" | grep -v ^+ | grep -v -- ^-- | perl -ane '@F=split //,$F[0]; foreach (@F) { $n=ord($_)-33; print $n," ";} print "\n";' |
Revision as of 19:26, 21 January 2008
Sanger
454 (single reads)
Anomalies:
* homopolymer lengths can be shorter than real * substitutions less likely than in traditional methodssingle base insertions * carry forward events usually near but not adjacent to homopolymers
GS20
Plate information * 1.6M total wells * 450K detactable wells * 200K usable wells
Accuracy: * published per-base accuracy of a Roche GS20 is only 96%. * Mitch Sogin paper * 99.5% accuracy rate in unassembled sequences * identified several factors that can be used to remove a small percentage of low-quality reads, improving the accuracy to 99.75% or better => better quality than Sanger sequencing * The error rate, defined as the number of errors (miscalled bases plus inserted and deleted bases) divided by the total number of expected bases, was 0.49% * 36% insertions, 27% delitions, 21% N's, 16% substitutions * A to G and T to C, were more frequent than other mismatches * reverse transitions, G to A and C to T, were not that frequent * Nearly 70% of the homopolymer extensions were A/T * errors were evenly distributed along the length of the reference sequences, they were not evenly distributed
among reads: 82% had no errors, 93% had no more than a single error, and 96% had no more than 2 errors.
* A small number of reads, fewer than 2%, contained a disproportionate number of errors that account for nearly 50% of the miscalls for the entire dataset * Avg quality is 25; in homopolymers can drop as low as 5 * Reads much longer than avg length had more errors * strong correlation between the presence of ambiguous base calls and other errors in a read * The presence of even a single ambiguous base in a read correlates strongly with the presence of other errors * Primer errors also correlated with errors
454 (paired ends)
Features:
* approximately 84-nucleotide DNA fragments * have a ~ 44-mer linker sequence in the middle * flanked by a ~ 20-mer sequence on each side. * The two flanking 20-mers are segments of DNA that were originally located approximately 2.5 kb apart in the genome of interest. * The ordering and orienting of contigs generates scaffolds which provide a high-quality draft sequence of the genome.
Anomalies:
* the linker can appear (tandem,completely/partially) more than once
Links:
1_paired_end.pdf
Solexa/Illumina
Links:
Strep suis Solexa data set for download at Sanger NCBI Solexa example data set ismb2007Poster.pdf Smith_Rennes_2007.pdf
Software:
Staden & Io_lib * IO_LIB package /fs/sz-user-supported/common/packages/io_lib-1.11-x86_64/bin/ * STADEN package /fs/sz-user-supported/common/packages/staden-src-1-7-0/distrib/unix-rel-1-7-0/linux-bin MAQ Sanger assembler FASTQ sequence format
Illumina 1G :
* ~40 Million DNA sequencing reactions * about 36 hours for a run * each sequence is up to 36 bases long
Example:
1 lane of Solexa reads: 10,959 READS; all are 36 bp $ solexa2srf s_8_0100_seq.txt ; mv traces.srf s_8_0100.srf $ srf2fastq s_8_0100.srf > s_8_0100.fastq @s_8_100_293_551 CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCACC + IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII @s_8_100_35_698 TATATGATTGACAATATAAAAATATGAGTATAAAAT + IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII4/:I @s_8_100_880_947 TTATTATCTTTATTGACGTACCTCTAGAAGACCCAA + IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII;>1 ...
Edge effect: N's have quality -14
$ cat s_8_0100_seq.txt | sort -nk3 -nk4 8 100 0 37 ......AT.AT...TAATCAATA..GA.GAAG.... ... 8 100 1003 959 AGTC.......T.C.........GT.........AA
$ more traces.qual ... >s_8_100_0_37 -14 -14 -14 -14 -14 -14 25 13 -14 25 25 -14 -14 -14 25 25 25 25 22 25 25 25 25 -14 -14 25 25 -14 25 -11 25 14 -14 -14 -14 -14 ... >s_8_100_1003_959 25 25 25 25 -14 -14 -14 -14 -14 -14 -14 25 -14 25 -14 -14 -14 -14 -14 -14 -14 -14 -14 25 -10 -14 -14 -14 -14 -14 -14 -14 -14 -14 8 25 ...
# bioperl script to convrt seq formats $ seqconvert.PLS --from fastq --to fasta < s_8_0100.fastq # get fastq qualities $ more *fastq | grep -A 1 "^+" | grep -v ^+ | grep -v -- ^-- | perl -ane '@F=split //,$F[0]; foreach (@F) { $n=ord($_)-33; print $n," ";} print "\n";'