Revision as of 15:35, 4 December 2008

Articles

Sequencing of natural strains of Arabidopsis thaliana with short reads (Illumina)

Technologies

Latest Technology Summary

 Technology            454                       Illumina                              Solid

                       seq-by-synthesis            seq-by-synthesis                        ABI
 Company               454(Roche)                  Illumina
 Location              Brandford,CT                SanDiego,CA          
 Latest                GS FLX, Titanium reagents   Genome Analyzer II                      SOLID 3

 Throughput            500M/run                    1G/run                                  20G/run
 RunTime               10hr                        3days
 ReadLen               500bp                       36                                      35
 InsertLen             3K                          100-200bp                               600-10K

 Accuracy                                                                                  99.94%
 Q20(99%accuracy)      400bp                       34bp
 Cost                                              $3K/run                                 60K/3G                             
                                                   $400/4M bacterial genome(25-30X)
 Problems              homopolimers

 DataSets              watson's genome

 AlignmentProg                                     BFAST                                   BFAST
                                                   MAQ                                     MAQ
                                                                                           CORONA

Sanger

454 : Pyrosequencing

454_Life_Sciences wikipedia

Anomalies:

 * homopolymer lengths can be shorter than real
 * substitutions less likely than in traditional methodssingle base insertions
 * carry forward events usually near but not adjacent to homopolymers

GS20

 * 1.6M total wells
 * 450K detactable wells
 * 200K usable wells

 Accuracy:
 * published per-base accuracy of a Roche GS20 is only 96%.
 * Mitch Sogin paper
   * 99.5% accuracy rate in unassembled sequences
   * identified several factors that can be used to remove a small percentage of low-quality reads, improving the accuracy to 99.75% or better => better quality than Sanger sequencing
   * The error rate, defined as the number of errors (miscalled bases plus inserted and deleted bases) divided by the total number of expected bases, was 0.49%
  * 36% insertions, 27% delitions, 21% N's, 16% substitutions
  * A to G and T to C, were more frequent than other mismatches
  * reverse transitions, G to A and C to T, were not that frequent 
  * Nearly 70% of the homopolymer extensions were A/T
  * errors were evenly distributed along the length of the reference sequences, they were not evenly distributed

among reads: 82% had no errors, 93% had no more than a single error, and 96% had no more than 2 errors.

  * A small number of reads, fewer than 2%, contained a disproportionate number of errors that account for nearly 50% of the miscalls for the entire dataset  
  * Avg quality is 25; in homopolymers can drop as low as 5
  * Reads much longer than avg length had more errors
  * strong correlation between the presence of ambiguous base calls and other errors in a read
  * The presence of even a single ambiguous base in a read correlates strongly with the presence of other errors 
  * Primer errors also correlated with errors

GS FLX

GS FLX with Titanium reagents

 * up to 500M/run
 * reads up to 500bp

Get info from .sff files:

 $ sffinfo -h
 Usage:  sffinfo [options...] [- | sfffile] [accno...]
 Options:
      -a or -accno      Output just the accessions
      -s or -seq      Output just the sequences
      -q or -qual     Output just the quality scores
      -f or -flow     Output just the flowgrams
      -t or -tab      Output the seq/qual/flow as tab-delimited lines
      -n or -notrim   Output the untrimmed sequence or quality scores
      -m or -mft      Output the manifest text

un-paired reads

paired ends

Features:

 * approximately 84-nucleotide DNA fragments 
 * have a ~ 44-mer linker sequence in the middle 
 * flanked by a ~ 20-mer sequence on each side. 
 * The two flanking 20-mers are segments of DNA that were originally located approximately 2.5 (3?) kb apart in the genome of interest.  
 * The ordering and orienting of contigs generates scaffolds which provide a high-quality draft sequence of the genome.

 Linker(palindrome) : GTTGGAACCGAAAGGGTTTGAATTCAAACCCTTTCGGTTCCAAC
 Check for linker   : sffinfo -s *.sff | ~/bin/fasta2tab.pl | grep GTTGGAACCGAAAGGGTTTGAATTCAAACCCTTTCGGTTCCAAC
 
 12345678901234567890123456789012345678901234
 GTTGGAACCGAAAGGGTTTGAATTCAAACCCTTTCGGTTCCAAC

 GTTGGAACCGA
 AAGGGTTTGAA
 TTCAAACCCTT
 TCGGTTCCAAC

Anomalies:

 * the linker can appear (tandem,completely/partially) more than once
 * some reads end up in linker (partial)
 * some reads don't contain the linker at all
 * some reads are cloning vector

Links:

 1_paired_end.pdf

File location:

 /fs/szdata/454p/

Solexa/Illumina : Sequencing by Synthesis

Platforms:

 * Genome Analyzer  (GA)
 * Genome Analyzer II : faster, higher tput
 * Future: 10GB/run  50bp reads
 * Future: 20GB/run 100bp reads

Data sets:

 Strep suis Solexa data set for download at Sanger
 Staphylococcus aureus strain MW2 (edena paper)
 NCBI Solexa example data set
 Pseudomonas aeruginosa
 Pseudomonas syringae

 human HapMap individual NA12878  SRR000921..SRR001306

Applications:

 * Gene Expression
 * ChIPSeq (hight throughput)
 * Re-sequencing
 * mRNA sequencing

Software:

 Staden & Io_lib
 * IO_LIB package /fs/sz-user-supported/common/packages/io_lib-1.11-x86_64/bin/
 * STADEN package /fs/sz-user-supported/common/packages/staden-src-1-7-0/distrib/unix-rel-1-7-0/linux-bin
 
 MAQ Sanger assembler
 FASTQ sequence format

Illumina 1G :

 * ~40 Million DNA sequencing reactions
 * about 36 hours for a run
 * each sequence is up to 36 bases long
 * insert len=~200bp

Illumina Genome Analyzer II:

 * up to 51 bp
 * mate-pairs: opposite directions, slight overlap (insert size is less than 200bp "advertised")
 * on the SRA mate-pairs are joined; when downloaded only one read is shown. What about the mate pair?

SRA: set of 4 files

 *_seq.txt  : lane,run, well(x,y) sequence
 *_prb.txt  : max quality from each group of 4 values is taken as quality
 *_sig2.txt : lane,run, well(x,y); max signal from each group of 4 values corresponds to max quality
 *_qhg.txt  : lane,run, well(x,y); some encoded info?

 # *_seq.txt 
 5       1       1269    1795    AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA  
 # _prb.txt 
 40  -40  -40  -40       40  -40  -40  -40  ...
 # _sig2.txt <==
 5       1       1269    1795    2594.0 2367.0  -10.0  -96.0 ...

Qualities:

 Range : -5..40
 Avg   : ~25, depending on the data set

Fastq format

Maq help

Example:

 1 lane of Solexa reads: 10,959 READS; all are 36 bp
 $ /fs/sz-user-supported/common/packages/io_lib-x86_64/bin/solexa2srf s_8_0100_seq.txt  ; mv traces.srf  s_8_0100.srf
 $ /fs/sz-user-supported/common/packages/io_lib-x86_64/bin/srf2fastq s_8_0100.srf > s_8_0100.fastq

   @s_8_100_293_551
   CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCACC
   +
   IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
   @s_8_100_35_698
   TATATGATTGACAATATAAAAATATGAGTATAAAAT
   +
   IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII4/:I
   @s_8_100_880_947
   TTATTATCTTTATTGACGTACCTCTAGAAGACCCAA
   +
   IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII;>1
   ...

 Edge effect: 
 N's have quality -14

 $ cat s_8_0100_seq.txt | sort -nk3 -nk4 
 8       100     0       37      ......AT.AT...TAATCAATA..GA.GAAG....
 ...
 8       100     1003    959     AGTC.......T.C.........GT.........AA

 $ more traces.qual
 ...
 >s_8_100_0_37
 -14 -14 -14 -14 -14 -14 25 13 -14 25 25 -14 -14 -14 25 25 25 25 22 25 25 25 25 -14 -14 25 25 -14 25 -11 25 14 -14 -14 -14 -14 
 ...
 >s_8_100_1003_959
 25 25 25 25 -14 -14 -14 -14 -14 -14 -14 25 -14 25 -14 -14 -14 -14 -14 -14 -14 -14 -14 25 -10 -14 -14 -14 -14 -14 -14 -14 -14 -14 8 25
 ...

 # bioperl script to convrt seq formats
 $ seqconvert.PLS --from fastq --to fasta < s_8_0100.fastq
 
 # get fastq qualities
 $ more *fastq | grep -A 1 "^+" | grep -v ^+ | grep -v -- ^-- | perl -ane '@F=split //,$F[0]; foreach (@F) { $n=ord($_)-33; print $n," ";} print "\n";'

 # convert Solexa format (maq fq_all2std.pl script)
 $ fq_all2std.pl seqprb2std s_5_0001_seq.txt s_5_0001_prb.txt > s_5_001.fastq
 $ fq_all2std.pl fq2fa s_5_001.fastq > s_5_001.seq

SOLiD

ABI SOLiD
article
Tools & Data Sets
color space (0123) => base space (ACGT)
.csfsta file : in color space; start with a known base (usually T)
low error rate (higher accuracy than Illumina)
2008: 4G run, read_len=35bp; insert=3Kbp (old)
2009: 9G run, read_len=50bp;

SOLiD™ 3 System generates (Oct 1 2008)
- over 20 gigabases
- mate-paired libraries with insert sizes ranging from 600 bp up to 10 kbp
- human genome for less than $60,000.

uniform bases quality
accuracy greater than 99.94%
because of double base interogation & high cvg, qualities can be "discarded"

Example:

 >1_88_1830_R3
 G32113123201300232320
>1 _89_1562_R3
 G23133131233333101320
 ..

Alignment matrix

Examples:

  AA is encoded as 0
  CG is encoded as 3
  AACG is encoded as 0 1 3

Features of Color space:

 * Color space data are self-complementary

   Example:
       Base    A G C T C G T C G T G C A G
       Color space 2 3 2 2 3 1 2 3 1 1 3 1 2
   
       Complemented
       Base    T C G A G C A G C A C G T C
       Color space 2 3 2 2 3 1 2 3 1 1 3 1 2

 * Two-Base Encoding and Error Recognition
   1 change: measuring error 
   multiple changes starting at a certain point: SNP

   Example:
      Reference 2 3 2 2 3 1 2 3 1 1 3 1 2
      Observed  2 3 2 2 0 1 2 3 1 1 3 1 2

Helicos

Pacific Biosystems

Visigen

Download

From online database

Example:

 >gi|45439865|ref|NC_005810.1| Yersinia pestis biovar Microtus str. 91001, complete genome
 TCGCGCGATCTTTGAGCTAATTAGAGTAAATTAATCCAATCTTTGACCCAAATCTCTGCTGGATCCTCTG
 GTATTTCATGTTGGATGACGTCAATTTCTAATATTTCACCCAACCGTTGAGCACCTTGTGCGATCAATTG
 ...

Bioperl scripts:

 /fs/sz-user-supported/common/bin/

 bp_fetch.pl net::genbank:NC_005810.1 > NC_005810.1
 bp_fetch.pl net::genbank:NC_005810 > NC_005810
 bp_fetch.pl net::genbank:45439865 > 45439865

Format

Traces

 Example:
   ~/bin/tarchive2amos -o Ba Ba.seq                                              # TA FTP
   ~/bin/tarchive2amos -o Ba -tracedir traces/                                   # TA querytrace_db 
   ~/bin/tarchive2amos -o Ba -assembly assembly/ASSEMBLY.xml -tracedir traces/   # AA

Convestion

 Example: EMBL->FATSA
   ~/bin//readseq.sh -f Fasta -o prefix.fasta prefix.embl
   bp_sreformat.pl -i prefix.embl -o prefix.fasta -if EMBL -of Fasta

Alignments

Whole genomes alignments

BLAT
BLASTZ, Post-processing long pairwise alignments article. decom program
LAGAN
MUMMER
AVID

Consensus calling and Structural variation

Read Mapping Software

Short-Read_Sequence_Alignment programs

BFAST

need to e-mail to author to get the code

BLAT

BLAT—The BLAST-Like Alignment Tool, Genome Research 2002
FAQ
Can align any type of reads
Can do nt:aa translation
Command: blat

 blat -noHead -t=dna  -q=dna  -tileSize=10 -stepSize=3 Pa.1con    Pa.seq    Pa.blat

MAQ *

Maq Sourceforge
Maq Poster from Sanger
Illumina-Solexa/AB-SOLiD , not 454 or capillary reads
Uses FASTQ format
Command: maq map ...
does ungapped alignment on unpaired reads

 SOLEXA
 maq.pl easyrun -d . ref.1con reads.fastq

 SOLID
 solid2fastq.pl reads_ shortname
 maq fastq2bfq shortname.fastq shortname.bfq
 maq fasta2csfa ref.fasta > ref.csfa
 maq fasta2bfa ref.csfa ref.csbfa
 maq fasta2bfa ref.fasta ref.bfa
 maq map -c aln.cs.map ref.csbfa shortname.bfq 2> aln.log
 maq csmap2nt aln.nt.map ref.bfa aln.cs.map
 maq assemble cns.cns ref.bfa aln.nt.map 2> cns.log

RMAP

RMAP : designed for Illumina-Solexa
Command: rmap

 rmap         -m 3 -w 33                            -c Pa.1con    Pa.seq -o Pa.rmap

SHRiMP

Web site
Commands: rmapper-cs , rmapper-ls, ...

SeqMap

SeqMap developed at Stanford
allows up to five mixed substitutions and inserted/deleted nucleotides in the mapping
allows sequences to contain N’s, and to have unequal lengths

 ./seqmap
 Usage: seqmap <number of mismatches> <probe FASTA file name> <transcript FASTA file name> <output file name> [options]
 
 Parameters:
 <number of mismatches>                          maximum edit distance allowed
 <probe FASTA file name>                         probe/tag/read sequences
 <transcript FASTA file name>                    reference sequences
 <output file name>                              name of the output file
 ...

SHORE

SHORE

SOAP *

Web site (China)
Formatofoutput
SOAP: short oligonucleotide alignment program, Bioinformatics Jan 2008
Commands: soap, soap.contig, soap_dealign, soap.huge, soap.short
can use qualities, do read trimming, use pair ends, RNA alignments

 soap         -v 5                                  -d Pa.1con -a Pa.seq -o Pa.soap

SOCS

ABI color space

 socs socs.pref
 more socs.pref
 Req.fa
 Seq_F3.csfasta
 Seq_F3_QV.qual
 out_prefix
 2
 1000
 2
 false
 true
 0

SOLiD

SOLID System Analysis Pipeline Tool (Corona Lite)

SSAHA

Web site(Sanger)
Focused on exact, nearly exact matches
Does not find all the exact matches???
Example: Solexa 33bp ~30% of reads are not found

ZOOM

ZOOM

Genome Resequencing

Whole-genome sequencing and variant discovery in C. elegans Nature Jan 2008

Links

1000 genomes

Chip-Seq @ Wikipedia

SRA
SRA FTP

Data

Solexa

Strep suis Solexa at Sanger 36bp, ~49X coverage
Staphylococcus aureus strain MW2 (edena paper) 35bp, ~47X coverage
Pseudomonas aeruginosa: 33bp, ~43X coverage
Pseudomonas syringae: 32bp, ~31X coverage
1000 Genomes (June 14th 2008): 47bp

 Accession       #Runs   Instrument                      Center  Study                          [Individual]
 SRA000303       41      Solexa 1G Genome Analyzer       BI      1000Genomes Project Pilot 2     NA12878
 SRA000304       49      Solexa 1G Genome Analyzer       BI      1000Genomes Project Pilot 2     NA12891
 SRA000305       56      Solexa 1G Genome Analyzer       BI      1000Genomes Project Pilot 2     NA12892
 SRA000307       1       Solexa 1G Genome Analyzer       SC      1000Genomes Project Pilot 1     NA10851
 SRA000308       2       Solexa 1G Genome Analyzer       SC      1000Genomes Project Pilot 1     NA11993
 SRA000309       3       Solexa 1G Genome Analyzer       SC      1000Genomes Project Pilot 1     NA11995
 SRA000310       1       Solexa 1G Genome Analyzer       SC      1000Genomes Project Pilot 1     NA12006
 SRA000311       1       Solexa 1G Genome Analyzer       SC      1000Genomes Project Pilot 1     NA12044
 SRA000312       2       Solexa 1G Genome Analyzer       SC      1000Genomes Project Pilot 1     NA12156
 SRA000313       1       Solexa 1G Genome Analyzer       SC      1000Genomes Project Pilot 1     NA12414
 SRA000314       1       Solexa 1G Genome Analyzer       SC      1000Genomes Project Pilot 1     NA12776
 SRA000315       1       Solexa 1G Genome Analyzer       SC      1000Genomes Project Pilot 1     NA12828
 SRA000316       12      Solexa 1G Genome Analyzer       SC      1000Genomes Project Pilot 2     NA12878
 SRA000317       8       Solexa 1G Genome Analyzer       SC      1000Genomes Project Pilot 2     NA12891
 SRA000318       14      Solexa 1G Genome Analyzer       SC      1000Genomes Project Pilot 2     NA12892
 SRA000319       1       Solexa 1G Genome Analyzer       SC      1000Genomes Project Pilot 1     NA12004

June 14th 2008: Sept 19th 2008

 SRA001100       23      Illumina Genome Analyzer        BGI     1000Genomes Project Pilot 2     NA19240
 ...
 SRA002029       1       Illumina Genome Analyzer II     WUGSC   1000Genomes Project Pilot 2     NA19239

 /fs/szdata/Solexa/1000genomes

Example SRR001113.seq :

 7,058,926 47 bp sequences
 2,402,398 contain at least 1 '.'

454

1000 Genomes

June 14th 2008

 Accession       #Runs   Instrument      Center  Study                           [Individual]
 SRA000302       121     454 GS FLX      BCM     1000Genomes Project Pilot 2     NA12878
 SRA001032       2       454 GS FLX      BCM     1000Genomes Project Pilot 2     NA12878
 SRA001036       1       454 GS FLX      BCM     1000Genomes Project Pilot 1     NA12812
 SRA001094       1       454 GS FLX      BCM     1000Genomes Project Pilot 2     NA12878

June 14th 2008: Sept 19th 2008

 SRA001037       2       454 GS FLX      BCM     1000Genomes Project Pilot 1     NA12812
 ...
 SRA001819       1       454 GS FLX      BCM     1000Genomes Project Pilot 2     NA12878

Refseq

/fs/szdata/genomes/human_ncbi_build36/ NCBI build36.1 May 2006 (Current build is 36.3 March 2008)
/fs/szdata/genomes/human_celera_2001_Orig/

@@ Line 338: / Line 338: @@
 * MUMMER
 * AVID
+= Consensus calling and Structural variation =
+* [http://www.ncbi.nlm.nih.gov/pubmed/18321888 Consensus generation and variant detection by Celera Assembler.]
+* [http://compbio.cs.toronto.edu/structvar/ Detecting Structural Variations, Brudno et al. ]
+= Read Mapping Software =
+* [http://en.wikipedia.org/wiki/Sequence_alignment_software#Short-Read_Sequence_Alignment Short-Read_Sequence_Alignment programs]
+== BFAST ==
+* need to e-mail to author to get the code
+== BLAT ==
+* [http://www.genome.org/cgi/reprint/GR-2292Rv1 BLAT—The BLAST-Like Alignment Tool, Genome Research 2002]
+* [http://genome.ucsc.edu/FAQ/FAQblat FAQ]
+* Can align any type of reads
+* Can do nt:aa translation
+* Command: blat
+  blat -noHead -t=dna  -q=dna  -tileSize=10 -stepSize=3 Pa.1con    Pa.seq    Pa.blat
+== MAQ * ==
+* [http://maq.sourceforge.net/ Maq Sourceforge]
+* [http://www.sanger.ac.uk/Users/lh3/maq-poster.pdf Maq Poster from Sanger]
+* Illumina-Solexa/AB-SOLiD , not 454 or capillary reads
+* Uses FASTQ format
+* Command: maq map ...
+* does ungapped alignment on unpaired reads
+  SOLEXA
+  maq.pl easyrun -d . ref.1con reads.fastq
+  SOLID
+  solid2fastq.pl reads_ shortname
+  maq fastq2bfq shortname.fastq shortname.bfq
+  maq fasta2csfa ref.fasta > ref.csfa
+  maq fasta2bfa ref.csfa ref.csbfa
+  maq fasta2bfa ref.fasta ref.bfa
+  maq map -c aln.cs.map ref.csbfa shortname.bfq 2> aln.log
+  maq csmap2nt aln.nt.map ref.bfa aln.cs.map
+  maq assemble cns.cns ref.bfa aln.nt.map 2> cns.log
+== RMAP ==
+* [http://rulai.cshl.edu/rmap/ RMAP] : designed for Illumina-Solexa
+* Command: rmap
+  rmap         -m 3 -w 33                            -c Pa.1con    Pa.seq -o Pa.rmap
+== SHRiMP ==
+* [http://compbio.cs.toronto.edu/shrimp/ Web site]
+* Commands: rmapper-cs , rmapper-ls, ...
+== SeqMap ==
+* [http://biogibbs.stanford.edu/~jiangh/SeqMap/ SeqMap] developed at Stanford
+* allows up to five mixed substitutions and inserted/deleted nucleotides in the mapping
+* allows sequences to contain N’s, and to have unequal lengths
+  ./seqmap
+  Usage: seqmap <number of mismatches> <probe FASTA file name> <transcript FASTA file name> <output file name> [options]
+  Parameters:
+  <number of mismatches>                          maximum edit distance allowed
+  <probe FASTA file name>                         probe/tag/read sequences
+  <transcript FASTA file name>                    reference sequences
+  <output file name>                              name of the output file
+  ...
+== SHORE ==
+* [http://1001genomes.org/downloads/ SHORE]
+== SOAP * ==
+* [http://soap.genomics.org.cn/ Web site (China)]
+* [http://soap.genomics.org.cn/#Formatofoutput Formatofoutput]
+* [http://soap.genomics.org.cn/SOAP_paper.pdf SOAP: short oligonucleotide alignment program, Bioinformatics Jan 2008]
+* Commands: soap, soap.contig, soap_dealign, soap.huge, soap.short
+* can use qualities, do read trimming, use pair ends, RNA alignments
+  soap         -v 5                                  -d Pa.1con -a Pa.seq -o Pa.soap
+== SOCS ==
+* ABI color space
+  socs socs.pref
+  more socs.pref
+  Req.fa
+  Seq_F3.csfasta
+  Seq_F3_QV.qual
+  out_prefix
+  false
+  true
+== SOLiD ==
+* [http://solidsoftwaretools.com/gf/project/corona/ SOLID System Analysis Pipeline Tool (Corona Lite)]
+== SSAHA ==
+* [http://www.sanger.ac.uk/Software/analysis/SSAHA/ Web site(Sanger)]
+* Focused on exact, nearly exact matches
+* Does not find all the exact matches???
+* Example: Solexa 33bp  ~30% of reads are not found
+== ZOOM  ==
+* [http://bioinformatics.oxfordjournals.org/cgi/content/full/24/21/2431 ZOOM]
+= Genome Resequencing =
+* [http://www.nature.com/nature/journal/v452/n7189/pdf/nature06884.pdf The complete genome of an individual by massively parallel DNA sequencing (J.Watson's genome) Nature April 2008]
+* [http://www.nature.com/nature/journal/v452/n7189/extref/nature06884-s1.pdf J.Watson's genome (supplementary info) ]
+* [http://www.nature.com/nmeth/journal/v5/n2/full/nmeth.1179.html;jsessionid=DC518BCD8B2CACAE8AFFF7F70DD46902 Whole-genome sequencing and variant discovery in C. elegans Nature Jan 2008]
+= Links =
+* [http://www.1000genomes.org/page.php?page=home 1000 genomes]
+* [http://www.cbcb.umd.edu/~langmead/solexa_1000genomes.html Ben's web site 1]
+* [http://www.cbcb.umd.edu/~langmead/solexa_format.html Ben's web site 2]
+* [http://en.wikipedia.org/wiki/Chip-Sequencing Chip-Seq @ Wikipedia]
+* [http://www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?cmd=table&f=run&m=data&s=run SRA]
+* [ftp://ftp.ncbi.nih.gov/pub/TraceDB/ShortRead SRA FTP]
+= Data =
+== Solexa ==
+* [ftp://ftp.sanger.ac.uk/pub/PRODUCTION_SOFTWARE/data_sets/suis_solexa/ Strep suis Solexa at Sanger] 36bp, ~49X coverage
+* [http://www.genomic.ch/edena/mw2Reads.seq.gz  Staphylococcus aureus strain MW2 (edena paper)] 35bp, ~47X coverage
+* Pseudomonas aeruginosa: 33bp, ~43X coverage
+* Pseudomonas syringae:   32bp, ~31X coverage
+* 1000 Genomes (June 14th 2008): 47bp
+  Accession       #Runs   Instrument                      Center  Study                          [Individual]
+  SRA000303       41      Solexa 1G Genome Analyzer       BI      1000Genomes Project Pilot 2     NA12878
+  SRA000304       49      Solexa 1G Genome Analyzer       BI      1000Genomes Project Pilot 2     NA12891
+  SRA000305       56      Solexa 1G Genome Analyzer       BI      1000Genomes Project Pilot 2     NA12892
+  SRA000307       1       Solexa 1G Genome Analyzer       SC      1000Genomes Project Pilot 1     NA10851
+  SRA000308       2       Solexa 1G Genome Analyzer       SC      1000Genomes Project Pilot 1     NA11993
+  SRA000309       3       Solexa 1G Genome Analyzer       SC      1000Genomes Project Pilot 1     NA11995
+  SRA000310       1       Solexa 1G Genome Analyzer       SC      1000Genomes Project Pilot 1     NA12006
+  SRA000311       1       Solexa 1G Genome Analyzer       SC      1000Genomes Project Pilot 1     NA12044
+  SRA000312       2       Solexa 1G Genome Analyzer       SC      1000Genomes Project Pilot 1     NA12156
+  SRA000313       1       Solexa 1G Genome Analyzer       SC      1000Genomes Project Pilot 1     NA12414
+  SRA000314       1       Solexa 1G Genome Analyzer       SC      1000Genomes Project Pilot 1     NA12776
+  SRA000315       1       Solexa 1G Genome Analyzer       SC      1000Genomes Project Pilot 1     NA12828
+  SRA000316       12      Solexa 1G Genome Analyzer       SC      1000Genomes Project Pilot 2     NA12878
+  SRA000317       8       Solexa 1G Genome Analyzer       SC      1000Genomes Project Pilot 2     NA12891
+  SRA000318       14      Solexa 1G Genome Analyzer       SC      1000Genomes Project Pilot 2     NA12892
+  SRA000319       1       Solexa 1G Genome Analyzer       SC      1000Genomes Project Pilot 1     NA12004
+June 14th 2008: Sept 19th 2008
+  SRA001100       23      Illumina Genome Analyzer        BGI     1000Genomes Project Pilot 2     NA19240
+  ...
+  SRA002029       1       Illumina Genome Analyzer II     WUGSC   1000Genomes Project Pilot 2     NA19239
+  /fs/szdata/Solexa/1000genomes
+* Example SRR001113.seq :
+,058,926 47 bp sequences
+,402,398 contain at least 1 '.'
+== 454 ==
+* 1000 Genomes
+June 14th 2008
+  Accession       #Runs   Instrument      Center  Study                           [Individual]
+  SRA000302       121     454 GS FLX      BCM     1000Genomes Project Pilot 2     NA12878
+  SRA001032       2       454 GS FLX      BCM     1000Genomes Project Pilot 2     NA12878
+  SRA001036       1       454 GS FLX      BCM     1000Genomes Project Pilot 1     NA12812
+  SRA001094       1       454 GS FLX      BCM     1000Genomes Project Pilot 2     NA12878
+June 14th 2008: Sept 19th 2008
+  SRA001037       2       454 GS FLX      BCM     1000Genomes Project Pilot 1     NA12812
+  ...
+  SRA001819       1       454 GS FLX      BCM     1000Genomes Project Pilot 2     NA12878
+== Refseq ==
+* /fs/szdata/genomes/human_ncbi_build36/ NCBI build36.1 May 2006 (Current build is 36.3 March 2008)
+* /fs/szdata/genomes/human_celera_2001_Orig/

Trace formatting: Difference between revisions

Revision as of 15:35, 4 December 2008

Articles

Technologies

Sanger

454 : Pyrosequencing

un-paired reads

paired ends

Solexa/Illumina : Sequencing by Synthesis

Fastq format

SOLiD

Helicos

Pacific Biosystems

Visigen

Download

From online database

Format

Traces

Convestion

Alignments

Whole genomes alignments

Consensus calling and Structural variation

Read Mapping Software

BFAST

BLAT

MAQ *

RMAP

SHRiMP

SeqMap

SHORE

SOAP *

SOCS

SOLiD

SSAHA

ZOOM

Genome Resequencing

Links

Data

Solexa

454

Refseq

Navigation menu

Search