Short read sequencing: Difference between revisions

From Cbcb
Jump to navigation Jump to search
No edit summary
 
(53 intermediate revisions by 2 users not shown)
Line 1: Line 1:
= Articles =
DELETED !!!


== 06/13/2008 ==
= Consensus calling and Structural variation =


* [http://www.nature.com/nature/journal/v452/n7189/pdf/nature06884.pdf The complete genome of an individual by massively parallel DNA sequencing (J.Watson's genome) Nature April 2008]
* [http://www.ncbi.nlm.nih.gov/pubmed/18321888 Consensus generation and variant detection by Celera Assembler.]
* [http://www.nature.com/nature/journal/v452/n7189/extref/nature06884-s1.pdf J.Watson's genome (supplementary info) ]
* [http://compbio.cs.toronto.edu/structvar/ Detecting Structural Variations, Brudno et al. ]
 
= Read Mapping Software =


* [http://www.nature.com/nmeth/journal/v5/n2/full/nmeth.1179.html;jsessionid=DC518BCD8B2CACAE8AFFF7F70DD46902 Whole-genome sequencing and variant discovery in C. elegans Nature Jan 2008]
* [http://en.wikipedia.org/wiki/Sequence_alignment_software#Short-Read_Sequence_Alignment Short-Read_Sequence_Alignment programs]


== 06/20/2008 ==
== BFAST ==
* need to e-mail to author to get the code


== BLAT ==
* [http://www.genome.org/cgi/reprint/GR-2292Rv1 BLAT—The BLAST-Like Alignment Tool, Genome Research 2002]
* [http://www.genome.org/cgi/reprint/GR-2292Rv1 BLAT—The BLAST-Like Alignment Tool, Genome Research 2002]
* [http://genome.ucsc.edu/FAQ/FAQblat BLAT FAQ]
* [http://genome.ucsc.edu/FAQ/FAQblat FAQ]
* Can align any type of reads
* Can do nt:aa translation
* Command: blat
  blat -noHead -t=dna  -q=dna  -tileSize=10 -stepSize=3 Pa.1con    Pa.seq    Pa.blat
 
== MAQ * ==
* [http://maq.sourceforge.net/ Maq Sourceforge]
* [http://www.sanger.ac.uk/Users/lh3/maq-poster.pdf Maq Poster from Sanger]
* Illumina-Solexa/AB-SOLiD , not 454 or capillary reads
* Uses FASTQ format
* Command: maq map ...
* does ungapped alignment on unpaired reads
 
  SOLEXA
  maq.pl easyrun -d . ref.1con reads.fastq
 
  SOLID
  solid2fastq.pl reads_ shortname
  maq fastq2bfq shortname.fastq shortname.bfq
  maq fasta2csfa ref.fasta > ref.csfa
  maq fasta2bfa ref.csfa ref.csbfa
  maq fasta2bfa ref.fasta ref.bfa
  maq map -c aln.cs.map ref.csbfa shortname.bfq 2> aln.log
  maq csmap2nt aln.nt.map ref.bfa aln.cs.map
  maq assemble cns.cns ref.bfa aln.nt.map 2> cns.log
 
== RMAP ==
* [http://rulai.cshl.edu/rmap/ RMAP] : designed for Illumina-Solexa
* Command: rmap
  rmap        -m 3 -w 33                            -c Pa.1con    Pa.seq -o Pa.rmap
 
== SHRiMP ==
* [http://compbio.cs.toronto.edu/shrimp/ Web site]
* Commands: rmapper-cs , rmapper-ls, ...


* [http://maq.sourceforge.net/ Map Sourceforge]
== SeqMap ==
* [http://biogibbs.stanford.edu/~jiangh/SeqMap/ SeqMap] developed at Stanford
* allows up to five mixed substitutions and inserted/deleted nucleotides in the mapping
* allows sequences to contain N’s, and to have unequal lengths


* [http://rulai.cshl.edu/rmap/ RMAP]
  ./seqmap
  Usage: seqmap <number of mismatches> <probe FASTA file name> <transcript FASTA file name> <output file name> [options]
 
  Parameters:
  <number of mismatches>                          maximum edit distance allowed
  <probe FASTA file name>                        probe/tag/read sequences
  <transcript FASTA file name>                    reference sequences
  <output file name>                              name of the output file
  ...


* [http://compbio.cs.toronto.edu/shrimp/ SHRiMP]
== SHORE ==
* [http://1001genomes.org/downloads/ SHORE]  


* [http://soap.genomics.org.cn/ SOAP web site (China)]
== SOAP * ==
* [http://soap.genomics.org.cn/ Web site (China)]
* [http://soap.genomics.org.cn/#Formatofoutput Formatofoutput]
* [http://soap.genomics.org.cn/SOAP_paper.pdf SOAP: short oligonucleotide alignment program, Bioinformatics Jan 2008]
* [http://soap.genomics.org.cn/SOAP_paper.pdf SOAP: short oligonucleotide alignment program, Bioinformatics Jan 2008]
* Commands: soap, soap.contig, soap_dealign, soap.huge, soap.short
* can use qualities, do read trimming, use pair ends, RNA alignments
  soap        -v 5                                  -d Pa.1con -a Pa.seq -o Pa.soap
== SOCS ==
* ABI color space
  socs socs.pref
  more socs.pref
  Req.fa
  Seq_F3.csfasta
  Seq_F3_QV.qual
  out_prefix
  2
  1000
  2
  false
  true
  0
== SOLiD ==
* [http://solidsoftwaretools.com/gf/project/corona/ SOLID System Analysis Pipeline Tool (Corona Lite)]
== SSAHA ==
* [http://www.sanger.ac.uk/Software/analysis/SSAHA/ Web site(Sanger)]
* Focused on exact, nearly exact matches
* Does not find all the exact matches???
* Example: Solexa 33bp  ~30% of reads are not found
== ZOOM  ==
* [http://bioinformatics.oxfordjournals.org/cgi/content/full/24/21/2431 ZOOM]
= Genome Resequencing =
* [http://www.nature.com/nature/journal/v452/n7189/pdf/nature06884.pdf The complete genome of an individual by massively parallel DNA sequencing (J.Watson's genome) Nature April 2008]
* [http://www.nature.com/nature/journal/v452/n7189/extref/nature06884-s1.pdf J.Watson's genome (supplementary info) ]
* [http://www.nature.com/nmeth/journal/v5/n2/full/nmeth.1179.html;jsessionid=DC518BCD8B2CACAE8AFFF7F70DD46902 Whole-genome sequencing and variant discovery in C. elegans Nature Jan 2008]


= Links =
= Links =
Line 28: Line 117:
* [http://www.cbcb.umd.edu/~langmead/solexa_1000genomes.html Ben's web site 1]
* [http://www.cbcb.umd.edu/~langmead/solexa_1000genomes.html Ben's web site 1]
* [http://www.cbcb.umd.edu/~langmead/solexa_format.html Ben's web site 2]
* [http://www.cbcb.umd.edu/~langmead/solexa_format.html Ben's web site 2]
* [http://en.wikipedia.org/wiki/Chip-Sequencing Chip-Seq @ Wikipedia]


* [http://www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?cmd=table&f=run&m=data&s=run SRA]
* [http://www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?cmd=table&f=run&m=data&s=run SRA]
Line 40: Line 131:
* Pseudomonas aeruginosa: 33bp, ~43X coverage
* Pseudomonas aeruginosa: 33bp, ~43X coverage
* Pseudomonas syringae:  32bp, ~31X coverage
* Pseudomonas syringae:  32bp, ~31X coverage
* 1000 Genomes (June 14th 2008)
* 1000 Genomes (June 14th 2008): 47bp


   Accession      #Runs  Instrument                      Center  Study                          [Individual]
   Accession      #Runs  Instrument                      Center  Study                          [Individual]
Line 60: Line 151:
   SRA000319      1      Solexa 1G Genome Analyzer      SC      1000Genomes Project Pilot 1    NA12004
   SRA000319      1      Solexa 1G Genome Analyzer      SC      1000Genomes Project Pilot 1    NA12004


= Software @ CBCB =
June 14th 2008: Sept 19th 2008
  SRA001100      23      Illumina Genome Analyzer        BGI    1000Genomes Project Pilot 2    NA19240
  ...
  SRA002029      1      Illumina Genome Analyzer II    WUGSC  1000Genomes Project Pilot 2    NA19239
 
  /fs/szdata/Solexa/1000genomes
 
* Example SRR001113.seq :
  7,058,926 47 bp sequences
  2,402,398 contain at least 1 '.'
 
== 454 ==


Under /fs/sz-user-supported/Linux-x86_64/bin/
* 1000 Genomes


== Denovo assembly ==
June 14th 2008
  Accession      #Runs  Instrument      Center  Study                          [Individual]
  SRA000302      121    454 GS FLX      BCM    1000Genomes Project Pilot 2    NA12878
  SRA001032      2      454 GS FLX      BCM    1000Genomes Project Pilot 2    NA12878
  SRA001036      1      454 GS FLX      BCM    1000Genomes Project Pilot 1    NA12812
  SRA001094      1      454 GS FLX      BCM    1000Genomes Project Pilot 2    NA12878


* edena
June 14th 2008: Sept 19th 2008
* ssake
  SRA001037      2      454 GS FLX      BCM    1000Genomes Project Pilot 1    NA12812
* velveth,velvetg
  ...
  SRA001819      1      454 GS FLX      BCM    1000Genomes Project Pilot 2    NA12878


== Read mapping ==
== Refseq ==


* blat
* /fs/szdata/genomes/human_ncbi_build36/ NCBI build36.1 May 2006 (Current build is 36.3 March 2008)
* maq
* /fs/szdata/genomes/human_celera_2001_Orig/
* soap

Latest revision as of 15:38, 4 December 2008

DELETED !!!

Consensus calling and Structural variation

Read Mapping Software

BFAST

  • need to e-mail to author to get the code

BLAT

 blat -noHead -t=dna  -q=dna  -tileSize=10 -stepSize=3 Pa.1con    Pa.seq    Pa.blat

MAQ *

 SOLEXA
 maq.pl easyrun -d . ref.1con reads.fastq
 SOLID
 solid2fastq.pl reads_ shortname
 maq fastq2bfq shortname.fastq shortname.bfq
 maq fasta2csfa ref.fasta > ref.csfa
 maq fasta2bfa ref.csfa ref.csbfa
 maq fasta2bfa ref.fasta ref.bfa
 maq map -c aln.cs.map ref.csbfa shortname.bfq 2> aln.log
 maq csmap2nt aln.nt.map ref.bfa aln.cs.map
 maq assemble cns.cns ref.bfa aln.nt.map 2> cns.log

RMAP

  • RMAP : designed for Illumina-Solexa
  • Command: rmap
 rmap         -m 3 -w 33                            -c Pa.1con    Pa.seq -o Pa.rmap

SHRiMP

  • Web site
  • Commands: rmapper-cs , rmapper-ls, ...

SeqMap

  • SeqMap developed at Stanford
  • allows up to five mixed substitutions and inserted/deleted nucleotides in the mapping
  • allows sequences to contain N’s, and to have unequal lengths
 ./seqmap
 Usage: seqmap <number of mismatches> <probe FASTA file name> <transcript FASTA file name> <output file name> [options]
 
 Parameters:
 <number of mismatches>                          maximum edit distance allowed
 <probe FASTA file name>                         probe/tag/read sequences
 <transcript FASTA file name>                    reference sequences
 <output file name>                              name of the output file
 ...

SHORE

SOAP *

 soap         -v 5                                  -d Pa.1con -a Pa.seq -o Pa.soap

SOCS

  • ABI color space
 socs socs.pref
 more socs.pref
 Req.fa
 Seq_F3.csfasta
 Seq_F3_QV.qual
 out_prefix
 2
 1000
 2
 false
 true
 0

SOLiD

SSAHA

  • Web site(Sanger)
  • Focused on exact, nearly exact matches
  • Does not find all the exact matches???
  • Example: Solexa 33bp ~30% of reads are not found

ZOOM

Genome Resequencing

Links

Data

Solexa

 Accession       #Runs   Instrument                      Center  Study                          [Individual]
 SRA000303       41      Solexa 1G Genome Analyzer       BI      1000Genomes Project Pilot 2     NA12878
 SRA000304       49      Solexa 1G Genome Analyzer       BI      1000Genomes Project Pilot 2     NA12891
 SRA000305       56      Solexa 1G Genome Analyzer       BI      1000Genomes Project Pilot 2     NA12892
 SRA000307       1       Solexa 1G Genome Analyzer       SC      1000Genomes Project Pilot 1     NA10851
 SRA000308       2       Solexa 1G Genome Analyzer       SC      1000Genomes Project Pilot 1     NA11993
 SRA000309       3       Solexa 1G Genome Analyzer       SC      1000Genomes Project Pilot 1     NA11995
 SRA000310       1       Solexa 1G Genome Analyzer       SC      1000Genomes Project Pilot 1     NA12006
 SRA000311       1       Solexa 1G Genome Analyzer       SC      1000Genomes Project Pilot 1     NA12044
 SRA000312       2       Solexa 1G Genome Analyzer       SC      1000Genomes Project Pilot 1     NA12156
 SRA000313       1       Solexa 1G Genome Analyzer       SC      1000Genomes Project Pilot 1     NA12414
 SRA000314       1       Solexa 1G Genome Analyzer       SC      1000Genomes Project Pilot 1     NA12776
 SRA000315       1       Solexa 1G Genome Analyzer       SC      1000Genomes Project Pilot 1     NA12828
 SRA000316       12      Solexa 1G Genome Analyzer       SC      1000Genomes Project Pilot 2     NA12878
 SRA000317       8       Solexa 1G Genome Analyzer       SC      1000Genomes Project Pilot 2     NA12891
 SRA000318       14      Solexa 1G Genome Analyzer       SC      1000Genomes Project Pilot 2     NA12892
 SRA000319       1       Solexa 1G Genome Analyzer       SC      1000Genomes Project Pilot 1     NA12004

June 14th 2008: Sept 19th 2008

 SRA001100       23      Illumina Genome Analyzer        BGI     1000Genomes Project Pilot 2     NA19240
 ...
 SRA002029       1       Illumina Genome Analyzer II     WUGSC   1000Genomes Project Pilot 2     NA19239
 /fs/szdata/Solexa/1000genomes
  • Example SRR001113.seq :
 7,058,926 47 bp sequences
 2,402,398 contain at least 1 '.'

454

  • 1000 Genomes

June 14th 2008

 Accession       #Runs   Instrument      Center  Study                           [Individual]
 SRA000302       121     454 GS FLX      BCM     1000Genomes Project Pilot 2     NA12878
 SRA001032       2       454 GS FLX      BCM     1000Genomes Project Pilot 2     NA12878
 SRA001036       1       454 GS FLX      BCM     1000Genomes Project Pilot 1     NA12812
 SRA001094       1       454 GS FLX      BCM     1000Genomes Project Pilot 2     NA12878

June 14th 2008: Sept 19th 2008

 SRA001037       2       454 GS FLX      BCM     1000Genomes Project Pilot 1     NA12812
 ...
 SRA001819       1       454 GS FLX      BCM     1000Genomes Project Pilot 2     NA12878

Refseq

  • /fs/szdata/genomes/human_ncbi_build36/ NCBI build36.1 May 2006 (Current build is 36.3 March 2008)
  • /fs/szdata/genomes/human_celera_2001_Orig/