Bumblebee: Difference between revisions

From Cbcb
Jump to navigation Jump to search
Line 7: Line 7:
* 7 pairs of data files (paired ends) : lanes 1..3,5..8 (lane 4 wasn't used)
* 7 pairs of data files (paired ends) : lanes 1..3,5..8 (lane 4 wasn't used)


   Lane  Insert            ReadLen    #Reads  
   Lane  Insert            ReadLen    #Reads       Coverage    Comments
   1      3K(2..6,avg 4K)  124        34,944,099  
   1      3K(2..6,avg 4K)  124        34,944,099   14X
   2      8K(7..9,avg 8K)  124        32,540,640
   2      8K(7..9,avg 8K)  124        32,540,640   13X
   
   
   3      500(450..600)    124        34,745,750 # gDNA
   3      500(450..600)    124        34,745,750               # gDNA
   5      500                          34,601,239
   5      500                          34,601,239
   6      500                          34,553,857
   6      500                          34,553,857
Line 26: Line 26:
   >3
   >3
   CGGCATTCCTGCTGAACCGAGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT
   CGGCATTCCTGCTGAACCGAGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT
 
 
== Tasks to figure out ==
== Tasks to figure out ==



Revision as of 15:41, 5 March 2010

Data

  • ~ 500B genome

Traces

  • 7 pairs of data files (paired ends) : lanes 1..3,5..8 (lane 4 wasn't used)
 Lane   Insert            ReadLen    #Reads       Coverage    Comments
 1      3K(2..6,avg 4K)   124        34,944,099   14X
 2      8K(7..9,avg 8K)   124        32,540,640   13X

 3      500(450..600)     124        34,745,750               # gDNA
 5      500                          34,601,239
 6      500                          34,553,857
 7      500                          34,682,612
 8      500                          12,975,839
  • Adaptors
 >circularizarion
 CGTAATAACTTCGTATAGCATACATTATACGAAGTTATACGA
 >circularizarion.revcomp
 TCGTATAACTTCGTATAATGTATGCTATACGAAGTTATTACG 
 >5
 GATCGGAAGAGCGGTTCAGCAGGAATGCCGAGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCG
 >3
 CGGCATTCCTGCTGAACCGAGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT

Tasks to figure out

  1. Erroneous reads/bases, which we need to correct or discard
  2. GC bias, so we can compute a-stats properly
  3. Redundancy in the long paired ends, which are lane 1 and lane 2.
  • Used the 454 protocol to circularize the DNA for sequencing with the Illumina instrument.
    • Some reads will begin in the circularization adaptor and thus will have only one usable read
    • Some reads have a few bases of DNA sequence and hit the circularization adaptor right away
    • Most reads will have at least 36bp from each end before hitting the adaptor.
    • Many reads will not have any adaptor to trim (>125bp of DNA sequence at both ends of the adaptor)

Trimming

  • Keep only the first 100bp (last 24 bp are anyway low qual) otherwise gatekeeper "Seg fault"
  • Adaptor trimming:
    • Split data set in 1M read subsets
    • Align all subsets to adaptors
 nucmer -l 8 -c 16 -b 8 -g 8 adaptors.seq subset.seq -p subset
 delta-filter -l 16 -q subset.delta > subset.filter-q.delta
    • Identify CLV
 ~/bin/delta2clr53.pl -5 5,3 -minLen 64 -maxLen 100 < subset.filter-q.delta > subset.clv
  • Location:
 /fs/szattic-asmg4/Bees/Bombus_impatiens

Assembly

  • Trimming
 No OBT
 adaptors in the seqs
  • Kmers
 meryl -Dh -s 0-mercounts/asm-C-ms22-cm1 >! 22mers.hist
 Found 3136399464 mers.
 Found 379123530 distinct mers.
 Found 201257394 unique mers.
 Largest mercount is 12006651; 90 mers are too big for histogram.
 countKmers.pl
 most frequent 42mer : CGTAATAACTTCGTATAGCATACATTATACGAAGTTATACGA ~ 20%  of the seqs : circularization adapter  
  • Overlapper
 #overlaps/read
 reads      min    q1     q2     q3     max        mean       n50        sum            
 62164168   0      0      4      12     324        11.42      38         709902310      


  • Unitigger : max utg len=852bp
  • Consensus after unitigger : 3 out of 129 jobs failed
  • Location
 /fs/szdevel/dpuiu/SourceForge/wgs-assembler.030210/Linux-amd64/bin/runCA