Bumblebee: Difference between revisions

From Cbcb
Jump to navigation Jump to search
Line 51: Line 51:
* Quality trimming
* Quality trimming
   cat s_1_*_sequence.*.txt | ~/bin/fastq2clb.pl >  s_1_sequence.clb
   cat s_1_*_sequence.*.txt | ~/bin/fastq2clb.pl >  s_1_sequence.clb
* Adaptor trimming:  Align all subsets to adaptors  
* Adaptor trimming:  Align all subsets to adaptors  
  >C
  CGTAATAACTTCGTATAGCATACATTATACGAAGTTATACGA
  >3
  CGGCATTCCTGCTGAACCGAGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT
  >5
  GATCGGAAGAGCGGTTCAGCAGGAATGCCGAGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCG
   id  len    gc%
   id  len    gc%
   C  42    30.95   
   C  42    30.95   
Line 63: Line 71:


   Adaptor positions
   Adaptor positions
   .                    elem      min    q1    q2    q3    max        mean  
   .                    elem      min    q1    q2    q3    max        mean   %elem
   C.5                  25181056  0      2      34    68    108        38
   C.5                  25181056  0      2      34    68    108        38     36%
   C.3                  25181056  16    43    75    108    124        75
   C.3                  25181056  16    43    75    108    124        75
   
   
   5.5                  374742    0      0      0      0      108        3
   5.5                  374742    0      0      0      0      108        3     0.53%
   5.3                  374742    17    36    36    67    124        46
   5.3                  374742    17    36    36    67    124        46
    
    
   3.5                  143332    0      0      0      11    108        10
   3.5                  143332    0      0      0      11    108        10     0.20%
   3.3                  143332    16    18    19    28    124        30
   3.3                  143332    16    18    19    28    124        30


Line 79: Line 87:
   clv                  69888198  18607136  51281062  0      0      124    124    124        86.96      124        6077231064
   clv                  69888198  18607136  51281062  0      0      124    124    124        86.96      124        6077231064
   clr                  69888198  24677952  45210246  0      0      88    115    124        67.31      113        4704368689
   clr                  69888198  24677952  45210246  0      0      88    115    124        67.31      113        4704368689
* Other frequent kmers
  ACGTTATAACGTATTACGTTATATGG -> revcomp -> CCATATAACGTAATACGTTATAACGT


* Location:
* Location:

Revision as of 05:38, 11 March 2010

Data

  • ~ 500B genome
  • Complete mitochondrion genomes:
 NC_011923.1    15468  14.67  Bombus hypocrita sapporoensis mitochondrion, complete genome
 NC_010967.1    16434  13.22  Bombus ignitus mitochondrion, complete genome
 only 88% identity; no rearrangements, only snps, short indels

Traces

  • 7 pairs of data files (paired ends) : lanes 1..3,5..8 (lane 4 wasn't used)
 Lane   Insert            ReadLen    #Mates       Coverage    Comments
 1      3K(2..6,avg 4K)   124        34,944,099   14X         865,687(1.2%) reads have qual==0
 2      8K(7..9,avg 8K)   124        32,540,640   13X

 3      500(450..600)     124        34,745,750               # gDNA
 5      500                          34,601,239
 6      500                          34,553,857
 7      500                          34,682,612
 8      500                          12,975,839
  • Adaptors
 >circularizarion
 CGTAATAACTTCGTATAGCATACATTATACGAAGTTATACGA
 >circularizarion.revcomp
 TCGTATAACTTCGTATAATGTATGCTATACGAAGTTATTACG 
 >5
 GATCGGAAGAGCGGTTCAGCAGGAATGCCGAGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCG
 >3
 CGGCATTCCTGCTGAACCGAGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT

Tasks to figure out

  1. Erroneous reads/bases, which we need to correct or discard
  2. GC bias, so we can compute a-stats properly
  3. Redundancy in the long paired ends, which are lane 1 and lane 2.
  • Used the 454 protocol to circularize the DNA for sequencing with the Illumina instrument.
    • Some reads will begin in the circularization adaptor and thus will have only one usable read
    • Some reads have a few bases of DNA sequence and hit the circularization adaptor right away
    • Most reads will have at least 36bp from each end before hitting the adaptor.
    • Many reads will not have any adaptor to trim (>125bp of DNA sequence at both ends of the adaptor)

Trimming

  • Keep only the first 100bp (last 24 bp are anyway low qual) otherwise gatekeeper "Seg fault" ? Too much seq discarded
  • Quality trimming
  cat s_1_*_sequence.*.txt | ~/bin/fastq2clb.pl >  s_1_sequence.clb
  • Adaptor trimming: Align all subsets to adaptors
 >C
 CGTAATAACTTCGTATAGCATACATTATACGAAGTTATACGA
 >3
 CGGCATTCCTGCTGAACCGAGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT
 >5
 GATCGGAAGAGCGGTTCAGCAGGAATGCCGAGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCG
 id  len    gc%
 C   42     30.95  
 3   52     55.77  
 5   67     59.70  
 nucmer -l 8 -c 16 -b 8 -g 8 adaptors.seq s_1_1_sequence.00.seq -p s_1_1_sequence.00
 delta-filter -l 16 -q  s_1_1_sequence.00.delta >  s_1_1_sequence.00.filter-q.delta
 ...
 cat  s_1_*_sequence.*.filter-q.delta | ~/bin/delta2clr53.pl -5 5,3 -minLen 64 >  s_1_sequence.clv
 Adaptor positions
 .                    elem       min    q1     q2     q3     max        mean   %elem
 C.5                  25181056   0      2      34     68     108        38     36%
 C.3                  25181056   16     43     75     108    124        75

 5.5                  374742     0      0      0      0      108        3      0.53%
 5.3                  374742     17     36     36     67     124        46
 
 3.5                  143332     0      0      0      11     108        10     0.20%
 3.3                  143332     16     18     19     28     124        30
  • Stats
 .                    elem       <=64       >64        min    q1     q2     q3     max        mean       n50        sum
 orig                 69888198   0          69888198   124    124    124    124    124        124        124        8666136552
 clq                  69888198   7724022    62164176   0      89     111    124    124        96.76      117        6762346722
 clv                  69888198   18607136   51281062   0      0      124    124    124        86.96      124        6077231064
 clr                  69888198   24677952   45210246   0      0      88     115    124        67.31      113        4704368689
  • Other frequent kmers
 ACGTTATAACGTATTACGTTATATGG -> revcomp -> CCATATAACGTAATACGTTATAACGT
  • Location:
 /fs/szattic-asmg4/Bees/Bombus_impatiens

D Kelly's trimming

 438088072 total reads
 109166398 reads were thrown away
 148886138 reads were corrected and/or trimmed (to a min length of 30 bp)

Assembly

  • Trimming
 No OBT
 adaptors in the seqs
  • Kmers
 meryl -Dh -s 0-mercounts/asm-C-ms22-cm1 >! 22mers.hist
 Found 3136399464 mers.
 Found 379123530 distinct mers.
 Found 201257394 unique mers.
 Largest mercount is 12006651; 90 mers are too big for histogram.
 countKmers.pl
 most frequent 42mer : CGTAATAACTTCGTATAGCATACATTATACGAAGTTATACGA ~ 20%  of the seqs : circularization adapter  
  • Overlapper
 #overlaps/read
 reads      0count   min    q1     q2     q3     max        mean       n50        sum            
 62164168   21589472 0      0      4      12     324        11.42      38         709902310      


  • Unitigger : max utg len=852bp
  • Consensus after unitigger : 3 out of 129 jobs failed
  • Location
 /fs/szdevel/dpuiu/SourceForge/wgs-assembler.030210/Linux-amd64/bin/runCA