Brugia malayi: Difference between revisions

From Cbcb
Jump to navigation Jump to search
 
(171 intermediate revisions by the same user not shown)
Line 3: Line 3:
* [http://www.sciencemag.org/cgi/content/full/317/5845/1756 Science 2007]
* [http://www.sciencemag.org/cgi/content/full/317/5845/1756 Science 2007]
* [http://www.sciencemag.org/cgi/data/317/5845/1756/DC1/1 Science 2007 Supplementary data]
* [http://www.sciencemag.org/cgi/data/317/5845/1756/DC1/1 Science 2007 Supplementary data]
* [[Media:Pnas00307-0278.pdf|HhaI repeat paper]] PNAS 1985
  ''the copy number of the repeat in B. malayi was found to be about 30,000. The 320-base-pair Hha I repeated sequences are arranged in direct tandem arrays and comprise about 12% of the genome.''
  Brugia malayi Hha I-repeat family element
  >gi|156092|gb|M12691.1|BRPRSHA Brugia malayi Hha I-repeat family element
  GCGCATAAATTCATCAGCAAAATTAATAAAACTTTCAATTAATCATGATTTTAATTGAATGTAAGAATTT
  AAATTAAATTTAAATTCAAATTTAAATTTTTAATTTTTTAAAAATTTTAAAATTTGTTATAGTTTTCCTT
  CATTAGACAAGGATATTGGTTCTAATTTATCAATTTTAATTCTAATTAAGTGCCAAAACTACTAAAAAAA
  GCTTATTTTGAAATTAATTGACTACGTTAGCTGCATTGTACCAGTGCTGGTCGTGTATTGTGTTGTCATT
  TTATAGTTTAAATATTAAAATACGCTTTTGTAATTAAGTTTT


= Genome Info =
= Genome Info =


* 6 chromosomes: 1-5, XY
* 6 chromosomes: 1-5, XY ; diploit genome ~ 110M bp
* ~ 90M, 30% GC, 32% coding, 15% repeats
* 30% GC,
 
* 32% coding, 15% repeats
= Other sequences =
 
* [http://www.ncbi.nlm.nih.gov/sites/entrez?db=genome&cmd=Retrieve&dopt=Overview&list_uids=16713 mitochondrion] finished: 13,657 bp; 24% GC
* [http://www.ncbi.nlm.nih.gov/sites/entrez?db=genome&cmd=Retrieve&dopt=Overview&list_uids=630 Wolbachia endosymbiont strain TRS from Brugia malayi strain wMel] complete:  1,080,084 bp; 34%GC (New England Biolabs)
* Wolbachia endosymbiont strain wMel progress (TIGR)
* Rodent: some trace contamination; Example: Mus musculus is ~40%GC


= Genome Project =
= Genome Project =
Line 23: Line 27:


The B. malayi genome project has been completed by The Institute for Genomic Research. Whole Genome Shotgun sequencing was used to obtain more than eight-fold coverage of the genome. The complete genome was assembled into approximately 8200 scaffolds and deposited in GenBank. The accession for the WGS project is AAQA00000000 and consists of sequences AAQA01000001-AAQA01029808.  
The B. malayi genome project has been completed by The Institute for Genomic Research. Whole Genome Shotgun sequencing was used to obtain more than eight-fold coverage of the genome. The complete genome was assembled into approximately 8200 scaffolds and deposited in GenBank. The accession for the WGS project is AAQA00000000 and consists of sequences AAQA01000001-AAQA01029808.  
File location:
  /fs/szasmg3/dpuiu/Brugia_malayi/Data/Bm.fasta
  ctgs              min    q1    q2    q3    max        mean      n50        sum           
  <span style="background:yellow">26,879            200    836    1005  1495  611,244    3241.17    18986      87,119,350</span> 
* [http://www.tigr.org/tdb/e2k1/bma1/intro.shtml TIGR Genome project] (TRS strain)
* [http://www.tigr.org/tdb/e2k1/bma1/intro.shtml TIGR Genome project] (TRS strain)
* [http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?db=nuccore&id=154235816 NCBI AAQA00000000] AAQA01000001-AAQA01029808
  * 26,879 good ctgs
  * 2,929 jird contaminants (Example: AAQA01001321 : mouse 99%id hits)


   good ctg len
* [http://ghedin.lab.dept-med.pitt.edu/GhedinLab/Parasite%20Genomics Univ of Pittsburg ]
        #elem   min    max     mean   median  n50     sum
 
   all  26879  200     611244  3241   1005   19005  '''87119350'''
= Contamination =
  10K+  1224    10036  611244  41018  23135  60727  50206329
 
* [http://www.ncbi.nlm.nih.gov/sites/entrez?db=genome&cmd=Retrieve&dopt=Overview&list_uids=16713 mitochondrion] finished: 13,657 bp; <span style="background:yellow">24% GC</span>
* [http://www.ncbi.nlm.nih.gov/sites/entrez?db=genome&cmd=Retrieve&dopt=Overview&list_uids=630 Wolbachia endosymbiont strain TRS from Brugia malayi strain wMel] 1.08 Mbp <span style="background:yellow">34% GC </span>
** complete:  New England Biolabs
** progress: TIGR
* Rodent: some trace contamination; <span style="background:yellow"> 44%GC </span>
   /fs/szasmg3/dpuiu/Brugia_malayi/Data/contam.fasta
   contaminants      min   q1    q2    q3     max       mean       n50       sum          
   2,929              200    527    675    820   8994      740.04    762        2,167,588
** [ftp://ftp.ncbi.nih.gov/genomes/M_musculus/Assembled_chromosomes/ M_musculus]
* pUC19c vector: 2686bp, <span style="background:yellow">50.63% GC</span>
    
    
  good ctg GC%
== Data ==
        #elem  min    max    mean    median  n50 
 
   all  26878  0.00    72.30  '''28.86'''  28.56  29.46
   1.26M Sanger reads (original TA) :      medLen=773bp; medGC=32.57%
   10K+  1224    24.38  38.44  '''30.38'''  30.43  30.62
   1.26M Sanger reads (contamination free): medLen=771bp; medGC=32.36%


   contaminant ctg len
   3.21M 454 reads (original sff) :        medLen=274bp
        #elem   min    max    mean    median n50    sum
   3.29M 454 reads (linker free) :         medLen=247bp; medGC=36.39%
  all  2929    200    8994    740    675    763    '''2167588'''
 
  contaminant ctg GC%
         #elem  min    max    mean    median  n50 
  all  2929    18.09  75.96  '''44.1'''    43.59  44.80


== Traces ==
=== Original Traces ===


  Libraries:
* 1.26M Sanger reads & 15 Libraries:
    * 2K : bulk of the sequence @TIGR
    * 15-20 K @TIGR
    * 8,000 BAC clones @Children's Hospital Oakland Research Institute.  (!!! no NCBI TA submission)
* [http://www.ncbi.nlm.nih.gov/Traces/trace.cgi?&cmd=retrieve&val=SPECIES_CODE%20%3D%20%22BRUGIA%20MALAYI%22&retrieve=Submit NCBI TA]
* [http://www.ncbi.nlm.nih.gov/Traces/trace.cgi?&cmd=retrieve&val=SPECIES_CODE%20%3D%20%22BRUGIA%20MALAYI%22&retrieve=Submit NCBI TA]
* [ftp://ftp.ncbi.nih.gov/pub/TraceDB/brugia_malayi/ NCBI TA FTP]
* [ftp://ftp.ncbi.nih.gov/pub/TraceDB/brugia_malayi/ NCBI TA FTP]


Trace summary:
  SEQ_LIB_ID      INSERT_SIZE  INSERT_STDEV  TRACE_TYPE_CODE       
   * all:          1,260,215
  1047113828118    1000        300          WGS              13500 
   * TRACE_TYPE_CODE
  1047113856575    1000        300          PRIMERWALK      325   
    * WGS:        1,258,277
  1047111632737    1258        377          PRIMERWALK      3     
    * TRANSPOSON:     1,437
  1047111632737   1258        377          WGS              305,906 
    * PRIMER_WALK       501
  1047111540304   1415        424          WGS              51772 
   * CENTER NAME
  1047112577106    1415        424          WGS             337,789 
    * TIGR:         856,624
  1047111718946    3123        936          WGS              47597 
    * JCVI         403,591
  1047113358719    3123        936          PRIMERWALK      173      
   * NO BACS !!!; max INSERT_SIZE=23K
  1047113358719    3123        936          WGS              246,185 
 
  1047174912885    3123        936          TRANSPOSON       1437   
* TI's: 1172642810, ... ,1174845185
  1047113570927    6000        1800          WGS              3193    
* SEQ_LIB_ID's : 1047111480027, ... , 1047174912885
  1047111814561    7158         2147          WGS              219,306 
  1047111480027    17168        5150          WGS              4087   
  1047111488095    17168        5150          WGS              3434   
  1047111495007    17168        5150          WGS              3716   
  1047111501919    17168        5150          WGS              3697   
  1047111480605    22419        6725         WGS              4638   
  1047111516154    22419        6725          WGS              4004   
  1047111523212    22419        6725          WGS              3766    
  1047111530126    22419        6725          WGS              5686   
  1047113855421    23000        6900          WGS              1     
  total                                                        <span style="background:yellow">1,260,215</span>


FRG file:  
FRG file: (contaminant free)
* FRG.src : same as TI's above
* FRG.src : TI's  
* FRG.acc: 2 ..
* FRG.acc: 2 ..
* DST.acc: 1260217, ... , 1260234
* DST.acc: 1260217, ... , 1260234
* Location
  /fs/szasmg3/dpuiu/Brugia_malayi/Data/nucmer_seq/Bm-all.frg
  DST    15
  FRG    1178192
  LKG    530930
 
          seqs        min    q1    q2    q3    max        mean      n50        sum           
  len    1,178,192    65    645    <span style="background:yellow">771</span>    850    1214      724        800        853,847,771  => 8X
  gc%    1,178,192    0.00  29    <span style="background:yellow">32.36</span>  35    100        32.41      33        .


Problems:
Problems:
* All library insert sizes are underestimated
* All library insert sizes are underestimated ???
* The contaminant reads align at ~91-93% id to the contaminant ctgs while the Mt/We reads align at 99% id to Mt/We finished seq. What %id thold to use for contaminant?
* The contaminant reads align at ~91-93% id to the contaminant ctgs while the Mt/We reads align at 99% id to Mt/We finished seq. What %id thold to use for contaminant?
=== BACS ===
8,000 BAC clones @Children's Hospital Oakland Research Institute.  (!!! no NCBI TA submission)
===  PITT FTP data ===
* 3.21M 454 reads ; about 13% are mated
* 3K  insert flx libraries (estimated to 2K based on alignment to the existing assembly)
* 20K insert tit libraries (estimated to 28K ...)
CBCB Location:
  /fs/szattic-asmg4/brugia_malayi/Data/
  /fs/szattic-asmg4/brugia_malayi/Data/Sff/  # Sff files
  /fs/szattic-asmg4/brugia_malayi/Data/Frg/  # Frg files
  /fs/szattic-asmg4/brugia_malayi/Data/Seq/  # Seq files
FTP access:
  lftp -u bma 136.142.191.201
  pass: 6279
  user: bma
  # empty as of --[[User:Dpuiu|Dpuiu]] 12:04, 8 January 2010 (EST)
Elodie's table:
  /scratch1/brugia_malayi/brugia-sequencing-summary.txt.csv
  #    elodie's date  protocol  platform  type            description                                run_name                                                                                                  Reads    Mates
  1    01/17/2008    WGS      Standard  Full run (2/2)  Mix of worms (calibration of the machine)  R_2008_01_31_18_01_35_FLX10070260_adminrig_ghedintestsample                                              534822  0
  2    07/01/2008    3Kb      Standard  Full            single worm (pUC contamination)            R_2008_08_06_13_52_29_FLX10070260_adminrig_080608_Ghedin-BrugiaLTPE1                                      492575  84341
  3    09/11/2008    3kb      Standard  4/8 wells      single worm (pUC contamination)            R_2008_09_19_14_17_55_FLX10070260_adminrig_091908_HATFULL-MIDrepeat_GHEDIN-LTPE1                          263421  49258
  4    10/01/2008    3Kb      Standard  Full            Mix of worms (still pUC contamination)    R_2008_10_14_15_06_50_FLX10070260_adminrig_101408_GHEDIN-Brugia-pool_LTPEtest                            59711    5096 
  5    02/01/2009    WGS      Standard  1/4 wells      Mix of worms; regions 2 & 3 were myxoma    R_2009_02_27_16_11_34_FLX10070260_adminrig_022709_GHEDIN                                                  18025    0
  6    04/06/2009    WGS      Standard  1/4 wells      Mix of worms; with comp. bio run          R_2009_04_15_14_46_56_FLX10070260_adminrig_041509_GHEDIN_r1-WGS1_r2-LMW4_r3-pool2compbio_r4-pool3compbio  118490  0
  7    05/01/2009    20Kb      Titanium  7/8 wells      Mix of worms                              R_2009_06_05_16_18_41_FLX10070260_adminrig_060509_GHEDIN_Brugia-gDNA-TI20kb1                              631287  213524
  8    10/28/2009    20Kb      Titanium  Full            Mix of worms                              R_2009_10_22_15_30_12_FLX10070260_adminrig_102209_GHEDIN_Brugia20kb2                                      1095713  377547
  9    11/17/2009    3 Kb      Titanium  Full            Mix of worms                              111209_Brugia_3kb.zip                                                                                    868928  ?
  .    Total                                                                                                                                                                                                  3411635  ?
* 22 Sff files:
      run                                                                                                                                                                    sffReads  linker
  1    R_2008_01_31_18_01_35_FLX10070260_adminrig_ghedintestsample/D_2008_01_31_18_01_35_FLX10070260_adminrig_FullAnalysis/sff/E4RA0X101.sff                                    272923  .
  1    R_2008_01_31_18_01_35_FLX10070260_adminrig_ghedintestsample/D_2008_01_31_18_01_35_FLX10070260_adminrig_FullAnalysis/sff/E4RA0X102.sff                                    261899  .
 
  2    R_2008_08_06_13_52_29_FLX10070260_adminrig_080608_Ghedin-BrugiaLTPE1/D_2009_02_12_22_12_04_j_SignalProcessing/sff/FEZH5RS01.sff                                          228204  flx
  2    R_2008_08_06_13_52_29_FLX10070260_adminrig_080608_Ghedin-BrugiaLTPE1/D_2009_02_12_22_12_04_j_SignalProcessing/sff/FEZH5RS02.sff                                          264371  flx
  3    R_2008_09_19_14_17_55_FLX10070260_adminrig_091908_HATFULL-MIDrepeat_GHEDIN-LTPE1/FHAVB5T02.sff                                                                            86862  flx
  3    R_2008_09_19_14_17_55_FLX10070260_adminrig_091908_HATFULL-MIDrepeat_GHEDIN-LTPE1/FHAVB5T03.sff                                                                            87488  flx
  3    R_2008_09_19_14_17_55_FLX10070260_adminrig_091908_HATFULL-MIDrepeat_GHEDIN-LTPE1/FHAVB5T04.sff                                                                            89071  flx
  4    R_2008_10_14_15_06_50_FLX10070260_adminrig_101408_GHEDIN-Brugia-pool_LTPEtest/FIOXLOM01.sff                                                                              13695  flx
  4    R_2008_10_14_15_06_50_FLX10070260_adminrig_101408_GHEDIN-Brugia-pool_LTPEtest/FIOXLOM02.sff                                                                              14197  flx
  4    R_2008_10_14_15_06_50_FLX10070260_adminrig_101408_GHEDIN-Brugia-pool_LTPEtest/FIOXLOM03.sff                                                                              15515  flx
  4    R_2008_10_14_15_06_50_FLX10070260_adminrig_101408_GHEDIN-Brugia-pool_LTPEtest/FIOXLOM04.sff                                                                              16304  flx
  5    R_2009_02_27_16_11_34_FLX10070260_adminrig_022709_GHEDIN/FRLDXKV01.sff                                                                                                    18025  .
  6    R_2009_04_15_14_46_56_FLX10070260_adminrig_041509_GHEDIN_r1-WGS1_r2-LMW4_r3-pool2compbio_r4-pool3compbio/D_2009_04_16_14_19_21_morty_fullProcessing/FT9KOI001.sff        118490  .
  7    R_2009_06_05_16_18_41_FLX10070260_adminrig_060509_GHEDIN_Brugia-gDNA-TI20kb1/D_2009_06_08_15_32_36_compute-0-2_fullProcessing/sff/FW1OXFY01.sff                          73807  tit
  7    R_2009_06_05_16_18_41_FLX10070260_adminrig_060509_GHEDIN_Brugia-gDNA-TI20kb1/D_2009_06_08_15_32_36_compute-0-2_fullProcessing/sff/FW1OXFY02.sff                          91698  tit
  7    R_2009_06_05_16_18_41_FLX10070260_adminrig_060509_GHEDIN_Brugia-gDNA-TI20kb1/D_2009_06_08_15_32_36_compute-0-2_fullProcessing/sff/FW1OXFY03.sff                          93878  tit
  7    R_2009_06_05_16_18_41_FLX10070260_adminrig_060509_GHEDIN_Brugia-gDNA-TI20kb1/D_2009_06_08_15_32_36_compute-0-2_fullProcessing/sff/FW1OXFY04.sff                          90232  tit
  7    R_2009_06_05_16_18_41_FLX10070260_adminrig_060509_GHEDIN_Brugia-gDNA-TI20kb1/D_2009_06_08_15_32_36_compute-0-2_fullProcessing/sff/FW1OXFY05.sff                          97065  tit
  7    R_2009_06_05_16_18_41_FLX10070260_adminrig_060509_GHEDIN_Brugia-gDNA-TI20kb1/D_2009_06_08_15_32_36_compute-0-2_fullProcessing/sff/FW1OXFY06.sff                          94326  tit
  7    R_2009_06_05_16_18_41_FLX10070260_adminrig_060509_GHEDIN_Brugia-gDNA-TI20kb1/D_2009_06_08_15_32_36_compute-0-2_fullProcessing/sff/FW1OXFY07.sff                          90281  tit
  8    R_2009_10_22_15_30_12_FLX10070260_adminrig_102209_GHEDIN_Brugia20kb2/F4H5CMB01.sff                                                                                        551263  tit
  8    R_2009_10_22_15_30_12_FLX10070260_adminrig_102209_GHEDIN_Brugia20kb2/F4H5CMB02.sff                                                                                        544450  tit
  9    111209_Brugia_3kb_1.sff                                                                                                                                                  365339  tit
  9    111209_Brugia_3kb_2.sff                                                                                                                                                  503589  tit
 
      total                                                                                                                                                                    <span style="background:yellow">3,214,044</span>  #without run 9
      total                                                                                                                                                                    <span style="background:yellow">4,082,972</span>  #all runs
* 22 Frg Libraries
. lib      meanIns(orig)  meanIns(est)    #reads      #mates  linker medLen  medGC
1 E4RA0X101    0          0              271066      0      .      250    37.04
1 E4RA0X102    0          0              260166      0      .      249    37.06
2 FEZH5RS01    3000        2000            181035      18676  flx    228    37.89
2 FEZH5RS02    3000        2000            211064      22270  flx    227    37.79
3 FHAVB5T02    3000        2000            68708        7850    flx    244    38.35
3 FHAVB5T03    3000        2000            69306        8227    flx    243    37.98
3 FHAVB5T04    3000        2000            70028        8353    flx    243    38.10
4 FIOXLOM01    3000        0              10921        0      flx    102    43.93  # no mates , shorter read length, highest GC !!!
4 FIOXLOM02    3000        0              11157        0      flx    103    43.75  # no mates , shorter read length, highest GC !!!
4 FIOXLOM03    3000        0              12197        0      flx    103    43.51  # no mates , shorter read length, highest GC !!!
4 FIOXLOM04    3000        0              12727        0      flx    103    43.56  # no mates , shorter read length, highest GC !!!
5 FRLDXKV01    0          0              17349        0      .      255    36.02
6 FT9KOI001    0          0              108826      0      .      256    36.10
7 FW1OXFY01    20000      28000          86127        15825  tit    276    35.40
7 FW1OXFY02    20000      28000          106911      19668  tit    275    35.37
7 FW1OXFY03    20000      28000          109874      20396  tit    271    35.23
7 FW1OXFY04    20000      28000          104797      18933  tit    269    35.23
7 FW1OXFY05    20000      28000          113716      20649  tit    265    35.09
7 FW1OXFY06    20000      28000          110693      20326  tit    270    35.05
7 FW1OXFY07    20000      28000          105931      19176  tit    271    35.18
8 F4H5CMB01    20000      28000          626046      109918  tit    241    36.08
8 F4H5CMB02    20000      28000          628432      118903  tit    256    35.87
9 111209_Brugia_3kb_1 3000 .              453449      100097  tit    199    31.19
9 111209_Brugia_3kb_2 3000 .              655457      165137  tit    214    30.93
. total        .          .              <span style="background:yellow">3,297,077    429,170</span>  #without run 9
. total        .          .              <span style="background:yellow">4,405,983    694,404</span>  #all runs
* Sff seqs clr (good qual)
  .        seqs        min    q1    q2    q3    max        mean      n50        sum           
  all      3,214,044    0      240    274    383    2042      294        326        947,254,956 => 9.4X   
* Frg seqs clr  (good qual , no linker)
            seqs        min    q1    q2    q3    max        mean      n50        sum           
  all      3,297,077    3      156    248    301    2043      244        275        806,091,347 => 8X
  mated      858,340    64    107    156    223    612        171        201        147,070,298     
  unmated  2,438,737    2      207    261    335    2042      268        286        655,723,972
* Frg seqs GC%
            seqs        min    q1    q2    q3    max        mean      n50        sum           
  all      3,297,077    0.00  29.25  36.39  44.54  86.76      36.86      39        .
  mated      858,340    0.00  28.48  34.35  40.65  78.75      34.74      36        .
  unmated  2,438,737    0.00  29.61  37.36  45.86  86.76      37.60      41        .
* Locations:
  /fs/szattic-asmg4/brugia_malayi/Data/Sff/
  /fs/szattic-asmg4/brugia_malayi/Data/Frg/
=== Contaminant & high copy repeats ===
* nucmer -maxmatch "-l 20 -c 65" or "-l 12 -c 24"
                        Sanger    454
  jird(26,879 ctgs)    31,501    197,420    # we'd probably find more contaminated reads if we align all the reads to the whole mouse genome ??
  Mt                    1,507    2,634    # 98% avg identity, 92% of read length
  We                    49,014    23,249    # 98% avg identity, 92% of read length
 
  UniVec                ?        661,586
  pUC19                134      562,107  # 99% avg identity, 99% of read length
  HhaI(~320bp)          16,336    69,400    # 90% avg identity, 63% of read length ; 
                                            # 29,021 out of 69,400  454 reads align 2+ times => tandem repeat
                                            # 11,882 out of 16,272  454 mated reads that align have both mates aligned => 30K+ repeats         
  mRNA(264bp)          20,504    59,706    # 80% avg identity, 65% of read length
  total                ~119,000  ~910,000
* List of contaminated reads:
  /nfshomes/dpuiu/Brugia_malayi/Data/nucmer_sanger/problems.qry_hits  #  118,996  Sanger reads
  /nfshomes/dpuiu/Brugia_malayi/Data/nucmer_454/problems.qry_hits    #  914,525  454 reads
  /nfshomes/dpuiu/Brugia_malayi/Data/problems.qry_hits                # 1,033,521  Sanger+454 reads
* Other possible contaminants: Schistosoma
''In my latest Brugia assembly, I looked for contigs/degenerates that were exclusively Sanger reads, thinking they might be jird contaminants. I came across a degenerate, deg1596341, with 417 reads, all Sanger, and only 1235bp long.  When I BLAST it against NCBI, the best hit (entire length, 99% identity) is to Schistosoma--then poorer hits to 28s rRNAs.  It has lots of mate pairs to another degenerate, which matches Schistosoma just as well and in the right position, but that degenerate has some 454 reads.'' (Art)
  >deg1596337                                                             
  ATTAGACAGTCGGATTCCCCGAGTCCGTGCCAGTTCTAAGTTGACTGTTTAACGCCGGCCGAAATATCAA   
  ATAAAACATTTACTTTTTTAAAAAAAAAAATAAAAAAATAAATGTTGATATGCAGCTATAACGGTCCATA   
  AGACAGTTCGAACACTAGCCGAGTTTCATCAAAATGAATACATTTTTTTTTTTTAATGTTTTCATTTTAA   
  TGTTACACTGCATGGATCAAACCGTACTCACTTCACATTACAGCCCGACCGGCCCAGTCCTTAGAGCCAA   
  TCCTTATCCCGAAGTTACGGATCTAATTTGCCGACTTCCCTTACCTACATTATTCTATCGACTAGAGGCT   
  GTTCACCTTGGAGACCTGCTGCGGATATGGGTACGATCTGGCACGAAATTCAAATAGCTTCCCTCGGATT   
  TTCATGGATCGAACAAAGCGCACGAGACACCACAGGAACCGTGGCGCTTTACGGAAACAACATCCCTATC   
  TCCGGCTGAACCGATTCCAGGGAGTCCGTTCCTTAACCAGAAAAGAGAACTCTGGCTCGGGCTTTCCTCA   
  ATGTTTCCGAGTTCATTTGCGTTACCGCGCTAAATTCTCACGATGAGCATTTATCTCCGTGTCCAGGTAC   
  GGGAATATTAACCCGTTTCCCTTTCGATTTATCAGATGGATTACACCTCCATTCCTCTATTTTATTTTAA   
  AAAACGGCACTAGCCAATATCTTAGGATCGACTGACCCACATTCAACTGCTGTTCACGTGGAACCCTTCT   
  CCACTTCAGTCTTCAAGGATCTCACTTGAATATTTGCTACTACCACCAAGATCTGCACCAATGGAAGCTT   
  CAACCGGGCCTACGCCCAAAGTCTTCAACGCTAACCATTGCGACCCTCTTACTCGTTGCGGCCAGATTTC   
  CCAAAAAAAAAAAAACACAAGCCATGCAACGGTTGAGTATAAGTCTCCCGCTCAAGCGCCATCCATTTTC   
  AGGGCTAGTTGATTTGGCAGGTGAGTTGTTACACACTCCTTAGCGGTTTCCAACTTCCATGGCCACCGTC   
  CTGCTGTCTATATCAACCAACGCCTTTCATGGGGTCTCATGAGCGGAAAGTTTGGCACTTTAACTCAACG   
  TTTGGTTCATCCCACAGCGCCAGTTCTGCTTACCAAAAATGGCCCACTTGGAGCACACATTCAATGTCTA   
  TGCTTCATAAAAAATTTAAGCAAGCAAGACGTCATACTCATTGAAAGTTTGAGAATAGGTTGAAGAC     
  >deg1596341                                                             
  CCAATTATACCAAAGATAATCTTTACTTTCATTATGCTTTTTATCTTTTAAATTAGGTTTACTACCCAAT   
  AACTTGCGTATATGCTAGACTCCTTGGTCCGTGTTTCAAGACGGGTCAGATAGGTGATTAACGTTCACAT   
  CGAGATGTAACTTTATTGCATACAATATTATAATATTACCAATTATTTTTACCGATAAAGTCGCATGCGA   
  CCACATGTAAAATAATAATAAGCAAAATTATAATCGATACATGTCACTATTATTTCAAGTGAAAGTTACA   
  TATATGGGAAAAAAAAAAAAAACTTCATCTAAGACATATTTCAACATAATTTAGGATTCCAATTATCAAT   
  TGAAATAATTGGTCCACTAAATTAACTTGTATTAATATGCTAAAATGAAGTTCTCGATGCATACCATCGG   
  TAAATACACCAATCTATGCATATACTGCTAATTTAGCATTAATATCATTTTATTCATTAATAAAAAAAAA   
  AAAATTATTAATGAATAATGAAATGAATTATGATTGCTAAATTGATTGGTTGAATACCGATAAGTTTTGT   
  TAACTCTATCCGTTTCCATCTCAGCGGTTTCACGCCCTCTTGAACTCTCTCTTCAAAGTTCTTTGCAACT   
  TTCCCTCACGGTACTTGTTTGCTATCGGTCTCATGGTCGTATTTAGCCTTAGATGAGGTTTACCACCCTC   
  TTTGGGCTGCAATCTCAAACAACCCGACTCCAAGGAATAACCTACCGTAACTTTTTTCACCCGTACAGGT   
  CTAGCACCTTCTATGGACTGTAGCCCCGCTCAAGGGGACTTTGGGTGTAAAAATATGTTACGGATAGTTA   
  TACCTATACGCTACATTTCCATATAGCCATATAATGTCTATTGGATTCAGCGTTGGGCTTTTTCCTTTTC   
  ACTCGCCGTTACTAGGGAAATCCTCGTTAGTTTCTTTTCCTCCGCTTAGTTATATGCTTAAATTCAGCGG   
  GTAATCACGACTGAGTTGAGGTCAAAAAAAAAAAAAATGATATAAAACATATTGAAATTATCATTCATAT   
  ATATATGCTAATTTTTTACCTTATTTATTTGTTTATTTTAATGTTTCAAATAACTTGCATTTTAATTTGA   
  AACATTTAACAACAAAACAAACAAACAATAAAGTAAATCAATGCATAATAAATAAATAATTGTAATCTTT   
  CTTTATTATTTATTCATGAAAGATTACTTTTTAATATATATATAT
''... posisble contamination at the library construction level. Schisto was being sequenced at the same time as Brugia at TIGR. Does this mean we should first filter all the Sanger reads against Schisto now that the Schisto genome is available?'' (Elodie)


== Assemblies ==
== Assemblies ==


=== TIGR  ===
=== TIGR/NCBI ===


* 9X coverage, 856K Sanger traces =>  8,200 scaff & 29,808 ctg (avg. scaff=~10K & avg ctg=~3K)
* 9X coverage, 856K Sanger traces =>  8,200 scaff & 29,808 ctg (avg. scaff=~10K & avg ctg=~3K)
* "scaffolds totaling ~71 Mb of data with a further ~17.5 Mb of contigs not integrated into any scaffold (orphan contigs)" (Science 2007)
* "scaffolds totaling ~71 Mb of data with a further ~17.5 Mb of contigs not integrated into any scaffold (orphan contigs)" (Science 2007)
* [http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?db=nuccore&id=154235816 NCBI AAQA00000000] AAQA01000001-AAQA01029808
  * 26,879 good ctgs
  * 2,929 jird contaminants (Example: AAQA01001321 : mouse 99%id hits)
* Stats
  .                      elem      min    q1    q2    q3    max        mean      n50        sum           
  ctg.len(good)          26879      200    836    1005  1495  611244*    3241.17    18986      87,119,350     
  ctg.len(contaminants)  2929      200    527    675    820    8994      740.04    762        2,167,588     
  .                      elem      min    q1    q2    q3    max        mean      n50        sum           
  ctg.gc%(good)          26878      0.00  24.77  28.56  32.27  72.30      28.86      29        .
  ctg.gc%(contaminants)  2929      18.09  39.16  <span style="background:yellow">43.59</span>  48.35  75.96      44.10      44        .
* Location
  /fs/szasmg3/dpuiu/Brugia_malayi/Assembly/TIGR/ <-> NCBI
=== PITT ===
* Date: 11/05/08
* Stats:
                      elem      min    q1    q2    q3    max        mean      n50        sum
  scf.len              3170      2000  2917  4483  14471  6534162*  22916      112914*    72,643,770 (66,051,795bp without gaps)
  scf.gc%              3170      15.70  25.53  28.17  30.95  66.60      28.46      28        .
* Location:
  /fs/szasmg3/dpuiu/Brugia_malayi/Assembly/PITT/
=== CBCB CA 5.1 Sanger ===
* Assembler: wgs 5.1
* Date: 2008/08/26
* Input: filtered Sanger reads
* better assembly than the published one
* repeat Hha appears in a few dozen contigs but not in tandem
* Stats: 
  .                    elem      min    q1    q2    q3    max        mean      n50        sum           
  scf                  10317      935    1215  1538  3462  3890532    8018.85    41716      82,730,474     
  scf2K+              3656      2001  3181  5733  18904  3890532    20189.57  50293      73,813,083
  ctg                  12753      273    1245  1632  3873  376744    6113.39    24748      77,964,006     
  deg                  9661      65    858    949    1023  72494      1240.97    1008      11,988,997     
  singl                134119 (11.43%)
  reads                1178192(100%)
* Location:
  /fs/szasmg3/dpuiu/Brugia_malayi/Assembly/CBCB/2008_0826_CA/
=== CBCB CA 6.0 454 (failed) ===
* Assembler: wgs 6.0-beta
* Input: 3,297,077 454 sffToCA processed reads
* Locations:
  ginkgo:/scratch1/brugia_malayi/Assembly/454/CA.failed/ 
  /scratch1/ -> umiacsfs01:/xraid03
  ginkgo: 32 proc, 128G mem
  genome6.umd.edu:/genome6/raid/dpuiu/Brugia_malayi/Assembly/CA.bog/
  genome6: 32 proc, 256G mem
* Problem: high frequency contamination & repeats
  obtMerThreshold, ovlMerThreshold set on auto (default) !!!
  runCA estimated them to: (see runCA.log)
    Reset OBT mer threshold from auto to 37235.
    Reset OVL mer threshold from auto to 43186.
  => olap-from-seeds very memory/cpu intensive!!!
  Example: 6 jobs: each is 2 thread, ~ 20G mem
  merOverlapperSeedConcurrency=6 => 6 jobs
  merOverlapperExtendBatchSize=20000
  $ ps -C olap-from-seeds
  PID %MEM  RSZ(KB) %CPU STIME TIME    CMD
  13158  0.0 1132    0.0 10:21 00:00:00 /bin/sh 1-overlapper/olap-from-seeds.sh 90
  ...
  13163  0.0 1136    0.0 10:21 00:00:00 /bin/sh 1-overlapper/olap-from-seeds.sh 95
  13199 15.6 20675720 133 10:21 02:46:39 olap-from-seeds -a -b -t 2 -S 1-overlapper/asm.merStore -c 3-overlapcorrection/0092.frgcorr.WORKING -o 1-overlapper/olaps/0092.ovb.WORKING.gz asm.gkpStore 1820001 1840000
  ...
  13205 15.2 20139808 138 10:21 02:52:35 olap-from-seeds -a -b -t 2 -S 1-overlapper/asm.merStore -c 3-overlapcorrection/0091.frgcorr.WORKING -o 1-overlapper/olaps/0091.ovb.WORKING.gz asm.gkpStore 1800001 1820000
  $ vmstat
  procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
  r  b  swpd  free  buff  cache  si  so    bi    bo  in  cs us sy id wa st
  9  0 9534576 121356  10124 2743568    5  23    7    23  36  69 16  0 83  1  0
  $ free
              total      used      free    shared    buffers    cached
  Mem:    132168632  132043920    124712          0      10064    2934076
  -/+ buffers/cache:  129099780    3068852
  Swap:    67108856    8842720  58266136
* 1-overlapper             
  #overlaps/read
  0+              3297077
  0                1282138 !!! more than 1/3 of reads have no overlaps
  1+              2014939
  100+            561195
                  #seqs    #maxOvl
  100+            561196    99376
  UniVec          359340    99376
  pUC19            355392    99376
  contam          116857    97757
  Hha1            50474    29186
  mRNA            36296    12935
  other            46214    4720
* 8-consensus failed on utg
* 9-terminator =>  asm.asm.FAILED
=== CBCB CA 6.0 454 ===
* Input: 3,297,077 454 sffToCA processed reads
* obtMerThreshold=200, ovlMerThreshold=60
  .                        elem    min    q1    q2    q3    max      mean    n50    sum       
  scf.len                  1406    585    1077  1178  1416  23313    1583.79  1354    2226813   
  ctg.len                  1513    280    1074  1177  1404  4952    1309.21  1257    1980831   
  deg.len                  192840  63    173    262    368    5688    290.91  339    56098824   
  utg.len                  194394  63    174    263    373    5691    298.80  348    58084744   
  singl.len                1039561  64    162    249    277    1519    244.45  265    254119265 
  seq.len                  2618840  64    140    239    287    1519    234.61  268    614416134 
 
  .                        elem    min    q1    q2    q3    max      mean    n50    sum       
  scf.gc%                  1406    17.45  29.91  32.75  35.47  69.26    33.21    33      .         
  ctg.gc%                  1513    17.45  30.01  32.54  34.93  69.26    32.94    33      .         
  deg.gc%                  192840  4.08  27.62  33.33  40.25  80.99    34.17    35      .         
  utg.gc%                  194394  4.08  27.65  33.33  40.20  80.99    34.16    35      .         
  singl.gc%                1039561  0.00  34.62  <span style="background:yellow">43.11</span>  49.35  86.76    41.58    45      .         
  seq.gc%                  2618840  0.00  29.44  <span style="background:yellow">37.01</span>  45.48  86.76    37.36    40      .
=== CBCB CA 6.0 Sanger ===
* Input:  1,178,192 Sanger "clean" reads
* obtMerThreshold=200, ovlMerThreshold=60
  .                        elem    min    q1    q2    q3    max      mean    n50    sum       
  scf.len                  10148    904    1235  1578  3564  2324103  8619.28  45146  87468465 
  scf.len2K+              3701    2001  3123  6500  20644  2324103  21272.70 52830  78730278 
  ctg.len                  12659    274    1265  1658  3811  565900  6478.56  27474  82012118   
  deg.len                  8497    65    820    914    987    82776    1211.94  980    10297846   
  utg.len                  40545    64    943    1247  1751  306650  2459.44  4231    99718105   
  singl.len                89048    64    545    702    812    1181    661.17  753    58875872   
  seq.len                  1173341  64    716    827    915    1222    790.88  850    927974911 
 
  .                        elem    min    q1    q2    q3    max      mean    n50    sum       
  scf.gc%                  10148    13.21  25.02  28.34  32.07  65.75    29.11    29      .         
  ctg.gc%                  12659    13.21  24.86  28.22  31.70  65.75    28.78    28      .         
  deg.gc%                  8497    7.87  23.87  28.90  33.86  77.13    29.60    30      .         
  utg.gc%                  40545    1.39  25.26  29.02  32.72  77.13    29.29    30      .         
  singl.gc%                89048    0.00  31.87  <span style="background:yellow">38.97</span>  45.31  99.15    38.52    41      .         
  seq.gc%                  1173341  0.00  29.41  <span style="background:yellow">32.40</span>  35.18  99.15    32.46    33      .
* Rerun using
** all Sanger reads : did not improve the stats
** isNotRandom=1 for all libs : did not improve the stats
=== CBCB CA 6.0 Sanger+454 (Best so far) ===
* Input: 1,178,192 Sanger "clean" reads ; 3,297,077 454 sffToCA processed reads
* obtMerThreshold=400, ovlMerThreshold=120
  .                        elem    min    q1    q2    q3    max      mean    n50    sum       
  scf.len                  10254    283    1149  1472  2552  1789338  9181.77  108218  94149853
  scf.len2K+              3193    2000  2728  4513  14385  1789338  26585*  150310* 84,886,293*  (75,044,213bp without gaps) 
  ctg.len                  13607    66    1181  1578  3187  522243  6196.05  30726  84309668   
  deg.len                  114780  63    195    279    478    25405    367.61  478    42193843   
  utg.len                  157710  63    231    383    885    131968  901.05  1602    142104793 
  singl.len                904272  64    199    255    306    1690    282.94  278    255858016    # 831,547 454 + 72,725 Sanger
  seq.len                  3849841  64    202    281    662    1690    410.15  699    1579010968 
 
  .                        elem    min    q1    q2    q3    max      mean    n50    sum       
  scf.gc%                  10254    13.80  24.59  28.34  32.95  69.26    29.44    29      .         
  ctg.gc%                  13607    13.80  24.54  28.17  32.02  69.26    28.95    28      .         
  deg.gc%                  114780  3.75  29.39  <span style="background:yellow">38.22</span>  45.45  80.99    37.40    41      .         
  utg.gc%                  157710  0.00  27.63  <span style="background:yellow">34.59</span>  43.08  80.99    35.28    38      .         
  singl.gc%                904272  0.00  36.60  <span style="background:yellow">43.75</span>  49.49  99.33    42.44    46      .         
  seq.gc%                  3849841  0.00  29.34  <span style="background:yellow">34.26</span>  41.98  99.33    35.73    36      .         
* Location:
  ginkgo:/scratch1/brugia_malayi/Assembly/hybrid/CA/


=== CBCB ===
* [[Bm.qc_combine]]
* [[Bm.lib_estimate]]


  [Scaffolds]
* Rerun using bogBadMateDepth = 4 (default is 7) at Aleksey's advice; utgs are smaller; failed in cgw "scaffolder failed" message
  TotalScaffolds=10317
* scaffolds:
  TotalContigsInScaffolds=12753
10254 : total
  MeanContigsPerScaffold=1.24
904   : begin in surrogates
   MinContigsPerScaffold=1
1029  : end in surrogates
  MaxContigsPerScaffold=53


  [Contigs]
=== CBCB newbler deNovo 454 (failed) ===
  TotalContigsInScaffolds=12753
  TotalBasesInScaffolds=77964006
  TotalVarRecords=87058
  MeanContigLength=6113
  MinContigLength=273
  MaxContigLength=376744
  N50ContigBases=24748


  [Reads]
* Still running after 7 days (killed)
  TotalReadsInput=1178192
   Detangling alignments...
  TotalUsableReads=1173016
  -> Level 2, Phase 8, Round 1...
  AvgClearRange=791
  ContigReads=663383(56.55%)
   BigContigReads=544689(46.43%)
  SmallContigReads=118694(10.12%)
  DegenContigReads=124230(10.59%)
  SurrogateReads=295861(25.22%)
  PlacedSurrogateReads=44577(3.80%)
  SingletonReads=134119(11.43%)
  ChaffReads=134119(11.43%)


   [Coverage]
   PID  %MEM  RSZ %CPU STIME    TIME CMD
   ContigsOnly=6.86
   4576  2.1 1427100 94.4 Feb16 7-03:26:12 /fs/szdevel/dpuiu/454/bin/runProject .
   Contigs_Surrogates=9.47
 
   Contigs_Degens_Surrogates=9.33
=== CBCB newbler deNovo 454 ===
   AllReads=11.91
 
* Filtered contaminants, high copy repeats
  deleted 1524401
  kept    1772677
 
* Input: 454 CA gkp dump
        elem      min    q1    q2    q3    max        mean      n50        sum           
  len    1772677    64    216    270    357    2044      294.13    312        521393273     
  gc%    1772677    0.00  28.60  34.15  40.70  83.90      34.83      36        .
 
* Output
  # ctg stats
  .                    elem      min    q1    q2    q3    max        mean      n50        sum           
  Len                  38315      100    217    319    437    5560      346.60    402        13,280,032     
  GC%                  38315      0.00  27.69  32.27  37.33  73.73      32.83      33        .
 
  # scf stats
  .                    elem      min    q1    q2    q3    max        mean      n50        sum           
  Len                  69        2006  2157  2422  2795  9476      2932.87    2629      202368   
 
  # read counts
                        count    %
  All                  1772677 100
  Singleton            884948  50.01 
  Assembled            780126  44.08 
  PartiallyAssembled    52199  2.95 
  Outlier              33009  1.87 
  TooShort              11372  0.64 
  Repeat                8028    0.45 
 
  # read GC%
                        elem      min    q1    q2    q3    max        mean      n50        sum           
  Assembled            780126    0.00  26.47  31.74  37.20  79.35      32.12      33     
  Singleton            884948    0.00  30.17  36.76  43.87  86.76      36.96      39
  Singleton.Mapped      434267    0.00  27.22  32.69  38.91  81.43      33.31      34       
  Singleton.Unmapped    450681    3.37  34.65  40.75  46.74  86.76      40.47      42       
 
  # mate pair counts
                        count    %
  All                  178718 100
  Link                  72421  40.52 
  OneUnmapped          62534  34.99 
  BothUnmapped          42982  24.05 
  FalsePair            344    0.19 
   SameContig            344    0.19 
  MultiplyMapped        92    0.05
 
* Most assembled contigs or unmapped singletons seem to be contaminants (aligned by blast to human/mouse/rat) => more contamination
* Location
  ginkgo:/scratch1/brugia_malayi/Assembly/454/newbler.deNovo/
 
=== CBCB newbler refMapper 454 ===
 
*  Assembler: newbler 2.3
* Host: CBCB walnut server
* Input
  # NCBI ref assembly
                                      ctgs      min  q1  q2  q3  max    mean  n50  sum
  Len                                  26,879    200  836  1005 1495 611244 3241  18986 87,119,350
 
  #Sff reads
  .                                    seqs      min  q1  q2  q3  max  mean    n50  sum       
  Len                                  3,214,044  0    240  274  383  2042  294.72  326  947,254,956
 
* Output
  #Ctg stats
  .                                    ctgs      min  q1  q2  q3  max  mean    n50  sum       
  Len                                  101,286    100  236  323  530  7013  433.36  535  43,893,507
 
  #Trimmed read stats
  .                                    seqs      min  q1  q2  q3  max  mean    n50  sum       
  All                                  3,898,373  1    45  163  264  1995  167.84  265  654319111
  Full|Partial                        1,085,167  20  119  216  285  706  214.13  271  232364804
  Chimeric|Repeat|Unmapped|TooShort    2,015,920  20  111  221  276  1995  208.85  263  421032444
  Deleted                              797,286    1    1    1    1    19    1.16    1    921863
 
  #Trimmed read counts
              count    %
  All        3898373  100
  Chimeric    25460    0.65 
  Deleted    797286  20.45  !!!
  Full        1001745  25.7 
  Partial    83422    2.14 
  Repeat      406119  10.42  !!!
  TooShort    14031    0.36 
  Unmapped    1570310  40.28  !!!
 
  #Mate pair counts
                  count  %
  BothUnmapped    301390  42.86
  OneUnmapped    110922  15.77
  MultiplyMapped  108641  15.45
  FalsePair      106249  15.11
  TruePair        75992  10.81
 
* Ref ctgs partially assembled
  # len
                  ctgs      min    q1    q2    q3    max        mean      n50        sum
  all              26879      200    836    1005  1495  611244    3241      18986      7119350
  assembled        14627      206    881    1265  2778  611244    5081.46    27414      74326497     
  not_assembled    12252      200    812    920    1075  32555      1044.14    988        12792853     
 
  # gc%
  .                elem      min    q1    q2    q3    max        mean      n50        sum           
  all              26878      0.00  24.77  28.56  32.27  72.30      28.86      29        .
  assembled        14626      0.76  23.74  27.47 30.26  60.37      27.24      28        .
   not_assembled    12252      0.00  26.87  30.27  34.39  72.30      30.80      31        .
 
=== CBCB CA Sanger (Art's) ===
 
''My redo of the assembly using just original Sanger reads (after removing jird contaminant and doing some extra vector trimming) got the following:''
 
  TotalBasesInScaffolds 81,379,515
  N50ScaffoldBases          80,913  <<** wrt TBS=70676234
  MaxBasesInScaffolds    6,446,756
  IntraScaffoldGaps          2,758
  TotalContigsInScaffolds  12,564
  MaxContigSize            565,900
  N50ContigBases            36,160  <<** wrt TBS=70676234
 
''The read coverage of unitigs was very biased by GC content.  E.g., for unitigs with 23% GC, there averaged one read every 134bp, while for unitigs with 40% GC, there averaged one read every 23bp.  So I used these values to recompute the unitig
astats (the astats indicate whether a unitig is likely a repeat or not).  This is a more principled way of doing the "boosting" that we did on the original assembly. The assembly changed to:
''
 
  TotalBasesInScaffolds 79,851,223
  N50ScaffoldBases        100,938  <<** wrt TBS=70676234
  MaxBasesInScaffolds    6,435,383
  IntraScaffoldGaps          2,616
  TotalContigsInScaffolds  12,232
  MaxContigSize          1,356,278
   N50ContigBases            39,235  <<** wrt TBS=70676234
 
''Note that all assemblies above used only the original Sanger reads.  I next added the 3Kb and 20Kb paired 454 reads (with some extra linker trimming and removing duplicate mate-pairs). This reduced the size of unitigs (N50 fell from 9142 to 5297) indicating there are still some trimming issues.  The coverage bias is also less with the 454 reads and the astat-adjustments are less effective. The best assembly (based on N50 sizes) I have of these data calculated astats assuming a  g enome size of 70Mb:''
  TotalBasesInScaffolds 80,688,005
  N50ScaffoldBases        358,475  <<** wrt TBS=70676234
  MaxBasesInScaffolds    3,020,329
  IntraScaffoldGaps          4,039
  TotalContigsInScaffolds  13,894
  MaxContigSize            602,785
  N50ContigBases            43,796  <<** wrt TBS=70676234


== Files ==  
== Files ==  
 
   /fs/szattic/asmg1/adelcher/Genomes/Brugia            : Art's files
   * /fs/szattic/asmg1/adelcher/Genomes/Brugia            : Art's files
   /fs/sztmpscratch/cole/tarchive_download/brugia_malay  : Cole's files
   * /fs/sztmpscratch/cole/tarchive_download/brugia_malay  : Cole's files
   /fs/szasmg3/dpuiu/Brugia_malayi/                      : Daniela's files
   * /fs/szasmg3/dpuiu/Brugia_malayi/                      : Daniela's files
  /scratch1/brugia_malayi/Data/                        : ftp PITT data
  /fs/szattic-asmg4/brugia_malayi                      : ftp PITT data (as well)

Latest revision as of 14:03, 30 March 2010

Articles

 the copy number of the repeat in B. malayi was found to be about 30,000. The 320-base-pair Hha I repeated sequences are arranged in direct tandem arrays and comprise about 12% of the genome.
 Brugia malayi Hha I-repeat family element
 >gi|156092|gb|M12691.1|BRPRSHA Brugia malayi Hha I-repeat family element
 GCGCATAAATTCATCAGCAAAATTAATAAAACTTTCAATTAATCATGATTTTAATTGAATGTAAGAATTT
 AAATTAAATTTAAATTCAAATTTAAATTTTTAATTTTTTAAAAATTTTAAAATTTGTTATAGTTTTCCTT
 CATTAGACAAGGATATTGGTTCTAATTTATCAATTTTAATTCTAATTAAGTGCCAAAACTACTAAAAAAA
 GCTTATTTTGAAATTAATTGACTACGTTAGCTGCATTGTACCAGTGCTGGTCGTGTATTGTGTTGTCATT
 TTATAGTTTAAATATTAAAATACGCTTTTGTAATTAAGTTTT

Genome Info

  • 6 chromosomes: 1-5, XY ; diploit genome ~ 110M bp
  • 30% GC,
  • 32% coding, 15% repeats

Genome Project

Brugia malayi has a diploid genome of approximately 110 Mb, organized in 6 pairs of chromosomes (five pairs of autosomes and one pair of sex chromosomes). In addition to the nuclear genome, B. malayi has a mitochondrial genome of about 14kb, and the genome of the harbored bacterial endosymbiont Wolbachia sp (1-2Mb).

The B. malayi genome project has been completed by The Institute for Genomic Research. Whole Genome Shotgun sequencing was used to obtain more than eight-fold coverage of the genome. The complete genome was assembled into approximately 8200 scaffolds and deposited in GenBank. The accession for the WGS project is AAQA00000000 and consists of sequences AAQA01000001-AAQA01029808. File location:

 /fs/szasmg3/dpuiu/Brugia_malayi/Data/Bm.fasta 
 ctgs               min    q1     q2     q3     max        mean       n50        sum            
 26,879             200    836    1005   1495   611,244    3241.17    18986      87,119,350   

Contamination

 /fs/szasmg3/dpuiu/Brugia_malayi/Data/contam.fasta
 contaminants       min    q1     q2     q3     max        mean       n50        sum            
 2,929              200    527    675    820    8994       740.04     762        2,167,588

Data

 1.26M Sanger reads (original TA) :       medLen=773bp; medGC=32.57%
 1.26M Sanger reads (contamination free): medLen=771bp; medGC=32.36%
 3.21M 454 reads (original sff) :         medLen=274bp
 3.29M 454 reads (linker free)  :         medLen=247bp; medGC=36.39%

Original Traces

 SEQ_LIB_ID       INSERT_SIZE  INSERT_STDEV  TRACE_TYPE_CODE         
 1047113828118    1000         300           WGS              13500   
 1047113856575    1000         300           PRIMERWALK       325     
 1047111632737    1258         377           PRIMERWALK       3       
 1047111632737    1258         377           WGS              305,906  
 1047111540304    1415         424           WGS              51772   
 1047112577106    1415         424           WGS              337,789  
 1047111718946    3123         936           WGS              47597   
 1047113358719    3123         936           PRIMERWALK       173     
 1047113358719    3123         936           WGS              246,185  
 1047174912885    3123         936           TRANSPOSON       1437    
 1047113570927    6000         1800          WGS              3193    
 1047111814561    7158         2147          WGS              219,306  
 1047111480027    17168        5150          WGS              4087    
 1047111488095    17168        5150          WGS              3434    
 1047111495007    17168        5150          WGS              3716    
 1047111501919    17168        5150          WGS              3697    
 1047111480605    22419        6725          WGS              4638    
 1047111516154    22419        6725          WGS              4004    
 1047111523212    22419        6725          WGS              3766    
 1047111530126    22419        6725          WGS              5686    
 1047113855421    23000        6900          WGS              1       
 total                                                        1,260,215

FRG file: (contaminant free)

  • FRG.src : TI's
  • FRG.acc: 2 ..
  • DST.acc: 1260217, ... , 1260234
  • Location
 /fs/szasmg3/dpuiu/Brugia_malayi/Data/nucmer_seq/Bm-all.frg 
 DST     15
 FRG     1178192
 LKG     530930
 
         seqs         min    q1     q2     q3     max        mean       n50        sum            
 len     1,178,192    65     645    771    850    1214       724        800        853,847,771  => 8X
 gc%     1,178,192    0.00   29     32.36  35     100        32.41      33         .

Problems:

  • All library insert sizes are underestimated ???
  • The contaminant reads align at ~91-93% id to the contaminant ctgs while the Mt/We reads align at 99% id to Mt/We finished seq. What %id thold to use for contaminant?

BACS

8,000 BAC clones @Children's Hospital Oakland Research Institute. (!!! no NCBI TA submission)

PITT FTP data

  • 3.21M 454 reads ; about 13% are mated
  • 3K insert flx libraries (estimated to 2K based on alignment to the existing assembly)
  • 20K insert tit libraries (estimated to 28K ...)

CBCB Location:

 /fs/szattic-asmg4/brugia_malayi/Data/
 /fs/szattic-asmg4/brugia_malayi/Data/Sff/  # Sff files
 /fs/szattic-asmg4/brugia_malayi/Data/Frg/  # Frg files
 /fs/szattic-asmg4/brugia_malayi/Data/Seq/  # Seq files

FTP access:

 lftp -u bma 136.142.191.201
 pass: 6279
 user: bma
 # empty as of --Dpuiu 12:04, 8 January 2010 (EST)

Elodie's table:

 /scratch1/brugia_malayi/brugia-sequencing-summary.txt.csv
 #    elodie's date  protocol  platform  type            description                                run_name                                                                                                  Reads    Mates
 1    01/17/2008     WGS       Standard  Full run (2/2)  Mix of worms (calibration of the machine)  R_2008_01_31_18_01_35_FLX10070260_adminrig_ghedintestsample                                               534822   0
 2    07/01/2008     3Kb       Standard  Full            single worm (pUC contamination)            R_2008_08_06_13_52_29_FLX10070260_adminrig_080608_Ghedin-BrugiaLTPE1                                      492575   84341 
 3    09/11/2008     3kb       Standard  4/8 wells       single worm (pUC contamination)            R_2008_09_19_14_17_55_FLX10070260_adminrig_091908_HATFULL-MIDrepeat_GHEDIN-LTPE1                          263421   49258 
 4    10/01/2008     3Kb       Standard  Full            Mix of worms (still pUC contamination)     R_2008_10_14_15_06_50_FLX10070260_adminrig_101408_GHEDIN-Brugia-pool_LTPEtest                             59711    5096  
 5    02/01/2009     WGS       Standard  1/4 wells       Mix of worms; regions 2 & 3 were myxoma    R_2009_02_27_16_11_34_FLX10070260_adminrig_022709_GHEDIN                                                  18025    0
 6    04/06/2009     WGS       Standard  1/4 wells       Mix of worms; with comp. bio run           R_2009_04_15_14_46_56_FLX10070260_adminrig_041509_GHEDIN_r1-WGS1_r2-LMW4_r3-pool2compbio_r4-pool3compbio  118490   0
 7    05/01/2009     20Kb      Titanium  7/8 wells       Mix of worms                               R_2009_06_05_16_18_41_FLX10070260_adminrig_060509_GHEDIN_Brugia-gDNA-TI20kb1                              631287   213524
 8    10/28/2009     20Kb      Titanium  Full            Mix of worms                               R_2009_10_22_15_30_12_FLX10070260_adminrig_102209_GHEDIN_Brugia20kb2                                      1095713  377547
 9    11/17/2009     3 Kb      Titanium  Full            Mix of worms                               111209_Brugia_3kb.zip                                                                                     868928   ?
 .    Total                                                                                                                                                                                                   3411635  ?
  • 22 Sff files:
      run                                                                                                                                                                     sffReads  linker
 1    R_2008_01_31_18_01_35_FLX10070260_adminrig_ghedintestsample/D_2008_01_31_18_01_35_FLX10070260_adminrig_FullAnalysis/sff/E4RA0X101.sff                                     272923  .
 1    R_2008_01_31_18_01_35_FLX10070260_adminrig_ghedintestsample/D_2008_01_31_18_01_35_FLX10070260_adminrig_FullAnalysis/sff/E4RA0X102.sff                                     261899  .
 
 2    R_2008_08_06_13_52_29_FLX10070260_adminrig_080608_Ghedin-BrugiaLTPE1/D_2009_02_12_22_12_04_j_SignalProcessing/sff/FEZH5RS01.sff                                           228204  flx
 2    R_2008_08_06_13_52_29_FLX10070260_adminrig_080608_Ghedin-BrugiaLTPE1/D_2009_02_12_22_12_04_j_SignalProcessing/sff/FEZH5RS02.sff                                           264371  flx

 3    R_2008_09_19_14_17_55_FLX10070260_adminrig_091908_HATFULL-MIDrepeat_GHEDIN-LTPE1/FHAVB5T02.sff                                                                            86862   flx
 3    R_2008_09_19_14_17_55_FLX10070260_adminrig_091908_HATFULL-MIDrepeat_GHEDIN-LTPE1/FHAVB5T03.sff                                                                            87488   flx
 3    R_2008_09_19_14_17_55_FLX10070260_adminrig_091908_HATFULL-MIDrepeat_GHEDIN-LTPE1/FHAVB5T04.sff                                                                            89071   flx

 4    R_2008_10_14_15_06_50_FLX10070260_adminrig_101408_GHEDIN-Brugia-pool_LTPEtest/FIOXLOM01.sff                                                                               13695   flx
 4    R_2008_10_14_15_06_50_FLX10070260_adminrig_101408_GHEDIN-Brugia-pool_LTPEtest/FIOXLOM02.sff                                                                               14197   flx
 4    R_2008_10_14_15_06_50_FLX10070260_adminrig_101408_GHEDIN-Brugia-pool_LTPEtest/FIOXLOM03.sff                                                                               15515   flx
 4    R_2008_10_14_15_06_50_FLX10070260_adminrig_101408_GHEDIN-Brugia-pool_LTPEtest/FIOXLOM04.sff                                                                               16304   flx

 5    R_2009_02_27_16_11_34_FLX10070260_adminrig_022709_GHEDIN/FRLDXKV01.sff                                                                                                    18025   .

 6    R_2009_04_15_14_46_56_FLX10070260_adminrig_041509_GHEDIN_r1-WGS1_r2-LMW4_r3-pool2compbio_r4-pool3compbio/D_2009_04_16_14_19_21_morty_fullProcessing/FT9KOI001.sff         118490  .

 7    R_2009_06_05_16_18_41_FLX10070260_adminrig_060509_GHEDIN_Brugia-gDNA-TI20kb1/D_2009_06_08_15_32_36_compute-0-2_fullProcessing/sff/FW1OXFY01.sff                           73807   tit
 7    R_2009_06_05_16_18_41_FLX10070260_adminrig_060509_GHEDIN_Brugia-gDNA-TI20kb1/D_2009_06_08_15_32_36_compute-0-2_fullProcessing/sff/FW1OXFY02.sff                           91698   tit
 7    R_2009_06_05_16_18_41_FLX10070260_adminrig_060509_GHEDIN_Brugia-gDNA-TI20kb1/D_2009_06_08_15_32_36_compute-0-2_fullProcessing/sff/FW1OXFY03.sff                           93878   tit
 7    R_2009_06_05_16_18_41_FLX10070260_adminrig_060509_GHEDIN_Brugia-gDNA-TI20kb1/D_2009_06_08_15_32_36_compute-0-2_fullProcessing/sff/FW1OXFY04.sff                           90232   tit
 7    R_2009_06_05_16_18_41_FLX10070260_adminrig_060509_GHEDIN_Brugia-gDNA-TI20kb1/D_2009_06_08_15_32_36_compute-0-2_fullProcessing/sff/FW1OXFY05.sff                           97065   tit
 7    R_2009_06_05_16_18_41_FLX10070260_adminrig_060509_GHEDIN_Brugia-gDNA-TI20kb1/D_2009_06_08_15_32_36_compute-0-2_fullProcessing/sff/FW1OXFY06.sff                           94326   tit
 7    R_2009_06_05_16_18_41_FLX10070260_adminrig_060509_GHEDIN_Brugia-gDNA-TI20kb1/D_2009_06_08_15_32_36_compute-0-2_fullProcessing/sff/FW1OXFY07.sff                           90281   tit

 8    R_2009_10_22_15_30_12_FLX10070260_adminrig_102209_GHEDIN_Brugia20kb2/F4H5CMB01.sff                                                                                        551263  tit
 8    R_2009_10_22_15_30_12_FLX10070260_adminrig_102209_GHEDIN_Brugia20kb2/F4H5CMB02.sff                                                                                        544450  tit

 9    111209_Brugia_3kb_1.sff                                                                                                                                                   365339  tit
 9    111209_Brugia_3kb_2.sff                                                                                                                                                   503589  tit
 
      total                                                                                                                                                                     3,214,044  #without run 9
      total                                                                                                                                                                     4,082,972  #all runs
  • 22 Frg Libraries
. lib       meanIns(orig)  meanIns(est)    #reads       #mates  linker medLen  medGC
1 E4RA0X101    0           0               271066       0       .      250     37.04
1 E4RA0X102    0           0               260166       0       .      249     37.06

2 FEZH5RS01    3000        2000            181035       18676   flx    228     37.89
2 FEZH5RS02    3000        2000            211064       22270   flx    227     37.79

3 FHAVB5T02    3000        2000            68708        7850    flx    244     38.35
3 FHAVB5T03    3000        2000            69306        8227    flx    243     37.98
3 FHAVB5T04    3000        2000            70028        8353    flx    243     38.10

4 FIOXLOM01    3000        0               10921        0       flx    102     43.93   # no mates , shorter read length, highest GC !!!
4 FIOXLOM02    3000        0               11157        0       flx    103     43.75   # no mates , shorter read length, highest GC !!!
4 FIOXLOM03    3000        0               12197        0       flx    103     43.51   # no mates , shorter read length, highest GC !!!
4 FIOXLOM04    3000        0               12727        0       flx    103     43.56   # no mates , shorter read length, highest GC !!!

5 FRLDXKV01    0           0               17349        0       .      255     36.02

6 FT9KOI001    0           0               108826       0       .      256     36.10

7 FW1OXFY01    20000       28000           86127        15825   tit    276     35.40
7 FW1OXFY02    20000       28000           106911       19668   tit    275     35.37
7 FW1OXFY03    20000       28000           109874       20396   tit    271     35.23
7 FW1OXFY04    20000       28000           104797       18933   tit    269     35.23
7 FW1OXFY05    20000       28000           113716       20649   tit    265     35.09
7 FW1OXFY06    20000       28000           110693       20326   tit    270     35.05
7 FW1OXFY07    20000       28000           105931       19176   tit    271     35.18

8 F4H5CMB01    20000       28000           626046       109918  tit    241     36.08
8 F4H5CMB02    20000       28000           628432       118903  tit    256     35.87

9 111209_Brugia_3kb_1 3000 .               453449       100097  tit    199     31.19
9 111209_Brugia_3kb_2 3000 .               655457       165137  tit    214     30.93

. total        .           .               3,297,077    429,170  #without run 9
. total        .           .               4,405,983    694,404  #all runs
  • Sff seqs clr (good qual)
 .         seqs         min    q1     q2     q3     max        mean       n50        sum            
 all       3,214,044    0      240    274    383    2042       294        326        947,254,956 => 9.4X    
  • Frg seqs clr (good qual , no linker)
           seqs         min    q1     q2     q3     max        mean       n50        sum            
 all       3,297,077    3      156    248    301    2043       244        275        806,091,347 => 8X
 mated       858,340    64     107    156    223    612        171        201        147,070,298      
 unmated   2,438,737    2      207    261    335    2042       268        286        655,723,972
  • Frg seqs GC%
           seqs         min    q1     q2     q3     max        mean       n50        sum            
 all       3,297,077    0.00   29.25  36.39  44.54  86.76      36.86      39         .
 mated       858,340    0.00   28.48  34.35  40.65  78.75      34.74      36         .
 unmated   2,438,737    0.00   29.61  37.36  45.86  86.76      37.60      41         .
  • Locations:
 /fs/szattic-asmg4/brugia_malayi/Data/Sff/ 
 /fs/szattic-asmg4/brugia_malayi/Data/Frg/

Contaminant & high copy repeats

  • nucmer -maxmatch "-l 20 -c 65" or "-l 12 -c 24"
                       Sanger    454
 jird(26,879 ctgs)     31,501    197,420    # we'd probably find more contaminated reads if we align all the reads to the whole mouse genome ??

 Mt                    1,507     2,634     # 98% avg identity, 92% of read length

 We                    49,014    23,249    # 98% avg identity, 92% of read length
 
 UniVec                ?         661,586 
 pUC19                 134       562,107   # 99% avg identity, 99% of read length

 HhaI(~320bp)          16,336    69,400    # 90% avg identity, 63% of read length ;  
                                           # 29,021 out of 69,400  454 reads align 2+ times => tandem repeat
                                           # 11,882 out of 16,272  454 mated reads that align have both mates aligned => 30K+ repeats          
 mRNA(264bp)           20,504    59,706    # 80% avg identity, 65% of read length

 total                 ~119,000  ~910,000
  • List of contaminated reads:
 /nfshomes/dpuiu/Brugia_malayi/Data/nucmer_sanger/problems.qry_hits  #   118,996  Sanger reads
 /nfshomes/dpuiu/Brugia_malayi/Data/nucmer_454/problems.qry_hits     #   914,525  454 reads
 /nfshomes/dpuiu/Brugia_malayi/Data/problems.qry_hits                # 1,033,521  Sanger+454 reads
  • Other possible contaminants: Schistosoma

In my latest Brugia assembly, I looked for contigs/degenerates that were exclusively Sanger reads, thinking they might be jird contaminants. I came across a degenerate, deg1596341, with 417 reads, all Sanger, and only 1235bp long. When I BLAST it against NCBI, the best hit (entire length, 99% identity) is to Schistosoma--then poorer hits to 28s rRNAs. It has lots of mate pairs to another degenerate, which matches Schistosoma just as well and in the right position, but that degenerate has some 454 reads. (Art)

 >deg1596337                                                               
 ATTAGACAGTCGGATTCCCCGAGTCCGTGCCAGTTCTAAGTTGACTGTTTAACGCCGGCCGAAATATCAA    
 ATAAAACATTTACTTTTTTAAAAAAAAAAATAAAAAAATAAATGTTGATATGCAGCTATAACGGTCCATA    
 AGACAGTTCGAACACTAGCCGAGTTTCATCAAAATGAATACATTTTTTTTTTTTAATGTTTTCATTTTAA    
 TGTTACACTGCATGGATCAAACCGTACTCACTTCACATTACAGCCCGACCGGCCCAGTCCTTAGAGCCAA    
 TCCTTATCCCGAAGTTACGGATCTAATTTGCCGACTTCCCTTACCTACATTATTCTATCGACTAGAGGCT    
 GTTCACCTTGGAGACCTGCTGCGGATATGGGTACGATCTGGCACGAAATTCAAATAGCTTCCCTCGGATT    
 TTCATGGATCGAACAAAGCGCACGAGACACCACAGGAACCGTGGCGCTTTACGGAAACAACATCCCTATC    
 TCCGGCTGAACCGATTCCAGGGAGTCCGTTCCTTAACCAGAAAAGAGAACTCTGGCTCGGGCTTTCCTCA    
 ATGTTTCCGAGTTCATTTGCGTTACCGCGCTAAATTCTCACGATGAGCATTTATCTCCGTGTCCAGGTAC    
 GGGAATATTAACCCGTTTCCCTTTCGATTTATCAGATGGATTACACCTCCATTCCTCTATTTTATTTTAA    
 AAAACGGCACTAGCCAATATCTTAGGATCGACTGACCCACATTCAACTGCTGTTCACGTGGAACCCTTCT    
 CCACTTCAGTCTTCAAGGATCTCACTTGAATATTTGCTACTACCACCAAGATCTGCACCAATGGAAGCTT    
 CAACCGGGCCTACGCCCAAAGTCTTCAACGCTAACCATTGCGACCCTCTTACTCGTTGCGGCCAGATTTC    
 CCAAAAAAAAAAAAACACAAGCCATGCAACGGTTGAGTATAAGTCTCCCGCTCAAGCGCCATCCATTTTC    
 AGGGCTAGTTGATTTGGCAGGTGAGTTGTTACACACTCCTTAGCGGTTTCCAACTTCCATGGCCACCGTC    
 CTGCTGTCTATATCAACCAACGCCTTTCATGGGGTCTCATGAGCGGAAAGTTTGGCACTTTAACTCAACG    
 TTTGGTTCATCCCACAGCGCCAGTTCTGCTTACCAAAAATGGCCCACTTGGAGCACACATTCAATGTCTA    
 TGCTTCATAAAAAATTTAAGCAAGCAAGACGTCATACTCATTGAAAGTTTGAGAATAGGTTGAAGAC       
 >deg1596341                                                               
 CCAATTATACCAAAGATAATCTTTACTTTCATTATGCTTTTTATCTTTTAAATTAGGTTTACTACCCAAT    
 AACTTGCGTATATGCTAGACTCCTTGGTCCGTGTTTCAAGACGGGTCAGATAGGTGATTAACGTTCACAT    
 CGAGATGTAACTTTATTGCATACAATATTATAATATTACCAATTATTTTTACCGATAAAGTCGCATGCGA    
 CCACATGTAAAATAATAATAAGCAAAATTATAATCGATACATGTCACTATTATTTCAAGTGAAAGTTACA    
 TATATGGGAAAAAAAAAAAAAACTTCATCTAAGACATATTTCAACATAATTTAGGATTCCAATTATCAAT    
 TGAAATAATTGGTCCACTAAATTAACTTGTATTAATATGCTAAAATGAAGTTCTCGATGCATACCATCGG    
 TAAATACACCAATCTATGCATATACTGCTAATTTAGCATTAATATCATTTTATTCATTAATAAAAAAAAA    
 AAAATTATTAATGAATAATGAAATGAATTATGATTGCTAAATTGATTGGTTGAATACCGATAAGTTTTGT    
 TAACTCTATCCGTTTCCATCTCAGCGGTTTCACGCCCTCTTGAACTCTCTCTTCAAAGTTCTTTGCAACT    
 TTCCCTCACGGTACTTGTTTGCTATCGGTCTCATGGTCGTATTTAGCCTTAGATGAGGTTTACCACCCTC    
 TTTGGGCTGCAATCTCAAACAACCCGACTCCAAGGAATAACCTACCGTAACTTTTTTCACCCGTACAGGT    
 CTAGCACCTTCTATGGACTGTAGCCCCGCTCAAGGGGACTTTGGGTGTAAAAATATGTTACGGATAGTTA    
 TACCTATACGCTACATTTCCATATAGCCATATAATGTCTATTGGATTCAGCGTTGGGCTTTTTCCTTTTC    
 ACTCGCCGTTACTAGGGAAATCCTCGTTAGTTTCTTTTCCTCCGCTTAGTTATATGCTTAAATTCAGCGG    
 GTAATCACGACTGAGTTGAGGTCAAAAAAAAAAAAAATGATATAAAACATATTGAAATTATCATTCATAT    
 ATATATGCTAATTTTTTACCTTATTTATTTGTTTATTTTAATGTTTCAAATAACTTGCATTTTAATTTGA    
 AACATTTAACAACAAAACAAACAAACAATAAAGTAAATCAATGCATAATAAATAAATAATTGTAATCTTT    
 CTTTATTATTTATTCATGAAAGATTACTTTTTAATATATATATAT

... posisble contamination at the library construction level. Schisto was being sequenced at the same time as Brugia at TIGR. Does this mean we should first filter all the Sanger reads against Schisto now that the Schisto genome is available? (Elodie)

Assemblies

TIGR/NCBI

  • 9X coverage, 856K Sanger traces => 8,200 scaff & 29,808 ctg (avg. scaff=~10K & avg ctg=~3K)
  • "scaffolds totaling ~71 Mb of data with a further ~17.5 Mb of contigs not integrated into any scaffold (orphan contigs)" (Science 2007)
  • NCBI AAQA00000000 AAQA01000001-AAQA01029808
 * 26,879 good ctgs
 * 2,929 jird contaminants (Example: AAQA01001321 : mouse 99%id hits)
  • Stats
 .                      elem       min    q1     q2     q3     max        mean       n50        sum            
 ctg.len(good)          26879      200    836    1005   1495   611244*    3241.17    18986      87,119,350       
 ctg.len(contaminants)  2929       200    527    675    820    8994       740.04     762        2,167,588       
 .                      elem       min    q1     q2     q3     max        mean       n50        sum            
 ctg.gc%(good)          26878      0.00   24.77  28.56  32.27  72.30      28.86      29         .
 ctg.gc%(contaminants)  2929       18.09  39.16  43.59  48.35  75.96      44.10      44         .
  • Location
 /fs/szasmg3/dpuiu/Brugia_malayi/Assembly/TIGR/ <-> NCBI

PITT

  • Date: 11/05/08
  • Stats:
                      elem       min    q1     q2     q3     max        mean       n50        sum
 scf.len              3170       2000   2917   4483   14471  6534162*   22916      112914*    72,643,770 (66,051,795bp without gaps)
 scf.gc%              3170       15.70  25.53  28.17  30.95  66.60      28.46      28         .
  • Location:
 /fs/szasmg3/dpuiu/Brugia_malayi/Assembly/PITT/

CBCB CA 5.1 Sanger

  • Assembler: wgs 5.1
  • Date: 2008/08/26
  • Input: filtered Sanger reads
  • better assembly than the published one
  • repeat Hha appears in a few dozen contigs but not in tandem
  • Stats:
 .                    elem       min    q1     q2     q3     max        mean       n50        sum            
 scf                  10317      935    1215   1538   3462   3890532    8018.85    41716      82,730,474       
 scf2K+               3656       2001   3181   5733   18904  3890532    20189.57   50293      73,813,083
 ctg                  12753      273    1245   1632   3873   376744     6113.39    24748      77,964,006       
 deg                  9661       65     858    949    1023   72494      1240.97    1008       11,988,997       
 singl                134119 (11.43%)
 reads                1178192(100%)
  • Location:
 /fs/szasmg3/dpuiu/Brugia_malayi/Assembly/CBCB/2008_0826_CA/

CBCB CA 6.0 454 (failed)

  • Assembler: wgs 6.0-beta
  • Input: 3,297,077 454 sffToCA processed reads
  • Locations:
 ginkgo:/scratch1/brugia_malayi/Assembly/454/CA.failed/  
 /scratch1/ -> umiacsfs01:/xraid03
 ginkgo: 32 proc, 128G mem
 genome6.umd.edu:/genome6/raid/dpuiu/Brugia_malayi/Assembly/CA.bog/
 genome6: 32 proc, 256G mem
  • Problem: high frequency contamination & repeats
 obtMerThreshold, ovlMerThreshold set on auto (default) !!!
 runCA estimated them to: (see runCA.log)
   Reset OBT mer threshold from auto to 37235.
   Reset OVL mer threshold from auto to 43186.
 => olap-from-seeds very memory/cpu intensive!!! 
 Example: 6 jobs: each is 2 thread, ~ 20G mem
 merOverlapperSeedConcurrency=6 => 6 jobs
 merOverlapperExtendBatchSize=20000
 $ ps -C olap-from-seeds 
 PID %MEM   RSZ(KB) %CPU STIME TIME     CMD
 13158  0.0 1132     0.0 10:21 00:00:00 /bin/sh 1-overlapper/olap-from-seeds.sh 90
 ...
 13163  0.0 1136     0.0 10:21 00:00:00 /bin/sh 1-overlapper/olap-from-seeds.sh 95
 13199 15.6 20675720 133 10:21 02:46:39 olap-from-seeds -a -b -t 2 -S 1-overlapper/asm.merStore -c 3-overlapcorrection/0092.frgcorr.WORKING -o 1-overlapper/olaps/0092.ovb.WORKING.gz asm.gkpStore 1820001 1840000
 ...
 13205 15.2 20139808 138 10:21 02:52:35 olap-from-seeds -a -b -t 2 -S 1-overlapper/asm.merStore -c 3-overlapcorrection/0091.frgcorr.WORKING -o 1-overlapper/olaps/0091.ovb.WORKING.gz asm.gkpStore 1800001 1820000
 $ vmstat
 procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 9  0 9534576 121356  10124 2743568    5   23     7    23   36   69 16  0 83  1  0
 $ free 
              total       used       free     shared    buffers     cached
 Mem:     132168632  132043920     124712          0      10064    2934076
 -/+ buffers/cache:  129099780    3068852
 Swap:     67108856    8842720   58266136
  • 1-overlapper
 #overlaps/read
 0+               3297077
 0                1282138 !!! more than 1/3 of reads have no overlaps
 1+               2014939
 100+             561195

                  #seqs     #maxOvl
 100+             561196    99376
 UniVec           359340    99376
 pUC19            355392    99376 
 contam           116857    97757
 Hha1             50474     29186
 mRNA             36296     12935
 other            46214     4720
  • 8-consensus failed on utg
  • 9-terminator => asm.asm.FAILED

CBCB CA 6.0 454

  • Input: 3,297,077 454 sffToCA processed reads
  • obtMerThreshold=200, ovlMerThreshold=60
 .                        elem     min    q1     q2     q3     max      mean     n50     sum         
 scf.len                  1406     585    1077   1178   1416   23313    1583.79  1354    2226813     
 ctg.len                  1513     280    1074   1177   1404   4952     1309.21  1257    1980831     
 deg.len                  192840   63     173    262    368    5688     290.91   339     56098824    
 utg.len                  194394   63     174    263    373    5691     298.80   348     58084744    
 singl.len                1039561  64     162    249    277    1519     244.45   265     254119265   
 seq.len                  2618840  64     140    239    287    1519     234.61   268     614416134   
 
 .                        elem     min    q1     q2     q3     max      mean     n50     sum         
 scf.gc%                  1406     17.45  29.91  32.75  35.47  69.26    33.21    33      .           
 ctg.gc%                  1513     17.45  30.01  32.54  34.93  69.26    32.94    33      .           
 deg.gc%                  192840   4.08   27.62  33.33  40.25  80.99    34.17    35      .           
 utg.gc%                  194394   4.08   27.65  33.33  40.20  80.99    34.16    35      .           
 singl.gc%                1039561  0.00   34.62  43.11  49.35  86.76    41.58    45      .           
 seq.gc%                  2618840  0.00   29.44  37.01  45.48  86.76    37.36    40      .

CBCB CA 6.0 Sanger

  • Input: 1,178,192 Sanger "clean" reads
  • obtMerThreshold=200, ovlMerThreshold=60
 .                        elem     min    q1     q2     q3     max      mean     n50     sum         
 scf.len                  10148    904    1235   1578   3564   2324103  8619.28  45146   87468465   
 scf.len2K+               3701     2001   3123   6500   20644  2324103  21272.70 52830   78730278  
 ctg.len                  12659    274    1265   1658   3811   565900   6478.56  27474   82012118    
 deg.len                  8497     65     820    914    987    82776    1211.94  980     10297846    
 utg.len                  40545    64     943    1247   1751   306650   2459.44  4231    99718105    
 singl.len                89048    64     545    702    812    1181     661.17   753     58875872    
 seq.len                  1173341  64     716    827    915    1222     790.88   850     927974911   
 
 .                        elem     min    q1     q2     q3     max      mean     n50     sum         
 scf.gc%                  10148    13.21  25.02  28.34  32.07  65.75    29.11    29      .           
 ctg.gc%                  12659    13.21  24.86  28.22  31.70  65.75    28.78    28      .           
 deg.gc%                  8497     7.87   23.87  28.90  33.86  77.13    29.60    30      .           
 utg.gc%                  40545    1.39   25.26  29.02  32.72  77.13    29.29    30      .           
 singl.gc%                89048    0.00   31.87  38.97  45.31  99.15    38.52    41      .           
 seq.gc%                  1173341  0.00   29.41  32.40  35.18  99.15    32.46    33      .
  • Rerun using
    • all Sanger reads : did not improve the stats
    • isNotRandom=1 for all libs : did not improve the stats

CBCB CA 6.0 Sanger+454 (Best so far)

  • Input: 1,178,192 Sanger "clean" reads ; 3,297,077 454 sffToCA processed reads
  • obtMerThreshold=400, ovlMerThreshold=120
 .                        elem     min    q1     q2     q3     max      mean     n50     sum         
 scf.len                  10254    283    1149   1472   2552   1789338  9181.77  108218  94149853
 scf.len2K+               3193     2000   2728   4513   14385  1789338  26585*   150310* 84,886,293*  (75,044,213bp without gaps)  
 ctg.len                  13607    66     1181   1578   3187   522243   6196.05  30726   84309668    
 deg.len                  114780   63     195    279    478    25405    367.61   478     42193843    
 utg.len                  157710   63     231    383    885    131968   901.05   1602    142104793   
 singl.len                904272   64     199    255    306    1690     282.94   278     255858016     # 831,547 454 + 72,725 Sanger
 seq.len                  3849841  64     202    281    662    1690     410.15   699     1579010968  
 
 .                        elem     min    q1     q2     q3     max      mean     n50     sum         
 scf.gc%                  10254    13.80  24.59  28.34  32.95  69.26    29.44    29      .           
 ctg.gc%                  13607    13.80  24.54  28.17  32.02  69.26    28.95    28      .           
 deg.gc%                  114780   3.75   29.39  38.22  45.45  80.99    37.40    41      .           
 utg.gc%                  157710   0.00   27.63  34.59  43.08  80.99    35.28    38      .           
 singl.gc%                904272   0.00   36.60  43.75  49.49  99.33    42.44    46      .           
 seq.gc%                  3849841  0.00   29.34  34.26  41.98  99.33    35.73    36      .           
  • Location:
 ginkgo:/scratch1/brugia_malayi/Assembly/hybrid/CA/ 
  • Rerun using bogBadMateDepth = 4 (default is 7) at Aleksey's advice; utgs are smaller; failed in cgw "scaffolder failed" message
  • scaffolds:
10254 : total
904   : begin in surrogates
1029  : end in surrogates

CBCB newbler deNovo 454 (failed)

  • Still running after 7 days (killed)
 Detangling alignments...
  -> Level 2, Phase 8, Round 1...
 PID  %MEM   RSZ %CPU STIME     TIME CMD
 4576  2.1 1427100 94.4 Feb16 7-03:26:12 /fs/szdevel/dpuiu/454/bin/runProject .

CBCB newbler deNovo 454

  • Filtered contaminants, high copy repeats
 deleted 1524401 
 kept    1772677
  • Input: 454 CA gkp dump
        elem       min    q1     q2     q3     max        mean       n50        sum            
 len    1772677    64     216    270    357    2044       294.13     312        521393273      
 gc%    1772677    0.00   28.60  34.15  40.70  83.90      34.83      36         .
  • Output
 # ctg stats
 .                    elem       min    q1     q2     q3     max        mean       n50        sum            
 Len                  38315      100    217    319    437    5560       346.60     402        13,280,032       
 GC%                  38315      0.00   27.69  32.27  37.33  73.73      32.83      33         .
 # scf stats
 .                    elem       min    q1     q2     q3     max        mean       n50        sum            
 Len                  69         2006   2157   2422   2795   9476       2932.87    2629       202368     
 # read counts
                       count    %
 All                   1772677 100
 Singleton             884948  50.01  
 Assembled             780126  44.08  
 PartiallyAssembled    52199   2.95   
 Outlier               33009   1.87   
 TooShort              11372   0.64   
 Repeat                8028    0.45   
 # read GC%
                       elem       min    q1     q2     q3     max        mean       n50        sum            
 Assembled             780126     0.00   26.47  31.74  37.20  79.35      32.12      33       
 Singleton             884948     0.00   30.17  36.76  43.87  86.76      36.96      39
 Singleton.Mapped      434267     0.00   27.22  32.69  38.91  81.43      33.31      34         
 Singleton.Unmapped    450681     3.37   34.65  40.75  46.74  86.76      40.47      42         
 # mate pair counts
                       count    %
 All                   178718 100
 Link                  72421  40.52  
 OneUnmapped           62534  34.99  
 BothUnmapped          42982  24.05  
 FalsePair             344    0.19   
 SameContig            344    0.19   
 MultiplyMapped        92     0.05
  • Most assembled contigs or unmapped singletons seem to be contaminants (aligned by blast to human/mouse/rat) => more contamination
  • Location
 ginkgo:/scratch1/brugia_malayi/Assembly/454/newbler.deNovo/

CBCB newbler refMapper 454

  • Assembler: newbler 2.3
  • Host: CBCB walnut server
  • Input
 # NCBI ref assembly
                                      ctgs       min   q1   q2   q3   max    mean   n50   sum
 Len                                  26,879     200   836  1005 1495 611244 3241   18986 87,119,350 
 #Sff reads
 .                                    seqs       min  q1   q2   q3   max   mean    n50  sum        
 Len                                  3,214,044  0    240  274  383  2042  294.72  326  947,254,956
  • Output
 #Ctg stats
 .                                    ctgs       min   q1   q2   q3   max   mean    n50  sum        
 Len                                  101,286    100   236  323  530  7013  433.36  535  43,893,507
 #Trimmed read stats
 .                                    seqs       min  q1   q2   q3   max   mean    n50  sum        
 All                                  3,898,373  1    45   163  264  1995  167.84  265  654319111
 Full|Partial                         1,085,167  20   119  216  285  706   214.13  271  232364804
 Chimeric|Repeat|Unmapped|TooShort    2,015,920  20   111  221  276  1995  208.85  263  421032444
 Deleted                              797,286    1    1    1    1    19    1.16    1    921863
 #Trimmed read counts
             count    %
 All         3898373  100
 Chimeric    25460    0.65   
 Deleted     797286   20.45  !!!
 Full        1001745  25.7   
 Partial     83422    2.14   
 Repeat      406119   10.42  !!!
 TooShort    14031    0.36   
 Unmapped    1570310  40.28  !!!
 #Mate pair counts
                 count   %
 BothUnmapped    301390  42.86
 OneUnmapped     110922  15.77
 MultiplyMapped  108641  15.45
 FalsePair       106249  15.11
 TruePair        75992   10.81
  • Ref ctgs partially assembled
 # len
                  ctgs       min    q1     q2     q3     max        mean       n50        sum
 all              26879      200    836    1005   1495   611244     3241       18986      7119350 
 assembled        14627      206    881    1265   2778   611244     5081.46    27414      74326497       
 not_assembled    12252      200    812    920    1075   32555      1044.14    988        12792853       
 # gc%
 .                 elem       min    q1     q2     q3     max        mean       n50        sum            
 all               26878      0.00   24.77  28.56  32.27  72.30      28.86      29         .
 assembled         14626      0.76   23.74  27.47  30.26  60.37      27.24      28         .
 not_assembled     12252      0.00   26.87  30.27  34.39  72.30      30.80      31         .

CBCB CA Sanger (Art's)

My redo of the assembly using just original Sanger reads (after removing jird contaminant and doing some extra vector trimming) got the following:

 TotalBasesInScaffolds 81,379,515
 N50ScaffoldBases          80,913   <<** wrt TBS=70676234
 MaxBasesInScaffolds    6,446,756
 IntraScaffoldGaps          2,758

 TotalContigsInScaffolds   12,564
 MaxContigSize            565,900
 N50ContigBases            36,160   <<** wrt TBS=70676234

The read coverage of unitigs was very biased by GC content. E.g., for unitigs with 23% GC, there averaged one read every 134bp, while for unitigs with 40% GC, there averaged one read every 23bp. So I used these values to recompute the unitig astats (the astats indicate whether a unitig is likely a repeat or not). This is a more principled way of doing the "boosting" that we did on the original assembly. The assembly changed to:

 TotalBasesInScaffolds 79,851,223
 N50ScaffoldBases         100,938   <<** wrt TBS=70676234
 MaxBasesInScaffolds    6,435,383
 IntraScaffoldGaps          2,616

 TotalContigsInScaffolds   12,232
 MaxContigSize          1,356,278
 N50ContigBases            39,235   <<** wrt TBS=70676234

Note that all assemblies above used only the original Sanger reads. I next added the 3Kb and 20Kb paired 454 reads (with some extra linker trimming and removing duplicate mate-pairs). This reduced the size of unitigs (N50 fell from 9142 to 5297) indicating there are still some trimming issues. The coverage bias is also less with the 454 reads and the astat-adjustments are less effective. The best assembly (based on N50 sizes) I have of these data calculated astats assuming a g enome size of 70Mb:

 TotalBasesInScaffolds 80,688,005
 N50ScaffoldBases         358,475   <<** wrt TBS=70676234
 MaxBasesInScaffolds    3,020,329
 IntraScaffoldGaps          4,039

 TotalContigsInScaffolds   13,894
 MaxContigSize            602,785
 N50ContigBases            43,796   <<** wrt TBS=70676234

Files

 /fs/szattic/asmg1/adelcher/Genomes/Brugia             : Art's files
 /fs/sztmpscratch/cole/tarchive_download/brugia_malay  : Cole's files
 /fs/szasmg3/dpuiu/Brugia_malayi/                      : Daniela's files

 /scratch1/brugia_malayi/Data/                         : ftp PITT data
 /fs/szattic-asmg4/brugia_malayi                       : ftp PITT data (as well)