Megachile rotundata

From Cbcb
Revision as of 20:08, 10 September 2010 by Dpuiu (talk | contribs) (→‎Assemblies)
Jump to navigation Jump to search

Data

Original Traces

  • 8 pairs of data files (paired ends)
 cat trace.count | grep _1_ | sed 's/_sequence.txt//' | perl -ane 'print "  ",$F[1],"\t",$F[0]/4,"\t",$F[0]/2,"\n";'
 lib        insert   mates           reads        readLen   ~coverage(500M genome)  
 s_2_3kbp   3000     21,563,283      43,126,566   124       11                              
 s_2_8kbp   8000     198377          396,754      124       0.1                             
 s_3        475      35548153        71,096,306   124       18
 s_4        475      35471044        70,942,088   124       18
 s_5        475      35616846        71,233,692   124       18
 s_6        475      35303840        70,607,680   124       18
 s_7        475      34893313        69,786,626   124       18
 total      .        198,594,856     397,189,712  128       98*

Corrected Traces

  • Mated ones
 lib        insert   mates           reads          repeatReads
 s_2_3kb    3000     4,823,235       9,646,470      4,349,208  (45%)
 s_2_8kb    8000     111,267         222,534        167,246    (75%)
 s_3        475      33,024,597      66,049,194     35,777,342 (54%)
 s_4        475      33,237,593      66,475,186     
 s_5        475      33,150,790      66,301,580  
 s_6        475      33,223,371      66,446,742  
 s_7        475      32,647,890      65,295,780
 total      .        170,218,743     340,437,486
  • repeatReads:
    • at least one of the mate contains a perfect match of one of the 15 frequent 22mers listed below
    • 32.5%GC in repeatREads vs ~ 35.5%GC in uniqueReads

Adaptors

 >circularizarion
 CGTAATAACTTCGTATAGCATACATTATACGAAGTTATACGA
 >circularizarion.revcomp
 TCGTATAACTTCGTATAATGTATGCTATACGAAGTTATTACG

Frequent kmers

  • 22mers which seem to appear in tandem
                                                            ~ %reads    
                                                       s_2_3kb s_2_8kb s_[4567] 
                                                       -------------------------
    1  AATCATACAATCACAATCATAC|GTATGATTGTGATTGTATGATT   12.04   20.1    9.99        # 14mer tandem repeat : AATCATACAATCAC|GTGATTGTATGATT
    2  CAATCACAATCATACAATCACA|TGTGATTGTATGATTGTGATTG   10.5    17.87   8.3
    3  AATAATATGAGTTAGATTGATA|TATCAATCTAACTCATATTATT   7.94    11.77   21.47
    4  AGTAATTGTCGTTCTATCGATC|GATCGATAGAACGACAATTACT   5.08    7.47    13.04
    5  ATATAAGCATAATATGGCTAAT|ATTAGCCATATTATGCTTATAT   5.01    7.55    15.15
    6  CACACAATCACACAATCACACA|TGTGTGATTGTGTGATTGTGTG   4.72    8.57    2.32
    7  ATTACTCTTATTATTATCAATC|GATTGATAATAATAAGAGTAAT   4.62    6.67    11.8
    8  TCACACAATCACAATCACACAA|TTGTGTGATTGTGATTGTGTGA   3.76    7.01    1.54
    9  ACAATTACTATACTTATTACTC|GAGTAATAAGTATAGTAATTGT   2.94    4.39    8.46
   10  AGACAGAGACAGAGACAGAGAC|GTCTCTGTCTCTGTCTCTGTCT   2.17    5.66    1.03
   11  CACAATCACGATCACACAATCA|TGATTGTGTGATCGTGATTGTG   1.43    2.25    0.5
   12  CTGTCTCTGTCTGTCTCTGTCT|AGACAGAGACAGACAGAGACAG   1.34    3.77    0.68
   13  CAGCGGATATGTGCGAATTAGA|TCTAATTCGCACATATCCGCTG   0.8     0.54    0.73
   14  CTGAGCACAATTCAACACCACA|TGTGGTGTTGAATTGTGCTCAG   0.58    0.35    0.68
   15  AACCTAACCTAACCTAACCTAA|TTAGGTTAGGTTAGGTTAGGTT   0.06    0.15    0.03

Location

 /fs/szattic-asmg5/Bees/Megachile_rotundata/error_correction/large_libs/s_?_?_?kb.sequence.cor.all.txt
 /fs/szattic-asmg5/Bees/Megachile_rotundata/error_free/s_?_?_sequence.cor.txt
 /fs/szattic-asmg5/Bees/Megachile_rotundata/frg/  # frg files to assemble

Assemblies

  • CA Version: 6.1 (09/01/2010) /fs/szdevel/dpuiu/SourceForge/wgs-6.1/Linux-amd64/bin/runCA
  • SOAP version 1.04: /nfshomes/dpuiu/szdevel/SOAPdenovo_Release1.04/

CA noOBT ; partial s_2_3kb, s_2_8kb, s_3

  • Data : 3 libs : ~ 16X cvg

Gatekeeper

 LibraryName           numActiveFRG    numDeletedFRG  numMatedFRG  readLength  clearLength  
 GLOBAL                72,995,448      0              70632632     8307194830  8278360381   
 LegacyUnmatedReads    0               0              0            0           0            
 s_2_3kb               9,166,343       0              8736228      942501164   914798596    
 s_2_8kb               210,266         0              199620       21669112    20742291     
 s_3                   63,618,839      0              61696784     7343024554  7342819494
 UID             IID     mateUID         mateIID libUID  libIID  isDel   isNonRandom     Orient  Length  clrBeginLATEST  clrEndLATEST
 110000000001    1       120000000001    2       s_2_3kb 1       0       0               I       75      0               75
 120000000001    2       110000000001    1       s_2_3kb 1       0       0               I       123     0               123
 110000000003    3       120000000003    4       s_2_3kb 1       0       0               I       90      0               90
 120000000003    4       110000000003    3       s_2_3kb 1       0       0               I       123     40              123
 ...
 110009166343    9166343 0               0       s_2_3kb 1       0       0               U       76      11              76

 210009166344    9166344 220009166344    9166345 s_2_8kb 2       0       0               I       123     21              123
 ...
 210009376609    9376609 0               0       s_2_8kb 2       0       0               U       88      0               88

 320009376610    9376610 0               0       s_3     3       0       0               U       72      0               72
 ...
 310072995448    72995448 0              0       s_3     3       0       0               U       68      0               68

Stats

 .                  elem       min    q1     q2     q3     max        mean       n50        sum             #repeats    comments          
 scf                20,827     122    3228   6374   13700  202495*    11508      20462*     239696810                   SOAPdenovo: max=1102803 , N50=26876  
 ctg                37,494     65     2185   3998   7706   191323*    6380       10151*     239226293       206         SOAPdenovo: max=121554  , N50=3138
 deg                1,136,469  64     123    143    184    5031       160        164        181954480       807132
 utg                1,437,146  64     123    143    195    67048      308        870        443759899      

 readsTotal         72,995,448
 readsInContigs     27,837,956
 readsInDegenerates  9,627,122
 singletons         34,881,692 (47%)                                                             

 readsWithOuttieMate 3,028,956(4.15%) ???
 Placed reads
 .          badLong  badOuttie   badSame  bothDegen  bothSurrogate  diffScaffold  good      notMated  oneChaff  oneDegen  oneSurrogate  
 s_2_3kb    534      2,998,286   458      1614846    9892           21872         27308     267328    979980    760044    65268         
 s_2_8kb    4        26,864      10       38636      114            294           178       5044      35465     7848      1022          
 s_3        11072    3,806       1104     2369982    61236          53370         23058022  1208112   3967689   371538    87260
 Chaff reads
 .          bothChaff    notMated   oneChaff  
 s_2_3kb    1,277,760    162,787    979,980    
 s_2_8kb    53,588       5,602      35,465     
 s_3        27,684,878   713,943    3,967,689

Issues

  • reads are renamed : HWI-EAS385_0062:2:1:1036:15608#GCCAAT/1 => UID:110000000001 => IID:1
  • reads < 64bp are deleted from the beginning : ID mapping ???
  • lib s_2 orientation ??? Too many badOuttie's

Location

 ginkgo:/scratch1/dpuiu/Megachile_rotundata/Assembly/wgs-noOBT-partial.1/

CA noOBT ; partial : s_2_3kb, s_2_8kb, s_3 ; no repeats

  • Reads that contain at least one of the 15 most frequent 22mers are deleted from the input set

Gatekeeper

 LibraryName           numActiveFRG  numDeletedFRG  numMatedFRG  readLength  clearLength  
 GLOBAL                33811292      0              32385550     3781368484  3772837304   
 LegacyUnmatedReads    0             0              0            0           0            
 s_2_3kb               5103461       0              4928188      518415011   510187775    
 s_2_8kb               53215         0              51436        5395700     5289866      
 s_3                   28654616      0              27405926     3257557773  3257359663   

Overlapper

  • Dirty 3' ends for the s_2_* reads
               totalOvl  avgOvl
 s_2_3kb  5'  4955294   9   
 s_2_3kb  3'  4955294   7   
 s_2_8kb  5'  51050     10  
 s_2_8kb  3'  51050     7   
 s_3      5'  27721948  9   
 s_3      3'  27721948  9   

Stats

  • Larger max scf & ctg  !!! (compared with "CA noOBT partial" that assembled the repeats as well)
 .                    elem       min    q1     q2     q3     max        mean       n50        sum            
 scf                  21041      65     3174   6334   13482  337719*    11376      20153*     239366537      
 ctg                  37668      65     2181   3963   7687   191376*    6343       10083*     238928665      
 deg                  380596     64     107    126    170    4688       163        160        62151395       
 utg                  652051     64     115    133    225    67870      491        2469       320381694    

 readsTotal           33,811,292
 readsInContigs       27,753,101 (82.08%)
 readsInDegenerates   4,004,853  (11.84%)
 singletons           1,276,811  (3.78%)
 Placed reads
 .    badLong  badOuttie  badSame  bothDegen  bothSurrogate  diffScaffold  good      notMated  oneChaff  oneDegen  oneSurrogate  
 1    582      2992742    410      773006     13486          20146         26870     159957    107838    753916    78412         
 2    1228     2          9486     84         124            62            1550      12940     7750      860       
 3    11354    3884       1074     2266364    90824          56066         23001316  1165421   346169    416116    156218        
 Chaff reads
 .    bothChaff  notMated  oneChaff  
 1    52942      15316     107838    
 2    5932       229       12940     
 3    652176     83269     346169
 ~/bin/asm2mdi.pl < asm.asm
 s_2_3kb 16      87  ???
 s_2_8kb 8000    800
 s_3     337     27

Location

 ginkgo:/scratch1/dpuiu/Megachile_rotundata/Assembly/wgs-noOBT-partial.1.noRepeats/

CA noOBT ; partial s_3 , s_4 ; no repeats

  • Reads that contain at least one of the 15 most frequent 22mers are deleted from the input set

Gatekeeper

 LibraryName           numActiveFRG  numDeletedFRG  numMatedFRG  readLength  clearLength  
 GLOBAL                56839966      0              54032986     6425669702  6425397526   
 LegacyUnmatedReads    0             0              0            0           0            
 s_3                   28654616      0              27405926     3257557773  3257359663   
 s_4                   28185350      0              26627060     3168111929  3168037863

Stats

 .                    elem       min    q1     q2     q3     max        mean       n50        sum            
 scf                  12908      148    4003   8811   22202  511831**   19416      42207**    250623581      
 ctg                  23116      64     2888   5959   13088  255301**   10828      20109**    250302752      
 deg                  274961     64     124    139    182    4652       172        164        47499725       
 utg                  574873     64     123    132    202    128547     563        4478       324059992

CA noOBT

  • Data : 7 libs : ~ 74X cvg

Gatekeeper

 LibraryName           numActiveFRG  numDelFRG  numMatedFRG  readLength   clearLength    #repeats                                                                                                                                                              
 GLOBAL                326,236,387   0          315518526    37451489553  37418130441  
 LegacyUnmatedReads    0             0          0            0            0            
 s_2_3kb               9107424       0          9107424      942165284    910444046      #
 s_2_8kb               209336        0          209336       21814418     20787384       #
 s_3                   63618839      0          61696784     7343024554   7342819494     #
 s_4                   63544688      0          61255960     7291557748   7291478152     #
 s_5                   63370860      0          61084368     7271218123   7271051639     #
 s_6                   63780887      0          61685156     7359094156   7359012512     #
 s_7                   62604353      0          60479498     7222615270   7222537214     #

Meryl

 meryl -Dh -s 0-mercounts/asm-C-ms22-cm0 
 Found 30570218845 mers.
 Found 271464470 distinct mers.
 Found 11164787 unique mers.
 Largest mercount is 87984949; 1896 mers are too big for histogram.
 1       11164787        0.0411  0.0004
 2       9376915         0.0757  0.0010
 3       3714582         0.0894  0.0013
 ...
 54      5344148         0.6573  0.1788
 ... 
 fasta2tab.pl 0-mercounts/asm.nmers.ovl.fasta | sort -n -r | head -5
 87,908,217        AATCATACAATCACAATCATAC
 84,450,288        CAATCATACAATCACAATCATA
 ...
 74,975,282        AATAATATGAGTTAGATTGATA
 egrep -c 'AATCATACAATCACAATCATAC|GTATGATTGTGATTGTATGATT' *fastq *txt > egrep.count
 mulberry:/scratch2/dpuiu/Megachile_rotundata/Data/error_free/egrep.count
 meryl -Dh -s 0-mercounts/asm-C-ms15-cm0 | head
 Found 32850820919 mers.
 Found 142500876 distinct mers.
 Found 2381895 unique mers.
 Largest mercount is 125816941; 2023 mers are too big for histogram.
 1       2381895 0.0167  0.0001
 2       2325770 0.0330  0.0002
 3       708786  0.0380  0.0003
 ...
 54      1851586 0.4894  0.0671
...

22mer.png 22mer.cumulative.png

Overlap

  • job count :
 cat 1-overlapper/ovlopts.pl | grep ^\"h | wc -l
 924
  • Failures: 709 jobs failed; runCA 6.1 could not restart overlap properly !!!
 cat 1-overlap/overlap*out | grep "^Could not" | sort -u
 Could not malloc memory (1305184948 bytes)
  • As of 09/09/2010 at 3pm : 623 out of 924 jobs completed

Location

 mulberry:/scratch2/dpuiu/Megachile_rotundata/Assembly/wgs-noOBT

SOAPdenovo (Tanja)

 cat *.ContigIndex | grep -v ^E | grep -v ^i | count.pl -i 1 | getSummary.pl -j 1 -t "contigs"
 cat *.ContigIndex | grep -v ^E | grep -v ^i | count.pl -i 1 | getSummary.pl -j 1 -min 100 -t "contigs(>100bp)"
 grep "^>" *.scaf | getSummary.pl -i 2 -t scaf
  • Stats
 .                    elem       min    q1     q2     q3     max        mean       n50        sum            
 contigs              9742349    31     32     33     37     114832     60         44         585430821      
 contigs(>100bp)      177327     100    131    261    1398   114832     1333       3897       236496823     # N50 for Bee was 7K
 scaf                 7863       102    903    3272   17692  2338728    37825      240706     297423517     # N50 for Bee was 1.17M

  • Location
 /fs/szattic-asmg5/Bees/Megachile_rotundata/Assembly/assembly5kbForAll

SOAPdenovo (Daniela)

Stats

 cat asm.K31.contig | grep "^>" | awk '{print $3}' | uniq -c | awk '{print $2,$1}'  > asm.K31.contigLen.count
 .                    elem       min    q1     q2     q3     max          mean       n50        sum            
 scaff                25,119     351    1896   4444   10914  1,102,803    11041      26876      277,338,897
 contigs(all)         6,917,796  31     32     34     40     121,554      70         73         487,401,812      
 contigs(>100bp)      210,666    100    124    222    1174   121,554*     1108       3138*      233,563,401
 reads                340,437,486
 readsOnContigs       171,212,613 

Alignments

  • Align the 3kb & 8kb libs to the scaffolds
 soap2-index asm.K31.scafSeq
 mkdir soap2-index
 mv asm.K31.scafSeq.index.* soap2-index/
  
 soap2 -D soap2-index/asm.K31.scafSeq.index -a s_2_1_8kb_sequence.txt -b s_2_2_8kb_sequence.txt -l 32 -p 8 -v 2 -m 6000 -x 10000 -o s_2_8kb.mated.soap2 -2 s_2_8kb.single.soap2 -R
 Total Pairs: 198377 PE
 Paired:      1049* ( 0.53%) PE
 Singled:     31250 ( 7.88%) SE
 soap2 -D soap2-index/asm.K31.scafSeq.index -a s_2_1_3kb_sequence.txt -b s_2_2_3kb_sequence.txt -l 32 -p 16 -v 2 -m 2000 -x 4000 -o s_2_3kb.mated.soap2 -2 s_2_3kb.single.soap2 -R > & s_2_3kb.soap2.log
 Total Pairs: 21563283 PE
 Paired:      772557* ( 3.58%) PE
 Singled:     8449321 (19.59%) SE

Location

 mulberry:/scratch2/dpuiu/Megachile_rotundata/Assembly/SOAPdenovo-redo

SOAPdenovo ; partial : s_[34567] ; no repeats

  • Similar results to SOAPdenovo : wrong inserts & repeats don't affect much

Stats

. elem min q1 q2 q3 max mean n50 sum

  scf               24602      333    1724   4380   11049  1,103,462   11135      27887      273963709
  contigs(all)      2515516    31     33     36     52     148,198     131        1880       330512911      
  contigs(100bp+)   184395     100    127    235    1316   148,198     1263       3730       232932308