Pseudomonas aeruginosa: Difference between revisions

From Cbcb
Jump to navigation Jump to search
No edit summary
 
(485 intermediate revisions by the same user not shown)
Line 2: Line 2:


== Data ==
== Data ==
=== CBCB ===
<pre style="background:yellow">
Data: 33 bp Solexa reads from 2 libraries
  lib    #reads      #reads(1+N's)  #reads(2+N's)
  s_1    4,105,993    37,933          18,771
  s_7    4,521,907    69,669          59,589
  total  8,627,900
  ~ 43.55X cvg
</pre>
Files:
  s_1_sequence.txt  # contains seq & qual; seq names: HWI% ; all reads are 33 bp
  s_1_0001_prb.txt  # 27679 lines of 4*36 values in the -40..40 range 
  s_1_tag.txt      # unique seq count in s_1_sequence.txt
 
  s_7_sequence.txt  # contains seq & qual; seq names HWU%  ; all reads are 33 bp
  s_7_0300_prb.txt
  s_7_tag.txt   
 
  s_1 reads have significantly fewer N's than s_7 reads !!!
  s_1 reads align better to ref than s_2 reads !!!
 
  Conflicts with avg qualities: 28.26 for s_1, 31.05 for s_2 !!!
 
  Base qualities:
  .      #elem          #elem0  #elem<0 min    mean                    max    sum
  s_1    135497769      1863585 3231899 -5      28.2624918053079        40      3829504586
  s_7    149222931      1357384 3113806 -5      31.0508663175903        40      4633501282
  /fs/szasmg2/Bacteria/Pseudomonas_aeruginosa/


=== NCBI ===
=== NCBI ===


Related strains
* [http://www.ncbi.nlm.nih.gov/sites/entrez?db=genomeprj&cmd=search&term=Pseudomonas Pseudomonas genome projects]
  [http://www.ncbi.nlm.nih.gov/sites/entrez?db=genomeprj&cmd=search&term=Pseudomonas%20aeruginosa%20PAO1 Pseudomonas aeruginosa PAO1] [http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?db=nucleotide&val=AE004091 AE004091] complete : 1 chromosome, 6,264,404 bp, 66.56 %GC
* [http://www.ncbi.nlm.nih.gov/sites/entrez?db=genomeprj&cmd=search&term=Pseudomonas%20aeruginosa Pseudomonas aeruginosa genome projects]
   [http://www.ncbi.nlm.nih.gov/sites/entrez?db=genomeprj&cmd=search&term=Pseudomonas%20aeruginosa%20PA14 Pseudomonas aeruginosa PA14] [http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?db=nucleotide&val=CP000438 CP000438] complete : 1 chromosome, 6,537,648 bp, 66.29 %GC
* [http://www.ncbi.nlm.nih.gov/sites/entrez?db=genomeprj&cmd=Link&LinkName=genomeprj_sra&from_uid=28809 PAb1 SRA (uploaded)]
 
'''Same Strains'''
 
Ranging from  6.2M to  6.9M
  PA14        CP000438      6,537,648 bp, 66.29 %GC : 5,892 protein coding genes  most similar to PAb1       
  PAO1        AE004091      6,264,404 bp, 66.56 %GC : 5,568                genes  rearrangement vs PA14       
  PACS2      AAQW01000001.1 6,492,423 bp, 66.33% GC : 5,676                genes  rearrangement vs PA14
  PA2192      .              6,905,121 bp, 65.99% GC : 6,157                genes  rearrangement vs PA14
  PAC3719    .              6,222,097 bp, 66.30% GC : 5,650                genes  no rearrangement vs PA14
  PA7        CP000744      6,588,339 bp            : 6,286                genes  no rearrangement vs PA14
  PALESB58    FM209186      6,601,757 bp  66% GC    : 5,925                                          (new)
 
 
All vs all nucmer alignments:
  #ref-qry                count  sum    avg%id  comment
  PA14-PAO1              404    6296632 98.66  ~510,582 bp (filter-q) ; ~ 921,919bp (filter-r) in PA14 not present in PAO1
  PA14-PACS2              462    6303773 98.56
  PA14-PA2192            532    6248297 98.47
  PA14-PAC3719            604    6129432 98.44
  PA14-PA7                822    5899920 93.72
 
Incomplete strains details(Broad):
  Contig length stats:
  desc  #contigs  min    max    mean    stdev  sum
  C3719 124        2079    242903  49572  53770  6146998
  2192  82        2087    398738  83246  88681  682625
 
5.7Kbp repeat: contains 16S rRNA gene; identical in all the strains
  Coordinates:
 
  '''PA14:'''
  732540  738302  +
  4951956  4957714 - 
  5535975  5541728 -
  6312434  6318199 -
 
  '''PAO1:'''
  721550  727325  +
  4788516  4794273 -
  5264042  5269801 -
  6039515  6045289 -
 
  '''PA7:'''
  806558  812299  +
  4982874  4988600 -
  5566182  5571924 -
  6353418  6359148 -
 
  [[Media:All-all.png| Pa's vs Pa's (nucmer) ]]
 
'''Same Genus, same Group, different Species: Pseudomonas mendocina ymp (Pm)'''
 
  Name          Length    %GC    Info
  NC_009439.1    5072807  64.68  4594 genes
 
  [[Media:PA14-Pm.png|PA14 vs Pm (nucmer)]]
 
                                  #elem  min    max    mean    median  n50    sum      avg%id
  Alignments(1con-1con  : nucmer)  1281    75      19618  1256    1011    1548    1609296  85.26
  Alignments(genes-genes: nucmer)  ...                                1326 PA14 genes align to 1306 Pm genes (85.36% ID)                                                       
  Alignments(genes-genes: promer)  ...                                2924 PA14 genes align to 2957 Pm genes (68% ID,75% SYM)                                                           
 
'''Same Genus, different Species: Pseudomonas syringae pv. tomato str. DC3000 (Ps) : chomosome+2plasmids'''
 
  Name          Length    %GC    Info
  NC_004578.1    6,397,126 58.40  chromosome      5478 genes
  NC_004633.1    73,661    55.15  plasmid pDC3000A
  NC_004632.1    67,473    56.17  plasmid pDC3000B
  total          6,538,260       
 
   [[Media:PA14-Ps.png|PA14 vs Ps (nucmer)]]
 
                                    #elem  min    max    mean    median  n50    sum      avg%id
  Alignments(1con-1con  : nucmer)  499    75      9435    1296    1161    1619    646614  84.79
  Alignments(genes-genes: nucmer)  438    161    4446    1090    996    1281    477229  83.38      425  PA14 genes align to 418  Ps genes (83.38% ID)
  Alignments(genes-genes: promer)  6193    114    13026  1159    1011    1359    7180527  67.36      2556 PA14 genes align to 2644 Ps genes (67% ID, 75% SYM)
 
== Publications ==
* [http://www.pnas.org/cgi/content/full/105/8/3100 Dynamics of Pseudomonas aeruginosa genome evolution] PNAS Feb 2008
  * 5,021 genes that are conserved across all five genomes analyzed, with at least 70% sequence identity (core genome)
  * Of these, >90% of the genes share at least 98% sequence identity
 
  * The genome for each strain carries a relatively modest number of unique sequences, 10% or less of the number in the core genome,
    ranging from 79 in C3719, to 507 in PA2192.
  * Less than 100 genes are unique to pairs of genomes, whereas the number decreases significantly when comparison is carried out among
    sets of three or four genomes.
 
  * The overall architecture of the pangenome of P. aeruginosa an be represented as a circular chromosome with dispersed polymorphic
    strain-specific segments, flanked by conserved genes referred to as anchors
  * These strain-specific segments are regions of genomic plasticity (RGP) and include any region of at least four contiguous ORFs
    that are missing in at least one of the genomes analyzed.
  * The number of RGPs in individual genomes varies from 27 to 37. A total of 52 RGPs were identified in the five genomes analyzed
 
* [http://biorg.cis.fiu.edu/genomics/PA/supplemental/SI-Figure_05.pdf Phylogenetic tree]
* [http://biorg.cis.fiu.edu/genomics/PA/supplemental/SI-Figure_06.pdf Common genes]
* [http://biorg.cis.fiu.edu/genomics/PA/supplemental/Figure_2.pdf Pangenome with RGPs]
* [http://www.pseudomonas.com/ Pseudomonas database]
 
* [http://www.ploscompbiol.org/doi/pcbi.1000186 Gene-Boosted Assembly of a Novel Bacterial Genome from Very Short Reads] Steven L. Salzberg1*, Daniel D. Sommer1, Daniela Puiu1, Vincent T. Lee2, PLoS Sept 2008
 
== Read trimming ==


=== CBCB ===
=== No trimming ===
 
If no trimming is done and the default AMOScmp pipeline is used (ALIGNWIGGLE=15), the contigs end up containing reference duplications
Make-consensus will also take a very long time to run
 
Duplication causes:
 
# Only the reference nucmer alignment coordinates are stored in the bank and used by casm-layout and make-consensus; the read alignment coordinates are not stored, just the read clear ranges (the ones given in the .afg file); read positions in the assembly layout are approximate
# reads can be shifted from their original alignment position by make-consensus by up to 2^3*15=120 bp (constants ALIGN_WIGGLE=15; MAX_ALIGN_ATTEMPTS=3)
# The Pseudomonas_aeruginosa reference contains multiple 2 copy 5-10 bp kmers, adjacent within a few dozen bp of each other;  reads starting with these kmers align in 2 ways to the reference; if the reads contain errors or SNP's, the 2nd (shorter) alignment to the greedyly built consensus is chosen over the correct one and a duplication is introduced
 
=== Quality trimming ===
 
Art's script: qual-trim.awk
 
Min_Qual=15
  .              #elem  #elem0  #elem<0 min    median  max    sum            mean    stdev  n50
  qualTrim        8627900 1266    0      0      23      33      193999902      22.49  7.43    26
 
!!! Avg clr drops from 33 to 22 bp
!!! Duplications still exist though at a lower rate
 
=== Alignment based trimming (nucmer) ===
 
PA14 is the most similar strain: more PAb1 reads aligned to PA14 than to PAO1 or PA7
 
  grep -c "^>" PA14-Pa.l-17_c-17.delta PAO1-Pa.l-17_c-17.delta PA7-Pa.l-17_c-17.delta
  PA14-Pa.l-17_c-17.delta:7261629
  PAO1-Pa.l-17_c-17.delta:7036243
  PA7-Pa.l-17_c-17.delta:5393144
 
== Read mapping  ==
 
=== PA14 reference ===
 
Commands:
blat -noHead -t=dna  -q=dna  -tileSize=10 -stepSize=3 Pa.1con    Pa.seq    Pa.blat
nucmer.pl    -l 20,16,12 -c 20,16,14                  Pa.1con    Pa.seq -p Pa.nucmer
rmap        -m 3 -w 33                            -c Pa.1con    Pa.seq -o Pa.rmap
soap        -v 5                                  -d Pa.1con -a Pa.seq -o Pa.soap
#soap        -v 2 -s 12 -g 3                      -d Pa.1con -a Pa.seq -o Pa.soap
 
Results:
          runTime                                hits(all)      hits(>=30bp)
  blat:  1038.28u 3.696s 17:49.00 97.4%          7004925        6353293
  nucmer: .                                      7553734        6832230        # 3 itterations were run for nucmer -l 20,16,12 -c 20,16,14; delta-filter -l 20                       
  rmap:  7318.77u 5.770s 2:02:11.29 99.9%        6958043        6958043
  soap:  685.208u 3.971s 12:45.68 90.0%          7310025        7310025        # fastest, most hits
  all:    .                                      7638420        7201727
 
Soap seems the "best" read mapping program at current moment.
 
  #number of hits/read
  cat Pa.soap.5 | awk '{print $4}' | count.pl
  #      total
  1      7185054  # most reads align in only one place
  2      83418
  3      8706
  4      32277
  5      291
  ...
  Total  7310025
 
  #number of misamtches
  awk '{print $10}' Pa.soap.5 | count.pl
  #      total
  0      3708583  # that many reads align exactly to the reference
  1      2025346
  2      908998
  3      395002
  4      181326
  5      90770
 
  # mismatch % (Pa's,PA14,Staphylococcus aureus,Strep suis)
  [[Media:Soap.mismatch.percent.png]]
  [[Media:Soap.mismatch_total.percent.png]] Pa's & Ss profiles are very similar; Sa has the best mappings
 
=== Pm reference ===
 
  Reads aligned: 1182734
 
  cat Pm.soap | awk '{print $10}' | count.pl | sort -n
  #number of misamatches
  0      48396
  1      117192
  2      207369
  3      267275
  4      279223
  5      263279
  Total  1182734
 
=== Ps reference ===
 
  Reads aligned: 632459
 
  cat Pm.soap | awk '{print $10}' | count.pl | sort -n
  #number of misamatches
  0      16442
  1      36440
  2      81147
  3      136472
  4      176779
  5      185179
 
== Sequence Assemblies ==
 
=== Comparative ===
 
==== PA14 ref assembly ====
 
  Location: 2008_0109_AMOSCmp-PA14-relaxed-17-nucmer-redo2
  Command:  AMOScmp Pa -D MINCLUSTER=17 -D MINMATCH=17 -D MINOVL=5 -D MAJORITY=50 -D ALIGNWIGGLE=2
 
#a align all reads to the reference using nucmer. I initially used  minmatch=17, mincluster=17 (nucmer -c 17 -l 17)
    8.62M reads, 4.10M HWI 5.32M HWU
    8.30M (...) total reads align to the reference (nucmer -c 14 -l 14)
    7.59M (...) total reads align to the reference (nucmer -c 16 -l 16)
    7.25M (...) total reads align to the reference (nucmer -c 16 -l 16; delta-filter -l 20)
    7.26M (84.16%, 83.50% HWI, 84.76% HWU) total reads align to the reference (nucmer -c 17 -l 17)
    6.54M (75.82%, 74.49% HWI, 77.02% HWU) total reads align to the reference (nucmer -c 20 -l 20)
    5.17M (...) total reads align to the reference (nucmer -c 28 -l 14)
    5.32M (73.27%) reads aligned on their full length (as opposed to 0.94M 33bp quality untrimmed reads)  (nucmer -c 33 -l 17)
    3.69M (42.79%, 38.24% HWI, 46.92% HWU) align exactly (33 bp, 100%id)
 
    7.00M align using blat -t=dna -q=dna -stepSize=5 -tileSize=10 (20bp alignments, max 3 mismatches)
    6.35M align using blat -t=dna -q=dna -stepSize=5 -tileSize=10 (30bp alignments filtered from above results)
 
    7.11M align using blat -t=dna -q=dna -stepSize=3 -tileSize=10 (20bp alignments, 3 mismatches)
    500K the unaligned ones were aligned by blat -t=dnax -q=dnax -stepSize=2 -tileSize=4 (20bp alignments, 6 mismatches; only 15% of the aligned reads are unique)
    => 7.61M aligned by blat/blatx
 
    6.95M aligned by rmap (-m 3)
 
    7.30M align by soap (-v 5 -s 12 -g 3) to PA14 chromosome
    6.64M align by soap (-v 2 -s 12 -g 3) to PA14 chromosome
 
    7.24M align by soap (-v 2 -s 12 -g 3 -c 42) to PA14 chromosome (iterative trimming)
    6.72M align by soap (-v 1 -s 12 -g 3 -c 41) to PA14 chromosome (iterative trimming)
 
    6.36M align by soap (-v 5 -s 12 -g 3) to PA14 genes
 
    7.43M align by maq
 
# b align all unaligned reads to the reference using nucmer. minmatch=14, mincluster=14 (-c 14 -l 14)
# Combine 1a & 1b
# trim reads according to their nucmer alignment coordinates; don't trim the ones adjacent to zero cvg regions
# assemble them using the AMOScmp pipeline(ALIGN_WIGGLE=2 instead of 15)
 
Contigs stats:
  desc    #elem  min    max    mean            stdev          sum  singletons
  all    2053    17      170485  3011.84        11917.53        6183320  1127399
  200    428    203    170485  14262.09        22852.74        6104175
  10K    157    10240  170485  35468.89        26531.33        5568616
 
  '''Data accuracy'''
  Get all assembled bases with coverage>=20
  count_bases=5926977
  sum_bases=235619001
  sum_errors=2455670
 
  sum_errors/sum_bases=2455670/235619001=0.01042=1.042% error
 
SNP's in the 157 >10K contigs: T<->C & A<-G> the most common
 
  PA14 PAb1    count
    T  C      6153
    C  T      6100
    G  A      6075
    A  G      6021
    G  C      1357
    C  G      1309
    G  T      859
    C  A      846
    T  G      840
    A  C      836
    A  T      338
    T  A      308
    G  .      241
    C  .      196
    .  C      182
    .  G      152
    A  .      101
    T  .      88
    .  T      74
    .  A      70
  Total        32146
 
Problems:
# Some of the short ctgs are not "real"
# Some misassemblies due to loose assembly params (MINOVL=5 MAJORITY=50 ):  Example: pos 6700
 
==== PA14 ref assembly (2) * ====
 
run 3 nucmer iterations
 
Parameters:
  MINCLUSTER=20,16,14  # nucmer -c
  MINMATCH=  20,16,12  # nucmer -l
  MINLEN=20      # delta-filter -l
  MINOVL=10      # casm-layout -o ; previously 5, allowed for missasemblies
  MAJORITY=70    # casm-layout -m ; previously 50 to reduce the number of negative gaps
 
Read CLR's:
 
  33      5510573
  32      471289
  31      456959
  30      412738
  29      136423
  28      120148
  27      104213
  26      60338
  25      54828
  24      44854
  23      38574 # smallest value !!!
  22      43302
  21      49923
  20      49572
  Total  7553734
 
<pre style="background:yellow">
Contig stats:
  desc    #elem  min    max    mean    median  n50    sum    singl  breaks
  all      1843    20      127984  3350    63      30995  6174765 1107200 9
  200bp+  542    200    127984  11245  3263    31106  6094904
</pre>
 
Alignment of contigs >=33bp to PA14 reference; qry_hits
  count  totalLen  %id
  940    6130875  99.33      !!! 99.33%id to PA14 vs Pab1
 
Alignments of PA14 genes to the contigs
  show-coords ctg-genes.filter-1.delta | awk '{print $20}' | count.pl
  #              total
  [CONTAINS]      5318
  [END]          198
  [BEGIN]        191
  [CONTAINED]    190
  [PARTIAL]      88
  [IDENTITY]      3
 
Location:
  /fs/szasmg2/Bacteria/Pseudomonas_aeruginosa/Assembly/2008_0424_AMOSCmp-PA14
 
==== PA14 ref assembly (used blat instead of nucmer for read alignment) ====
 
run 2 blat iteration: 
  blat -t=dna  -q=dna  -stepSize=3 -tileSize=10 Pa.1con Pa.seq
  blat -t=dnax -q=dnax -stepSize=2 -tileSize=4  Pa.1con Pa.singletons.seq
 
Contig stats:
          #ctgs  min    max    mean    median  n50    sum    singletons
  all    1372    20      176323  4517    64      47563  6197771 1049302
  200bp+  370    201    176323  16584  3929    47563  6136095
  10Kbp+  146    10097  176323  39176  30457  51554  5719752
 
Location:
  /fs/szasmg2/Bacteria/Pseudomonas_aeruginosa/Assembly/2008_0613_AMOSCmp-PA14-blat
 
  !!! Best AMOScmp assembly so far :
  * fewer contig than when using nucmer;
  * largest contig
  * largest contig sum
 
==== PA14 ref assembly (used soap instead of nucmer for read alignment) ====
 
* soap parameters: -v 5 -g 3 -s 12
 
  Contig stats:
          #elem  min    max    mean    median  n50    sum
  all    1228    33      132817  5022    106    33779  6167051  !!! smaller assembly  6.16M vs 6.19M with blat
  200bp+  478    200    132817  12789  3824    33988  6113109
  10Kbp+  303    1000    132817  19927  12470  34129  6037852
 
Location: /fs/szasmg3/dpuiu/Pseudomonas_aeruginosa/Assembly/2008_0625_AMOSCmp-PA14-soap/
 
* soap parameters: -v 2 -g 3 -s 12 -c 42 :
* iterative trimming of 2 bp from read ends
 
  Contig stats:
          #elem  min    max    mean    median  n50    sum
  all      1932    33      79489  3190    153    16122  6162793
  200bp+  874    200    79489  6960    3195    16145  6082694
  10Kbp+  202    10048  79489  20745  17422  21640  4190588
 
----
 
  params                #ctgs  #breaks
  v2.g3.s12.OVL16      2655  34
  v2.g3.s12.OVL20      3446  27
  v2.s12.g3.c42.OVL10  1932  64
  v5.g3.s12.OVL10      1228  83
  v5.g3.s12.OVL3        953    89
 
----
 
v5.g3.s12.OVL10
 
  all
          #elem  min    max    mean    median  n50    sum      breaks
  PA14    1228    33      132817  5022    106    33779  6167051  83
  PA2192  1784    33      127510  3477    191    16665  6203818
  PAC3719 1335    33      46247  4424    968    14667  5905499
  PACS2  1285    33      87356  4746    165    25837  6098413
  PAO1    969    33      103523  6210    293    25135  6017855
 
  200bp+
          #elem  min    max    mean    median  n50    sum
  PA14    478    200    132817  12789  3824    33988  6113109
  PA2192  880    200    127510  6974    3011    16816  6137484
  PAC3719 838    200    46247  7002    3789    14875  5868046
  PACS2  606    200    87356  9976    3842    25961  6045393
  PAO1    520    200    103523  11510  5601    25135  5985190
 
==== PAO1 ref assembly ====
 
Same method except that 1.b was not used
 
  Contigs stats:
 
  desc    #elem  min    max    mean            stdev  sum
  all    2797    17      75626  2161.19        5812.2  6044851
  200    865    200    75626  6893.96        8766.63 5963278
  10K    204    10016  75626  19016.22        10368  3879309
 
  Singletons: 1592525
 
Location:
  /fs/szasmg2/Bacteria/Pseudomonas_aeruginosa/Assembly/2008_0124_AMOSCmp-PAO1-relaxed-17-nucmer-redo2
 
==== PAO1 ref assembly (2) ====
 
Parameters:
  MINCLUSTER=16  # nucmer -l
  MINMATCH=16    # nucmer -c
  MINLEN=20      # delta-filter -l
  MINOVL=10      # casm-layout -o ; previously 5, allowed for missasemblies
  MAJORITY=70    # casm-layout -m ; previously 50 to reduce the number of negative gaps
 
Contig stats:
 
  desc    #elem  min    max    mean    stdev  sum
  all    1961    20      46414  3065    5251    6010800
  200bp+  1166    203    46414  5108    6007    5955960
  1758777 singletons
 
Alignment of contigs >=33bp to PAO1 reference; qry_hits
  count  totalLen  %id
  1471    5984905  99.06      !!! 99.06%id PAO1 vs Pab1
 
Location:
  /fs/szasmg2/Bacteria/Pseudomonas_aeruginosa/Assembly/2008_0424_AMOSCmp-PAO1
 
==== PA14 & PAO1 merge ====
 
Use minimus to merge all contigs in the PA14 & PAO1 reference assemblies
 
Filter contigs that contain at least 2 adjacent PA14 merged by PAO1
 
  desc      #elem  min    max    mean            stdev          sum
  all        1850    17      236472  3400.31        16863.79        6290586
  200        306    204    236472  20318.3        37147.91        6217401
  10K        113    10520  236472  52647.45        45602.76        5949162
 
  Singletons: 1066226
 
De novo assembly of singletons using SSAKE
Input: 1066226 singletons
 
  Ssake version 3.0 run with default parameters:
  -m  Minimum number of reads needed to call a base during overhang consensus build up (default -m 16)
  -o  Minimum number of reads needed to call a base during an extension (default -o 2)
  -r  Minimum base ratio used to accept a overhang consensus base (default -r 0.6)
 
  Contigs stats:
 
  desc    #elem  min    max    mean    stdev  sum
  all    19795  34      2866    67.78  102.73  1341825
  200    879    200    2866    416.72  308.92  366304
 
  200 bp contigs stats
 
  desc.          #elem  min    mean    max    sum
  contig_len      879    200    416.72  2866    366304
  contig_reads    879    52      414.21  3157    364091
  contig_cvg      879    7.91    31.58  146.52  27764.51
 
  Singletons:  702007
 
  Find contigs overlapping 12 bp or more and merge them using EMBOSS merger program:
 
  desc            #elem  min    max    mean    stdev  sum
  original        879    200    2866    416.72  308.92  366304
  new            670    200    4826    539.32  579.26  361350
 
==== Pa-b1 AMOScmp reference assembly ====
 
  Contig stats:
 
  desc                          #elem  min    max    mean      stdev          sum
  input(reference|final)        770    200    512638  8484.21  37210.58        6532844
  output(AMOScmp contigs)        936    21      260827  6988.98  22288.49        6541689
 
  Singletons: 1257963
  Snp's:      1941
 
Location: /fs/szasmg2/Bacteria/Pseudomonas_aeruginosa/Assembly/2008_0215_AMOScmp-PAb1
 
==== Use ALL strains as reference ====
 
Location: /fs/szasmg2/Bacteria/Pseudomonas_aeruginosa/Assembly/2008_0228_AMOSCmp-ALL
 
Steps:
  * Identify related strains (completed & draft assemblies): PA14, PAO1, PA7, PACS2, PA2192, PAC3719
  * Download & merge related strains into a multi-FASTA file
  * Align PAb1 reads to multi-FASTA file using nucmer (Ex: -l 17 -c 17)
  * Count the number of PAb1 reads aligned to each strain
 
    strain  #reads_aligned
    ======================
    PA14    7261629
    PA2192  7193960
    PACS2  7089455
    PAO1    7036243
    PAC3719 6928744
    PA7    5393144
    ======================
 
  * Sort the strains in descending order according to the number of reads aligned to each one
  * The strain with most hits is PA14 => closest relative
 
  * Assuming the order is PA14, PAO1, PA7, PACS2, PA2192, PAC3719 ...
  * Align strains to one other; identify unique regions in each genome (regions which are not present in the previously considered ones)
 
    Unique regions length summary:
 
    Table 1:
    reference                            #regions  min    max    mean    stdev  sum
    =======================================================================================
    PA14                                  1      6537648 6537648 6537648 0      6537648
 
    PAO1-PA14                            115    1      19263  2038.58 3843.93 234437
    PA7-PA14                              511    1      46020  1876.96 4872.51 959129
    PACS2-PA14                            147    1      90591  3054.19 8766.3  448967
    PA2192-PA14                          231    1      92553  3746.3  11475  865396
    PAC3719-PA14                          272    1      14058  1164.78 2089.85 316822
    =======================================================================================
    ALL                                  1277                                    9362399
 
    Table 2:
    reference                            #regions  min    max    mean    stdev  sum
    =======================================================================================
    PA14                                    1      6537648 6537648 6537648 0      6537648
 
    PAO1-PA14                              115    1      19263  2038.58 3843.93 234437
    PA7-PAO1-PA14                          487    1      46020  1830.82 4910.86 891614
    PACS2-PA7-PAO1-PA14                    71      2      79031  4112.94 10618.2 292019
    PA2192-PACS2-PA7-PAO1-PA14              181    1      92066  3439.9  9500.07 622622 #some N's maybe?
    PAC3719-PA2192-PACS2-PA7-PAO1-PA14      173    1      11425  998.16  1888.94 172682 #some N's maybe?
    =======================================================================================
    ALL                                    1028    2      6537648 8513.66 203929  8752049
 
    Table 3:
    reference                            #regions  min    max    mean    stdev  sum
    =======================================================================================
    PA14                                    1      6537648 6537648 6537648 0      6537648
   
    PA14-PAO1                              123    1      43690  4009.28 7903.16 493142
    PA14-PA7                                516    1      31464  1768.16 3534.88 912375
    PA14-PACS2                              146    2      43690  3259.15 6129.12 475836
    PA14-PA2192                            222    1      34854  2189.76 4338.6  486127
    PA14-PAC3719                            277    1      43690  2206.38 5214.55 611170
    =======================================================================================
 
  * Extract unique regions(Table 2) and add them to the most similar genome
 
  * Assemble using all these regions as reference => 10,201 contigs; 1,052,178 singletons
 
    reference                        #ctgs  min    max    mean    stdev          sum
    =======================================================================================
    PA14                            2056    17      57476  2992.27 7078.25        6152122
    PAO1                            877    17      8159    95.16  482.74          83456
    PA7                              4723    17      6822    41.7    128.87          196996
    PACS2                            914    17      3346    66.96  148.47          61207
    PA2192                          1528    17      49826  134.88  1734.19        206109
    PAC3719                          103    17      693    42.11  85.89          4338
    =======================================================================================
    ALL                              10201  17      57476  657.21  3457.1          6704228
 
  * Compare ALL assembly contigs to PA14 ref assembly :
 
    reference                        #ctgs  min    max    mean    stdev          sum        #singletons
    ========================================================================================================
    ALL                              10201  17      57476  657.21  3457.1          6704228    1052178
    ALL(200bp+)                      1020    200    57476  6235.17 9220.76        6359881
    ALL(PA14)                        2056    17      57476  2992.27 7078.25        6152122
    PA14                            1984    17      82196  3103.21 8274.53        6156788    1375004
    ========================================================================================================
    ALL-PA14                        8217    ......                                547440                # 0.5M extra bp
    ALL(PA14)-PA14                  72      ......                                -4666                  # 4666 bp which were originally mapped to PA14 are now mapped to other strains
 
  * Compare ALL assembly contigs to ssake 200bp+ contigs :
 
    ssake 200bp+ contigs:
      strain    #ctgs  min    max    mean    stdev  sum
      PAb1      879    200    2866    416.72  308.92  366304
 
    ALL vs ssake alignments:
    desc    #elem  min    max    mean    stdev  sum
    ALL    265    23      49826  912.9  4187.17 241919
    ssake  306    200    2358    379.85  257.04  116235
 
  ALL vs ssake CONTAINED alignments:
    desc    #elem  min    max    mean    stdev  sum
    ALL    164    23      49826  1264.71 5268.16 207413
    ssake  260    200    2358    368.09  259.02  95704
 
  ~300 out of 879 ssake contigs were assembled using this method.
  The ALL assembly contigs ssake contigs align to are longer than the ssake contigs!
 
  * Compare ALL assembly contigs to final de novo contigs :
 
        id                          len        #de novo ctgs it CONTAINS
  ======================================================================
    1  2361_PA2192_2249428_2341494 49826      52
    2  2360_PA2192_2249428_2341494 42258      49
    3  2364_PA2192_2346558_2361947 15388      17
    4  10190_PAO1_747565_761293    8159        7
    5  2429_PA2192_2405271_2412668 6103        7
    6  3259_PA2192_6656583_6693732 2594        6
    7  2367_PA2192_2374235_2379148 4914        5
    8  6917_PA7_4563267_4570077    6822        5
    9  2958_PA2192_4496915_4500004 3090        4
    10  2375_PA2192_2383706_2386485 2789        3
    11  2366_PA2192_2366515_2369415 2901        2
    12  6919_PA7_4579629_4582081    1085        2
    13  5364_PA7_3138605_3164438    1931        2
    14  8663_PACS2_3494057_3573088  449        2
    15  9322_PACS2_982458_983976    1519        2
    16  2368_PA2192_2379432_2380661 1230        2
    17  2716_PA2192_3268239_3292117 567        2
    18  9587_PAO1_2439014_2458239  3513        1
    ...
    49  8384_PAC3719_5517906_5518496 591        1
  ======================================================================
      Total                                    201
 
==== PA14 Sanger maq on all reads ====
 
  Contig stats:
  min_len    #contigs  min    max    mean    stdev          sum
  0          991        33      155551  6199.79  17445.05        6143996  !!!Smaller assembly size than with AMOScmp
  200        368        200    155551  16581.7  25475.92        6102067
  10K        149        10048  155551  38164.35 28513.51        5686489
 
1197385 singletons;
SNP's in the contigs: T<->C & A<-G> the most common
 
Location: /fs/szasmg2/Bacteria/Pseudomonas_aeruginosa/Assembly/2008_0206_PA14_maq
 
==== PA14 gene ref assembly ====
 
* Use the 5892 PA14 genes as reference
* Use "soap -v 5 -s 12 -g 3" for alignments
=> 5671 scaffolds (that many genes are shared by the 2 strains)
 
contig stats:
          #elem  min    max    mean    median  n50    sum 
  all    6487    33      13070  854    752    1164    5541032
  200bp+  5666    200    13070  965    840    1174    5470492
 
=== Comparative-ABBA ===
 
==== ABBA ====
 
Steps
# Annotate draft assemby
# Find fragmented genes at ends of contigs.  Example shown is of putative secretion protein from PAO1
# Fill in missing amino acid sequence using reference protein
# Merge contigs together with output sequence from ABBA
 
Pipeline
# Input reference amino acid sequence and database of short reads
# Search read database (tblastn)‏
# Create layout using alignment coordinates
# Validate good read coverage across entire sequence
# Make consensus from a multiple alignment
 
PA14:PO1
* 5892 genes in PA14 vs 5568 in PAO1
* 5244 genes in PA14 IDENTITY (99.03%) to 5239 genes in PAO1
* 5379 genes in PA14 IDENTITY|CONTAIN|BEGIN|END (99.003%) to 5367 genes in PAO1
* 5389 genes in PA14 align to 5372 genes in PAO1 (98.69%id)
* 501 genes in PA14 & 196 genes in PAO1 don't align to opposite strand genes
 
!!! The genes that align are very conserved
 
==== PA14 gene ref assembly ====
 
  desc            #elem  min    max    mean    stdev  sum
  genes(nt)      5892    72      15639  992    752    5846175  # out of a total of 6537648bp genome => 89.42 coding
  genes(aa)      5892    23      5212    330    251    1942833
 
==== PAO1 gene ref assembly ====
 
  desc            #elem  min    max    mean    stdev  sum
  genes(nt)      5568    72      16884  1006    751    5601468  # out of a total of 6264404bp genome => 89.41 coding
  genes(aa)      5568    23      5627    334    250    1861588
 
=== De-novo ===
 
==== SSAKE ====
 
version 3.0 with default parameters
 
  Contigs stats:
 
  desc    #elem  min    max    mean    stdev  sum
  length  185030  34      5490    77.21  141.23  14287079
  reads  185030  2      13352  29.52  127.65  5463405
 
  Singletons: 3,164,495  (singletons file + ambiguous reads)
 
Run times:
  18459.472u 31.849s 5:08:59.82 99.7% 0+0k 0+0io 0pf+0w
 
Location: /fs/szasmg2/Bacteria/Pseudomonas_aeruginosa/Assembly/2008_0214_ssake/
 
Latest version is 3.2: run on the same dataset generated fewer contigs and more singletons; also took longer to run
 
==== SSAKE (redo) ====
 
version 3.2.1 with default parameters
 
  Contigs stats:
 
  desc    #elem  min    max    mean    median  n50    sum
  length  107046  34      3994    96.39  42      201    10317932
  reads  ...
 
  Singlets: 3,309,974
 
==== Velvet * ====
 
Version 0.5.05
Ovl len=23
 
Commands:
  host: sycamore (Opteron)
  velveth . 23 Pa.seq
  velvetg . -read_trkg yes
 
Run times:
  velveth : 88.614u  22.865s 2:16.35  81.7%  0+0k 0+0io 0pf+0w
  velvetg : 439.939u 19.258s 8:19.77  91.8%  0+0k 0+0io 0pf+0w
  total  : 528.553u 42.123s 10:36.12
 
Output files (used)
  contigs.fa : contig FASTA file
 
<pre style="background:yellow">
Contigs stats:
                #elem  min    max    mean    median  n50    sum
  all          10684  45      16239  640    363    1184    6841458
  200bp+        7382    200    16239  877    586    1252    6474426
  breaks        257  (-b 20)
  breaks(200+)  251  (-b 20) coming from 170 ctgs
  1,273,164 singletons; 879,840 of them aligned by soap (-v5,-s12,-g3) to PA14
</pre>
 
Distance between breakpoints:
  #elem  min    max    mean    median  n50    sum
  251    0      314620  24946  3776    99406  6261404
 
Align velvet ctgs to PA14 ref (nucmer -c 40;  delta-filter -q):
  alignemnts  length  %id
  9927        6164312 99.20  (slightly lower %id than the AMOS PA14 ref assembly: 99.33%)
 
PA14 0cvg regions
                      #elem  min    max    mean    median  n50    sum
original              2235    1      31503  197    22      7529    439633  # 6098014 of PA14 covered
delta-filter -q      2378    1      31676  220    24      6434    522246  # 6015401 of PA14 covered
 
 
rRNA repeat is collapsed; Contigs aligned to the 4 copy 5.7Kbp PA14 rRNA repeat:
 
      [S1]    [E1]  |    [S2]    [E2]  |  [LEN 1]  [LEN 2]  |  [% IDY]  |  [LEN R]  [LEN Q]  |  [COV R]  [COV Q]  | [TAGS]
  ===============================================================================================================================
        23      409  |        1      388  |      387      388  |    97.94  |    5763      388  |    6.72  100.00  | rRNA        856    [CONTAINS]
      388    2132  |        1    1745  |    1745    1745  |    99.94  |    5763    1745  |    30.28  100.00  | rRNA        642    [CONTAINS]
      2111    2279  |      169        1  |      169      169  |  100.00  |    5763      169  |    2.93  100.00  | rRNA        3433    [CONTAINS]
      2258    5763  |        1    3506  |    3506    3506  |    99.89  |    5763    3506  |    60.84  100.00  | rRNA        260    [CONTAINS]
 
Question: Are there some velvet contigs contained in each other?
Answer:  Very few: 302 (out of 10K) ; avg %id is 86 and avg lenght is 106 bp; nucmer -c 20 has been used for alignments
 
Many contigs seem to overlap by ~ 20bp
 
Some contigs share end reads labeled as D(uplicates)
 
  listReadPlacedStatus Pa.bnk | awk '{print $3}' | count.pl
  #      total
  P      7322684
  S      1273164
  D      32052    # 33 of them belong to 3 contigs
  Total  8627900
 
About ~4607 contig pairs could be merged:
  grep D Pa.status.detailed | awk '{print $5,$6}' | sort -u | wc -l
  4607
Most of the overlaps are shorter than 10bp; if "nucmer -c 20 -l 20" is run on the contig ends; only 627 of them are found
 
Example:
  Contigs 1915 & 3551 share 15 Duplicate reads
  cat Pa.status | grep 1915 | grep 3551$ | wc -l
  15
They align to adjacent pos in PA14 :
  3743092 3745646 |  2555  1  | 2555  2555  |    99.49  |  6537648    2555  |    0.04  100.00  | 1      1915    [CONTAINS]
  3745625 3746382 |  758  1  |  758  758  |    99.60  |  6537648      758  |    0.01  100.00  | 1      3551    [CONTAINS]
 
411870 out of 1273164 singletons (~33%) align to PA14 (nucmer -maxmatch -c 20)
304162 out of 1273164 singletons (~25%) are also singletons in the AMOScmp-PA14 assembly; probably the common ones are low quality sequences that just don't assemble
 
'''PA14 reference alignments:'''
nucmer -maxmatch -c 40):
 
Aligned:
  #elem  min    max    mean    stdev  sum
  9859    45      15342  642    785    6331357
 
Completely aligned:
  #elem  min    max    mean    stdev  sum
  9681    45      11120  617    712    5969126
 
  9208 are aligned 1 time
  408  are aligned 2 times  #repeats
  39  are aligned 3 times  #repeats
  25  are aligned 4 times  #repeats
  1    is aligned  6 times  #repeat
 
Partially aligned:
  #elem  min    max    mean    stdev  sum
  178    78      15342  2035    2139    362231
 
  120 are aligned once:          [[Media:Pa.1con-velvet.uniq.not_contained.filter-q.1.coords|Pa.1con-velvet.uniq.not_contained.filter-q.1.coords]]
  58  are aligned more than once: [[Media:Pa.1con-velvet.uniq.not_contained.filter-q.2.coords|Pa.1con-velvet.uniq.not_contained.filter-q.2.coords]]
 
  All these contigs have AMOScmp assembly singletons aligned to them!!!
  These velvet contigs seem correct.
 
  Example: There is a large insertion in PA14 at position 6700
 
  PA-b1 reads vs PA14 ref 
  find-query-breaks -B Pa.filter-q.delta
  => 3 reads end at base 6700
   
  velvet ctgs vs PA14 ref:
  ...
    6177    6700  |    1  524  |  524    524  | 99.43  |  6537648  875  | 0.01  59.89  | 1  900    # 431 out of 1054 seqs in this contigs are singletons in AMOScmp assembly
    6694    8045  |  743  2094 |  1352  1352 | 99.63  |  6537648  2097 | 0.02  64.47  | 1  1970    # 895 out of 2354 seqs in this contigs are singletons in AMOScmp assembly; unaligned regions aligns at low %id to a Pseudomonas fluorescens hypothetical protein
 
  Nucmer/Soap alignments of the reads adjacent to position 6700
  [[Media:PA14.coord.6700]] : ~20 reads end at pos 6700, ~12 reads start at pos 6694 (6bp ovl only)
  [[Media:PA14.soap.v5.s12.g3.6700]]
  [[Media:PA14.soap.v2.s12.g3.6700]] 10 bp ovl between reads that start at pos 6671(2 mismatches) & 6694(0 mismatches)
  [[Media:PA14.soap.v2.s12.g3.c42.6700]] 11 bp ovl between reads that start at pos 6676(2 mismatches) & 6694(0 mismatches)
 
  Other breaks:
  ...
  66764    67073  |    1  310  |  310    310  | 100.00 |  6537648  1282 | 0.00  24.18  | 1 320
  67181    67322  | 1136  1277  |  142    142  |  98.59 |  6537648  1282 | 0.00  11.08  | 1 320
  ...
  116429  117249  |    1  821  |  821    821  |  99.03 |  6537648  1457 | 0.01  56.35  | 1 1955
  120490  120573  |  84    1  |  84    84  | 100.00 |  6537648  1693 | 0.00  4.96  | 1 1827
...
  234641  235980  | 3294  1955  | 1340  1340  |  99.40 |  6537648  3294 | 0.02  40.68  | 1 139
  235981  237442  | 1466    5  | 1462  1462  |  99.32 |  6537648  3294 | 0.02  44.38  | 1 139
...
  286003  287915  | 1216  3128  | 1913  1913  |  99.22 |  6537648  3128 | 0.03  61.16  | 1 3068
...
 
Not aligned:
  #elem  min    max    mean    stdev  sum
  825    45      16239  618.3  1209.37 510101        # the longest velvet ctg does not align
 
Not CONTAINED:
  #elem  min    max    mean    stdev  sum
  1002    45      16239  868.82  1518.56 870566
 
Velvet vs all Pseudomonas strains:
  strain          aligned        CONTAINED      partially-aligned   
  PA14            9859            9681            178                # smallest number
  PA2192          9940            9607            333
  PACS2          9688            9469            219
  PAO1            9587            9402            185
  PAC3719        9427            9057            370
  PA7            8514            7651            863
  all            10378          10243          135
 
  not aligned stats:
  #elem  min    max    mean    median  n50    sum
  306    45      9681    549    119    2310    168094
 
  strain(s)                              CONTAINED
  PA14                                    9681
  PA14|PA2192                            10081
  PA14|PA2192|PACS2                      10186
  PA14|PA2192|PACS2|PAO1                  10198
  PA14|PA2192|PACS2|PAO1|PAC3719          10210
  PA14|PA2192|PACS2|PAO1|PAC3719|PA7      10243
 
Location:  /fs/szasmg2/Bacteria/Pseudomonas_aeruginosa/Assembly/2008_0311_velvet-24
 
Velvet version 0.5.07 : does not "seem" to be an improvement over Version 0.5.05
Ovl len=23
  desc              #elem  min    max    mean    stdev  sum
  contig            11230  45      16239  611.64  814.09  6868784
  contig(200bp+)    7353    200    16239  878.61  897.06  6460467
 
Some of the 306 contigs should probably assemble together:
* Assembled with minimus2 REFCOUNT=0 OVL=20 => 54 ctgs that contain 132 reads & 174 singlets
* => reduced the number from 306 to 54+174= 228
 
Location: /fs/szasmg2/Bacteria/Pseudomonas_aeruginosa/Assembly/2008_0311_velvet-24/no_hits
 
 
!!! If AMOScmp is run only on the velvet singlets (mostly low quality), some of the resulting contigs will have "artificial" gaps
          #elem  min    max    mean    median  n50    sum
  all    43926  33      963    122    93      157    5367766
  200bp+  6757    200    963    293    263    289    1980273
Location:
/fs/szasmg2/Bacteria/Pseudomonas_aeruginosa/Assembly/2008_0311_velvet-24/AMOScmp-PA14.singletons/1con-contigs.filter-q.na.coords
 
==== Edena ====
 
Version 2.1.1 , default parameters
 
  Contigs stats: 
                #elem  min    max    mean    median  n50    sum
  all          11180  100    11300  552    356    837    6175460
  200bp+        8316    200    11300  693    488    902    5759209
  breaks(PA14)        245 (-b 20)  # 78 of them also present in velvet
  breaks(velvet)      210 (-b 20)
 
  3955865 singletons
 
Contig don't seem to overlap much
 
Run times:
  ovl step:      1655.561u 7.560s  27:54.62 99.3%  0+0k 0+0io 0pf+0w
  assembly step: 56.141u  3.791s  1:03.82 93.9%  0+0k 0+0io 0pf+0w
  total:        1711.702u 11.351s 28:58.44   
 
A significant number of edena contigs CONTAIN one(several) velvet contigs:
 
  $ show-coords -I 94 velvet-edena.delta | egrep 'CONTAINED|BEGIN|END' | awk '{print $19}' | count.pl -m 1 | wc -l # 2269
  $ show-coords -I 94 velvet-edena.delta | egrep 'CONTAINED|BEGIN|END' | awk '{print $19}' | count.pl -m 2 | wc -l # 1455
 
Align edena ctgs to PA14 ref (nucmer -c 40; delta-filter -q):
        #elem  min    max    mean    median  n50    sum
  0cvg  8809    1      34854  119    38      438    1050135
 
Location: /fs/szasmg2/Bacteria/Pseudomonas_aeruginosa/Assembly/2008_0312_edena/
 
==== SSAKE/Edena/Velvet summary ====
 
host: sycamore.umiacs.umd.edu (Linux 64)
 
  program  version  run-time
  SSAKE    3.0      8459.472u 31.849s 5:08:59.82 
  Velvet  0.5.05    528.553u  42.123s  10:36.12
  Edena    2.11      1711.702u 11.351s  28:58.44   
 
Input:  8627900 33 bp Solexa reads(2 lanes)
 
Output: contigs+singletons
 
Contig statistics:
  program #seqs  min  max    mean  med  n50  sum
  SSAKE  185030  34  5490  77    44  87    14287079
  Velvet  10684  45  16239  640  363  1184  6841458
  Edena  11180  100  11300  552  356  837  6175460
  Euler  7090    22  22671  953  312  2622  6759979
 
200bp+ contig statistics:
  program #seqs  min  max    mean  med  n50  sum
  SSAKE  12532  200  5490  486  385  549  6090567
  Edena  8316    200  11300  692  488  902  5759209
  Velvet  7382    200  16239  877  586  1252  6474426
 
Singleton statistics:
  program #seqs
  SSAKE  3164495
  Velvet  1273164
  Edena  3955865
 
==== Euler-SR ====
 
  cmd:    Assemble.pl Pa.seq vertexSize
  version: 12/4/08
 
Ctg len stats:
  vertexSize      #elem  min    max    mean    median  n50    sum
  21              7090    22      22671  953    312    2622    6759979
  25              11130  26      9619    564    359    912    6276548
 
=== Gene assemblies ===
 
Dan Sommer reduced number of contigs >=200bp from 306 to 120
/fs/szasmg2/Bacteria/Pseudomonas_aeruginosa/Annotation/PAb1.fasta
 
  Contig stats:
  desc    #elem  min    max    mean    stdev  sum
  contigs 120    212    512638  51438  81999  6172675
 
  Average Gap: 105 nt bases
  Median Gap: 14 nt bases
  Largest Gap: 1095 nt bases
 
927 singletons assembled
 
== Contig Assemblies ==
 
=== Comparative ===
 
==== AMOScmp on the velvet contigs (PA14 ref) * ====
 
* Use only velvet ctgs
* This assembly will have some of the repeats collapsed
 
All:
        #elem  min    max    mean    median  n50    sum
  ctg  3046    45      27957  1928    1101    3744    5873032
  singl 989    45      16239  836    292    2221    826462
  all  4035    45      27957  1660    850    3573    6699494
  breaks 270
 
200 bp+:
        #elem  min    max    mean    median  n50    sum
  ctg  2632    200    27957  2213    1390    3798    5824679
  singl 586    200    16239  1344    721    2352    787686
  all  3218    200    27957  2055    1243    3631    6612365
  breaks 265
 
Reduced the number of contigs from 10684 to 4035 !!!
 
Location:  /fs/szasmg2/Bacteria/Pseudomonas_aeruginosa/Assembly/2008_0311_velvet-24/AMOScmp-PA14
 
==== AMOScmp on the velvet contigs & velvet singletons (PA14 ref) * ====
 
* Use singletones from velvet ctgs as well
* This assembly will have some of the repeats collapsed
 
<pre style="background:yellow">
                  #elem  min    max    mean    median  n50    sum
  all              2383    33      177630  2592    72      45071  6177155
  200bp+          513    200    177630  11778  738    45351  6042047
  all-velvet      334    59      177630  17920  7295    45351  5985407  # ctgs that contain at least 1 velvet ctg
  singl-velvet    984    45      16239  837    292    2221    823862  # velvet ctgs left as singletons
  singl-read      393416  33      45      33      33      33      12983724 # original Solexa reads not assembled
</pre>
 
Location: /fs/szasmg2/Bacteria/Pseudomonas_aeruginosa/Assembly/2008_0311_velvet-24/AMOScmp-PA14.singletons
 
==== AMOScmp on the velvet contigs (PAO1 ref) ====
 
All:
        #elem  min    max    mean    median  n50    sum
  ctg  2982    45      27957  1928    1124    3652    5749909
  singl 1255    45      16239  764    247    2275    958885
  all  4237    45      27957  1583    773    3440    6708794
  breaks(PAO1) 271
 
200 bp+:
        #elem  min    max    mean    median  n50    sum
  ctg  2610    200    27957  2186    1380    3675    5705769
  singl 707    200    16239  1283    674    2398    906818
  all  3317    200    27957  1994    1205    3494    6612587
  breaks(PAO1) 266
 
Location:  /fs/szasmg2/Bacteria/Pseudomonas_aeruginosa/Assembly/2008_0311_velvet-24/AMOScmp-PAO1
 
==== AMOScmp on the velvet contigs (PA14 & PA2192 ref) ====
 
Input: above contigs & singletons
 
  All:
  desc    #elem  min    max    mean    stdev  sum
  ctg    2841    45      34374  2022    2570    5744186
  singl  751    45      17475  1251    2264    939443  # 593 are velvet original ctgs
  all    3592    45      34374  1861    2529    6683629
 
  200bp+:
  desc    #elem  min    max    mean    stdev  sum
  ctg    2489    200    34374  2291    2638    5701588
  singl  456    200    17475  2001    2648    912503
  all    2945    200    34374  2246    2641    6614091
 
Location:  /fs/szasmg2/Bacteria/Pseudomonas_aeruginosa/Assembly/2008_0311_velvet-24/AMOScmp-PA14-PA2192
 
==== AMOScmp on the velvet contigs (PA14, PA2192 & PACS2 ref) ====
 
Input: above contigs & singletons
 
  All:
  desc    #elem  min    max    mean    stdev  sum
  ctg    2759    45      28263  2029    2554    5597067
  singl  725    45      34374  1495    2744    1084142  # 493 are velvet original ctgs
  all    3484    45      34374  1918    2603    6681209
 
  200bp+:
  desc    #elem  min    max    mean    stdev  sum
  ctg    2419    200    28263  2297    2618    5556096
  singl  463    200    34374  2290    3170    1060442
  all    2882    200    34374  2296    2714    6616538
 
Location:  /fs/szasmg2/Bacteria/Pseudomonas_aeruginosa/Assembly/2008_0311_velvet-24/AMOScmp-PA14-PA2192-PACS2
 
==== AMOScmp on the velvet contigs (PA14, PA2192, PACS2 & PAO1 ref) ====
 
Input: above contigs & singletons
 
  All:
  desc    #elem  min    max    mean    stdev  sum
  ctg    2696    45      24046  2028    2510    5468683
  singl  753    45      34374  1609    3050    1211809  # 483 are velvet original ctgs
  all    3449    45      34374  1937    2643    6680492
 
  200bp+:
  desc    #elem  min    max    mean    stdev  sum
  ctg    2369    200    24046  2292    2569    5429268
  singl  487    200    34374  2439    3527    1187586
  all    2856    200    34374  2317    2756    6616854
 
Location:  /fs/szasmg2/Bacteria/Pseudomonas_aeruginosa/Assembly/2008_0311_velvet-24/AMOScmp-PA14-PA2192-PACS2-PAO1
 
==== AMOScmp/minimus on the AMOScmp (PA14 ref) & velvet "unique" contigs ====
 
Input:
  AMOScmp PA14 ref assembly (2) contigs
  velvet "unique: contigs
  !!! No singletons
 
AMOScmp2 steps:
# Align AMOScmp & velvet contigs (nucmer -c 40) to each other
# Identify velvet contigs CONTAINED in AMOScmp contigs(max trimming of 20)
# Remove CONTAINED contigs from the velvet set
# Create alignment file based to the AMOS.scaff file (if nucmer is repeat contigs are collapsed/no placed exactly)
# Align velvet contigs to reference using nucmer ; filter only the CONTAINED ones
# Merge AMOS & velvet alignments
# Run "casm-layout -S -r ..." ; if -S is not used, many AMOS contigs are not placed 
# Run "make-consensus ..."
 
* All AMOS input contigs should be part of the new assembly
* If the velvet CONTAINED  alignments are not filtered, some end conflicts might cause some of the AMOS contigs end up as singletons
 
Input:
 
  All:
  desc            #elem  min    max    mean    stdev  sum
  AMOS            1843    20      127984  3350.38 10776  6174765
  velvet-uniq    1189    45      16239  997.85  1485.34 1186452  # 462 aligned to PA14 ref (nucmer -c 40);
                                                                  # 284 CONTAIEND in the ref
                                                                  # 178 (462-284) partially aligned :
                                                                    # 58  have 2+ alignments
                                                                    # 120 have 1  alignment
                                                                  # 727 (1189-462) not aligned to the ref
  all            3032    20      127984  2427.84 8530.33 7361217
 
  200bp+:
  desc            #elem  min    max    mean    stdev  sum
  AMOS            542    200    127984  11245  17520  6094904
  velvet-uniq    848    200    16239  1354.36 1627.86 1148499  # 422 aligned to PA14 ref (nucmer -c 40); 249 CONTAIEND in the ref; 426 not aligned to the ref
  all            1390    200    127984  5211.08 12019.6 7243403
 
Output:
 
  All:
  desc            #elem  min    max    mean    stdev  sum
  ctg            1327    20      221554  4644.9  19079.  6163795    # two of them(530,531) contain only velvet seqs
                                                                    # 1210 contain only AMOS seqs ; 11 contain 2 AMOS seqs (negative gap got merged)
                                                                    # 115 contain both AMOS & velvet seqs       
  singl          817    45      16274  1053.94 1719.12 861076    # all but one(807) are velvet : see .conflict file
                                                                    # 170 out of 178 partially aligned (see above) is a singleton
  all            2144    20      221554  3276.52 15146.  7024871
 
  200bp+:
  desc            #elem  min    max    mean    stdev  sum
  ctg            282    203    221554  21651  36731  6105733
  singl          571    200    16274  1461.55 1917.62 834546
  all            853    200    221554  8136.31 23189  6940279
 
minimus steps:
# Run minimus2 pipeline on the above contigs(1327) and singletons(817) ; Only uses contig:singletion overlaps (not all:all)
 
  $ ~dpuiu/bin/minimus2 -D BREAKID=1327 -D OVERLAP=40 Pa
 
min OVL=40 : use contig:singleton alignments
 
Output:
 
  All:
  desc            #elem  min    max    mean    stdev  sum
  ctg            180    135    253489  22817.2 45339  4107096  # 218 of the PA14 unaligned velvet contigs assembled
  singl          1377    20      153036  1996.79 10534  2749587
  all            1557    20      253489  4403.77 19467  6856683
  200bp+:
  desc            #elem  min    max    mean    stdev  sum
  ctg            174    210    253489  23598.3 45919  4106116
  singl          488    200    153036  5519.51 17154  2693524
  all            662    200    253489  10271.3 28846  6799640
 
min OVL=20 : use contig:singleton alignments (not significantlty better than ,on OVL=40)
 
Output:
 
  All:
  desc    #elem  min    max    mean    stdev  sum
  ctg    279    63      253489  15557  37925  4340305 # 321 of the PA14 unaligned velvet contigs assembled
  singl  972    20      153036  2576    12252  2504000
  all    1251    20      253489  5471    21580  6844305
 
  200bp+:
  desc    #elem  min    max    mean    stdev  sum
  ctg    263    214    253489  16495  38868  4338061
  singl  393    200    153036  6261    18680  2460755
  all    656    200    253489  10364  28954  6798816
 
 
min OVL=20 : use contig:singleton & singleton:singleton alignments
 
  All:
  desc    #elem  min    max    mean    stdev  sum
  ctg    277    103    253489  15983  37991  4427292  # 556 of the PA14 unaligned velvet contigs assembled
  singl  808    20      153036  2986    13439  2412439  # 260 of them are velvet original contigs
  all    1085    20      253489  6304    23111  6839731
 
  200bp+:
  desc    #elem  min    max    mean    stdev  sum
  ctg    263    222    253489  16826  38812  4425275
  singl  298    201    153036  7971    21242  2375456  # 165 of them are velvet original contigs
  all    561    201    253489  12123  31042  6800731
 
min OVL=20 : use contig:singleton & singleton:singleton alignments & adjacent contig ovelaps (best)
 
  All:
  desc    #elem  min    max    mean    stdev  sum
  ctg    317    47      253489  16450  40806  5214579
  singl  703    20      112908  2324    11332  1633528  # 335 are original velvet contigs
                                                          # 29 are velvet contigs partially aligned
  all    1020    20      253489  6714    25448  6848107
 
  200bp+:
  desc    #elem  min    max    mean    stdev  sum
  ctg    291    200    253489  17909  42289  5211547
  singl  247    201    112908  6480    18431  1600539  # 186 are original velvet contigs
  all    538    200    253489  12662  33969  6812086
 
Locations:
  /fs/szasmg2/Bacteria/Pseudomonas_aeruginosa/Assembly/AMOScmp-velvet
  /fs/szasmg2/Bacteria/Pseudomonas_aeruginosa/Assembly/AMOScmp-velvet/AMOScmp2
  /fs/szasmg2/Bacteria/Pseudomonas_aeruginosa/Assembly/AMOScmp-velvet/AMOScmp2.not_CONTAINED
 
==== AMOScmp/minimus on the AMOScmp (PA14 ref) & velvet contigs * ====
 
* Use all velvet contigs
* Very similar with the previous case (the velvet CONTAINED contigs were discarded at the beginning)
 
AMOScmp2


File location:
Input:
   /fs/szasmg2/Bacteria/Pseudomonas_aeruginosa/
   All:
  desc    #elem  min    max    mean    stdev  sum
  AMOS    1843    20      127984  3350    10777  6174765
  velvet  10684  45      16239  640    825    6841458
  all    12527  20      127984  1039    4311    13016223
    
    
Files:
  200bp+:  
   s_1_0001_prb.txt
   desc    #elem  min    max    mean    stdev  sum
   s_1_sequence.txt # contains seq & qual
   AMOS    542    200    127984 11245  17521  6094904
   s_1_tag.txt          
  velvet  7382    200    16239  877    896    6474426
   s_7_0300_prb.txt
  all    7924    200    127984  1586    5344    12569330
   s_7_sequence.txt # contains seq & qual
 
   s_7_tag.txt    
Output:
 
  All:
  desc    #elem  min    max    mean    stdev  sum
  ctg    1361    20      221563  4536    19428  6173665
  singl  838    45      22884  1064    1868    891463
  all    2199    20      221563  3213    15418  7065128
 
  200bp+:
  desc    #elem  min    max    mean    stdev  sum
  ctg    299**  203    221563  20448  37378  6113919
  singl  576    200    22884  1499    2114    863431
  all    875    200    221563  7974    23668  6977350
 
minimus2 on AMOScmp2 output
 
  All:
  desc    #elem  min    max    mean    stdev  sum
  ctg    284    47      274913  18279  46193  5191268
  singl  1001    20      133259  1666    9754    1667778
  all    1285    20      274913  5338    24330  6859046
 
  200bp+:
  desc    #elem  min    max    mean    stdev  sum
  ctg    246    210    274913  21088  49046  5187663
  singl  285    201    133259  5696    17670  1623459
  all    531    201    274913  12827  36583  6811122
 
Locations:
  /fs/szasmg2/Bacteria/Pseudomonas_aeruginosa/Assembly/AMOScmp-velvet/AMOScmp2
  /fs/szasmg2/Bacteria/Pseudomonas_aeruginosa/Assembly/AMOScmp-velvet/AMOScmp2/minimus2/
 
==== AMOScmp/minimus on the AMOScmp PA14 & PAO1 contigs  ====
 
AMOScmp
 
Input:
  All:
  desc    #elem  min    max    mean    stdev  sum
  PA14    1843    20      127984  3350    10777  6174765
  PAO1    1961    20      46414  3065    5251    6010800  # 1655 aligned 
                                                            # 1312 CONTAINED in PA14 contigs
                                                            # 343 aligned but not CONTAINED in PA14 ctgs; 36 have 2+ alignments
                                                            # 306 not aligned
  all    3804    20      127984  3203    8396    12185565
 
   200bp+:
  desc    #elem  min    max    mean    stdev  sum
  PA14    542    200    127984  11245  17521  6094904
  PAO1    1166    203    46414  5108    6007    5955960
  all    1708    200    127984  7056    11405  12050864
 
Output:
  All:
  desc    #elem  min    max    mean    stdev  sum
  ctg    1585    20      159598  3896    15496  6174745
  singl  266    20      26709  1900    4486    505395      # all singletons are PAO1
  all    1851    20      159598  3609    14456  6680140
 
  200bp+:
  desc    #elem  min    max    mean    stdev  sum
  ctg    356    200    159598  17137  29063  6100717
  singl  75      215    26709  6631    6361    497316
  all    431    200    159598  15309  26837  6598033
 
minimus2 on the above contigs and singletons:
 
Output:
 
  All:
  desc    #elem  min    max    mean    stdev  sum
  ctg    246    30      266798  18632  44344  4583528
  singl  1138    19      126780  1569    9194    1785442  # 161 are original PAO1 contigs
  all    1384    19      266798  4602    21458  6368970
 
  200bp+:
  desc    #elem  min    max    mean    stdev  sum
  ctg    139    202    266798  32900  54951  4573157
  singl  152    201    126780  11405  22894  1733533 # 23 are original PAO1 contigs
  all    291    201    266798  21672  42726  6306690
 
 
==== AMOScmp2 on the AMOScmp PA14 contigs  & PA14 genes * ====
 
INPUT:
                  #elem  min    max    mean    median  n50    sum
  AMOScmp         1843    20      127984  3350    63      30995  6174765
   AMOScmp(200bp+) 542    200    127984  11245  3263    31106  6094904
   genes          5892    72      15639  992    861    1206    5846175
 
OUTPUT:
          #elem  min    max    mean    median n50    sum
  all    650    20      187557  9954    846    60335  6470391
  200bp+  462    201    187557  13977  1578    60335  6457375
  0 singletons !!!
 
Contigs that contain at least 1 AMOScmp contig & 1 gene: 356
                  #elem  min    max    mean    median  n50    sum      "singletons"
  ctg-all        356**  111    187557  17861  2692    63102  6358519
  ctg-200bp+      353    217    187557  18012  2723    63102  6358095
  ctg-AMOS        356    1      85      5      2      9      1653      190 
  ctg-genes      356    1      160    16      2      58      5781      111  **  that many genes from PA14 seem to miss from PAb1
 
Contigs that contain at least 1 AMOScmp contig(>100bp) & 1 gene: 218
                  #elem  min    max    mean    median  n50    sum
  ctg-all        218**  158    187557  28371  13008  66709  6184790
  ctg-200bp+      217    305    187557  28501  13008  66709  6184632
  ctg-AMOS        218    1      19      3      2      4      685
  ctg-genes      218    1      160    26      12      59      5632
 
Contigs that contain at least 1 AMOScmp contig(>200bp) & 1 gene: 206
 
Location: /fs/szasmg2/Bacteria/Pseudomonas_aeruginosa/Assembly/AMOScmp-gene/AMOScmp2
 
==== AMOScmp2 on the AMOScmp PA14 contigs , PA14 genes * & velvet contigs ====
 
INPUT:
                  #elem   min    max    mean   median  n50    sum
  AMOScmp        1843    20      127984  3350    63      30995  6174765
  AMOScmp(200bp+) 542    200    127984  11245  3263    31106  6094904
  genes          5892    72      15639  992    861    1206    5846175
  velvet          10684  45      16239  640    363    1184    6841458
 
OUTPUT:
                  #elem  min    max    mean    median  n50    sum
  all            552    20      337673  11718  663    133802  6468490
  200bp+          377    201    337673  17126  1155    133802  6456628
  singl          1006    45      16239  869    306    2308    874312      # all but 2 are velvet ctgs;
                                                                              # 232 velvet singletons have annotated alignments to the AMOScmp ctgs
             
 
About half of the contigs end up in genes
 
Contigs that contain at least 1 gene & (1 AMOS ctg >=100bp or 1 velvet ctg): 135
                  #elem  min    max    mean    median  n50    sum
  ctg-all        135**  163    337673  45819  14232  141346  6185574
  ctg-200bp+      134    305    337673  46160  16317  141346  6185411
  ctg-AMOS100-vel 135    1      508    77      28      215    10356
  ctg-genes      135    1      287    42      12      126    5631
 
Contigs that contain at least 1 gene & 1 velvet ctg: 104
Contigs that contain at least 1 gene , 1 velvet ctg & 1 AMOScmp ctg : 104 (same number)
 
                  #elem  min    max    mean    median  n50    sum
  ctg-all        104**  163    337673  58949  32961  141346  6130650
  ctg-200bp+      103    747    337673  59519  32961  141346  6130487
  ctg-AMOS                                                        1085
  ctg-velvet                                                      9667
  ctg-genes                                                      5572
 
 
Location: /fs/szasmg2/Bacteria/Pseudomonas_aeruginosa/Assembly/AMOScmp-velvet-gene/AMOScmp2
 
==== AMOScmp on velvet & edena contigs (PA14 ref) ====
 
Input:
                #elem  min    max    mean    median  n50    sum
  velvet        10684  45      16239  640    363    1184    6841458
  edena        11180  100    11300  552    356    837    6175460
  all          21864  45      16239  595    359    992    13016918
 
PA14 0cvg regions (nucmer -c 40; delta-filter -q):
              #elem  min    max    mean    median  n50    sum
  velvet        2388    1      31832  219    24      6434    522048
  edena        8809    1      34854  119    38      438    1050135
  all          2129    1      31832  234    22      7529    498077
 
Output:
  All:
        #elem  min    max    mean    median  n50    sum
  ctg  2782    45      27957  2141    1207    4246    5955924
  singl 1874    45      16239  839    343    1972    1572938
  all  4656    45      27957  1617    727    3749    7528862
  breaks 492
    
    
   wc -l s_1_sequence.txt s_7_sequence.txt
   200 bp+:
   4105993 s_1_sequence.txt
        #elem  min    max    mean    median  n50    sum
   4521907 s_7_sequence.txt
  ctg  2424    200    27957  2439    1505    4288    5912995
   8627900 total
  singl 1233    200    16239  1217    677    2121    1500822
  all  3657    200    27957  2027    1122    3816    7413817
  breaks 478
 
==== AMOScmp on velvet contigs (PA14 gene ref) * ====
 
Contig stats:
                              #elem  min    max    mean    median  n50    sum
  nucmer (-l 20,-c 65)        3137    65      3615    490    370    705    1535656
  promer (-l 6, -c 10)        3622    61      6125    623    448    922    2255378
 
Only the contigs CONTAINED in the genes get assembled; BEGIN/END are discarded
 
Locations:
  /fs/szasmg2/Bacteria/Pseudomonas_aeruginosa/Assembly/2008_0311_velvet-24/genes/AMOScmp-nucmer
  /fs/szasmg2/Bacteria/Pseudomonas_aeruginosa/Assembly/2008_0311_velvet-24/genes/AMOScmp-promer
 
=== De-novo ===
 
==== minimus on velvet contigs (OVL=40) ====
 
All:
        #elem  min    max    mean    median  n50    sum
  ctg  179    45      1495    75      45      45      13336
  singl 10324  45      16239  661    384    1188    6819550
  all  10503  45      16239  651    374    1185    6832886
  breaks 257
 
200 bp+:
        #elem  min    max    mean    median  n50    sum
  ctg  6      227    1495    836    1129    1495    5015
  singl 7375    200    16239  877    586    1252    6469371
  all  7381    200    16239  877    586    1252    6474386
  breaks 251
 
==== minimus on velvet contigs (OVL=20) ====
 
All:
        #elem  min    max    mean    median  n50    sum
  ctg  1812    47      23446  1621    1164    2366    2936650
  singl 6135    45      15342  626    357    1136    3839779
  all  7947    45      23446  853    471    1574    6776429
  breaks(PA14) 299
 
200 bp+:
        #elem  min    max    mean    median  n50    sum
  ctg  1746    201    23446  1677    1211    2374    2927255
  singl 4246    200    15342  851    568    1216    3615358
  all  5992    200    23446  1092    702    1651    6542613
  breaks(PA14) 293
 
==== minimus1 on velvet contigs (OVL=20) ====
 
All:
        #elem  min    max    mean    median  n50    sum
  ctg  2046    73      23446  1628    1157    2396    3330276
  singl 5658    45      15342  609    344    1144    3445004
  all  7704    45      23446  879    489    1660    6775280
  breaks(PA14) 568
 
200 bp+:
        #elem  min    max    mean    median  n50    sum
  ctg  1978    201    23446  1678    1205    2396    3320032
  singl 3786    200    15342  857    575    1222    3243138
  all  5764    200    23446  1139    737    1729    6563170
  breaks(PA14) 559
 
If OVL is dropped to 16, fewer contigs get merged => fewer breakpoints
 
==== minimus2 on velvet & edena ctgs (OVL=40) ====
 
Input: velvet & edene ctgs
          #elem  min    max    mean    median  n50    sum      breaks
  velvet  10684  45      16239  640    363    1184    6841458  240
  edena    11180  100    11300  552    356    837    6175460  226
  v+e      21864  45      16239  595    359    992    13016918 
  v+e(200+)15698  200    16239  779    528    1069    12233635
 
Output (ctg+singletons):
          #elem  min    max    mean    median  n50    sum        breaks
  all      8835    45      16296  774    424    1448    6841222    253     
  200bp+  6446    200    16296  1018    663    1514    6563221
 
Comment: for OVL=20 we get 475 breaks (too many?)
 
==== minimus2 on the AMOScmp PA14 & PAO1 contigs ====
 
* Use PA14-PA14 (5 bp, adj ctg ends) , PA14-PAO1 (20 bp, contig ends) , PAO1-PAO1 (5 bp, adj ctg ends) overlaps
 
All:
        #elem  min    max    mean    median  n50    sum
  ctg  441    43      223777  13972  294    86090  6161798
  singl 1227    20      18974  237    42      6391    290505
  all  1668    20      223777  3868    52      78110  6452303
  breaks(PA14) 130
 
200 bp+:
        #elem  min    max    mean    median  n50    sum
  ctg  245    202    223777  25080  5100    86090  6144514
  singl 107    210    18974  2180    547    8193    233281
  all  352    202    223777  18119  1678    78110  6377795
  breaks(PA14) 121
 
Location:
  /fs/szasmg2/Bacteria/Pseudomonas_aeruginosa/Assembly/AMOScmp-PA14-PAO1/minimus2
 
==== minimus2 on AMOScmp PA14 contigs and PA14 genes (OVL=20) ====
 
INPUT:
          #elem  min    max    mean    median  n50    sum
  AMOScmp 1843    20      127984  3350    63      30995  6174765
  genes  5892    72      15639  992    861    1206    5846175
 
OUTPUT:
                  #elem  min    max    mean    median  n50    sum      singl
  ctg-all        278**  158    170499  22136  10332  52297  6153935
  ctg-200bp+      277    315    170499  22216  10332  52297  6153777
  ctg-AMOS        278    1      23      2      1      3      604      1239
  ctg-genes      278    1      157    20      10      49      5582      310
 
==== minimus2 on velvet contigs and PA14 genes (OVL=20) ====
 
* min %id = 94
 
INPUT
                  #elem  min    max    mean    median  n50    sum
  gene            5892    72      15639  992    861    1206    5846175
  velvet          10684  45      16239  640    363    1184    6841458
 
OUTPUT (nucmer,OVL=20)
                  #elem  min    max    mean    median  n50    sum 
  ctg            923**  181    66497  6728    4737    10446  6210244
  singl-gene      352    93      7347    979    777    1272    344478
  singl-velvet    1144    45      16239  580    214    1603    663940
  all            2419    45      66497  2984    901    9049    7218662
 
Best alignments:
  show-coords Pa.filter-1.delta | ~/bin/AMOS/mummerAnnotate.pl -all | awk '{print $20}' | count.pl
  [CONTAINED]    5306
  [END]          3482
  [BEGIN]        3453
  [CONTAINS]      1977  # ??? why only 1977 velvet ctgs contain full genes; genes are not repetitive; even worse for edena
  [PARTIAL]      168
  [IDENTITY]      1
  Total          14387
 
0cvg regions in the genes
  #genes      #elem  min    max    mean    median  n50    sum
  1391        1940    1      1762   36      18      67      69336
 
=>4166 out of 5557 assembled genes are fully covered by velvet ctgs
 
Location: /fs/szasmg2/Bacteria/Pseudomonas_aeruginosa/Assembly/2008_0311_velvet-24/genes-PA14/minimus2-nucmer.OVL20
 
==== minimus2 on velvet contigs and Ps genes (OVL20) ====
 
* min %id: = 74 (nucmer default)
* avg %id: = 84
 
  min%id  overlaps
  74      1162      # all considered
  85      415
  90      110
 
INPUT
          #elem  min    max    mean   median  n50    sum
  gene    5478    75      18825  996    852    1227    5456880
  velvet  10684  45      16239  640    363    1184    6841458
 
OUTPUT
                #elem  min    max    mean    median  n50    sum        "singl"
  ctg          652**  323    18825  2150    1768    2451    1401627                !!! 1.4M of genome gets covered
  ctg-gene      757    96      18825  1271    1122    1455   962199      4721
  ctg-velvet    998    45      11120  938    558    1767    935636      9686
 
==== minimus2 on velvet contigs and Ps genes (OVL36: promer -l 6 -c 12) ====
 
  min%id  overlaps  genes  velvet
  22      8496      2985  4827        #default
  50      7209      2813  4495
  60      5948      2571  4087
  70      4361      2122  3269
  74      3700      1880  2827
  80      2615      1407  2036
  85      1714      956    1311
  90      976      516    659
 
min%id=74  =>
  * 3700 overlaps
  * 2169 Ps genes
  * 3206 velvet contigs
 
make-consensus error
                #elem  min    max    mean    median  n50    sum      "singl"
ctg            929**  206    16291  2012    1669    2335    1869539            !!! 1.8M of genome gets covered
ctg-gene        991    111    9522    1118    990    1272    1107648  4487    # 2169-991 genes had overlaps but were not assembled
ctg-velvet      1032    45      16239  1199    910    1742    1237453  9652    # 3206-1032  velvet contigs had overlaps but were not assembled
 
!!! Slightly more genes/velvet contigs get assembled with promer than nucmer
 
 
==== minimusN on PA14,PA2192,PAC3719,PACS2,PAO1 genes ====


   grep -c N s_*seq
Output:
   s_1_sequence.seq:37933
  strains        #elem  min    max    mean    median  n50    sum
   s_7_sequence.seq:69669
  2+            5702    72      15450  1015    870    1227    5785616
  3+            5469    153    15450   1007    876    1212    5505737
  4+            5292   153    13067  1008    883    1209    5334219
  5              4962    153    13067  1020   891    1212    5061120  (5021 in the article)


== Assemblies ==
Total and strain specific genes:
  strain          total  singletons
  PA14            5892    519
  PA2192          6157    387  (507 in the article)
  PAC3719        5650    132  (79  in the article)
  PACS2          5676    127
  PAO1            5568    68
  total          28943  1233


=== CBCB ===
all vs all genes: from the aligned ones ~ 1556 are CONTAINED|BEGIN|END (426 are PA14)


  1. 2007_1203_AMOSCmp-PAO1
=== Final assembly (v3) ===
  2. 2007_1203_AMOSCmp-PAO1-relaxed
  3. 2007_1204_AMOSCmp-PA14
  4. '''2007_1204_AMOSCmp-PA14-relaxed -> best'''
  5. 2007_1204_AMOSCmp-PA14-relaxed-noN : removed all reads that contain N's; same parameters as 2007_1204_AMOSCmp-PA14-relaxed => no big differences compared to 2007_1204_AMOSCmp-PA14-relaxed


  QC stats:
Merged some of the 120 AMOScmp contigs from v2 using velvet contigs.
Grouped overlapping contigs and used minimus to merge them.


   Assembly                              1              2              3              4
   Contig length stats:
  ======================================================================================
  [Info]
  REF                                PAO1          PAO1          PA14          PA14
  MINCLUSTER                            20            20            20            20
  MINOVL                                10              3            10              3
  MAJORITY                              70            50            70            50
    
    
   [Scaffolds]
  desc            #ctgs   min    max    mean             stdev           sum
  TotalScaffolds                        1              1              1              1
  final3          512    200    512638  13099.41          53526.8        6706902
  TotalContigsInScaffolds            4412          2197          3226          1828
  AMOScmp3       76      212    512638  82763.22          117199.28      6290005
  MeanContigsPerScaffold             4412          2197          3226           1828
  novo3           436    200    10493   956.18           1336.44        416897
  MinContigsPerScaffold              4412          2197          3226          1828
  MaxContigsPerScaffold              4412          2197          3226          1828
  TotalBasesInScaffolds            6000280        5985597        6137031        6127761
  MeanBasesInScaffolds            6000280        5985597        6137031        6127761
  MaxBasesInScaffolds              6000280        5985597        6137031        6127761
  N50ScaffoldBases                6000280        5985597        6137031        6127761
  TotalSpanOfScaffolds            6264430        6264430        6537674        6537674
  MeanSpanOfScaffolds              6264430        6264430        6537674        6537674
  MinScaffoldSpan                  6264430        6264430        6537674        6537674
  MaxScaffoldSpan                  6264430       6264430        6537674        6537674
  IntraScaffoldGaps                  4411          2196          3225           1827
   2KbScaffolds                          1              1              1              1
  2KbScaffoldSpan                  6264430        6264430        6537674        6537674
  2KbScaffoldPercent                  100            100            100            100
  MeanSequenceGapSize                  59            126            123           223
    
    
   [Contigs]
  final3-10KB+    49      10493   512638  127086.12        125757.98      6227220
  TotalContigs                        4412          2197          3226          1828
 
  TotalBasesInContigs              6620035        6602214        6791916        6782045
Comment:
  MeanContigSize                      1500          3005          2105          3710
   ~ 53-74 (out of 436) the velvet denovo that contigs align to PA14.
  MinContigSize                        33            33            33            33
  MaxContigSize                      21576          46268          61425          86811
   N50ContigBases                      3548          8478          6480          14280
    
    
   [BigContigs_greater_10000]
   Length statistics:
   TotalBigContigs                      44           173           142            229
                    #ctgs   min    max    mean    stdev  sum
   BigContigLength                  547699       2709766       2127747       4558923
    CONTAINED        53      204    7228    514.81  978.25  27285
   MeanBigContigSize                  12448          15663          14984          19908
    ALL              74      204    7228    622.85  926.13  46091
   MinBigContig                      10018          10045          10022          10040
 
   MaxBigContig                      21576          46268          61425          86811
Locations:
   BigContigsPercentBases                8            41            31            67
  /fs/szasmg2/Bacteria/Pseudomonas_aeruginosa/Assembly/final3/
  /fs/szasmg2/Bacteria/Pseudomonas_aeruginosa/Assembly/2008_0311_velvet-24/nucmer/merge/
 
== Simulations ==
 
=== First ~ 100KB from PA14, PAO1 ===
 
 
=== All PA14, PAO1 ===
 
Uniform read error : 1%
 
Repetitive regions in PA14 (100 bp found my repeat-match)
          #rep.  min    max    mean    median  n50    sum
  PA14    282    100    5571    340    171    520    95919
 
AMOScmp
  PA14=sample
  PAO1=reference
----
 
'''Conclusions:''' (results are below)
 
* more 33bp reads align with soap
* more 50bp reads align with nucmer
* slightly more reads align to the reference if the errors have exponential than uniform distribution ; no much impact on the assembly
 
* AMOScmp 50bp read assemblies have fewer ctgs than the 33bp read assemblies
 
* 33bp reads: more errors(snps & breaks) in the AMOScmp soap than nucmer assembly
* 50bp reads: more errors(snps & breaks) in the AMOScmp nucmer than soap assembly
 
* more snps & breaks in the non-unique regions => many introduced by repeats
* more indels (than substitutions) in the unique regions
* more deletions from the reference
* the indels are actually longer than 1bp; usually 5-20bp; occur where there are larger differences in the 2 genomes; they seem to be assembly errors
 
* there are fewer longer insertions in the qry than in the ref (for all PA14 33bp, PAO1 33bp, PA14 50bp)
* decreasing the make-consensus alignment wiggle -w to 1 makes no much difference than -w 2
 
* if we incease the min% id for the alignments, fewer reads will align , more 1X cvg regions => more snps; these snps are mostly substitutions, not indels
 
==== Read alignments to PAO1 ====
 
Uniform read error : 1%
 
Alignment programs: nucmer & soap
* 33bp: min alignment=28bp both nucmer & soap
* 50bp: min alignment=45bp both nucmer & soap
 
 
        error    total    aligned(nucmer)  aligned(soap)      aligned(nucmer&soap)
  33bp  uniform  7528200  6631195(88.08%)  6851948(91.01%)    6602210(87.69%)
  33bp  exp.    7528200  6713822          6879256
  50bp  uniform  4968612  4524844(91.06%)  4497468(90.51%)
 
'''0cvg regions in PAO1:'''
 
                  #elem  min    max    mean    median  n50    sum
  33bp-nucmer    969    1      13696  245    22      1470    237643
  33bp-soap      543    1      18331  434    26      3262    235621
  50bp-nucmer    326    1      18663  716    96      3178    233556
  50bp-soap      453    1      19256  546    29      4840    247205  !!! largest values
 
==== PA14 33 & 50bp simulated reads ====
 
Uniform read error : 1%
 
Legend:
* uniq:contigs that align uniquely to the reference
* snps-I: substitutions only
 
33bp
                                #ctg    min    max    mean    median  n50    sum    snps    snps-I  snps/1KB  snps-I/1KB    breaks
  AMOScmp-PAO1.nucmer           1298    33      60570  4641    1342    13211  6023652 1655    915      0.274    0.151        57  *
  AMOScmp-PAO1.nucmer(uniq)      851    40      46537  3482    1167    9474    2963269 425    57      0.143    0.019        2
 
  AMOScmp-PAO1.soap              774    33      214591  7791    527    30892  6030156 2037    1209    0.337    0.200        85
  AMOScmp-PAO1.soap(uniq)        395    40      42671  3496    194    14876  1380824 189    29      0.136    0.021        2
 
50bp
                                #ctg    min    max    mean    median  n50    sum    snps    snps-I  snps/1KB  snps-I/1KB    breaks
  AMOScmp-PAO1.nucmer           366    50      214838  16466  1998    59036  6026608 1966    1190    0.326    0.197        101  *
  AMOScmp-PAO1.nucmer(uniq)      158    50      91686  4954    233    23389  782667  77      32      0.098    0.040        0
 
  AMOScmp-PAO1.soap              634    50      144624  9492    2302    28330  6018087 1927    1112    0.320    0.184        60
   AMOScmp-PAO1.soap(uniq)       349    50      62962  4705    1149    15406  1641990 133    13      0.080    0.007        0      !!! fewest snps
 
                snps    snps-I  breaks
  *common       838    441    29
 
----
  PA14: 6537648   
 
                                                                    all              common
                                                        Coverage  snps  snps-I      snps  snps-I 
  --------------------------------------------------------------------------------------------------
  sim.PA14/AMOScmp-PAO1/PA14-ctgs.filter-q              6003782    1655  915        1281  761
  sim.PA14.50bp//AMOScmp-PAO1.redo//PA14-ctgs.filter-q  6017813    1966  1190       1198  734
   common(last 2)                                        5996786
 
* The larger number of snps in the 50bp assemblies is an artifact of gap closing in 33bp assembly
* The number of snps in the 33bp covered regions goes slightly down if read length is increased
 
==== PA14 33bp simulated reads ====
 
Uniform read error : 1%
 
Legend:
* uniq : reads that seem to be unique have been filtered out
 
                          #elem   min    max    mean    median  n50    sum    snps    snps-I  breaks
   AMOScmp-PAO1            1298    33      60570  4641    1342    13211  6023652 1655    915      57*
   AMOScmp-PAO1.uniq      1383    33      60570  4297    1026    13192  5943384 1207    57      48
 
  AMOScmp-PAO1.soap      774    33      214591  7791    527    30892  6030156 2037    1209    85*
    
    
   [SmallContigs]
   velvet                  4145    45      25746   1577    529    4317    6536518 151    148      0
   TotalSmallContigs                  4368          2024          3084          1599
  SmallContigLength                6072336        3892448        4664169        2223122
  MeanSmallContigSize                1390          1923          1512          1390
  MinSmallContig                        33            33            33            33
  MaxSmallContig                      9964          9996          9977          9995
  SmallContigsPercentBases              92            59            69            33
    
    
   [Reads]
   minimus2                582    33      275737  11306  261    91069  6579905 1610    1056    69  # AMOS & velvet ctg phred score=30
   TotalReads                      8627900        8627900        8627900        8627900
  minimus2.qual          582    33      275730  11305  261    91069  6579772 1240    928      67  # AMOS ctg phred score=20; velvet ctg phred score=30
   ReadsInContigs                  6218126        6218255        6541160        6541364
  minimus2.pre            575    23      275737  11371  238    84812  6538289 1569    1131    34  # some of the AMOS ctg are broken by velvet ctg
   BigContigReads                    549658        2601986        2115038        4460608
 
   SmallContigReads                5668468        3616269        4426122        2080756
  AMOScmp-PACS2          1713    33      55026  3521    423    13041  6031518 1344    831    56    # more ctgs than in the AMOScmp-PAO1 assembly
   SingletonReads                  2409774        2409645        2086740        2086536
 
          snps    snps-I  breaks
  *common  1039    598      34
 
==== PA14 33bp simulated reads ====
 
Exponential read error : for a 33 bp read avg=~1%
 
Base 0 error: 1/(1.2**48-1)
Multiplied by 1.2 for next base ...
Base 48+ error: 1
 
  $ ~/bin/mutateSeq.pl -exp 48 -test
  0    0.000158259181534141
  1    0.000189911017840969
  2    0.000227893221409163
  ...
  30    0.0375669811375429
  31    0.0450803773650515
  32    0.0540964528380618
  ----------------------------------
  total 0.323
 
   47    0.833465215984612
   48    1.00015825918153
 
 
AMOScmp alignment
                                  #ctgs   min    max    mean    median  n50    sum    snps    snps-I  breaks  #reads-aligned
   AMOScmp-PAO1.exp                1396    33      55370  4315    1188    12277  6023140 1773    877    56      6713822(89.18%)
   AMOScmp-PAO1.exp.uniq          908    40      46548  3394    1054    9533    3081827 481    22      3      .
 
== Annotation ==
 
* ( 5769 genes /fs/szasmg2/Bacteria/Pseudomonas_aeruginosa/Annotation/merge/gene_list.ptt ) OLD ?
*  5602 genes  /fs/szasmg2/Bacteria/Pseudomonas_aeruginosa/Annotation/final/gene_list.ptt
*  512 contigs /fs/szasmg2/Bacteria/Pseudomonas_aeruginosa/Assembly/final3/PAb1.final3.fasta
 
= Submission =
* [http://www.ncbi.nlm.nih.gov/genomes/mpfsubmission.cgi?show=5984A383-B4D9-4CDB-9E81-B2A6EF8EFE82 PAb1 genome project]
* [http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=509633 Taxonomy]
* Genbank accession: ABKZ01000000
* [ftp://ftp.ncbi.nih.gov/pub/TraceDB/ShortRead/SRA001120/  SRA001120]
* [http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1000186 PLoS aricle]
** ''Our final assembly contains one large scaffold with 76 contigs whose total length is 6,290,005 bp, with the longest contig at 512,638 bp. The 436 unplaced contigs, which should fit into the remaining gaps, represent sequence that is unique to PAb1 ... The final assembly thus consists of 512 contigs covering 6,706,902 bp ...''
** ''Our annotation of the PAb1 genome identified 5,602 protein-coding genes, ...''

Latest revision as of 15:07, 14 October 2010

Strain PA-b1

Data

CBCB

Data: 33 bp Solexa reads from 2 libraries
  lib     #reads      #reads(1+N's)   #reads(2+N's)
  s_1    4,105,993    37,933          18,771
  s_7    4,521,907    69,669          59,589
  total  8,627,900 
  ~ 43.55X cvg

Files:

 s_1_sequence.txt  # contains seq & qual; seq names: HWI% ; all reads are 33 bp
 s_1_0001_prb.txt  # 27679 lines of 4*36 values in the -40..40 range  
 s_1_tag.txt       # unique seq count in s_1_sequence.txt
 
 s_7_sequence.txt  # contains seq & qual; seq names HWU%  ; all reads are 33 bp
 s_7_0300_prb.txt
 s_7_tag.txt    
 
 s_1 reads have significantly fewer N's than s_7 reads !!!
 s_1 reads align better to ref than s_2 reads !!!
  
 Conflicts with avg qualities: 28.26 for s_1, 31.05 for s_2 !!!
 
 Base qualities:
 .       #elem           #elem0  #elem<0 min     mean                    max     sum
 s_1     135497769       1863585 3231899 -5      28.2624918053079        40      3829504586
 s_7     149222931       1357384 3113806 -5      31.0508663175903        40      4633501282
 /fs/szasmg2/Bacteria/Pseudomonas_aeruginosa/

NCBI

Same Strains

Ranging from 6.2M to 6.9M

 PA14        CP000438       6,537,648 bp, 66.29 %GC : 5,892 protein coding genes  most similar to PAb1         
 PAO1        AE004091       6,264,404 bp, 66.56 %GC : 5,568                genes  rearrangement vs PA14        
 PACS2       AAQW01000001.1 6,492,423 bp, 66.33% GC : 5,676                genes  rearrangement vs PA14
 PA2192      .              6,905,121 bp, 65.99% GC : 6,157                genes  rearrangement vs PA14
 PAC3719     .              6,222,097 bp, 66.30% GC : 5,650                genes  no rearrangement vs PA14
 PA7         CP000744       6,588,339 bp            : 6,286                genes  no rearrangement vs PA14
 PALESB58    FM209186       6,601,757 bp  66% GC    : 5,925                                           (new)


All vs all nucmer alignments:

 #ref-qry                count   sum     avg%id  comment
 PA14-PAO1               404     6296632 98.66   ~510,582 bp (filter-q) ; ~ 921,919bp (filter-r) in PA14 not present in PAO1
 PA14-PACS2              462     6303773 98.56
 PA14-PA2192             532     6248297 98.47
 PA14-PAC3719            604     6129432 98.44
 PA14-PA7                822     5899920 93.72

Incomplete strains details(Broad):

 Contig length stats:
 desc  #contigs   min     max     mean    stdev   sum
 C3719 124        2079    242903  49572   53770   6146998
 2192  82         2087    398738  83246   88681   682625

5.7Kbp repeat: contains 16S rRNA gene; identical in all the strains

 Coordinates:
 
 PA14:
 732540   738302  +
 4951956  4957714 -  
 5535975  5541728 -
 6312434  6318199 -
 PAO1:
 721550   727325  +
 4788516  4794273 -
 5264042  5269801 -
 6039515  6045289 -
 PA7:
 806558   812299  +
 4982874  4988600 -
 5566182  5571924 -
 6353418  6359148 -
  Pa's vs Pa's (nucmer) 

Same Genus, same Group, different Species: Pseudomonas mendocina ymp (Pm)

 Name           Length    %GC    Info
 NC_009439.1    5072807   64.68  4594 genes
 
 PA14 vs Pm (nucmer)
 
                                  #elem   min     max     mean    median  n50     sum      avg%id
 Alignments(1con-1con  : nucmer)  1281    75      19618   1256    1011    1548    1609296  85.26
 Alignments(genes-genes: nucmer)  ...                                 1326 PA14 genes align to 1306 Pm genes (85.36% ID)                                                        
 Alignments(genes-genes: promer)  ...                                 2924 PA14 genes align to 2957 Pm genes (68% ID,75% SYM)                                                             

Same Genus, different Species: Pseudomonas syringae pv. tomato str. DC3000 (Ps) : chomosome+2plasmids

 Name           Length    %GC    Info
 NC_004578.1    6,397,126 58.40  chromosome       5478 genes
 NC_004633.1    73,661    55.15  plasmid pDC3000A
 NC_004632.1    67,473    56.17  plasmid pDC3000B
 total          6,538,260        
 
 PA14 vs Ps (nucmer) 
 
                                   #elem   min     max     mean    median  n50     sum      avg%id
 Alignments(1con-1con  : nucmer)   499     75      9435    1296    1161    1619    646614   84.79
 Alignments(genes-genes: nucmer)   438     161     4446    1090    996     1281    477229   83.38      425  PA14 genes align to 418  Ps genes (83.38% ID)
 Alignments(genes-genes: promer)   6193    114     13026   1159    1011    1359    7180527  67.36      2556 PA14 genes align to 2644 Ps genes (67% ID, 75% SYM)

Publications

 * 5,021 genes that are conserved across all five genomes analyzed, with at least 70% sequence identity (core genome)
 * Of these, >90% of the genes share at least 98% sequence identity
 
 * The genome for each strain carries a relatively modest number of unique sequences, 10% or less of the number in the core genome, 
   ranging from 79 in C3719, to 507 in PA2192. 
 * Less than 100 genes are unique to pairs of genomes, whereas the number decreases significantly when comparison is carried out among 
   sets of three or four genomes.
 
 * The overall architecture of the pangenome of P. aeruginosa an be represented as a circular chromosome with dispersed polymorphic 
   strain-specific segments, flanked by conserved genes referred to as anchors 
 * These strain-specific segments are regions of genomic plasticity (RGP) and include any region of at least four contiguous ORFs 
   that are missing in at least one of the genomes analyzed.
 * The number of RGPs in individual genomes varies from 27 to 37. A total of 52 RGPs were identified in the five genomes analyzed 

Read trimming

No trimming

If no trimming is done and the default AMOScmp pipeline is used (ALIGNWIGGLE=15), the contigs end up containing reference duplications Make-consensus will also take a very long time to run

Duplication causes:

  1. Only the reference nucmer alignment coordinates are stored in the bank and used by casm-layout and make-consensus; the read alignment coordinates are not stored, just the read clear ranges (the ones given in the .afg file); read positions in the assembly layout are approximate
  2. reads can be shifted from their original alignment position by make-consensus by up to 2^3*15=120 bp (constants ALIGN_WIGGLE=15; MAX_ALIGN_ATTEMPTS=3)
  3. The Pseudomonas_aeruginosa reference contains multiple 2 copy 5-10 bp kmers, adjacent within a few dozen bp of each other; reads starting with these kmers align in 2 ways to the reference; if the reads contain errors or SNP's, the 2nd (shorter) alignment to the greedyly built consensus is chosen over the correct one and a duplication is introduced

Quality trimming

Art's script: qual-trim.awk

Min_Qual=15

 .               #elem   #elem0  #elem<0 min     median  max     sum             mean    stdev   n50
 qualTrim        8627900 1266    0       0       23      33      193999902       22.49   7.43    26

!!! Avg clr drops from 33 to 22 bp !!! Duplications still exist though at a lower rate

Alignment based trimming (nucmer)

PA14 is the most similar strain: more PAb1 reads aligned to PA14 than to PAO1 or PA7

 grep -c "^>" PA14-Pa.l-17_c-17.delta PAO1-Pa.l-17_c-17.delta PA7-Pa.l-17_c-17.delta
 PA14-Pa.l-17_c-17.delta:7261629
 PAO1-Pa.l-17_c-17.delta:7036243
 PA7-Pa.l-17_c-17.delta:5393144

Read mapping

PA14 reference

Commands:

blat -noHead -t=dna  -q=dna  -tileSize=10 -stepSize=3 Pa.1con    Pa.seq    Pa.blat
nucmer.pl    -l 20,16,12 -c 20,16,14                  Pa.1con    Pa.seq -p Pa.nucmer
rmap         -m 3 -w 33                            -c Pa.1con    Pa.seq -o Pa.rmap
soap         -v 5                                  -d Pa.1con -a Pa.seq -o Pa.soap
#soap        -v 2 -s 12 -g 3                       -d Pa.1con -a Pa.seq -o Pa.soap

Results:

         runTime                                 hits(all)       hits(>=30bp)
 blat:   1038.28u 3.696s 17:49.00 97.4%          7004925         6353293
 nucmer: .                                       7553734         6832230         # 3 itterations were run for nucmer -l 20,16,12 -c 20,16,14; delta-filter -l 20                        
 rmap:   7318.77u 5.770s 2:02:11.29 99.9%        6958043         6958043
 soap:   685.208u 3.971s 12:45.68 90.0%          7310025         7310025         # fastest, most hits
 all:    .                                       7638420         7201727

Soap seems the "best" read mapping program at current moment.

 #number of hits/read
 cat Pa.soap.5 | awk '{print $4}' | count.pl
 #       total
 1       7185054  # most reads align in only one place
 2       83418
 3       8706
 4       32277
 5       291
 ...
 Total   7310025
 #number of misamtches
 awk '{print $10}' Pa.soap.5 | count.pl
 #       total
 0       3708583   # that many reads align exactly to the reference
 1       2025346
 2       908998
 3       395002
 4       181326
 5       90770
 # mismatch % (Pa's,PA14,Staphylococcus aureus,Strep suis)
 Media:Soap.mismatch.percent.png
 Media:Soap.mismatch_total.percent.png Pa's & Ss profiles are very similar; Sa has the best mappings

Pm reference

 Reads aligned: 1182734
 cat Pm.soap | awk '{print $10}' | count.pl | sort -n
 #number of misamatches
 0       48396
 1       117192
 2       207369
 3       267275
 4       279223
 5       263279
 Total   1182734

Ps reference

 Reads aligned: 632459
 cat Pm.soap | awk '{print $10}' | count.pl | sort -n
 #number of misamatches
 0       16442
 1       36440
 2       81147
 3       136472
 4       176779
 5       185179

Sequence Assemblies

Comparative

PA14 ref assembly

 Location: 2008_0109_AMOSCmp-PA14-relaxed-17-nucmer-redo2
 Command:   AMOScmp Pa -D MINCLUSTER=17 -D MINMATCH=17 -D MINOVL=5 -D MAJORITY=50 -D ALIGNWIGGLE=2
  1. a align all reads to the reference using nucmer. I initially used minmatch=17, mincluster=17 (nucmer -c 17 -l 17)
   8.62M reads, 4.10M HWI 5.32M HWU
   8.30M (...) total reads align to the reference (nucmer -c 14 -l 14)
   7.59M (...) total reads align to the reference (nucmer -c 16 -l 16)
   7.25M (...) total reads align to the reference (nucmer -c 16 -l 16; delta-filter -l 20)
   7.26M (84.16%, 83.50% HWI, 84.76% HWU) total reads align to the reference (nucmer -c 17 -l 17)
   6.54M (75.82%, 74.49% HWI, 77.02% HWU) total reads align to the reference (nucmer -c 20 -l 20)
   5.17M (...) total reads align to the reference (nucmer -c 28 -l 14)
   5.32M (73.27%) reads aligned on their full length (as opposed to 0.94M 33bp quality untrimmed reads)  (nucmer -c 33 -l 17)
   3.69M (42.79%, 38.24% HWI, 46.92% HWU) align exactly (33 bp, 100%id)
   7.00M align using blat -t=dna -q=dna -stepSize=5 -tileSize=10 (20bp alignments, max 3 mismatches)
   6.35M align using blat -t=dna -q=dna -stepSize=5 -tileSize=10 (30bp alignments filtered from above results)
  
   7.11M align using blat -t=dna -q=dna -stepSize=3 -tileSize=10 (20bp alignments, 3 mismatches) 
   500K the unaligned ones were aligned by blat -t=dnax -q=dnax -stepSize=2 -tileSize=4 (20bp alignments, 6 mismatches; only 15% of the aligned reads are unique) 
   => 7.61M aligned by blat/blatx
   6.95M aligned by rmap (-m 3)
   7.30M align by soap (-v 5 -s 12 -g 3) to PA14 chromosome
   6.64M align by soap (-v 2 -s 12 -g 3) to PA14 chromosome
 
   7.24M align by soap (-v 2 -s 12 -g 3 -c 42) to PA14 chromosome (iterative trimming)
   6.72M align by soap (-v 1 -s 12 -g 3 -c 41) to PA14 chromosome (iterative trimming)
   6.36M align by soap (-v 5 -s 12 -g 3) to PA14 genes
   7.43M align by maq
  1. b align all unaligned reads to the reference using nucmer. minmatch=14, mincluster=14 (-c 14 -l 14)
  2. Combine 1a & 1b
  3. trim reads according to their nucmer alignment coordinates; don't trim the ones adjacent to zero cvg regions
  4. assemble them using the AMOScmp pipeline(ALIGN_WIGGLE=2 instead of 15)
Contigs stats:
 desc    #elem   min     max     mean            stdev           sum   singletons
 all     2053    17      170485  3011.84         11917.53        6183320   1127399
 200     428     203     170485  14262.09        22852.74        6104175
 10K     157     10240   170485  35468.89        26531.33        5568616
 
 Data accuracy
 Get all assembled bases with coverage>=20
 count_bases=5926977
 sum_bases=235619001
 sum_errors=2455670
 sum_errors/sum_bases=2455670/235619001=0.01042=1.042% error

SNP's in the 157 >10K contigs: T<->C & A<-G> the most common

  PA14 PAb1    count
   T   C       6153
   C   T       6100
   G   A       6075
   A   G       6021
   G   C       1357
   C   G       1309
   G   T       859
   C   A       846
   T   G       840
   A   C       836
   A   T       338
   T   A       308
   G   .       241
   C   .       196
   .   C       182
   .   G       152
   A   .       101
   T   .       88
   .   T       74
   .   A       70
  Total        32146

Problems:

  1. Some of the short ctgs are not "real"
  2. Some misassemblies due to loose assembly params (MINOVL=5 MAJORITY=50 ): Example: pos 6700

PA14 ref assembly (2) *

run 3 nucmer iterations

Parameters:

 MINCLUSTER=20,16,14   # nucmer -c
 MINMATCH=  20,16,12   # nucmer -l
 MINLEN=20       # delta-filter -l
 MINOVL=10       # casm-layout -o ; previously 5, allowed for missasemblies
 MAJORITY=70     # casm-layout -m ; previously 50 to reduce the number of negative gaps

Read CLR's:

 33      5510573
 32      471289
 31      456959
 30      412738
 29      136423
 28      120148
 27      104213
 26      60338
 25      54828
 24      44854
 23      38574 # smallest value !!!
 22      43302
 21      49923
 20      49572
 Total   7553734
Contig stats:
  desc     #elem   min     max     mean    median  n50     sum     singl   breaks
  all      1843    20      127984  3350    63      30995   6174765 1107200 9
  200bp+   542     200     127984  11245   3263    31106   6094904

Alignment of contigs >=33bp to PA14 reference; qry_hits

 count   totalLen  %id
 940     6130875   99.33       !!! 99.33%id to PA14 vs Pab1

Alignments of PA14 genes to the contigs

 show-coords ctg-genes.filter-1.delta | awk '{print $20}' | count.pl
 #               total
 [CONTAINS]      5318
 [END]           198
 [BEGIN]         191
 [CONTAINED]     190
 [PARTIAL]       88
 [IDENTITY]      3

Location:

 /fs/szasmg2/Bacteria/Pseudomonas_aeruginosa/Assembly/2008_0424_AMOSCmp-PA14

PA14 ref assembly (used blat instead of nucmer for read alignment)

run 2 blat iteration:

 blat -t=dna  -q=dna  -stepSize=3 -tileSize=10 Pa.1con Pa.seq 
 blat -t=dnax -q=dnax -stepSize=2 -tileSize=4  Pa.1con Pa.singletons.seq 

Contig stats:

         #ctgs   min     max     mean    median  n50     sum     singletons
 all     1372    20      176323  4517    64      47563   6197771 1049302
 200bp+  370     201     176323  16584   3929    47563   6136095
 10Kbp+  146     10097   176323  39176   30457   51554   5719752

Location:

 /fs/szasmg2/Bacteria/Pseudomonas_aeruginosa/Assembly/2008_0613_AMOSCmp-PA14-blat
 !!! Best AMOScmp assembly so far : 
 * fewer contig than when using nucmer; 
 * largest contig
 * largest contig sum

PA14 ref assembly (used soap instead of nucmer for read alignment)

  • soap parameters: -v 5 -g 3 -s 12
 Contig stats:
         #elem   min     max     mean    median  n50     sum
 all     1228    33      132817  5022    106     33779   6167051   !!! smaller assembly  6.16M vs 6.19M with blat
 200bp+  478     200     132817  12789   3824    33988   6113109
 10Kbp+  303     1000    132817  19927   12470   34129   6037852

Location: /fs/szasmg3/dpuiu/Pseudomonas_aeruginosa/Assembly/2008_0625_AMOSCmp-PA14-soap/

  • soap parameters: -v 2 -g 3 -s 12 -c 42 :
  • iterative trimming of 2 bp from read ends
 Contig stats:
          #elem   min     max     mean    median  n50     sum
 all      1932    33      79489   3190    153     16122   6162793
 200bp+   874     200     79489   6960    3195    16145   6082694
 10Kbp+   202     10048   79489   20745   17422   21640   4190588

  params                #ctgs  #breaks
  v2.g3.s12.OVL16       2655   34
  v2.g3.s12.OVL20       3446   27
  v2.s12.g3.c42.OVL10   1932   64
  v5.g3.s12.OVL10       1228   83
  v5.g3.s12.OVL3        953    89

v5.g3.s12.OVL10

 all
         #elem   min     max     mean    median  n50     sum       breaks
 PA14    1228    33      132817  5022    106     33779   6167051   83
 PA2192  1784    33      127510  3477    191     16665   6203818
 PAC3719 1335    33      46247   4424    968     14667   5905499
 PACS2   1285    33      87356   4746    165     25837   6098413
 PAO1    969     33      103523  6210    293     25135   6017855
 
 200bp+
         #elem   min     max     mean    median  n50     sum
 PA14    478     200     132817  12789   3824    33988   6113109
 PA2192  880     200     127510  6974    3011    16816   6137484
 PAC3719 838     200     46247   7002    3789    14875   5868046
 PACS2   606     200     87356   9976    3842    25961   6045393
 PAO1    520     200     103523  11510   5601    25135   5985190

PAO1 ref assembly

Same method except that 1.b was not used

 Contigs stats:
 
 desc    #elem   min     max     mean            stdev   sum
 all     2797    17      75626   2161.19         5812.2  6044851
 200     865     200     75626   6893.96         8766.63 5963278
 10K     204     10016   75626   19016.22        10368   3879309
 
 Singletons: 1592525

Location:

 /fs/szasmg2/Bacteria/Pseudomonas_aeruginosa/Assembly/2008_0124_AMOSCmp-PAO1-relaxed-17-nucmer-redo2

PAO1 ref assembly (2)

Parameters:

 MINCLUSTER=16   # nucmer -l
 MINMATCH=16     # nucmer -c
 MINLEN=20       # delta-filter -l
 MINOVL=10       # casm-layout -o ; previously 5, allowed for missasemblies
 MAJORITY=70     # casm-layout -m ; previously 50 to reduce the number of negative gaps

Contig stats:

 desc    #elem   min     max     mean    stdev   sum
 all     1961    20      46414   3065    5251    6010800
 200bp+  1166    203     46414   5108    6007    5955960
 1758777 singletons

Alignment of contigs >=33bp to PAO1 reference; qry_hits

 count   totalLen  %id
 1471    5984905   99.06       !!! 99.06%id PAO1 vs Pab1

Location:

 /fs/szasmg2/Bacteria/Pseudomonas_aeruginosa/Assembly/2008_0424_AMOSCmp-PAO1

PA14 & PAO1 merge

Use minimus to merge all contigs in the PA14 & PAO1 reference assemblies

Filter contigs that contain at least 2 adjacent PA14 merged by PAO1

 desc       #elem   min     max     mean            stdev           sum
 all        1850    17      236472  3400.31         16863.79        6290586
 200        306     204     236472  20318.3         37147.91        6217401
 10K        113     10520   236472  52647.45        45602.76        5949162
 
 Singletons: 1066226

De novo assembly of singletons using SSAKE Input: 1066226 singletons

 Ssake version 3.0 run with default parameters:
 -m  Minimum number of reads needed to call a base during overhang consensus build up (default -m 16)
 -o  Minimum number of reads needed to call a base during an extension (default -o 2)
 -r  Minimum base ratio used to accept a overhang consensus base (default -r 0.6) 
 Contigs stats:
 
 desc    #elem   min     max     mean    stdev   sum
 all     19795   34      2866    67.78   102.73  1341825
 200     879     200     2866    416.72  308.92  366304
 200 bp contigs stats
 
 desc.           #elem   min     mean    max     sum
 contig_len      879     200     416.72  2866    366304
 contig_reads    879     52      414.21  3157    364091
 contig_cvg      879     7.91    31.58   146.52  27764.51 
 
 Singletons:  702007
 Find contigs overlapping 12 bp or more and merge them using EMBOSS merger program:
 
 desc            #elem   min     max     mean    stdev   sum
 original        879     200     2866    416.72  308.92  366304
 new             670     200     4826    539.32  579.26  361350

Pa-b1 AMOScmp reference assembly

 Contig stats:
  
 desc                           #elem   min     max     mean      stdev           sum
 input(reference|final)         770     200     512638  8484.21   37210.58        6532844
 output(AMOScmp contigs)        936     21      260827  6988.98   22288.49        6541689
 
 Singletons: 1257963
 Snp's:      1941

Location: /fs/szasmg2/Bacteria/Pseudomonas_aeruginosa/Assembly/2008_0215_AMOScmp-PAb1

Use ALL strains as reference

Location: /fs/szasmg2/Bacteria/Pseudomonas_aeruginosa/Assembly/2008_0228_AMOSCmp-ALL

Steps:

 * Identify related strains (completed & draft assemblies): PA14, PAO1, PA7, PACS2, PA2192, PAC3719
 * Download & merge related strains into a multi-FASTA file
 * Align PAb1 reads to multi-FASTA file using nucmer (Ex: -l 17 -c 17)
 * Count the number of PAb1 reads aligned to each strain
    strain   #reads_aligned
    ======================
    PA14    7261629
    PA2192  7193960
    PACS2   7089455
    PAO1    7036243
    PAC3719 6928744
    PA7     5393144
    ======================
 * Sort the strains in descending order according to the number of reads aligned to each one
 * The strain with most hits is PA14 => closest relative
 * Assuming the order is PA14, PAO1, PA7, PACS2, PA2192, PAC3719 ...
 * Align strains to one other; identify unique regions in each genome (regions which are not present in the previously considered ones)
 
   Unique regions length summary: 
   Table 1:
   reference                             #regions  min     max     mean    stdev   sum
   =======================================================================================
   PA14                                  1       6537648 6537648 6537648 0       6537648
 
   PAO1-PA14                             115     1       19263   2038.58 3843.93 234437
   PA7-PA14                              511     1       46020   1876.96 4872.51 959129
   PACS2-PA14                            147     1       90591   3054.19 8766.3  448967
   PA2192-PA14                           231     1       92553   3746.3  11475   865396
   PAC3719-PA14                          272     1       14058   1164.78 2089.85 316822
   =======================================================================================
   ALL                                   1277                                     9362399
   Table 2:
   reference                             #regions  min     max     mean    stdev   sum
   =======================================================================================
   PA14                                    1       6537648 6537648 6537648 0       6537648
 
   PAO1-PA14                               115     1       19263   2038.58 3843.93 234437
   PA7-PAO1-PA14                           487     1       46020   1830.82 4910.86 891614
   PACS2-PA7-PAO1-PA14                     71      2       79031   4112.94 10618.2 292019
   PA2192-PACS2-PA7-PAO1-PA14              181     1       92066   3439.9  9500.07 622622 #some N's maybe?
   PAC3719-PA2192-PACS2-PA7-PAO1-PA14      173     1       11425   998.16  1888.94 172682 #some N's maybe?
   =======================================================================================
   ALL                                     1028    2       6537648 8513.66 203929  8752049
   Table 3:
   reference                             #regions  min     max     mean    stdev   sum
   =======================================================================================
   PA14                                    1       6537648 6537648 6537648 0       6537648
   
   PA14-PAO1                               123     1       43690   4009.28 7903.16 493142
   PA14-PA7                                516     1       31464   1768.16 3534.88 912375
   PA14-PACS2                              146     2       43690   3259.15 6129.12 475836
   PA14-PA2192                             222     1       34854   2189.76 4338.6  486127
   PA14-PAC3719                            277     1       43690   2206.38 5214.55 611170
   =======================================================================================
 * Extract unique regions(Table 2) and add them to the most similar genome
 * Assemble using all these regions as reference => 10,201 contigs; 1,052,178 singletons
   reference                        #ctgs   min     max     mean    stdev           sum
   =======================================================================================
   PA14                             2056    17      57476   2992.27 7078.25         6152122 
   PAO1                             877     17      8159    95.16   482.74          83456
   PA7                              4723    17      6822    41.7    128.87          196996
   PACS2                            914     17      3346    66.96   148.47          61207
   PA2192                           1528    17      49826   134.88  1734.19         206109
   PAC3719                          103     17      693     42.11   85.89           4338
   =======================================================================================
   ALL                              10201   17      57476   657.21  3457.1          6704228
 
 * Compare ALL assembly contigs to PA14 ref assembly :
   reference                        #ctgs   min     max     mean    stdev           sum        #singletons
   ========================================================================================================
   ALL                              10201   17      57476   657.21  3457.1          6704228    1052178 
   ALL(200bp+)                      1020    200     57476   6235.17 9220.76         6359881
   ALL(PA14)                        2056    17      57476   2992.27 7078.25         6152122 
   PA14                             1984    17      82196   3103.21 8274.53         6156788    1375004
   ========================================================================================================
   ALL-PA14                         8217     ......                                 547440                 # 0.5M extra bp
   ALL(PA14)-PA14                   72       ......                                 -4666                  # 4666 bp which were originally mapped to PA14 are now mapped to other strains
  * Compare ALL assembly contigs to ssake 200bp+ contigs :
 
   ssake 200bp+ contigs:
     strain    #ctgs   min     max     mean    stdev   sum
     PAb1      879     200     2866    416.72  308.92  366304
   ALL vs ssake alignments:
    desc    #elem   min     max     mean    stdev   sum
    ALL     265     23      49826   912.9   4187.17 241919
    ssake   306     200     2358    379.85  257.04  116235
  ALL vs ssake CONTAINED alignments:
    desc    #elem   min     max     mean    stdev   sum
    ALL     164     23      49826   1264.71 5268.16 207413
    ssake   260     200     2358    368.09  259.02  95704
  ~300 out of 879 ssake contigs were assembled using this method.
  The ALL assembly contigs ssake contigs align to are longer than the ssake contigs!
  * Compare ALL assembly contigs to final de novo contigs :
       id                          len         #de novo ctgs it CONTAINS
  ======================================================================
    1  2361_PA2192_2249428_2341494 49826       52
    2  2360_PA2192_2249428_2341494 42258       49
    3  2364_PA2192_2346558_2361947 15388       17
    4  10190_PAO1_747565_761293    8159        7
    5  2429_PA2192_2405271_2412668 6103        7
    6  3259_PA2192_6656583_6693732 2594        6
    7  2367_PA2192_2374235_2379148 4914        5
    8  6917_PA7_4563267_4570077    6822        5
    9  2958_PA2192_4496915_4500004 3090        4
   10  2375_PA2192_2383706_2386485 2789        3
   11  2366_PA2192_2366515_2369415 2901        2
   12  6919_PA7_4579629_4582081    1085        2
   13  5364_PA7_3138605_3164438    1931        2
   14  8663_PACS2_3494057_3573088  449         2
   15  9322_PACS2_982458_983976    1519        2
   16  2368_PA2192_2379432_2380661 1230        2
   17  2716_PA2192_3268239_3292117 567         2
   18  9587_PAO1_2439014_2458239   3513        1
   ...
   49  8384_PAC3719_5517906_5518496 591        1
  ======================================================================
      Total                                    201

PA14 Sanger maq on all reads

 Contig stats:
 min_len    #contigs   min     max     mean     stdev           sum
 0          991        33      155551  6199.79  17445.05        6143996  !!!Smaller assembly size than with AMOScmp
 200        368        200     155551  16581.7  25475.92        6102067
 10K        149        10048   155551  38164.35 28513.51        5686489

1197385 singletons; SNP's in the contigs: T<->C & A<-G> the most common

Location: /fs/szasmg2/Bacteria/Pseudomonas_aeruginosa/Assembly/2008_0206_PA14_maq

PA14 gene ref assembly

  • Use the 5892 PA14 genes as reference
  • Use "soap -v 5 -s 12 -g 3" for alignments

=> 5671 scaffolds (that many genes are shared by the 2 strains)

contig stats:

         #elem   min     max     mean    median  n50     sum   
 all     6487    33      13070   854     752     1164    5541032
 200bp+  5666    200     13070   965     840     1174    5470492

Comparative-ABBA

ABBA

Steps

  1. Annotate draft assemby
  2. Find fragmented genes at ends of contigs. Example shown is of putative secretion protein from PAO1
  3. Fill in missing amino acid sequence using reference protein
  4. Merge contigs together with output sequence from ABBA

Pipeline

  1. Input reference amino acid sequence and database of short reads
  2. Search read database (tblastn)‏
  3. Create layout using alignment coordinates
  4. Validate good read coverage across entire sequence
  5. Make consensus from a multiple alignment

PA14:PO1

  • 5892 genes in PA14 vs 5568 in PAO1
  • 5244 genes in PA14 IDENTITY (99.03%) to 5239 genes in PAO1
  • 5379 genes in PA14 IDENTITY|CONTAIN|BEGIN|END (99.003%) to 5367 genes in PAO1
  • 5389 genes in PA14 align to 5372 genes in PAO1 (98.69%id)
  • 501 genes in PA14 & 196 genes in PAO1 don't align to opposite strand genes

!!! The genes that align are very conserved

PA14 gene ref assembly

 desc            #elem   min     max     mean    stdev   sum
 genes(nt)       5892    72      15639   992     752     5846175  # out of a total of 6537648bp genome => 89.42 coding
 genes(aa)       5892    23      5212    330     251     1942833

PAO1 gene ref assembly

 desc            #elem   min     max     mean    stdev   sum
 genes(nt)       5568    72      16884   1006    751     5601468  # out of a total of 6264404bp genome => 89.41 coding
 genes(aa)       5568    23      5627    334     250     1861588

De-novo

SSAKE

version 3.0 with default parameters

 Contigs stats:
 
 desc    #elem   min     max     mean    stdev   sum
 length  185030  34      5490    77.21   141.23  14287079
 reads   185030  2       13352   29.52   127.65  5463405 
 
 Singletons: 3,164,495  (singletons file + ambiguous reads)

Run times:

 18459.472u 31.849s 5:08:59.82 99.7%	0+0k 0+0io 0pf+0w

Location: /fs/szasmg2/Bacteria/Pseudomonas_aeruginosa/Assembly/2008_0214_ssake/

Latest version is 3.2: run on the same dataset generated fewer contigs and more singletons; also took longer to run

SSAKE (redo)

version 3.2.1 with default parameters

 Contigs stats:
 
 desc    #elem   min     max     mean    median  n50     sum
 length  107046  34      3994    96.39   42      201     10317932
 reads   ...
 
 Singlets: 3,309,974

Velvet *

Version 0.5.05 Ovl len=23

Commands:

 host: sycamore (Opteron)
 velveth . 23 Pa.seq 
 velvetg . -read_trkg yes

Run times:

 velveth : 88.614u  22.865s 2:16.35  81.7%  0+0k 0+0io 0pf+0w
 velvetg : 439.939u 19.258s 8:19.77  91.8%  0+0k 0+0io 0pf+0w
 total   : 528.553u 42.123s 10:36.12

Output files (used)

 contigs.fa : contig FASTA file
Contigs stats:
                #elem   min     max     mean    median  n50     sum
  all           10684   45      16239   640     363     1184    6841458
  200bp+        7382    200     16239   877     586     1252    6474426
  breaks        257  (-b 20) 
  breaks(200+)  251  (-b 20) coming from 170 ctgs
  1,273,164 singletons; 879,840 of them aligned by soap (-v5,-s12,-g3) to PA14

Distance between breakpoints:

 #elem   min     max     mean    median  n50     sum
 251     0       314620  24946   3776    99406   6261404

Align velvet ctgs to PA14 ref (nucmer -c 40; delta-filter -q):

 alignemnts  length  %id
 9927        6164312 99.20  (slightly lower %id than the AMOS PA14 ref assembly: 99.33%)

PA14 0cvg regions

                      #elem   min     max     mean    median  n50     sum
original              2235    1       31503   197     22      7529    439633  # 6098014 of PA14 covered
delta-filter -q       2378    1       31676   220     24      6434    522246  # 6015401 of PA14 covered


rRNA repeat is collapsed; Contigs aligned to the 4 copy 5.7Kbp PA14 rRNA repeat:

     [S1]     [E1]  |     [S2]     [E2]  |  [LEN 1]  [LEN 2]  |  [% IDY]  |  [LEN R]  [LEN Q]  |  [COV R]  [COV Q]  | [TAGS]
 ===============================================================================================================================
       23      409  |        1      388  |      387      388  |    97.94  |     5763      388  |     6.72   100.00  | rRNA        856     [CONTAINS]
      388     2132  |        1     1745  |     1745     1745  |    99.94  |     5763     1745  |    30.28   100.00  | rRNA        642     [CONTAINS]
     2111     2279  |      169        1  |      169      169  |   100.00  |     5763      169  |     2.93   100.00  | rRNA        3433    [CONTAINS]
     2258     5763  |        1     3506  |     3506     3506  |    99.89  |     5763     3506  |    60.84   100.00  | rRNA        260     [CONTAINS]

Question: Are there some velvet contigs contained in each other? Answer: Very few: 302 (out of 10K) ; avg %id is 86 and avg lenght is 106 bp; nucmer -c 20 has been used for alignments

Many contigs seem to overlap by ~ 20bp

Some contigs share end reads labeled as D(uplicates)

 listReadPlacedStatus Pa.bnk | awk '{print $3}' | count.pl
 #       total
 P       7322684
 S       1273164
 D       32052     # 33 of them belong to 3 contigs
 Total   8627900

About ~4607 contig pairs could be merged:

 grep D Pa.status.detailed | awk '{print $5,$6}' | sort -u | wc -l
 4607

Most of the overlaps are shorter than 10bp; if "nucmer -c 20 -l 20" is run on the contig ends; only 627 of them are found

Example:

 Contigs 1915 & 3551 share 15 Duplicate reads
 cat Pa.status | grep 1915 | grep 3551$ | wc -l
 15 

They align to adjacent pos in PA14 :
 3743092 3745646 |  2555  1  | 2555  2555  |    99.49  |  6537648     2555  |     0.04   100.00  | 1      1915    [CONTAINS]
 3745625 3746382 |   758  1  |  758   758  |    99.60  |  6537648      758  |     0.01   100.00  | 1      3551    [CONTAINS]

411870 out of 1273164 singletons (~33%) align to PA14 (nucmer -maxmatch -c 20) 304162 out of 1273164 singletons (~25%) are also singletons in the AMOScmp-PA14 assembly; probably the common ones are low quality sequences that just don't assemble

PA14 reference alignments: nucmer -maxmatch -c 40):

Aligned:

 #elem   min     max     mean    stdev   sum
 9859    45      15342   642     785     6331357

Completely aligned:

 #elem   min     max     mean    stdev   sum
 9681    45      11120   617     712     5969126
 9208 are aligned 1 time
 408  are aligned 2 times  #repeats
 39   are aligned 3 times  #repeats
 25   are aligned 4 times  #repeats
 1    is aligned  6 times  #repeat

Partially aligned:

 #elem   min     max     mean    stdev   sum
 178     78      15342   2035    2139    362231 
 
 120 are aligned once:           Pa.1con-velvet.uniq.not_contained.filter-q.1.coords 
 58  are aligned more than once: Pa.1con-velvet.uniq.not_contained.filter-q.2.coords
 All these contigs have AMOScmp assembly singletons aligned to them!!! 
 These velvet contigs seem correct.
 Example: There is a large insertion in PA14 at position 6700
 
 PA-b1 reads vs PA14 ref  
  find-query-breaks -B Pa.filter-q.delta
  => 3 reads end at base 6700
   
 velvet ctgs vs PA14 ref:
 ...
   6177     6700  |    1   524  |  524    524  | 99.43  |  6537648  875  | 0.01  59.89  | 1  900     # 431 out of 1054 seqs in this contigs are singletons in AMOScmp assembly
   6694     8045  |  743   2094 |  1352   1352 | 99.63  |  6537648  2097 | 0.02  64.47  | 1  1970    # 895 out of 2354 seqs in this contigs are singletons in AMOScmp assembly; unaligned regions aligns at low %id to a Pseudomonas fluorescens hypothetical protein
 Nucmer/Soap alignments of the reads adjacent to position 6700
 Media:PA14.coord.6700 : ~20 reads end at pos 6700, ~12 reads start at pos 6694 (6bp ovl only)
 Media:PA14.soap.v5.s12.g3.6700
 Media:PA14.soap.v2.s12.g3.6700 10 bp ovl between reads that start at pos 6671(2 mismatches) & 6694(0 mismatches)
 Media:PA14.soap.v2.s12.g3.c42.6700 11 bp ovl between reads that start at pos 6676(2 mismatches) & 6694(0 mismatches)
 Other breaks: 
 ...
  66764    67073  |    1   310  |  310    310  | 100.00 |  6537648  1282 | 0.00  24.18  | 1 320
  67181    67322  | 1136  1277  |  142    142  |  98.59 |  6537648  1282 | 0.00  11.08  | 1 320
 ...
 116429   117249  |    1   821  |  821    821  |  99.03 |  6537648  1457 | 0.01  56.35  | 1 1955
 120490   120573  |   84     1  |   84     84  | 100.00 |  6537648  1693 | 0.00  4.96   | 1 1827
...
 234641   235980  | 3294  1955  | 1340   1340  |  99.40 |  6537648  3294 | 0.02  40.68  | 1 139
 235981   237442  | 1466     5  | 1462   1462  |  99.32 |  6537648  3294 | 0.02  44.38  | 1 139
...
 286003   287915  | 1216  3128  | 1913   1913  |  99.22 |  6537648  3128 | 0.03  61.16  | 1 3068
...

Not aligned:

 #elem   min     max     mean    stdev   sum
 825     45      16239   618.3   1209.37 510101        # the longest velvet ctg does not align

Not CONTAINED:

 #elem   min     max     mean    stdev   sum
 1002    45      16239   868.82  1518.56 870566

Velvet vs all Pseudomonas strains:

 strain          aligned         CONTAINED       partially-aligned    
 PA14            9859            9681            178                # smallest number
 PA2192          9940            9607            333
 PACS2           9688            9469            219
 PAO1            9587            9402            185
 PAC3719         9427            9057            370
 PA7             8514            7651            863
 all             10378           10243           135
 not aligned stats: 
 #elem   min     max     mean    median  n50     sum
 306     45      9681    549     119     2310    168094
 strain(s)                               CONTAINED
 PA14                                    9681
 PA14|PA2192                             10081
 PA14|PA2192|PACS2                       10186
 PA14|PA2192|PACS2|PAO1                  10198
 PA14|PA2192|PACS2|PAO1|PAC3719          10210
 PA14|PA2192|PACS2|PAO1|PAC3719|PA7      10243

Location: /fs/szasmg2/Bacteria/Pseudomonas_aeruginosa/Assembly/2008_0311_velvet-24

Velvet version 0.5.07 : does not "seem" to be an improvement over Version 0.5.05 Ovl len=23

 desc              #elem   min     max     mean    stdev   sum
 contig            11230   45      16239   611.64  814.09  6868784
 contig(200bp+)    7353    200     16239   878.61  897.06  6460467

Some of the 306 contigs should probably assemble together:

  • Assembled with minimus2 REFCOUNT=0 OVL=20 => 54 ctgs that contain 132 reads & 174 singlets
  • => reduced the number from 306 to 54+174= 228

Location: /fs/szasmg2/Bacteria/Pseudomonas_aeruginosa/Assembly/2008_0311_velvet-24/no_hits


!!! If AMOScmp is run only on the velvet singlets (mostly low quality), some of the resulting contigs will have "artificial" gaps

         #elem   min     max     mean    median  n50     sum
 all     43926   33      963     122     93      157     5367766
 200bp+  6757    200     963     293     263     289     1980273

Location: /fs/szasmg2/Bacteria/Pseudomonas_aeruginosa/Assembly/2008_0311_velvet-24/AMOScmp-PA14.singletons/1con-contigs.filter-q.na.coords

Edena

Version 2.1.1 , default parameters

 Contigs stats:  
               #elem   min     max     mean    median  n50     sum
 all           11180   100     11300   552     356     837     6175460
 200bp+        8316    200     11300   693     488     902     5759209
 breaks(PA14)        245 (-b 20)   # 78 of them also present in velvet
 breaks(velvet)      210 (-b 20)
 
 3955865 singletons

Contig don't seem to overlap much

Run times:

 ovl step:      1655.561u 7.560s  27:54.62 99.3%   0+0k 0+0io 0pf+0w
 assembly step: 56.141u   3.791s   1:03.82 93.9%   0+0k 0+0io 0pf+0w
 total:         1711.702u 11.351s 28:58.44     

A significant number of edena contigs CONTAIN one(several) velvet contigs:

 $ show-coords -I 94 velvet-edena.delta | egrep 'CONTAINED|BEGIN|END' | awk '{print $19}' | count.pl -m 1 | wc -l # 2269
 $ show-coords -I 94 velvet-edena.delta | egrep 'CONTAINED|BEGIN|END' | awk '{print $19}' | count.pl -m 2 | wc -l # 1455

Align edena ctgs to PA14 ref (nucmer -c 40; delta-filter -q):

       #elem   min     max     mean    median  n50     sum
 0cvg  8809    1       34854   119     38      438     1050135

Location: /fs/szasmg2/Bacteria/Pseudomonas_aeruginosa/Assembly/2008_0312_edena/

SSAKE/Edena/Velvet summary

host: sycamore.umiacs.umd.edu (Linux 64)

 program  version   run-time
 SSAKE    3.0       8459.472u 31.849s 5:08:59.82  
 Velvet   0.5.05    528.553u  42.123s   10:36.12
 Edena    2.11      1711.702u 11.351s   28:58.44     

Input: 8627900 33 bp Solexa reads(2 lanes)

Output: contigs+singletons

Contig statistics:

 program #seqs   min  max    mean  med  n50   sum
 SSAKE   185030  34   5490   77    44   87    14287079
 Velvet  10684   45   16239  640   363  1184  6841458
 Edena   11180   100  11300  552   356  837   6175460
 Euler   7090    22   22671  953   312  2622  6759979

200bp+ contig statistics:

 program #seqs   min  max    mean  med  n50   sum
 SSAKE   12532   200  5490   486   385  549   6090567
 Edena   8316    200  11300  692   488  902   5759209
 Velvet  7382    200  16239  877   586  1252  6474426

Singleton statistics:

 program #seqs
 SSAKE   3164495
 Velvet  1273164 
 Edena   3955865

Euler-SR

 cmd:     Assemble.pl Pa.seq vertexSize
 version: 12/4/08

Ctg len stats:

 vertexSize       #elem   min     max     mean    median  n50     sum
 21               7090    22      22671   953     312     2622    6759979
 25               11130   26      9619    564     359     912     6276548

Gene assemblies

Dan Sommer reduced number of contigs >=200bp from 306 to 120 /fs/szasmg2/Bacteria/Pseudomonas_aeruginosa/Annotation/PAb1.fasta

 Contig stats:
 desc    #elem   min     max     mean    stdev   sum
 contigs 120     212     512638  51438   81999   6172675
 Average Gap: 105 nt bases
 Median Gap: 14 nt bases
 Largest Gap: 1095 nt bases

927 singletons assembled

Contig Assemblies

Comparative

AMOScmp on the velvet contigs (PA14 ref) *

  • Use only velvet ctgs
  • This assembly will have some of the repeats collapsed

All:

       #elem   min     max     mean    median  n50     sum
 ctg   3046    45      27957   1928    1101    3744    5873032
 singl 989     45      16239   836     292     2221    826462
 all   4035    45      27957   1660    850     3573    6699494
 breaks 270

200 bp+:

       #elem   min     max     mean    median  n50     sum
 ctg   2632    200     27957   2213    1390    3798    5824679
 singl 586     200     16239   1344    721     2352    787686
 all   3218    200     27957   2055    1243    3631    6612365
 breaks 265

Reduced the number of contigs from 10684 to 4035 !!!

Location: /fs/szasmg2/Bacteria/Pseudomonas_aeruginosa/Assembly/2008_0311_velvet-24/AMOScmp-PA14

AMOScmp on the velvet contigs & velvet singletons (PA14 ref) *

  • Use singletones from velvet ctgs as well
  • This assembly will have some of the repeats collapsed
                   #elem   min     max     mean    median  n50     sum
  all              2383    33      177630  2592    72      45071   6177155
  200bp+           513     200     177630  11778   738     45351   6042047
  all-velvet       334     59      177630  17920   7295    45351   5985407  # ctgs that contain at least 1 velvet ctg
  singl-velvet     984     45      16239   837     292     2221    823862   # velvet ctgs left as singletons
  singl-read       393416  33      45      33      33      33      12983724 # original Solexa reads not assembled 

Location: /fs/szasmg2/Bacteria/Pseudomonas_aeruginosa/Assembly/2008_0311_velvet-24/AMOScmp-PA14.singletons

AMOScmp on the velvet contigs (PAO1 ref)

All:

       #elem   min     max     mean    median  n50     sum
 ctg   2982    45      27957   1928    1124    3652    5749909
 singl 1255    45      16239   764     247     2275    958885
 all   4237    45      27957   1583    773     3440    6708794
 breaks(PAO1) 271

200 bp+:

       #elem   min     max     mean    median  n50     sum
 ctg   2610    200     27957   2186    1380    3675    5705769
 singl 707     200     16239   1283    674     2398    906818
 all   3317    200     27957   1994    1205    3494    6612587
 breaks(PAO1) 266

Location: /fs/szasmg2/Bacteria/Pseudomonas_aeruginosa/Assembly/2008_0311_velvet-24/AMOScmp-PAO1

AMOScmp on the velvet contigs (PA14 & PA2192 ref)

Input: above contigs & singletons

 All:
 desc    #elem   min     max     mean    stdev   sum
 ctg     2841    45      34374   2022    2570    5744186
 singl   751     45      17475   1251    2264    939443   # 593 are velvet original ctgs
 all     3592    45      34374   1861    2529    6683629
 200bp+:
 desc    #elem   min     max     mean    stdev   sum
 ctg     2489    200     34374   2291    2638    5701588
 singl   456     200     17475   2001    2648    912503
 all     2945    200     34374   2246    2641    6614091

Location: /fs/szasmg2/Bacteria/Pseudomonas_aeruginosa/Assembly/2008_0311_velvet-24/AMOScmp-PA14-PA2192

AMOScmp on the velvet contigs (PA14, PA2192 & PACS2 ref)

Input: above contigs & singletons

 All:
 desc    #elem   min     max     mean    stdev   sum
 ctg     2759    45      28263   2029    2554    5597067
 singl   725     45      34374   1495    2744    1084142  # 493 are velvet original ctgs
 all     3484    45      34374   1918    2603    6681209
 200bp+:
 desc    #elem   min     max     mean    stdev   sum
 ctg     2419    200     28263   2297    2618    5556096
 singl   463     200     34374   2290    3170    1060442
 all     2882    200     34374   2296    2714    6616538

Location: /fs/szasmg2/Bacteria/Pseudomonas_aeruginosa/Assembly/2008_0311_velvet-24/AMOScmp-PA14-PA2192-PACS2

AMOScmp on the velvet contigs (PA14, PA2192, PACS2 & PAO1 ref)

Input: above contigs & singletons

 All:
 desc    #elem   min     max     mean    stdev   sum
 ctg     2696    45      24046   2028    2510    5468683
 singl   753     45      34374   1609    3050    1211809  # 483 are velvet original ctgs
 all     3449    45      34374   1937    2643    6680492
 200bp+:
 desc    #elem   min     max     mean    stdev   sum
 ctg     2369    200     24046   2292    2569    5429268
 singl   487     200     34374   2439    3527    1187586
 all     2856    200     34374   2317    2756    6616854

Location: /fs/szasmg2/Bacteria/Pseudomonas_aeruginosa/Assembly/2008_0311_velvet-24/AMOScmp-PA14-PA2192-PACS2-PAO1

AMOScmp/minimus on the AMOScmp (PA14 ref) & velvet "unique" contigs

Input:

 AMOScmp PA14 ref assembly (2) contigs
 velvet "unique: contigs
 !!! No singletons

AMOScmp2 steps:

  1. Align AMOScmp & velvet contigs (nucmer -c 40) to each other
  2. Identify velvet contigs CONTAINED in AMOScmp contigs(max trimming of 20)
  3. Remove CONTAINED contigs from the velvet set
  4. Create alignment file based to the AMOS.scaff file (if nucmer is repeat contigs are collapsed/no placed exactly)
  5. Align velvet contigs to reference using nucmer ; filter only the CONTAINED ones
  6. Merge AMOS & velvet alignments
  7. Run "casm-layout -S -r ..." ; if -S is not used, many AMOS contigs are not placed
  8. Run "make-consensus ..."
  • All AMOS input contigs should be part of the new assembly
  • If the velvet CONTAINED alignments are not filtered, some end conflicts might cause some of the AMOS contigs end up as singletons

Input:

 All:
 desc            #elem   min     max     mean    stdev   sum
 AMOS            1843    20      127984  3350.38 10776   6174765
 velvet-uniq     1189    45      16239   997.85  1485.34 1186452  # 462 aligned to PA14 ref (nucmer -c 40); 
                                                                  # 284 CONTAIEND in the ref
                                                                  # 178 (462-284) partially aligned : 
                                                                    # 58  have 2+ alignments 
                                                                    # 120 have 1  alignment
                                                                  # 727 (1189-462) not aligned to the ref
 all             3032    20      127984  2427.84 8530.33 7361217
 200bp+:
 desc            #elem   min     max     mean    stdev   sum
 AMOS            542     200     127984  11245   17520   6094904
 velvet-uniq     848     200     16239   1354.36 1627.86 1148499  # 422 aligned to PA14 ref (nucmer -c 40); 249 CONTAIEND in the ref; 426 not aligned to the ref
 all             1390    200     127984  5211.08 12019.6 7243403

Output:

 All:
 desc            #elem   min     max     mean    stdev   sum
 ctg             1327    20      221554  4644.9  19079.  6163795    # two of them(530,531) contain only velvet seqs
                                                                    # 1210 contain only AMOS seqs ; 11 contain 2 AMOS seqs (negative gap got merged)
                                                                    # 115 contain both AMOS & velvet seqs        
 singl           817     45      16274   1053.94 1719.12 861076     # all but one(807) are velvet : see .conflict file
                                                                    # 170 out of 178 partially aligned (see above) is a singleton
 all             2144    20      221554  3276.52 15146.  7024871
 200bp+:
 desc            #elem   min     max     mean    stdev   sum
 ctg             282     203     221554  21651   36731   6105733
 singl           571     200     16274   1461.55 1917.62 834546
 all             853     200     221554  8136.31 23189   6940279

minimus steps:

  1. Run minimus2 pipeline on the above contigs(1327) and singletons(817) ; Only uses contig:singletion overlaps (not all:all)
 $ ~dpuiu/bin/minimus2 -D BREAKID=1327 -D OVERLAP=40 Pa 

min OVL=40 : use contig:singleton alignments

Output:

 All:
 desc            #elem   min     max     mean    stdev   sum
 ctg             180     135     253489  22817.2 45339   4107096   # 218 of the PA14 unaligned velvet contigs assembled
 singl           1377    20      153036  1996.79 10534   2749587
 all             1557    20      253489  4403.77 19467   6856683

 200bp+:
 desc            #elem   min     max     mean    stdev   sum
 ctg             174     210     253489  23598.3 45919   4106116
 singl           488     200     153036  5519.51 17154   2693524
 all             662     200     253489  10271.3 28846   6799640

min OVL=20 : use contig:singleton alignments (not significantlty better than ,on OVL=40)

Output:

 All:
 desc    #elem   min     max     mean    stdev   sum
 ctg     279     63      253489  15557   37925   4340305 # 321 of the PA14 unaligned velvet contigs assembled
 singl   972     20      153036  2576    12252   2504000
 all     1251    20      253489  5471    21580   6844305
 200bp+:
 desc    #elem   min     max     mean    stdev   sum
 ctg     263     214     253489  16495   38868   4338061
 singl   393     200     153036  6261    18680   2460755
 all     656     200     253489  10364   28954   6798816


min OVL=20 : use contig:singleton & singleton:singleton alignments

 All:
 desc    #elem   min     max     mean    stdev   sum
 ctg     277     103     253489  15983   37991   4427292  # 556 of the PA14 unaligned velvet contigs assembled
 singl   808     20      153036  2986    13439   2412439  # 260 of them are velvet original contigs
 all     1085    20      253489  6304    23111   6839731
 200bp+:
 desc    #elem   min     max     mean    stdev   sum
 ctg     263     222     253489  16826   38812   4425275
 singl   298     201     153036  7971    21242   2375456  # 165 of them are velvet original contigs
 all     561     201     253489  12123   31042   6800731

min OVL=20 : use contig:singleton & singleton:singleton alignments & adjacent contig ovelaps (best)

 All:
 desc    #elem   min     max     mean    stdev   sum
 ctg     317     47      253489  16450   40806   5214579
 singl   703     20      112908  2324    11332   1633528  # 335 are original velvet contigs
                                                          # 29 are velvet contigs partially aligned
 all     1020    20      253489  6714    25448   6848107
 200bp+:
 desc    #elem   min     max     mean    stdev   sum
 ctg     291     200     253489  17909   42289   5211547
 singl   247     201     112908  6480    18431   1600539  # 186 are original velvet contigs
 all     538     200     253489  12662   33969   6812086

Locations:

 /fs/szasmg2/Bacteria/Pseudomonas_aeruginosa/Assembly/AMOScmp-velvet
 /fs/szasmg2/Bacteria/Pseudomonas_aeruginosa/Assembly/AMOScmp-velvet/AMOScmp2
 /fs/szasmg2/Bacteria/Pseudomonas_aeruginosa/Assembly/AMOScmp-velvet/AMOScmp2.not_CONTAINED

AMOScmp/minimus on the AMOScmp (PA14 ref) & velvet contigs *

  • Use all velvet contigs
  • Very similar with the previous case (the velvet CONTAINED contigs were discarded at the beginning)

AMOScmp2

Input:

 All:
 desc    #elem   min     max     mean    stdev   sum
 AMOS    1843    20      127984  3350    10777   6174765
 velvet  10684   45      16239   640     825     6841458
 all     12527   20      127984  1039    4311    13016223
 
 200bp+: 
 desc    #elem   min     max     mean    stdev   sum
 AMOS    542     200     127984  11245   17521   6094904
 velvet  7382    200     16239   877     896     6474426
 all     7924    200     127984  1586    5344    12569330

Output:

 All: 
 desc    #elem   min     max     mean    stdev   sum
 ctg     1361    20      221563  4536    19428   6173665
 singl   838     45      22884   1064    1868    891463
 all     2199    20      221563  3213    15418   7065128
 200bp+:
 desc    #elem   min     max     mean    stdev   sum
 ctg     299**   203     221563  20448   37378   6113919
 singl   576     200     22884   1499    2114    863431
 all     875     200     221563  7974    23668   6977350

minimus2 on AMOScmp2 output

 All:
 desc    #elem   min     max     mean    stdev   sum
 ctg     284     47      274913  18279   46193   5191268
 singl   1001    20      133259  1666    9754    1667778
 all     1285    20      274913  5338    24330   6859046
 200bp+:
 desc    #elem   min     max     mean    stdev   sum
 ctg     246     210     274913  21088   49046   5187663
 singl   285     201     133259  5696    17670   1623459
 all     531     201     274913  12827   36583   6811122

Locations:

 /fs/szasmg2/Bacteria/Pseudomonas_aeruginosa/Assembly/AMOScmp-velvet/AMOScmp2
 /fs/szasmg2/Bacteria/Pseudomonas_aeruginosa/Assembly/AMOScmp-velvet/AMOScmp2/minimus2/

AMOScmp/minimus on the AMOScmp PA14 & PAO1 contigs

AMOScmp

Input:

 All:
 desc    #elem   min     max     mean    stdev   sum
 PA14    1843    20      127984  3350    10777   6174765
 PAO1    1961    20      46414   3065    5251    6010800   # 1655 aligned   
                                                           # 1312 CONTAINED in PA14 contigs
                                                           # 343 aligned but not CONTAINED in PA14 ctgs; 36 have 2+ alignments
                                                           # 306 not aligned
 all     3804    20      127984  3203    8396    12185565
 200bp+:
 desc    #elem   min     max     mean    stdev   sum
 PA14    542     200     127984  11245   17521   6094904
 PAO1    1166    203     46414   5108    6007    5955960
 all     1708    200     127984  7056    11405   12050864

Output:

 All:
 desc    #elem   min     max     mean    stdev   sum
 ctg     1585    20      159598  3896    15496   6174745
 singl   266     20      26709   1900    4486    505395      # all singletons are PAO1
 all     1851    20      159598  3609    14456   6680140
 200bp+:
 desc    #elem   min     max     mean    stdev   sum
 ctg     356     200     159598  17137   29063   6100717
 singl   75      215     26709   6631    6361    497316
 all     431     200     159598  15309   26837   6598033

minimus2 on the above contigs and singletons:

Output:

 All:
 desc    #elem   min     max     mean    stdev   sum
 ctg     246     30      266798  18632   44344   4583528
 singl   1138    19      126780  1569    9194    1785442  # 161 are original PAO1 contigs
 all     1384    19      266798  4602    21458   6368970
 200bp+:
 desc    #elem   min     max     mean    stdev   sum
 ctg     139     202     266798  32900   54951   4573157
 singl   152     201     126780  11405   22894   1733533 # 23 are original PAO1 contigs
 all     291     201     266798  21672   42726   6306690


AMOScmp2 on the AMOScmp PA14 contigs & PA14 genes *

INPUT:

                 #elem   min     max     mean    median  n50     sum
 AMOScmp         1843    20      127984  3350    63      30995   6174765
 AMOScmp(200bp+) 542     200     127984  11245   3263    31106   6094904
 genes           5892    72      15639   992     861     1206    5846175

OUTPUT:

         #elem   min     max     mean    median  n50     sum
 all     650     20      187557  9954    846     60335   6470391
 200bp+  462     201     187557  13977   1578    60335   6457375
 0 singletons !!!

Contigs that contain at least 1 AMOScmp contig & 1 gene: 356

                 #elem   min     max     mean    median  n50     sum       "singletons"
 ctg-all         356**   111     187557  17861   2692    63102   6358519
 ctg-200bp+      353     217     187557  18012   2723    63102   6358095
 ctg-AMOS        356     1       85      5       2       9       1653       190  
 ctg-genes       356     1       160     16      2       58      5781       111  **  that many genes from PA14 seem to miss from PAb1

Contigs that contain at least 1 AMOScmp contig(>100bp) & 1 gene: 218

                 #elem   min     max     mean    median  n50     sum
 ctg-all         218**   158     187557  28371   13008   66709   6184790
 ctg-200bp+      217     305     187557  28501   13008   66709   6184632
 ctg-AMOS        218     1       19      3       2       4       685
 ctg-genes       218     1       160     26      12      59      5632

Contigs that contain at least 1 AMOScmp contig(>200bp) & 1 gene: 206

Location: /fs/szasmg2/Bacteria/Pseudomonas_aeruginosa/Assembly/AMOScmp-gene/AMOScmp2

AMOScmp2 on the AMOScmp PA14 contigs , PA14 genes * & velvet contigs

INPUT:

                 #elem   min     max     mean    median  n50     sum
 AMOScmp         1843    20      127984  3350    63      30995   6174765
 AMOScmp(200bp+) 542     200     127984  11245   3263    31106   6094904
 genes           5892    72      15639   992     861     1206    5846175
 velvet          10684   45      16239   640     363     1184    6841458

OUTPUT:

                 #elem   min     max     mean    median  n50     sum
 all             552     20      337673  11718   663     133802  6468490
 200bp+          377     201     337673  17126   1155    133802  6456628
 singl           1006    45      16239   869     306     2308    874312       # all but 2 are velvet ctgs; 
                                                                              # 232 velvet singletons have annotated alignments to the AMOScmp ctgs
              

About half of the contigs end up in genes

Contigs that contain at least 1 gene & (1 AMOS ctg >=100bp or 1 velvet ctg): 135

                 #elem   min     max     mean    median  n50     sum
 ctg-all         135**   163     337673  45819   14232   141346  6185574
 ctg-200bp+      134     305     337673  46160   16317   141346  6185411
 ctg-AMOS100-vel 135     1       508     77      28      215     10356
 ctg-genes       135     1       287     42      12      126     5631

Contigs that contain at least 1 gene & 1 velvet ctg: 104 Contigs that contain at least 1 gene , 1 velvet ctg & 1 AMOScmp ctg : 104 (same number)

                 #elem   min     max     mean    median  n50     sum
 ctg-all         104**   163     337673  58949   32961   141346  6130650
 ctg-200bp+      103     747     337673  59519   32961   141346  6130487
 ctg-AMOS                                                        1085
 ctg-velvet                                                      9667
 ctg-genes                                                       5572


Location: /fs/szasmg2/Bacteria/Pseudomonas_aeruginosa/Assembly/AMOScmp-velvet-gene/AMOScmp2

AMOScmp on velvet & edena contigs (PA14 ref)

Input:

               #elem   min     max     mean    median  n50     sum
 velvet        10684   45      16239   640     363     1184    6841458
 edena         11180   100     11300   552     356     837     6175460
 all           21864   45      16239   595     359     992     13016918

PA14 0cvg regions (nucmer -c 40; delta-filter -q):

              #elem   min     max     mean    median  n50     sum
 velvet        2388    1       31832   219     24      6434    522048
 edena         8809    1       34854   119     38      438     1050135
 all           2129    1       31832   234     22      7529    498077

Output:

 All:
       #elem   min     max     mean    median  n50     sum
 ctg   2782    45      27957   2141    1207    4246    5955924
 singl 1874    45      16239   839     343     1972    1572938
 all   4656    45      27957   1617    727     3749    7528862
 breaks 492
 
 200 bp+:
       #elem   min     max     mean    median  n50     sum
 ctg   2424    200     27957   2439    1505    4288    5912995
 singl 1233    200     16239   1217    677     2121    1500822
 all   3657    200     27957   2027    1122    3816    7413817
 breaks 478

AMOScmp on velvet contigs (PA14 gene ref) *

Contig stats:

                             #elem   min     max     mean    median  n50     sum
 nucmer (-l 20,-c 65)        3137    65      3615    490     370     705     1535656
 promer (-l 6, -c 10)        3622    61      6125    623     448     922     2255378

Only the contigs CONTAINED in the genes get assembled; BEGIN/END are discarded

Locations:

 /fs/szasmg2/Bacteria/Pseudomonas_aeruginosa/Assembly/2008_0311_velvet-24/genes/AMOScmp-nucmer
 /fs/szasmg2/Bacteria/Pseudomonas_aeruginosa/Assembly/2008_0311_velvet-24/genes/AMOScmp-promer

De-novo

minimus on velvet contigs (OVL=40)

All:

       #elem   min     max     mean    median  n50     sum
 ctg   179     45      1495    75      45      45      13336
 singl 10324   45      16239   661     384     1188    6819550
 all   10503   45      16239   651     374     1185    6832886
 breaks 257

200 bp+:

       #elem   min     max     mean    median  n50     sum
 ctg   6       227     1495    836     1129    1495    5015
 singl 7375    200     16239   877     586     1252    6469371
 all   7381    200     16239   877     586     1252    6474386
 breaks 251

minimus on velvet contigs (OVL=20)

All:

       #elem   min     max     mean    median  n50     sum
 ctg   1812    47      23446   1621    1164    2366    2936650
 singl 6135    45      15342   626     357     1136    3839779
 all   7947    45      23446   853     471     1574    6776429
 breaks(PA14) 299

200 bp+:

       #elem   min     max     mean    median  n50     sum
 ctg   1746    201     23446   1677    1211    2374    2927255
 singl 4246    200     15342   851     568     1216    3615358
 all   5992    200     23446   1092    702     1651    6542613
 breaks(PA14) 293

minimus1 on velvet contigs (OVL=20)

All:

       #elem   min     max     mean    median  n50     sum
 ctg   2046    73      23446   1628    1157    2396    3330276
 singl 5658    45      15342   609     344     1144    3445004
 all   7704    45      23446   879     489     1660    6775280
 breaks(PA14) 568

200 bp+:

       #elem   min     max     mean    median  n50     sum
 ctg   1978    201     23446   1678    1205    2396    3320032
 singl 3786    200     15342   857     575     1222    3243138
 all   5764    200     23446   1139    737     1729    6563170
 breaks(PA14) 559

If OVL is dropped to 16, fewer contigs get merged => fewer breakpoints

minimus2 on velvet & edena ctgs (OVL=40)

Input: velvet & edene ctgs

          #elem   min     max     mean    median  n50     sum       breaks
 velvet   10684   45      16239   640     363     1184    6841458   240 
 edena    11180   100     11300   552     356     837     6175460   226
 v+e      21864   45      16239   595     359     992     13016918  
 v+e(200+)15698   200     16239   779     528     1069    12233635

Output (ctg+singletons):

          #elem   min     max     mean    median  n50     sum        breaks
 all      8835    45      16296   774     424     1448    6841222    253      
 200bp+   6446    200     16296   1018    663     1514    6563221

Comment: for OVL=20 we get 475 breaks (too many?)

minimus2 on the AMOScmp PA14 & PAO1 contigs

  • Use PA14-PA14 (5 bp, adj ctg ends) , PA14-PAO1 (20 bp, contig ends) , PAO1-PAO1 (5 bp, adj ctg ends) overlaps

All:

       #elem   min     max     mean    median  n50     sum
 ctg   441     43      223777  13972   294     86090   6161798
 singl 1227    20      18974   237     42      6391    290505
 all   1668    20      223777  3868    52      78110   6452303
 breaks(PA14) 130

200 bp+:

       #elem   min     max     mean    median  n50     sum
 ctg   245     202     223777  25080   5100    86090   6144514
 singl 107     210     18974   2180    547     8193    233281
 all   352     202     223777  18119   1678    78110   6377795
 breaks(PA14) 121

Location:

 /fs/szasmg2/Bacteria/Pseudomonas_aeruginosa/Assembly/AMOScmp-PA14-PAO1/minimus2

minimus2 on AMOScmp PA14 contigs and PA14 genes (OVL=20)

INPUT:

         #elem   min     max     mean    median  n50     sum
 AMOScmp 1843    20      127984  3350    63      30995   6174765
 genes   5892    72      15639   992     861     1206    5846175

OUTPUT:

                 #elem   min     max     mean    median  n50     sum       singl
 ctg-all         278**   158     170499  22136   10332   52297   6153935
 ctg-200bp+      277     315     170499  22216   10332   52297   6153777
 ctg-AMOS        278     1       23      2       1       3       604       1239
 ctg-genes       278     1       157     20      10      49      5582      310

minimus2 on velvet contigs and PA14 genes (OVL=20)

  • min %id = 94

INPUT

                 #elem   min     max     mean    median  n50     sum
 gene            5892    72      15639   992     861     1206    5846175
 velvet          10684   45      16239   640     363     1184    6841458

OUTPUT (nucmer,OVL=20)

                 #elem   min     max     mean    median  n50     sum  
 ctg             923**   181     66497   6728    4737    10446   6210244
 singl-gene      352     93      7347    979     777     1272    344478
 singl-velvet    1144    45      16239   580     214     1603    663940
 all             2419    45      66497   2984    901     9049    7218662

Best alignments:

 show-coords Pa.filter-1.delta | ~/bin/AMOS/mummerAnnotate.pl -all | awk '{print $20}' | count.pl
 [CONTAINED]     5306
 [END]           3482
 [BEGIN]         3453
 [CONTAINS]      1977  # ??? why only 1977 velvet ctgs contain full genes; genes are not repetitive; even worse for edena 
 [PARTIAL]       168
 [IDENTITY]      1
 Total           14387

0cvg regions in the genes

 #genes      #elem   min     max     mean    median  n50     sum
 1391        1940    1       1762    36      18      67      69336

=>4166 out of 5557 assembled genes are fully covered by velvet ctgs

Location: /fs/szasmg2/Bacteria/Pseudomonas_aeruginosa/Assembly/2008_0311_velvet-24/genes-PA14/minimus2-nucmer.OVL20

minimus2 on velvet contigs and Ps genes (OVL20)

  • min %id: = 74 (nucmer default)
  • avg %id: = 84
 min%id   overlaps
 74       1162      # all considered
 85       415
 90       110

INPUT

         #elem   min     max     mean    median  n50     sum
 gene    5478    75      18825   996     852     1227    5456880
 velvet  10684   45      16239   640     363     1184    6841458

OUTPUT

               #elem   min     max     mean    median  n50     sum        "singl"
 ctg           652**   323     18825   2150    1768    2451    1401627                 !!! 1.4M of genome gets covered
 ctg-gene      757     96      18825   1271    1122    1455    962199      4721
 ctg-velvet    998     45      11120   938     558     1767    935636      9686

minimus2 on velvet contigs and Ps genes (OVL36: promer -l 6 -c 12)

 min%id   overlaps  genes  velvet 
 22       8496      2985   4827        #default
 50       7209      2813   4495
 60       5948      2571   4087
 70       4361      2122   3269
 74       3700      1880   2827
 80       2615      1407   2036
 85       1714      956    1311
 90       976       516    659

min%id=74 =>

 * 3700 overlaps 
 * 2169 Ps genes
 * 3206 velvet contigs

make-consensus error

                #elem   min     max     mean    median  n50     sum      "singl"
ctg             929**   206     16291   2012    1669    2335    1869539            !!! 1.8M of genome gets covered
ctg-gene        991     111     9522    1118    990     1272    1107648   4487     # 2169-991 genes had overlaps but were not assembled
ctg-velvet      1032    45      16239   1199    910     1742    1237453   9652     # 3206-1032  velvet contigs had overlaps but were not assembled

!!! Slightly more genes/velvet contigs get assembled with promer than nucmer


minimusN on PA14,PA2192,PAC3719,PACS2,PAO1 genes

Output:

 strains        #elem   min     max     mean    median  n50     sum
 2+             5702    72      15450   1015    870     1227    5785616
 3+             5469    153     15450   1007    876     1212    5505737
 4+             5292    153     13067   1008    883     1209    5334219
 5              4962    153     13067   1020    891     1212    5061120  (5021 in the article)

Total and strain specific genes:

 strain          total   singletons
 PA14            5892    519
 PA2192          6157    387   (507 in the article)
 PAC3719         5650    132   (79  in the article)
 PACS2           5676    127
 PAO1            5568    68
 total           28943   1233

all vs all genes: from the aligned ones ~ 1556 are CONTAINED|BEGIN|END (426 are PA14)

Final assembly (v3)

Merged some of the 120 AMOScmp contigs from v2 using velvet contigs. Grouped overlapping contigs and used minimus to merge them.

 Contig length stats:
 
  desc            #ctgs   min     max     mean              stdev           sum
  final3          512     200     512638  13099.41          53526.8         6706902
  AMOScmp3        76      212     512638  82763.22          117199.28       6290005
  novo3           436     200     10493   956.18            1336.44         416897
 
  final3-10KB+    49      10493   512638  127086.12         125757.98       6227220 

Comment:

 ~ 53-74 (out of 436) the velvet denovo that contigs align to PA14.
 
 Length statistics:
                    #ctgs   min     max     mean    stdev   sum
   CONTAINED        53      204     7228    514.81  978.25  27285
   ALL              74      204     7228    622.85  926.13  46091 

Locations:

 /fs/szasmg2/Bacteria/Pseudomonas_aeruginosa/Assembly/final3/
 /fs/szasmg2/Bacteria/Pseudomonas_aeruginosa/Assembly/2008_0311_velvet-24/nucmer/merge/

Simulations

First ~ 100KB from PA14, PAO1

All PA14, PAO1

Uniform read error : 1%

Repetitive regions in PA14 (100 bp found my repeat-match)

         #rep.   min     max     mean    median  n50     sum
 PA14    282     100     5571    340     171     520     95919

AMOScmp

 PA14=sample
 PAO1=reference

Conclusions: (results are below)

  • more 33bp reads align with soap
  • more 50bp reads align with nucmer
  • slightly more reads align to the reference if the errors have exponential than uniform distribution ; no much impact on the assembly
  • AMOScmp 50bp read assemblies have fewer ctgs than the 33bp read assemblies
  • 33bp reads: more errors(snps & breaks) in the AMOScmp soap than nucmer assembly
  • 50bp reads: more errors(snps & breaks) in the AMOScmp nucmer than soap assembly
  • more snps & breaks in the non-unique regions => many introduced by repeats
  • more indels (than substitutions) in the unique regions
  • more deletions from the reference
  • the indels are actually longer than 1bp; usually 5-20bp; occur where there are larger differences in the 2 genomes; they seem to be assembly errors
  • there are fewer longer insertions in the qry than in the ref (for all PA14 33bp, PAO1 33bp, PA14 50bp)
  • decreasing the make-consensus alignment wiggle -w to 1 makes no much difference than -w 2
  • if we incease the min% id for the alignments, fewer reads will align , more 1X cvg regions => more snps; these snps are mostly substitutions, not indels

Read alignments to PAO1

Uniform read error : 1%

Alignment programs: nucmer & soap

  • 33bp: min alignment=28bp both nucmer & soap
  • 50bp: min alignment=45bp both nucmer & soap


       error    total    aligned(nucmer)   aligned(soap)       aligned(nucmer&soap)
 33bp  uniform  7528200  6631195(88.08%)   6851948(91.01%)     6602210(87.69%)
 33bp  exp.     7528200  6713822           6879256
 50bp  uniform  4968612  4524844(91.06%)   4497468(90.51%)

0cvg regions in PAO1:

                 #elem   min     max     mean    median  n50     sum
 33bp-nucmer     969     1       13696   245     22      1470    237643
 33bp-soap       543     1       18331   434     26      3262    235621
 50bp-nucmer     326     1       18663   716     96      3178    233556
 50bp-soap       453     1       19256   546     29      4840    247205  !!! largest values

PA14 33 & 50bp simulated reads

Uniform read error : 1%

Legend:

  • uniq:contigs that align uniquely to the reference
  • snps-I: substitutions only

33bp

                                #ctg    min     max     mean    median  n50     sum     snps    snps-I   snps/1KB  snps-I/1KB    breaks
 AMOScmp-PAO1.nucmer            1298    33      60570   4641    1342    13211   6023652 1655    915      0.274     0.151         57   *
 AMOScmp-PAO1.nucmer(uniq)      851     40      46537   3482    1167    9474    2963269 425     57       0.143     0.019         2
  
 AMOScmp-PAO1.soap              774     33      214591  7791    527     30892   6030156 2037    1209     0.337     0.200         85
 AMOScmp-PAO1.soap(uniq)        395     40      42671   3496    194     14876   1380824 189     29       0.136     0.021         2

50bp

                                #ctg    min     max     mean    median  n50     sum     snps    snps-I   snps/1KB  snps-I/1KB    breaks
 AMOScmp-PAO1.nucmer            366     50      214838  16466   1998    59036   6026608 1966    1190     0.326     0.197         101   *
 AMOScmp-PAO1.nucmer(uniq)      158     50      91686   4954    233     23389   782667  77      32       0.098     0.040         0
 
 AMOScmp-PAO1.soap              634     50      144624  9492    2302    28330   6018087 1927    1112     0.320     0.184         60 
 AMOScmp-PAO1.soap(uniq)        349     50      62962   4705    1149    15406   1641990 133     13       0.080     0.007         0      !!! fewest snps
                snps    snps-I  breaks
 *common        838     441     29

 PA14: 6537648    
                                                                   all               common
                                                        Coverage   snps  snps-I      snps   snps-I  
 --------------------------------------------------------------------------------------------------
 sim.PA14/AMOScmp-PAO1/PA14-ctgs.filter-q               6003782    1655  915         1281   761
 sim.PA14.50bp//AMOScmp-PAO1.redo//PA14-ctgs.filter-q   6017813    1966  1190        1198   734
 common(last 2)                                         5996786
  • The larger number of snps in the 50bp assemblies is an artifact of gap closing in 33bp assembly
  • The number of snps in the 33bp covered regions goes slightly down if read length is increased

PA14 33bp simulated reads

Uniform read error : 1%

Legend:

  • uniq : reads that seem to be unique have been filtered out
                         #elem   min     max     mean    median  n50     sum     snps    snps-I   breaks
 AMOScmp-PAO1            1298    33      60570   4641    1342    13211   6023652 1655    915      57* 
 AMOScmp-PAO1.uniq       1383    33      60570   4297    1026    13192   5943384 1207    57       48
  
 AMOScmp-PAO1.soap       774     33      214591  7791    527     30892   6030156 2037    1209     85*
 
 velvet                  4145    45      25746   1577    529     4317    6536518 151     148      0
 
 minimus2                582     33      275737  11306   261     91069   6579905 1610    1056     69   # AMOS & velvet ctg phred score=30
 minimus2.qual           582     33      275730  11305   261     91069   6579772 1240    928      67   # AMOS ctg phred score=20; velvet ctg phred score=30
 minimus2.pre            575     23      275737  11371   238     84812   6538289 1569    1131     34   # some of the AMOS ctg are broken by velvet ctg
 AMOScmp-PACS2           1713    33      55026   3521    423     13041   6031518 1344    831     56    # more ctgs than in the AMOScmp-PAO1 assembly
          snps    snps-I   breaks
 *common  1039    598      34

PA14 33bp simulated reads

Exponential read error : for a 33 bp read avg=~1%

Base 0 error: 1/(1.2**48-1) Multiplied by 1.2 for next base ... Base 48+ error: 1

 $ ~/bin/mutateSeq.pl -exp 48 -test
 0     0.000158259181534141
 1     0.000189911017840969
 2     0.000227893221409163
 ...
 30    0.0375669811375429
 31    0.0450803773650515
 32    0.0540964528380618
 ----------------------------------
 total 0.323
 47    0.833465215984612
 48    1.00015825918153


AMOScmp alignment

                                 #ctgs   min     max     mean    median  n50     sum     snps    snps-I  breaks  #reads-aligned
 AMOScmp-PAO1.exp                1396    33      55370   4315    1188    12277   6023140 1773    877     56      6713822(89.18%)
 AMOScmp-PAO1.exp.uniq           908     40      46548   3394    1054    9533    3081827 481     22      3       .

Annotation

  • ( 5769 genes /fs/szasmg2/Bacteria/Pseudomonas_aeruginosa/Annotation/merge/gene_list.ptt ) OLD ?
  • 5602 genes /fs/szasmg2/Bacteria/Pseudomonas_aeruginosa/Annotation/final/gene_list.ptt
  • 512 contigs /fs/szasmg2/Bacteria/Pseudomonas_aeruginosa/Assembly/final3/PAb1.final3.fasta

Submission

  • PAb1 genome project
  • Taxonomy
  • Genbank accession: ABKZ01000000
  • SRA001120
  • PLoS aricle
    • Our final assembly contains one large scaffold with 76 contigs whose total length is 6,290,005 bp, with the longest contig at 512,638 bp. The 436 unplaced contigs, which should fit into the remaining gaps, represent sequence that is unique to PAb1 ... The final assembly thus consists of 512 contigs covering 6,706,902 bp ...
    • Our annotation of the PAb1 genome identified 5,602 protein-coding genes, ...