Culex pipiens symbiont: Difference between revisions

From Cbcb
Jump to navigation Jump to search
No edit summary
No edit summary
 
(151 intermediate revisions by the same user not shown)
Line 1: Line 1:
= Data Sources =
= Data Sources =


== Sanger: Wolbachia pipientis endosymbiont of Culex quinquefasciatus ==
== Sanger ==
'''Wolbachia pipientis endosymbiont of Culex quinquefasciatus'''


* [http://www.sanger.ac.uk/Projects/W_pipientis/ Sanger Wolbachia Genome Project]
* [http://www.sanger.ac.uk/Projects/W_pipientis/ Sanger Wolbachia Genome Project]
* [ftp://ftp.sanger.ac.uk/pub/pathogens/Wolbachia/ Sanger Wolbachia FTP]
* [ftp://ftp.sanger.ac.uk/pub/pathogens/Wolbachia/ Sanger Wolbachia FTP] 24,532 Sanger traces
 
Old reference:
file name: /fs/szasmg2/Culex_pipiens_symbiont/Sanger/Wb_Cq_061226.dbs


* December 2006 reference (95 sequences):
  file name: /fs/szasmg2/Culex_pipiens_symbiont/Sanger/Wb_Cq_061226.dbs
 
   Top 10 seqs
   Top 10 seqs
   Name                          Length    %GC
   Name                          Length    %GC
Line 23: Line 24:
   ...
   ...


  New reference (12 sequences):
* July 2007 reference (12 sequences; 7 "good"; 4 "unique"):
   file name: /fs/szasmg2/Culex_pipiens_symbiont/Sanger/Wb_Cq.dbs
   file name: /fs/szasmg2/Culex_pipiens_symbiont/Sanger/Wb_Cq.dbs
 
 
   All seqs:
   All seqs:
   Name                    Length  %GC
   Name                    Length  %GC
   1  culexbac1b5Ab03.q1k    1136301 34.17           
   1  culexbac1b5Ab03.q1k    1136301 34.17           
   2  culex161b01.q1k        346054  34.25             
   2  culex161b01.q1k        346054  34.25             
   3  culex166f03.q1k        13962  36.17          share almost all sequence with culex161b01.q1k & 1996bp with culexbac1b5Ab03.q1k
   3  #culex166f03.q1k        13962  36.17          share almost all sequence with culex161b01.q1k & 1996bp with culexbac1b5Ab03.q1k
     subtotal(3)            1496317
     subtotal(3)            1496317
    
    
   4  culex49c07.p1k          9245    35.53          looks circular(misoriented mates at the ends); region 4979-6364 aligns to culexbac1b5Ab03.q1k 3 times
   4  culex49c07.p1k          9245    35.53          misoriented mates at the ends; region 4979-6364(1.3Kbp) aligns to culexbac1b5Ab03.q1k 3 times
   5  culex53a02.p1k          5343    33.58          ~ 1Kbp alignments to culexbac1b5Ab03.q1k & culex161b01.q1k
   5  culex53a02.p1k          5343    33.58          ~ 1Kbp alignments to culexbac1b5Ab03.q1k & culex161b01.q1k
   6  culex117e02.p2kA55      3501    33.10          contained (in 2 pieces) in culexbac1b5Ab03.q1k
   6  #culex117e02.p2kA55      3501    33.10          contained (in 2 pieces) in culexbac1b5Ab03.q1k
   7  culex141a08.q1k        1920    33.44          few hundred bp alignments to other culex* seqs
   7  #culex141a08.q1k        1920    33.44          contained in other seqs
     subtotal(7)            1516326
     subtotal(7)            1516326
    
    
Line 45: Line 46:
   12 culex166d08.p1k        2071    55.53          repeat (higher GC%): culex 18S rRNA & 28S rRNA ; no alignments to NC_002978 & NC_006833
   12 culex166d08.p1k        2071    55.53          repeat (higher GC%): culex 18S rRNA & 28S rRNA ; no alignments to NC_002978 & NC_006833
     total(12)              1546442
     total(12)              1546442
* Sept 2008 reference
  file name: /fs/szasmg2/Culex_pipiens_symbiont/Sanger/Wb_Cq_080903.dbs
 
  1  contig000310  1136301 34.17
  2  contig000307  346054 34.25
  3  contig000311  15587  21.81
  4  contig000305  13962  36.17
  5  contig000312  9245  35.53
  6  contig000309  6967  36.63
  7  contig000306  5343  33.58
  8  contig000315  3501  33.10
  9  contig000308  3350  51.73
  10 contig000313  2148  54.89
  11 contig000314  2071  55.53
  12 contig000304  1994  33.85


== NCBI ==
== NCBI ==
'''Culex quinquefasciatus'''


* Taxonomy:
* Taxonomy:
Line 58: Line 76:
     * Culex pipiens x Culex quinquefasciatus   
     * Culex pipiens x Culex quinquefasciatus   
     * Culex quinquefasciatus (southern house mosquito)    1(project)
     * Culex quinquefasciatus (southern house mosquito)    1(project)
  * Wolbachia Lineage:
    root; cellular organisms; Bacteria; Proteobacteria; Alphaproteobacteria; Rickettsiales; Rickettsiaceae; Wolbachieae; Wolbachia
  * Wolbachia phage WO
    http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=112596
    http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?db=nuccore&id=6723230
    up to 2K alignments at ~90%id of our genome to this virus


* [http://www.ncbi.nlm.nih.gov/sites/entrez?db=genomeprj&cmd=Retrieve&dopt=Overview&list_uids=12963 Culex quinquefasciatus Genome Project]
* [http://www.ncbi.nlm.nih.gov/sites/entrez?db=genomeprj&cmd=Retrieve&dopt=Overview&list_uids=12963 Culex quinquefasciatus Genome Project]
* Taxonomy ID: 7176
* [ftp://ftp.ncbi.nih.gov/pub/TraceDB/culex_pipiens_quinquefasciatus/ Culex quinquefasciatus TA] : 7,379,314 traces (Sept 2007)
* [ftp://ftp.ncbi.nih.gov/pub/TraceDB/culex_pipiens_quinquefasciatus/ Culex quinquefasciatus TA] : 7,379,314 traces (Sept 2007)


Line 136: Line 163:
   Total                                  7354992        100
   Total                                  7354992        100


Broad:
* [http://www.broad.mit.edu/annotation/genome/culex_pipiens.4/Info.html Culex pipiens quinquefasciatus JHB whole-genome sequencing project]


JCVI:  
JCVI:  
* [http://msc.tigr.org/c_pipiens/index.shtml MSC]
* [http://msc.tigr.org/c_pipiens/index.shtml Culex pipiens Genome Project(MSC)]


Articles
Articles
* [http://rana.lbl.gov/papers/Salzberg_GB_2005.pdf Salzberg_GB_2005.pdf]
* [http://rana.lbl.gov/papers/Salzberg_GB_2005.pdf Salzberg_GB_2005.pdf]
* [http://www.pubmedcentral.nih.gov/articlerender.fcgi?tool=pubmed&pubmedid=18550617 Sanger_2008]
* [http://www.pubmedcentral.nih.gov/articlerender.fcgi?tool=pubmed&pubmedid=18550617 Sanger_2008]
* [http://arjournals.annualreviews.org/doi/abs/10.1146%2Fannurev.micro.53.1.71 WOLBACHIA PIPIENTIS: Microbial Manipulator of Arthropod Reproduction(1999)]
* [http://en.wikipedia.org/wiki/Obligate_intracellular_parasite Obligate intracellular parasite] Wikipedia
* [http://www.sciencedirect.com/science?_ob=ArticleURL&_udi=B6WBK-458W13V-W7&_user=961305&_rdoc=1&_fmt=&_orig=search&_sort=d&view=c&_version=1&_urlVersion=0&_userid=961305&md5=2cb06656fc6ea7f3100ba1cfa894c43f  Bacteriophage WO and Virus-like Particles in Wolbachia, an Endosymbiont of Arthropods]


= Other Strains (complete) =
= Other Strains (complete) =


                                                                     RefSeq GenBank Pub Length (Mbp) GC Prot RNAs
                                                                     RefSeq GenBank Pub Length (Mbp) GC Prot RNAs
   Wolbachia endosymbiont of Drosophila melanogaster(TIGR)            NC_002978 AE017196 1 1.26778 35.2% 1195 39
   Wolbachia endosymbiont of Drosophila melanogaster(TIGR)            NC_002978 AE017196 1 1.267782 35.2% 1195 39
   Wolbachia endosymbiont strain TRS of Brugia malayi srain wMel(NEB) NC_006833 AE017321 1 1.08008 34.2% 805 37
   Wolbachia endosymbiont strain TRS of Brugia malayi srain wMel(NEB) NC_006833 AE017321 1 1.080084 34.2% 805 37
   Wolbachia pipientis wPip(Sanger)                                  NC_010981 AM999887 1 1.48246 34.2% 1275 37
   Wolbachia pipientis wPip(Sanger)                                  NC_010981 AM999887 1 1.482455 34.2% 1275 37 # 1386 CDSs (Sanger article 2008)
  # several ather @ JCVI, Sanger ...
 
!!!  Wolbachia pipientis wPip(Sanger) = culex161b01.q1k(346,054) + N(102) + culexbac1b5Ab03.q1k(1,136,301-2)


!!!  Wolbachia pipientis wPip(Sanger) = culex161b01.q1k(346,054) + N(102) + culexbac1b5Ab03.q1k(1,136,301)
  $ cat NC_010981.gb | grep '\.\.' | egrep -v 'anticodon|source' | awk '{print $1}' | count.pl
  #      total
  gene    1423
  CDS    1275
  tRNA    34
  rRNA    3
 
  $ cat /fs/szasmg2/Culex_pipiens_symbiont/NCBI/NC_010981.gb | grep -c "\/pseudo"
  110
 
  1275+34+3+110=1422


= Read Counts =
= Read Counts =
Line 167: Line 213:


Steps:
Steps:
   1. All cpqg reads have been downloaded from the TA (July 2006). The
   1. All cpqg reads have been downloaded from the TA (July 2006). The reads have been grouped by libraries and the clear range has been computed. There were 6.6M reads in the download compared with 7.3M now. Unfortunately I've only noticed this difference at the end of my experiment.
  reads have been grouped by libraries and the clear range has been
  computed. There were 6.6M reads in the download compared with 7.3M now.
  Unfortunately I've only noticed this difference at the end of my experiment.


   2. The Wolbachia endosymbiont of Culex quinquefasciatus assembly has
   2. The Wolbachia endosymbiont of Culex quinquefasciatus assembly has been downloaded from the Sanger ftp site ( ftp://ftp.sanger.ac.uk/pub/pathogens/Wolbachia/Wb_Cq.dbs ) ; there are 95 sequences in this file. Most of them are very short. Below are listed the name,length & gc% of the longest 10:  
  been downloaded from the Sanger ftp site
  ( ftp://ftp.sanger.ac.uk/pub/pathogens/Wolbachia/Wb_Cq.dbs ) ; there are
  95 sequences in this file. Most of them are very short. Below are listed
  the name,length & gc% of the longest 10:  
     name                        length(bp)  gc%
     name                        length(bp)  gc%
     culex173d08.p1k              1457497    34.17
     culex173d08.p1k              1457497    34.17
Line 189: Line 228:
     culex53a02.p1k                  5343    33.58
     culex53a02.p1k                  5343    33.58


   3. The cpqg random reads (clr only) have been aligned to symbiont
   3. The cpqg random reads (clr only) have been aligned to symbiont sequences using nucmer (default parameters)
  sequences using nucmer (default parameters)


   4.  The nucmer output has been analyzed. It's been noticed that many of
   4.  The nucmer output has been analyzed. It's been noticed that many of the short symbiont sequences (2-3KB in length) have a higher than expected number of alignments. To avoid the repeats I've elected only the reads that aligned to the longest 10 symbiont sequences (see above).
  the short symbiont sequences (2-3KB in length) have a higher than
  expected number of alignments. To avoid the repeats I've selected only
  the reads that aligned to the longest 10 symbiont sequences (see above).


   5. A 95% identity and minimum of 400 bp alignment thold has been used to
   5. A 95% identity and minimum of 400 bp alignment thold has been used to determine the symbiont reads. There were 29,110 unique reads (30,690 reads+mates) selected. Below is a per library breakdown (reads+mates):
  determine the symbiont reads. There were 29,110 unique reads (30,690
  reads+mates) selected. Below is a per library breakdown (reads+mates):
     MSC-CULEX-PIPIENS-QUINQUEFASCIATUS_01-G-CULEX-10KB      9581
     MSC-CULEX-PIPIENS-QUINQUEFASCIATUS_01-G-CULEX-10KB      9581
     MSC-CULEX-PIPIENS-QUINQUEFASCIATUS_06-G-CULEX-10KB      4549
     MSC-CULEX-PIPIENS-QUINQUEFASCIATUS_06-G-CULEX-10KB      4549
Line 220: Line 253:
     1099641499000                                          2
     1099641499000                                          2


   6. The reads have been assembled using the runCA-OBT.pl script (default
   6. The reads have been assembled using the runCA-OBT.pl script (default parameters).  Most of the reads got assembled into 3 large scaffolds. There is mate pair evidence (outie mates) that the largest scaffold is circular.  
  parameters).   
  Most of the reads got assembled into 3 large scaffolds. There is mate
  pair evidence (outie mates) that the largest scaffold is circular.  


   All the scaffolds ens up in surrogates (20-50KB total surrogate length)
   All the scaffolds ens up in surrogates (20-50KB total surrogate length)
Line 235: Line 265:
   [[Media:Cpqg.2006_1226_WGA.6.png|scaff_6]] Small scaff that Looks circular
   [[Media:Cpqg.2006_1226_WGA.6.png|scaff_6]] Small scaff that Looks circular


   7. The scaffolds/contigs have been aligned to longest 10 Wolbachia
   7. The scaffolds/contigs have been aligned to longest 10 Wolbachia endosymbiont sequences. Most of the long alignments were at over 99% identity. However, several large rearrangements have been noticed.
  endosymbiont sequences. Most of the long alignments were at over 99%
  identity. However, several large rearrangements have been noticed.


   [[Media:Cpqg.2006_1226_WGA.Wb_Cq-cpqg_scaff.png|Wb_Cq-vs-scaff]] Reference vs scaff
   [[Media:Cpqg.2006_1226_WGA.Wb_Cq-cpqg_scaff.png|Wb_Cq-vs-scaff]] Reference vs scaff
Line 246: Line 274:
Steps:  
Steps:  


   1. All Culex reads have been downloaded from TA . ~1M new reads sincd 2006_1226
   1. All Culex reads have been downloaded from TA . ~1M new reads since 2006_1226


   2. The reads have been aligned to the new reference (exclude mito,repeats) using nucmer (default parameters)
   2. The reads have been aligned to the new reference (exclude mito,repeats) using nucmer (default parameters)


   3. A 95% identity and minimum of 400 bp alignment thold has been used to
   3. A 95% identity and minimum of 400 bp alignment thold has been used to determine the symbiont reads.  3850 new reads & mates in addition to the previous ones were identified
  determine the symbiont reads.  3850 new reads & mates in addition to the previous
  ones were identified


   4. 33,783 reads have been assembled using the runCA-OBT.pl script (default
   4. 33,783 reads have been assembled using the runCA-OBT.pl script (default parameters).   
  parameters).   
   [[Media:Cpqg.2007_0802_WGA-default.qc|Cpqg.qc]]
   [[Media:Cpqg.2007_0802_WGA-default.qc|Cpqg.qc]]


   Compared to the initial assembly, many metrics went down (TotalBasesInScaffolds,MaxBasesInScaffolds,MaxContigLength ...)
   Compared to the initial assembly, many metrics went down (TotalBasesInScaffolds,MaxBasesInScaffolds,MaxContigLength ...)
   TotalSurrogates & SurrogateInstances more than doubled
   TotalSurrogates & SurrogateInstances more than doubled


== 2007_0802_WGA-0.5E  ==
== 2007_0802_WGA-0.5E  ==
Line 304: Line 328:
   total=8,1477210,1478945,184651,578
   total=8,1477210,1478945,184651,578


== 2008_0829_WGA-wgs e=2.0 ==  
== 2008_0829_WGA-wgs e=2.0 -> best ==  
 
=== Assembly description ===
 
# The Wolbachia pipientis endosymbiont of Culex quinquefasciatus assembly was downloaded from the Sanger web site (July 2007 version: 12 contigs)
# 5 of the 12 contigs were discarded due to their high GC% or repetitive content; 7 contigs were kept to be used for sequence alignments
# The NCBI TA Culex quinquefasciatus traces were downloaded locally (Sept 2007: 7,379,314 total Sanger traces; 7,183,129 WGS)
# The WGS traces were aligned to the 7 reference contigs using nucmer (default parameters: minimum 65bp length, 80% identity)
# The traces that aligned and their mates were filtered out and formatted as input for Celera Assembler (36,767 total traces; 35,750 mated, 1,017 unmated)
# The traces were assembled with CA (wgs-5.1) (default parameters except for unitiggerRrrorRate=2%)
# The assembler generated 16 scaffolds, 21 contigs and 92 degenerates;
# 5 scf, 10 ctg & 11 deg were filtered based on their "uniqueness"; 2 of the scaffolds contain multiple contigs
# There are 2 unique regions in reference not present in this genome  (NC_010981.1: 775928-776047 120bp; 1253284-1254139 856bp)
# There are 4 unique regions (~ 500bp each ) in this genome not present in the reference sequence/assembly:    ctg7180000001230_202_725, ctg7180000001305_11303_11867, ctg7180000001305_13367_14006, deg7180000001252_328_851
# 10 large scale rearrangements
 


Comments
# wgs-5.2-beta generated the same results
# modifying astatLowBound, astatHighBound  did not result in better assembly
[[Media:Cpqg.qc|Cpqg.qc]]
   all:                                                                16 scf, 21 ctg , 92 deg
   all:                                                                16 scf, 21 ctg , 92 deg
   ones with gc% in the 32..36 range or have Wp genes aligned to them: 11 scf, 16 ctg , 41 deg
   ones with gc% in the 32..36 range or have Wp genes aligned to them: 11 scf, 16 ctg , 41 deg:
  filtered(submission)                                                5 scf, 10 ctg,  11 deg


   [Top5Scaffolds=contigs,size,span,avgContig,avgGap]
   [Top5Scaffolds=contigs,size,span,avgContig,avgGap]
Line 316: Line 361:
   4=1,2425,2425,2425,0
   4=1,2425,2425,2425,0
   total=10,1511725,1513222,151172,299
   total=10,1511725,1513222,151172,299
  [[Media:NC_010981-scf.filter-q.png]]
  [[Media:NC_010981-ctg-deg.filter-q.png]]
  [[Media:Cpqg.all.infoseq‎]]


   top 2 scaffold size=1458420
   top 2 scaffold size=1458420
   top 3 scaffold size=1500985
   top 3 scaffold size=1500985
  Alignment files
  [[Media:NC_010981-scf.filter-q.png]]    [[Media:NC_010981-scf.filter-1.png]]  [[Media:scf-NC_010981.filter-1.png]]
  [[Media:NC_010981-ctg.filter-q.png]]    [[Media:NC_010981-ctg.filter-1.png]]
  [[Media:NC_010981-ctg-deg.filter-q.png]] [[Media:NC_010981-ctg-deg.filter-1.png]]
  Stats:
        #elem  min    max    mean    median  n50    sum
  scf  16      1035    1389537 95513  1501    1389537 1528210 ; 4  CONTAINED in bigger scf
  ctg  21      1035    478325  72697  1583    478325  1526633 ; 4  CONTAINED in bigger ctg
  deg  92      245    7632    1079    843    1000    99229  ; 22 CONTAINED in        ctg
  #id                    len    gc%    Wb_Cq.7 NC_gene cvg
  scf7180000001311        1389537 34.17  3744    1250    14.59 #ctg7180000001298..ctg7180000001301
  scf7180000001316        70460  34.82  85      62      6.22  #ctg7180000001303..ctg7180000001305
  scf7180000001315        42565  34.15  72      110    13.01
  scf7180000001310        8315    35.65  5      3      4.84
  scf7180000001307        2425    62.76  0      0      2.64
  ...
  scf7180000001320        1315    35.67  3      1      20.27
  ... 
  ctg7180000001299        478325  34.12  1253    478    15.56
  ctg7180000001300        466173  34.13  1401    532    14.5
  ctg7180000001298        316943  34.05  541    343    13.99
  ctg7180000001301        126623  34.61  550    225    12.77
  ctg7180000001302        42565  34.15  72      110    13.01
  ctg7180000001305        37016  34.66  47      52      6.49
  ...
  deg7180000001279        7632    33.32  6      6      26.85  # the long degenerates have high coverage
  deg7180000001277        4159    35.18  61      72      40.3
  deg7180000001280        3685    36.88  12      18      36.67
  ...
  deg7180000001231        245    33.06  0      0      1.3
=== Filtering ===
Steps:
# Align scaffold & degenerates to top3 ref ctgs culex161b01.q1k(346,054)+culexbac1b5Ab03.q1k(1,136,301)+culex49c07.p1k(9,245) using nucmer;
# Filter alignments using "delta-filter -r"
# Remove CONTAINED scf & deg
# add scf & deg that contain UNIQUE seq & not in the list: scf7180000001309(1,361) & deg7180000001252(937)
# order & orient ctgs
  #id                    len    gc%    Wb_Cq.7 NC_gene cvg    contained
  ctg7180000001230        1361    37.99  17      5      3.03    N
  ctg7180000001248        8315    35.65  5      3      4.84    N
  ctg7180000001298        316943  34.05  541    343    13.99  N
  ctg7180000001299        478325  34.12  1253    478    15.56  N
  ctg7180000001300        466173  34.13  1401    532    14.5    N
  ctg7180000001301        126623  34.61  550    225    12.77  N
  ctg7180000001302        42565  34.15  72      110    13.01  N
  ctg7180000001303        29919  34.92  35      52      6.29    N
  ctg7180000001304        3421    35.19  3      2      2.83    N
  ctg7180000001305        37016  34.66  47      52      6.49    N
 
  scf7180000001309        1361    37.99  17      5      3.03    N
  scf7180000001310        8315    35.65  5      3      4.84    N
  scf7180000001311        1389537 34.17  3744    1250    14.59  N # origin of replication at pos 112,7008 (-)
  scf7180000001315        42565  34.15  72      110    13.01  N
  scf7180000001316        70460  34.82  85      62      6.22    N
 
  deg7180000001236        2346    34.02  4      5      35.34  N
  deg7180000001244        3090    33.66  8      10      33.09  N
  deg7180000001252        937    36.29  4      5      1.37    N
  deg7180000001256        1888    32.42  110    49      19.98  N
  deg7180000001260        1198    37.73  5      3      13.1    N
  deg7180000001266        2375    32.80  55      49      28.4    N
  deg7180000001272        2923    32.91  54      46      35.26  N
  deg7180000001277        4159    35.18  61      72      40.3    N
  deg7180000001279        7632    33.32  6      6      26.85  N
  deg7180000001280        3685    36.88  12      18      36.67  N
  deg7180000001290        1879    31.40  10      5      33.06  N
 
  => 10 ctgs (5 scaff) & 11 deg
  .scaff file
  >7180000001309 1 1365 1364
  7180000001230 BE 1365 0
 
  >7180000001310 1 8319 8318
  7180000001248 BE 8319 0
 
  >7180000001311 4 1404424 1405836
  7180000001298 BE 320424 -19
  7180000001299 BE 484180 1434
  7180000001300 BE 471690 1
  7180000001301 BE 128130 0
 
  >7180000001315 1 42888 42887
  7180000001302 BE 42888 0
 
  >7180000001316 3 70739 70822
  7180000001303 BE 30041 1
  7180000001304 BE 3421 85
  7180000001305 BE 37277 0
=== Reference sequence not present in the assembly ===


100+ bp 0cvg regions in the reference:
100+ bp 0cvg regions in the reference:
Line 333: Line 472:


   1.1. NC_010981.1    RefSeq  gene    775763  777826  .      +      .    contains  GeneID:6385213 # WP0709 Putative outer membrane protein
   1.1. NC_010981.1    RefSeq  gene    775763  777826  .      +      .    contains  GeneID:6385213 # WP0709 Putative outer membrane protein
   1.2. NC_010981.1    RefSeq  gene    1252115 1253287 .      +      .    begin    GeneID:6385392 # tuf translation elongation factor tu
   1.2. NC_010981.1    RefSeq  gene    1252115 1253287 .      +      .    begin    GeneID:6385392 # tuf translation elongation factor tu (2 in Sanger wPip, none in Dan's annotation)
   1.3. NC_010981.1    RefSeq  gene    1253302 1253622 .      +      .    contained GeneID:6385310 # rpsJ 30s ribosomal protein s10 ??? missing;                                                                                                     
   1.3. NC_010981.1    RefSeq  gene    1253302 1253622 .      +      .    contained GeneID:6385310 # rpsJ 30s ribosomal protein s10 ??? missing;                                                                                                     
                                                                                                       # very conserved in Wolbachia endosymbiont of Drosophila melanogaster
                                                                                                       # very conserved in Wolbachia endosymbiont of Drosophila melanogaster
   2.1. NC_010981.1    RefSeq  gene    1253632 1254354 .      +      .    end      GeneID:6385679 # rplC ribosomal protein L3
   2.1. NC_010981.1    RefSeq  gene    1253632 1254354 .      +      .    end      GeneID:6385679 # rplC ribosomal protein L3 (partially present in Dan's annotation)
                                                                                                       # very conserved in several species : Wolbachia, Erlichia ...
                                                                                                       # very conserved in several species : Wolbachia, Erlichia ...


   No promer alignments of sequences to these regions
   No promer alignments of sequences to these regions
 
=== Assembly sequence not present in the reference ===
 
100+ bp 0cvg regions in the assembly:
100+ bp 0cvg regions in the assembly:


     ctg_start_stop              len    gc%    comments
     ctg_start_stop              len    gc%    comments
   1. ctg7180000001230_202_1361   1160  37.50  # first 600 have no alignments; last 400bp are cloning vector
   1. ctg7180000001230_202_725    524   37.50  # first 202bp have multiple alignments to NC_010981
   2. ctg7180000001305_11303_11867 565    33.45  # aligns at 100%len, 100%id to Wolbachia endosymbiont of Drosophila melanogaster, complete genome; NC_002978.6:243504..243803
                                                # bases 203..725 have no alignments to NC_010981
                                                # this contig used to comntain cloning vector at the 3' end which was removed (725..1160
                                                # scf7180000001309 
   2. ctg7180000001305_11303_11867 565    33.45  # aligns at 100%len, 100%id to Wolbachia endosymbiont of Drosophila melanogaster, complete genome; NC_002978.6:243504..243803  
                                                # 11070..11735  Putative dna repair protein radc [Wolbachia pipientis] (Dan's annotation)
                                                # scf7180000001316 3 70739 70822
   3. ctg7180000001305_13367_14006 640    33.91  # good blastx alignment to Wolabchia gene on 100% length; NC_002978.6:488974..489912
   3. ctg7180000001305_13367_14006 640    33.91  # good blastx alignment to Wolabchia gene on 100% length; NC_002978.6:488974..489912
   4. deg7180000001252_328_851    524    35.88  # blastx align to We of Bm NC_006833.1:754520..755170
   4. deg7180000001252_328_851    524    35.88  # blastx align to We of Bm NC_006833.1:754520..755170
 
  no alignments to Sanger raw reads
Others: might be contaminated?
  5. ctg7180000001257            1378    32.80  # 313:1378  Culex pipiens LINE repeat !!!
                                                # 537..1379 reverse transcriptase [Bacteroides thetaiotaomicron VPI-5482] (Dan's annotation)
  6. ctg7180000001285            1568  37.37  # GC% higher than avg
                                                # 125..700  transcriptional regulator, XRE family [Thermotoga lettingae TMO] (Dan's annotation)
                                                # 984..1571 putative outer membrane protein probably involved in nutrient binding [Bacteroides fragilis YCH46] (Dan's annotation)
-------
ORFS's
  1.1 ctg7180000001230:orf00001 ctg7180000001230 -1 262 +2        transposase, IS256 family [Wolbachia endosymbiont of Drosophila melanogaster]
  $ cat NC_010981.ptt | grepi -c transposase
  80
  1.2 ctg7180000001230:orf00002 ctg7180000001230 1277 840 -3      blast e-val:3e-84 chloramphenicol acetyltransferase [Salmonella enterica subsp. enterica serovar Typhi str. CT18] not in NC_010981 !!!
                                                                  cloning vector !!!
  >ctg7180000001230:orf00002  ctg7180000001230  1277 840  len=438
  ATGGCAATGAAAGACGGTGAGCTGGTGATATGGGATAGTGTTCACCCTTGTTACACCGTT
  TTCCATGAGCAAACTGAAACGTTTTCATCGCTCTGGAGTGAATACCACGACGATTTCCGG
  CAGTTTCTACACATATATTCGCAAGATGTGGCGTGTTACGGTGAAAACCTGGCCTATTTC
  CCTAAAGGGTTTATTGAGAATATGTTTTTCGTCTCAGCCAATCCCTGGGTGAGTTTCACC
  AGTTTTGATTTAAACGTGGCCAATATGGACAACTTCTTCGCCCCCGTTTTCACCATGGGC
  AAATATTATACGCAAGGCGACAAGGTGCTGATGCCGCTGGCGATTCAGGTTCATCATGCC
  GTTTGTGATGGCTTCCATGTCGGCAGAATGCTTAATGAATTACAACAGTACTGCGATGAG
  TGGCAGGGCGGGGCGTAA
  >ctg7180000001230:orf00002  ctg7180000001230  1277 840  len=438
  MAMKDGELVIWDSVHPCYTVFHEQTETFSSLWSEYHDDFRQFLHIYSQDVACYGENLAYF
  PKGFIENMFFVSANPWVSFTSFDLNVANMDNFFAPVFTMGKYYTQGDKVLMPLAIQVHHA
  VCDGFHVGRMLNELQQYCDEWQGGA*
  >gi|18466598|ref|NP_569406.1| chloramphenicol acetyltransferase [Salmonella enterica subsp. enterica serovar Typhi str. CT18]
  MEKKITGYTTVDISQWHRKEHFEAFQSVAQCTYNQTVQLDITAFLKTVKKNKHKFYPAFIHILARLMNAH
  PEFRMAMKDGELVIWDSVHPCYTVFHEQTETFSSLWSEYHDDFRQFLHIYSQDVACYGENLAYFPKGFIE
  NMFFVSANPWVSFTSFDLNVANMDNFFAPVFTMGKYYTQGDKVLMPLAIQVHHAVCDGFHVGRMLNELQQ
  YCDEWQGGA
  2. ctg7180000001305:orf00013 ctg7180000001305 11070 11735 +3  DNA repair protein RadC, putative [Wolbachia endosymbiont of Drosophila melanogaster]
 
  # 3 copies in NC_010981
  $ cat NC_010981.ptt | grepi RadC
  280207..280863  -      218    190570723      -      WP0276  -      -      Putative dna repair protein radc
  488966..489634  -      222    190570883      -      WP0459  -      -      Putative dna repair protein radc
  1418058..1418726        -      222    190571715      -      WP1343  -      -      Putative dna repair protein radc
  3. ctg7180000001305:orf00015 ctg7180000001305 13185 14093 +3  transcriptional regulator, putative [Wolbachia endosymbiont of Drosophila melanogaster]
 
  # 10 copies
  cat NC_010981.ptt | grepi "transcriptional regulator"
  247653..248570  -      305    190570687      -      WP0239  -      -      Putative transcriptional regulator
  277056..277895  -      279    190570720      -      WP0273  -      -      Putative transcriptional regulator
  277921..278835  -      304    190570721      -      WP0274  -      -      Putative transcriptional regulator
  281034..281954  -      306    190570724      -      WP0277  -      -      Putative transcriptional regulator
  296912..297466  -      184    190570733      -      WP0290  -      -      Putative transcriptional regulator
  486388..487365  -      325    190570881      -      WP0457  -      -      Putative transcriptional regulator
  630467..631237  +      256    190570997      -      WP0585  -      -      two component transcriptional regulator
  806511..806837  +      108    190571141      -      WP0739  -      -      Putative transcriptional regulator, MerR family
  1129005..1129301        +      98      190571445      -      WP1058  -      -      Putative transcriptional regulator
  1415480..1416457        -      325    190571713      -      WP1341  -      -      putative transcriptional regulator
  4. ?
=== Repeats ===
* No alignments of ctg/deg to RepeatMaskerLib
* Tandem repeats
** Minisatellites copy number variation can be used to genotype bacteria strains
  $ show-coords NC_010981.trf-wPip.trf.filter-1.delta | grep -f NC_010981.trf-wPip.trf.filter-1.qry_diff
      1      100  |      126      27  |      100      100  |  100.00  |      100      208  |  100.00    48.08  | 34.18.100  98.54.208      [CONTAINED]
      1      122  |      122        1  |      122      122  |  100.00  |      122      208  |  100.00    58.65  | 35.54.122  98.54.208      [CONTAINED]
      1      273  |        1      273  |      273      273  |  100.00  |      280      355  |    97.50    76.90  | 68.75.280  75.75.355      [CONTAINED]
  $ infoseq wPip.trf.fasta    | grep -f NC_010981.trf-wPip.trf.filter-1.qry_diff
  60.65.213      213    38.97
  75.75.355      355    34.08
  98.54.208      208    45.19
  99.18.146      146    43.84
* RepeatScout pipeline summary
  10 ctg+11 degen
                  #elem  min    max    mean    median  n50    sum
  families        90      67      7071    678    358    989    61024
  repeats        465    30      7071    631    378    984    293381
  uniq            248    68      46556  5071    1549    16001  1257709
  NC_010981
                  #elem  min    max    mean    median  n50    sum
  families        51      71      5779    726    307    1360    37012
  repeats        304    31      5779    682    548    989    207331
  uniq            199    68      58998  6384    2512    16001  1270466
* !!! more repeats in our assembly
* Comparison of the longest repeats (our strain vs Sanger strain):
  $ cd /fs/szasmg2/Culex_pipiens_symbiont/2008_0829_WGA-wgs-e.20/RepeatScout
  $ sort -nk2 -r wPip-NC_010981.families.infocount
  fam    len    gc%    #ref    #qry
  12      7071    36.18  5      2        # repeat family 12 has 5 copies in our assembly and 2 copies in NC_010981
  57      6770    35.05  4      2
  77      3129    34.48  4      5
  6      2461    35.11  3      2
  87      1468    34.81  4      2
  26      1399    35.67  0      0
  1      1346    36.18  2      2
  2      1345    38.74  33      31
  60      1097    39.56  3      5
* there are differences in the copy numbers
* there is no frequent repeat present in one genome but not in the other
=== Multiple copies in reference ===
=== Multiple copies in assembly ===
=== Snps ===
=== Rearrangements ===
* ~ 10 rearrangements
* some rearrangements are associated with IS elements: the 20 copy 1.3K repeats belong to "12 IS5 (IS256-family)" , transposase gene
----
=== Improving strategy ===


Identify more reads that align to those regions (blastn TA):
Identify more reads that align to those regions (blastn TA):
Line 355: Line 632:


Adding these reads did not improve the assembly.
Adding these reads did not improve the assembly.
----


   gi|42519920|ref|NC_002978.6|    243504  243803 # ctg7180000001305_11303_11867 565    33.45  : there are reads aligned to 241173-243822
   gi|42519920|ref|NC_002978.6|    243504  243803 # ctg7180000001305_11303_11867 565    33.45  : there are reads aligned to 241173-243822
Line 362: Line 638:


All the reads aligned to the 3 regions above have been assembled; the 3 regions seem to contain rearrangements
All the reads aligned to the 3 regions above have been assembled; the 3 regions seem to contain rearrangements
---
=== Files & Directories ===
* Wolbachia pipientis, complete genome
  /fs/szasmg2/Culex_pipiens_symbiont/NCBI/NC_010981.fna
* qc file
  /fs/szasmg2/Culex_pipiens_symbiont/2008_0829_WGA-wgs-e.20/cpqg.qc
* AMOS bank
  /fs/szasmg2/Culex_pipiens_symbiont/2008_0829_WGA-wgs-e.20/cpqg.bnk/
* nucmer alignment files : assembly scaffolds/contigs/denenerates/unitigs vs the reference genome
  *.filter-q.* were generated using "delta-filter -q"
  *.filter-1.* were generated using "delta-filter -1"
  /fs/szasmg2/Culex_pipiens_symbiont/2008_0829_WGA-wgs-e.20/nucmer/NC_010981-*delta
  Filtered scaffolds:
  #id                  len    gc%    Wb_Cq.7 NC_gene cvg
  scf7180000001311      1389537 34.17  3744    1250    14.59  *
  scf7180000001316      70460  34.82  85      62      6.22
  scf7180000001315      42565  34.15  72      110    13.01  *
  scf7180000001310      8315    35.65  5      3      4.84
  scf7180000001319      1501    36.64  3      1      7.9
  scf7180000001312      1378    32.80  0      0      1.62
  scf7180000001309      1361    37.99  17      5      3.03
  scf7180000001320      1315    35.67  3      1      20.27
  scf7180000001317      1173    34.53  3      1      2.18
  scf7180000001318      1115    36.41  2      3      2.32
  scf7180000001321      1035    34.01  3      1      3.4
  ctg7180000001299      478325  34.12  1253    478    15.56  *
  ctg7180000001300      466173  34.13  1401    532    14.5    *
  ctg7180000001298      316943  34.05  541    343    13.99  *
  ctg7180000001301      126623  34.61  550    225    12.77  *
  ctg7180000001302      42565  34.15  72      110    13.01  *
  ctg7180000001305      37016  34.66  47      52      6.49
  ctg7180000001303      29919  34.92  35      52      6.29
  ctg7180000001248      8315    35.65  5      3      4.84
  ctg7180000001304      3421    35.19  3      2      2.83
  ctg7180000001270      1501    36.64  3      1      7.9
  ctg7180000001257      1378    32.80  0      0      1.62
  ctg7180000001230      1361    37.99  17      5      3.03
  ctg7180000001284      1315    35.67  3      1      6.83
  ctg7180000001232      1173    34.53  3      1      2.18
  ctg7180000001237      1115    36.41  2      3      2.32
  ctg7180000001297      1035    34.01  3      1      3.4
* filtered contigs & degens
  /fs/szasmg2/Culex_pipiens_symbiont/2008_0829_WGA-wgs-e.20/cpqg.ctg-deg.filter.fasta
  /fs/szasmg2/Culex_pipiens_symbiont/2008_0829_WGA-wgs-e.20/cpqg.ctg-deg.filter.infoseq
=== Annotation (original) ===
Format annotation for NCBI submission:
  $ wc -l  cpqg.ctg.CDS cpqg.ctg.tRNA cpqg.ctg.rRNA
    1476 cpqg.ctg.CDS
      34 cpqg.ctg.tRNA
      4 cpqg.ctg.rRNA
  cat cpqg.ctg.CDS  | sed 's/orf00//' | sed 's/ctg718000000//' | ~/bin/tab2annotation.pl -hl 1 -t CDS >! cpqg.ctg.CDS.tbl
  cat  cpqg.ctg.tRNA | sed 's/orf00//' | sed 's/ctg718000000//' | ~/bin/tab2annotation.pl -hl 1 -t tRNA >! cpqg.ctg.tRNA.tbl
  cat cpqg.ctg.rRNA  | sed 's/orf00//' | sed 's/ctg718000000//' | ~/bin/tab2annotation.pl -hl 1 -t rRNA >! cpqg.ctg.rRNA.tbl
  cat cpqg.ctg.CDS.tbl cpqg.ctg.tRNA.tbl  cpqg.ctg.rRNA.tbl > cpqg.ctg.tbl
Sanger Wolbachia: much fewer genes !!!
  NC_010981.ptt
  1248 CDS
    25 tRNA : 1Leu (vs 5 in our strain) !!!
    2 rRNA
No CRISPRs found by CRISPRFinder
=== Annotation (revised) ===
* Genes manually curated by Dan; many transposases deleted
  wc -l cpqg.ctg.CDS cpqg.ctg.tRNA cpqg.ctg.rRNA cpqg.deg.CDS
    1342 cpqg.ctg.CDS
      36 cpqg.deg.CDS
      34 cpqg.ctg.tRNA
      4 cpqg.ctg.rRNA   
    1416 total
= NCBI submission =
* name: Wolbachia pipientis wPip(strain) JHB(substrain) (got this from Steven)
* NCBI suggestion:
  [organism=Wolbachia endosymbiont of Culex quinquefasciatus JHB]
  [host=Culex quinquefasciatus JHB]
* [http://www.ncbi.nlm.nih.gov/Genbank/genomesubmit.html registration]
  http://www.ncbi.nlm.nih.gov/genomes/mpfsubmission.cgi?show=EB95B67D-199C-42C9-80CE-F2AC9C7C7A02
  Project ID: 32209
  Locus Tag Prefix: C1A
* Submission dir:
  /fs/szasmg2/Culex_pipiens_symbiont/2008_0829_WGA-wgs-e.20/submission2/
* Submission via GenomesMacroSend;
** Direct Submit ID: DSub8465 (1st submission)
** Direct Submit ID: DSub8474,DSub8475  (revisions to the 1st submission)
* TaxId: 569881
* [http://www.ncbi.nlm.nih.gov/sites/entrez?db=genome&cmd=search&term=ABZA00000000  ABZA00000000] Genome Project
* [http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?db=nuccore&id=212995898 ABZA00000000] Project accession number
** ctg:  ABZA01000001..ABZA01000021
** scaff: DS996929-DS996944
* NCBI files:
  /fs/szasmg2/Culex_pipiens_symbiont/best/submission2/ABZA01_accs :          21 ctg & deg accession numbers ABZA01000001..ABZA01000021
  /fs/szasmg2/Culex_pipiens_symbiont/best/submission2/ABZA.01.modified.p2g  1378  gene accession numbers  EEB55160.. EEB56537
  /fs/szasmg2/Culex_pipiens_symbiont/best/submission2/ABZA01_scfld_DS_accs  16 scaffold id's
* Future updates
** Protein id formats to use(?):
  gnl|umiacs|C1A_1|gb|EEB55198
  gnl|WGS:ABZA|C1A_1|gb|EEB55198
* [http://www.ncbi.nlm.nih.gov/Traces/assembly/assmbrowser.cgi?cmd=browse&ai=3900&m=main&s=browse AA] AI 3900
= Article =
* [[wPip_article|article submitted]]

Latest revision as of 12:36, 11 December 2008

Data Sources

Sanger

Wolbachia pipientis endosymbiont of Culex quinquefasciatus

  • December 2006 reference (95 sequences):
 file name: /fs/szasmg2/Culex_pipiens_symbiont/Sanger/Wb_Cq_061226.dbs
 
 Top 10 seqs
 Name                           Length     %GC
 culex173d08.p1k               1457497    34.17
 culexbac1d10Bg07.p1k            24726    35.11
 culex3d09.p1k                   15587    21.81
 culex166f03.q1k                 13962    36.17
 culex_1177_1189-1a02.w2k1177    13564    37.10
 culex26b07.p1k                   9245    35.53
 culex174d04.p1k                  8832    33.64
 J28015Ag08.q1ka                  7809    36.04
 culex180e07.p1k                  6960    36.59
 culex53a02.p1k                   5343    33.58
 ...
  • July 2007 reference (12 sequences; 7 "good"; 4 "unique"):
 file name: /fs/szasmg2/Culex_pipiens_symbiont/Sanger/Wb_Cq.dbs
 
 All seqs:
 Name                    Length  %GC
 1  culexbac1b5Ab03.q1k     1136301 34.17           
 2  culex161b01.q1k         346054  34.25            
 3  #culex166f03.q1k         13962   36.17          share almost all sequence with culex161b01.q1k & 1996bp with culexbac1b5Ab03.q1k
    subtotal(3)             1496317
 
 4  culex49c07.p1k          9245    35.53          misoriented mates at the ends; region 4979-6364(1.3Kbp) aligns to culexbac1b5Ab03.q1k 3 times
 5  culex53a02.p1k          5343    33.58          ~ 1Kbp alignments to culexbac1b5Ab03.q1k & culex161b01.q1k
 6  #culex117e02.p2kA55      3501    33.10          contained (in 2 pieces) in culexbac1b5Ab03.q1k
 7  #culex141a08.q1k         1920    33.44          contained in other seqs
    subtotal(7)             1516326
 
 8  culex180e07.p1k         6960    36.62          "CONTAINED" culexbac1b5Ab03.q1k (surrogate in WGA) 
 9  culex5c05.p1k           15587   21.81          low GC%; no alignments to NC_002978 & NC_006833; best hit is  Anopheles gambiae complete mitochondrial genome : 15363 bp (96% coverage, 86% max id)
 10 culex14h11.p1k          3350    51.73          repeat (higher GC%): good cvg of culex 18SrRNA gene ; no alignments to NC_002978 & NC_006833
 11 culex22h10.q1k          2148    54.89          repeat (higher GC%): some alignment to culex 118S rRNA ; no alignments to NC_002978 & NC_006833
 12 culex166d08.p1k         2071    55.53          repeat (higher GC%): culex 18S rRNA & 28S rRNA ; no alignments to NC_002978 & NC_006833
    total(12)               1546442
  • Sept 2008 reference
 file name: /fs/szasmg2/Culex_pipiens_symbiont/Sanger/Wb_Cq_080903.dbs
 
 1  contig000310   1136301 34.17
 2  contig000307   346054 34.25
 3  contig000311   15587  21.81
 4  contig000305   13962  36.17
 5  contig000312   9245   35.53
 6  contig000309   6967   36.63
 7  contig000306   5343   33.58
 8  contig000315   3501   33.10
 9  contig000308   3350   51.73
 10 contig000313   2148   54.89
 11 contig000314   2071   55.53
 12 contig000304   1994   33.85

NCBI

Culex quinquefasciatus

  • Taxonomy:
 * Culex pipiens complex 
   * Culex australicus   
   * Culex pipiens (house mosquito)    1(project)
         o Culex pipiens molestus   
         o Culex pipiens pallens   
         o Culex pipiens pipiens (northern house mosquito)    
   * Culex pipiens x Culex quinquefasciatus   
   * Culex quinquefasciatus (southern house mosquito)    1(project)
 * Wolbachia Lineage: 
   root; cellular organisms; Bacteria; Proteobacteria; Alphaproteobacteria; Rickettsiales; Rickettsiaceae; Wolbachieae; Wolbachia
 * Wolbachia phage WO
   http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=112596
   http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?db=nuccore&id=6723230
   up to 2K alignments at ~90%id of our genome to this virus
 SEQ_LIB_ID                                            SIZE    STDEV    CENTER_NAME    TYPE            COUNT   PERCENT
 
 1099499586718                                         9000    2700    TIGR_JCVIJTC    WGS             15349   0.21
 1099522705601                                         3500    1050    TIGR_JCVIJTC    WGS             16116   0.22
 1099641499000                                         33000   9900    TIGR_JCVIJTC    WGS             768     0.01
 
 G766BES1                                              120000  .       WIBR            WGS             100434  1.37   BE
 
 G771K1                                                5000    500     WIBR            CLONEEND        51540   0.7
 G772K1                                                5000    500     WIBR            CLONEEND        25314   0.34
 G809K1                                                2000    200     WIBR            CLONEEND        29949   0.41
 G810K1                                                2000    200     WIBR            CLONEEND        2295    0.03
 
 G818F1                                                40000   4000    WIBR            WGS             437994  5.96
 G818F2                                                40000   4000    WIBR            WGS             8505    0.12
 G818P1                                                4000    400     WIBR            WGS             580557  7.89
 G818P2                                                4000    400     WIBR            WGS             1091326 14.84
 G818P3                                                4000    400     WIBR            WGS             350523  4.77
 G818P4                                                4000    400     WIBR            WGS             1017105 13.83
 
 L31420P2                                              5000    .       WIBR            SHOTGUN         2259    0.03
 L31422P1                                              4000    .       WIBR            SHOTGUN         3766    0.05
 L31424P2                                              5000    .       WIBR            SHOTGUN         2226    0.03
 L31425P1                                              4000    .       WIBR            SHOTGUN         3817    0.05
 L31426P1                                              4000    .       WIBR            SHOTGUN         2274    0.03
 L31427P1                                              4000    .       WIBR            SHOTGUN         2273    0.03
 L31428P1                                              4000    .       WIBR            SHOTGUN         3045    0.04
 L31429P1                                              4000    .       WIBR            SHOTGUN         3034    0.04
 L31430P1                                              4000    .       WIBR            SHOTGUN         2261    0.03
 L31431P1                                              4000    .       WIBR            SHOTGUN         2918    0.04
 L31432P1                                              4000    .       WIBR            SHOTGUN         2947    0.04
 L31433P1                                              4000    .       WIBR            SHOTGUN         2251    0.03
 L31435P1                                              4000    .       WIBR            SHOTGUN         2987    0.04
 L31439P1                                              4000    .       WIBR            SHOTGUN         2292    0.03
 L31440P1                                              4000    400     WIBR            SHOTGUN         1478    0.02
 L31440P1                                              4000    .       WIBR            SHOTGUN         2281    0.03
 L31441P1                                              4000    .       WIBR            SHOTGUN         2297    0.03
 L31444P2                                              5000    .       WIBR            SHOTGUN         2241    0.03
 L31446P2                                              5000    .       WIBR            SHOTGUN         2278    0.03
 L31448P1                                              4000    .       WIBR            SHOTGUN         3052    0.04
 L31449P1                                              4000    .       WIBR            SHOTGUN         2234    0.03
 
 MSC-CULEX-PIPIENS-QUINQUEFASCIATUS_01-G-CULEX-10KB    10000   2000    TIGR_JCVIJTC    WGS             1939130 26.36
 MSC-CULEX-PIPIENS-QUINQUEFASCIATUS_02-G-CULEX-4KB     4000    800     TCAG_JCVIJTC    WGS             119990  1.63
 MSC-CULEX-PIPIENS-QUINQUEFASCIATUS_02-G-CULEX-4KB     4000    800     TIGR_JCVIJTC    WGS             213407  2.9
 MSC-CULEX-PIPIENS-QUINQUEFASCIATUS_03-F-CULEX-40KB    40000   8000    TCAG_JCVIJTC    WGS             2405    0.03
 MSC-CULEX-PIPIENS-QUINQUEFASCIATUS_03-F-CULEX-40KB    40000   8000    TIGR_JCVIJTC    WGS             101370  1.38
 MSC-CULEX-PIPIENS-QUINQUEFASCIATUS_04-F-CULEX-40KB    40000   8000    TCAG_JCVIJTC    WGS             16126   0.22
 MSC-CULEX-PIPIENS-QUINQUEFASCIATUS_04-F-CULEX-40KB    40000   8000    TIGR_JCVIJTC    WGS             22134   0.3
 MSC-CULEX-PIPIENS-QUINQUEFASCIATUS_05-F-CULEX-40KB    40000   8000    TIGR_JCVIJTC    WGS             51281   0.7
 MSC-CULEX-PIPIENS-QUINQUEFASCIATUS_06-G-CULEX-10KB    11000   2200    TIGR_JCVIJTC    WGS             992283  13.49
 MSC-CULEX-PIPIENS-QUINQUEFASCIATUS_07-G-CULEX-10KB    9000    1800    TIGR_JCVIJTC    WGS             106326  1.45
 
 .       .       .                                                     WIBR            OTHER           229     0
 .       .       .                                                     WIBR            PCR             228     0
 .       .       .                                                     WIBR            TRANSPOSON      8096    0.11
 
 Total                                                                                                 7354992 100

 CENTER_NAME     TRACE_TYPE_CODE         COUNT           PERCENT
 
 WIBR            WGS                     3586444         48.76
 TIGR_JCVIJTC    WGS                     3458164         47.02
 TCAG_JCVIJTC    WGS                     138521          1.88
 WIBR            CLONEEND                109098          1.48
 WIBR            SHOTGUN                 54211           0.74
 WIBR            TRANSPOSON              8096            0.11
 WIBR            OTHER                   229             0
 WIBR            PCR                     228             0
 
 Total                                   7354992         100


Broad:

JCVI:

Articles

Other Strains (complete)

                                                                    RefSeq 	GenBank 	Pub 	Length (Mbp) 	GC 	Prot 	RNAs
 Wolbachia endosymbiont of Drosophila melanogaster(TIGR)            NC_002978 	AE017196 	1 	1.267782 	35.2% 	1195 	39
 Wolbachia endosymbiont strain TRS of Brugia malayi srain wMel(NEB) NC_006833 	AE017321 	1 	1.080084 	34.2% 	805 	37
 Wolbachia pipientis wPip(Sanger)                                   NC_010981 	AM999887 	1 	1.482455 	34.2% 	1275 	37  # 1386 CDSs (Sanger article 2008)
 # several ather @ JCVI, Sanger ...

!!! Wolbachia pipientis wPip(Sanger) = culex161b01.q1k(346,054) + N(102) + culexbac1b5Ab03.q1k(1,136,301-2)

 $ cat NC_010981.gb | grep '\.\.' | egrep -v 'anticodon|source' | awk '{print $1}' | count.pl
 #       total
 gene    1423
 CDS     1275
 tRNA    34
 rRNA    3
 
 $ cat /fs/szasmg2/Culex_pipiens_symbiont/NCBI/NC_010981.gb | grep -c "\/pseudo"
 110
 
 1275+34+3+110=1422

Read Counts

 query_tracedb "query count SPECIES_CODE='CULEX PIPIENS QUINQUEFASCIATUS'"                                #  7552113  : all traces 
 query_tracedb "query count SPECIES_CODE='CULEX PIPIENS QUINQUEFASCIATUS' AND load_date >='09/01/2007'"   #  172799   : new traces (all cDNA)

Assembly

Locations:

 /fs/szasmg2/Culex_pipiens_symbiont/

2006_1226_WGA

initial assembly

Steps:

 1. All cpqg reads have been downloaded from the TA (July 2006). The reads have been grouped by libraries and the clear range has been computed. There were 6.6M reads in the download compared with 7.3M now.  Unfortunately I've only noticed this difference at the end of my experiment.
 2. The Wolbachia endosymbiont of Culex quinquefasciatus assembly has  been downloaded from the Sanger ftp site ( ftp://ftp.sanger.ac.uk/pub/pathogens/Wolbachia/Wb_Cq.dbs ) ; there are 95 sequences in this file. Most of them are very short. Below are listed the name,length & gc% of the longest 10: 
    name                        length(bp)   gc%
    culex173d08.p1k               1457497    34.17
    culexbac1d10Bg07.p1k            24726    35.11
    culex3d09.p1k                   15587    21.81
    culex166f03.q1k                 13962    36.17
    culex_1177_1189-1a02.w2k1177    13564    37.10
    culex26b07.p1k                   9245    35.53
    culex174d04.p1k                  8832    33.64
    J28015Ag08.q1ka                  7809    36.04
    culex180e07.p1k                  6960    36.59
    culex53a02.p1k                   5343    33.58
 3. The cpqg random reads (clr only) have been aligned to symbiont sequences using nucmer (default parameters)
 4.  The nucmer output has been analyzed. It's been noticed that many of the short symbiont sequences (2-3KB in length) have a higher than expected number of alignments. To avoid the repeats I've  elected only  the reads that aligned to the longest 10 symbiont sequences (see above).
 5. A 95% identity and minimum of 400 bp alignment thold has been used to determine the symbiont reads. There were 29,110 unique reads (30,690 reads+mates) selected. Below is a per library breakdown (reads+mates):
    MSC-CULEX-PIPIENS-QUINQUEFASCIATUS_01-G-CULEX-10KB      9581
    MSC-CULEX-PIPIENS-QUINQUEFASCIATUS_06-G-CULEX-10KB      4549
    G818P4                                                  3784
    G818P2                                                  3478
    G818P1                                                  2238
    G818F1                                                  1283
    MSC-CULEX-PIPIENS-QUINQUEFASCIATUS_02-G-CULEX-4KB       1156
    MSC-CULEX-PIPIENS-QUINQUEFASCIATUS_03-F-CULEX-40KB      738
    G818P3                                                  723
    MSC-CULEX-PIPIENS-QUINQUEFASCIATUS_07-G-CULEX-10KB      556
    MSC-CULEX-PIPIENS-QUINQUEFASCIATUS_05-F-CULEX-40KB      327
    MSC-CULEX-PIPIENS-QUINQUEFASCIATUS_04-F-CULEX-40KB      185
    1099522705601                                           99
    G809K1                                                  89 : cDNA , should be removed
    1099499586718                                           77
    G772K1                                                  12 : cDNA , should be removed
    G771K1                                                  10 : cDNA , should be removed
    G766BES1                                                4 :  BE library
    1099641499000                                           2
 6. The reads have been assembled using the runCA-OBT.pl script (default parameters).  Most of the reads got assembled into 3 large scaffolds. There is mate pair evidence (outie mates) that the largest scaffold is circular. 
 All the scaffolds ens up in surrogates (20-50KB total surrogate length)
 Are there not enough BE to span the unique regions?  
 Cpqg.qc
 scaff_8 Longest scaff
 scaff_9 2nd longest scaff
 scaff_7 3rd longest scaff
 scaff_6 Small scaff that Looks circular
 7. The scaffolds/contigs have been aligned to longest 10 Wolbachia endosymbiont sequences. Most of the long alignments were at over 99% identity. However, several large rearrangements have been noticed.
 Wb_Cq-vs-scaff Reference vs scaff

2007_0802_WGA-default

new assembly

Steps:

 1. All Culex reads have been downloaded from TA . ~1M new reads since 2006_1226
 2. The reads have been aligned to the new reference (exclude mito,repeats) using nucmer (default parameters)
 3. A 95% identity and minimum of 400 bp alignment thold has been used to determine the symbiont reads.  3850 new reads & mates in addition to the previous ones were identified
 4. 33,783 reads have been assembled using the runCA-OBT.pl script (default parameters).  
 Cpqg.qc
 Compared to the initial assembly, many metrics went down (TotalBasesInScaffolds,MaxBasesInScaffolds,MaxContigLength ...)
 TotalSurrogates & SurrogateInstances more than doubled

2007_0802_WGA-0.5E

error rate =0.5 % => more fragmented assembly

2007_0802_WGA-0.5M

genome size=1.5M => more TotalBasesInScaffolds but more unhappy mates

What to do next?

  • use CA 5.1 (latest version)
  • remove 958 cDNA's aligned to culex*
  • increase utg error rate to from 1.5% to 2% (3% gave worse results than 2%)
  • recruite reads that align to contig ends: some ends are repetitive => too many; others no alignments
  • use 2 other complete strains; only 2 new aligned reads were identified
  • AMOScmp new reference => more unhappy mates then before
  • dropping the min Astat from 1 to -1 made some degens into places ctgs; did not improve overall stats
  • separate JCVI & WIBR reads, assemble separately => 5 obvious alignment breaks
  • use only the reads from lib with insert size <=11Kbp => more fragmented
  • use only the reads that aligned to the top2 Sanger ctgs (36606 instead of 36767)


Reads aligned to the 7 Sanger sequences:

 CENTER          STRATEGY        COUNT   PERCENTAGE
 TIGR_JCVIJTC    WGS             18268   48.97
 WIBR            WGS             17316   46.42          # 155 BE align but mostly at 80-90% id, only 4 at >=95% id, >=400bp
 WIBR            CLONEEND(CDNA)  884     2.37           # about avg 1.48
 TCAG_JCVIJTC    WGS             815     2.18           
 WIBR            SHOTGUN         20      0.05
 total                           37303   100            
 
 total+mates                     39027 (37724 in .frg file)
 wgs+mates                       38069 (36767 in .frg file)  # 302 BE

2008_0829_WGA-wgs

e=1.5

 [Top5Scaffolds=contigs,size,span,avgContig,avgGap]
 0=3,1363559,1364974,454520,708
 1=1,53307,53307,53307,0
 2=1,28821,28821,28821,0
 3=2,23208,23528,11604,320
 4=1,8315,8315,8315,0
 total=8,1477210,1478945,184651,578

2008_0829_WGA-wgs e=2.0 -> best

Assembly description

  1. The Wolbachia pipientis endosymbiont of Culex quinquefasciatus assembly was downloaded from the Sanger web site (July 2007 version: 12 contigs)
  2. 5 of the 12 contigs were discarded due to their high GC% or repetitive content; 7 contigs were kept to be used for sequence alignments
  3. The NCBI TA Culex quinquefasciatus traces were downloaded locally (Sept 2007: 7,379,314 total Sanger traces; 7,183,129 WGS)
  4. The WGS traces were aligned to the 7 reference contigs using nucmer (default parameters: minimum 65bp length, 80% identity)
  5. The traces that aligned and their mates were filtered out and formatted as input for Celera Assembler (36,767 total traces; 35,750 mated, 1,017 unmated)
  6. The traces were assembled with CA (wgs-5.1) (default parameters except for unitiggerRrrorRate=2%)
  7. The assembler generated 16 scaffolds, 21 contigs and 92 degenerates;
  8. 5 scf, 10 ctg & 11 deg were filtered based on their "uniqueness"; 2 of the scaffolds contain multiple contigs
  9. There are 2 unique regions in reference not present in this genome (NC_010981.1: 775928-776047 120bp; 1253284-1254139 856bp)
  10. There are 4 unique regions (~ 500bp each ) in this genome not present in the reference sequence/assembly: ctg7180000001230_202_725, ctg7180000001305_11303_11867, ctg7180000001305_13367_14006, deg7180000001252_328_851
  11. 10 large scale rearrangements


Comments

  1. wgs-5.2-beta generated the same results
  2. modifying astatLowBound, astatHighBound did not result in better assembly
Cpqg.qc
 all:                                                                16 scf, 21 ctg , 92 deg
 ones with gc% in the 32..36 range or have Wp genes aligned to them: 11 scf, 16 ctg , 41 deg:
 filtered(submission)                                                5 scf, 10 ctg,  11 deg
 [Top5Scaffolds=contigs,size,span,avgContig,avgGap]
 0=4,1388064,1389477,347016,471
 1=3,70356,70440,23452,42
 2=1,42565,42565,42565,0
 3=1,8315,8315,8315,0
 4=1,2425,2425,2425,0
 total=10,1511725,1513222,151172,299
 top 2 scaffold size=1458420
 top 3 scaffold size=1500985
 Alignment files
 Media:NC_010981-scf.filter-q.png     Media:NC_010981-scf.filter-1.png   Media:scf-NC_010981.filter-1.png
 Media:NC_010981-ctg.filter-q.png     Media:NC_010981-ctg.filter-1.png
 Media:NC_010981-ctg-deg.filter-q.png Media:NC_010981-ctg-deg.filter-1.png
 Stats:
       #elem   min     max     mean    median  n50     sum
 scf   16      1035    1389537 95513   1501    1389537 1528210 ; 4  CONTAINED in bigger scf
 ctg   21      1035    478325  72697   1583    478325  1526633 ; 4  CONTAINED in bigger ctg
 deg   92      245     7632    1079    843     1000    99229   ; 22 CONTAINED in        ctg
 #id                     len     gc%     Wb_Cq.7 NC_gene cvg
 scf7180000001311        1389537 34.17   3744    1250    14.59 #ctg7180000001298..ctg7180000001301 
 scf7180000001316        70460   34.82   85      62      6.22  #ctg7180000001303..ctg7180000001305
 scf7180000001315        42565   34.15   72      110     13.01
 scf7180000001310        8315    35.65   5       3       4.84
 scf7180000001307        2425    62.76   0       0       2.64
 ...
 scf7180000001320        1315    35.67   3       1       20.27
 ...  
 ctg7180000001299        478325  34.12   1253    478     15.56
 ctg7180000001300        466173  34.13   1401    532     14.5
 ctg7180000001298        316943  34.05   541     343     13.99
 ctg7180000001301        126623  34.61   550     225     12.77
 ctg7180000001302        42565   34.15   72      110     13.01
 ctg7180000001305        37016   34.66   47      52      6.49
 ...
 deg7180000001279        7632    33.32   6       6       26.85  # the long degenerates have high coverage
 deg7180000001277        4159    35.18   61      72      40.3
 deg7180000001280        3685    36.88   12      18      36.67
 ...
 deg7180000001231        245     33.06   0       0       1.3

Filtering

Steps:

  1. Align scaffold & degenerates to top3 ref ctgs culex161b01.q1k(346,054)+culexbac1b5Ab03.q1k(1,136,301)+culex49c07.p1k(9,245) using nucmer;
  2. Filter alignments using "delta-filter -r"
  3. Remove CONTAINED scf & deg
  4. add scf & deg that contain UNIQUE seq & not in the list: scf7180000001309(1,361) & deg7180000001252(937)
  5. order & orient ctgs
 #id                     len     gc%     Wb_Cq.7 NC_gene cvg     contained
 ctg7180000001230        1361    37.99   17      5       3.03    N
 ctg7180000001248        8315    35.65   5       3       4.84    N
 ctg7180000001298        316943  34.05   541     343     13.99   N
 ctg7180000001299        478325  34.12   1253    478     15.56   N
 ctg7180000001300        466173  34.13   1401    532     14.5    N
 ctg7180000001301        126623  34.61   550     225     12.77   N
 ctg7180000001302        42565   34.15   72      110     13.01   N
 ctg7180000001303        29919   34.92   35      52      6.29    N
 ctg7180000001304        3421    35.19   3       2       2.83    N
 ctg7180000001305        37016   34.66   47      52      6.49    N
 
 scf7180000001309        1361    37.99   17      5       3.03    N
 scf7180000001310        8315    35.65   5       3       4.84    N
 scf7180000001311        1389537 34.17   3744    1250    14.59   N # origin of replication at pos 112,7008 (-)
 scf7180000001315        42565   34.15   72      110     13.01   N
 scf7180000001316        70460   34.82   85      62      6.22    N
 
 deg7180000001236        2346    34.02   4       5       35.34   N
 deg7180000001244        3090    33.66   8       10      33.09   N
 deg7180000001252        937     36.29   4       5       1.37    N
 deg7180000001256        1888    32.42   110     49      19.98   N
 deg7180000001260        1198    37.73   5       3       13.1    N
 deg7180000001266        2375    32.80   55      49      28.4    N
 deg7180000001272        2923    32.91   54      46      35.26   N
 deg7180000001277        4159    35.18   61      72      40.3    N
 deg7180000001279        7632    33.32   6       6       26.85   N
 deg7180000001280        3685    36.88   12      18      36.67   N
 deg7180000001290        1879    31.40   10      5       33.06   N
  
 => 10 ctgs (5 scaff) & 11 deg
 .scaff file
 >7180000001309 1 1365 1364
 7180000001230 BE 1365 0
 
 >7180000001310 1 8319 8318
 7180000001248 BE 8319 0
 
 >7180000001311 4 1404424 1405836
 7180000001298 BE 320424 -19
 7180000001299 BE 484180 1434
 7180000001300 BE 471690 1
 7180000001301 BE 128130 0
 
 >7180000001315 1 42888 42887
 7180000001302 BE 42888 0
 
 >7180000001316 3 70739 70822
 7180000001303 BE 30041 1
 7180000001304 BE 3421 85
 7180000001305 BE 37277 0

Reference sequence not present in the assembly

100+ bp 0cvg regions in the reference:

 1. culexbac1b5Ab03.q1k     429746  429972  226     0
 2. culexbac1b5Ab03.q1k     907129  908079  950     0
 
 1. NC_010981.1             775928  776047  120     0
 2. NC_010981.1             1253284 1254139 856     0
 1.1. NC_010981.1     RefSeq  gene    775763  777826  .       +       .     contains  GeneID:6385213 # WP0709 Putative outer membrane protein
 1.2. NC_010981.1     RefSeq  gene    1252115 1253287 .       +       .     begin     GeneID:6385392 # tuf translation elongation factor tu (2 in Sanger wPip, none in Dan's annotation)
 1.3. NC_010981.1     RefSeq  gene    1253302 1253622 .       +       .     contained GeneID:6385310 # rpsJ 30s ribosomal protein s10 ??? missing;                                                                                                     
                                                                                                     # very conserved in Wolbachia endosymbiont of Drosophila melanogaster
 2.1. NC_010981.1     RefSeq  gene    1253632 1254354 .       +       .     end       GeneID:6385679 # rplC ribosomal protein L3 (partially present in Dan's annotation)
                                                                                                     # very conserved in several species : Wolbachia, Erlichia ...
 No promer alignments of sequences to these regions

Assembly sequence not present in the reference

100+ bp 0cvg regions in the assembly:

    ctg_start_stop               len    gc%    comments
 1. ctg7180000001230_202_725     524    37.50  # first 202bp have multiple alignments to NC_010981
                                               # bases 203..725 have no alignments to NC_010981
                                               # this contig used to comntain cloning vector at the 3' end which was removed (725..1160
                                               # scf7180000001309  
 2. ctg7180000001305_11303_11867 565    33.45  # aligns at 100%len, 100%id to Wolbachia endosymbiont of Drosophila melanogaster, complete genome; NC_002978.6:243504..243803 
                                               # 11070..11735  Putative dna repair protein radc [Wolbachia pipientis] (Dan's annotation) 
                                               # scf7180000001316 3 70739 70822
 3. ctg7180000001305_13367_14006 640    33.91  # good blastx alignment to Wolabchia gene on 100% length; NC_002978.6:488974..489912
 4. deg7180000001252_328_851     524    35.88  # blastx align to We of Bm NC_006833.1:754520..755170
 
 no alignments to Sanger raw reads

Others: might be contaminated?

 5. ctg7180000001257            1378    32.80  # 313:1378  Culex pipiens LINE repeat !!!
                                               # 537..1379 reverse transcriptase [Bacteroides thetaiotaomicron VPI-5482] (Dan's annotation) 
 6. ctg7180000001285            1568   37.37   # GC% higher than avg 
                                               # 125..700  transcriptional regulator, XRE family [Thermotoga lettingae TMO] (Dan's annotation) 
                                               # 984..1571 putative outer membrane protein probably involved in nutrient binding [Bacteroides fragilis YCH46] (Dan's annotation) 

ORFS's

 1.1 ctg7180000001230:orf00001 ctg7180000001230 -1 262 +2        transposase, IS256 family [Wolbachia endosymbiont of Drosophila melanogaster]
 $ cat NC_010981.ptt | grepi -c transposase
 80
 1.2 ctg7180000001230:orf00002 ctg7180000001230 1277 840 -3      blast e-val:3e-84 chloramphenicol acetyltransferase [Salmonella enterica subsp. enterica serovar Typhi str. CT18] not in NC_010981 !!!
                                                                 cloning vector !!!
 >ctg7180000001230:orf00002  ctg7180000001230  1277 840  len=438
 ATGGCAATGAAAGACGGTGAGCTGGTGATATGGGATAGTGTTCACCCTTGTTACACCGTT
 TTCCATGAGCAAACTGAAACGTTTTCATCGCTCTGGAGTGAATACCACGACGATTTCCGG
 CAGTTTCTACACATATATTCGCAAGATGTGGCGTGTTACGGTGAAAACCTGGCCTATTTC
 CCTAAAGGGTTTATTGAGAATATGTTTTTCGTCTCAGCCAATCCCTGGGTGAGTTTCACC
 AGTTTTGATTTAAACGTGGCCAATATGGACAACTTCTTCGCCCCCGTTTTCACCATGGGC
 AAATATTATACGCAAGGCGACAAGGTGCTGATGCCGCTGGCGATTCAGGTTCATCATGCC
 GTTTGTGATGGCTTCCATGTCGGCAGAATGCTTAATGAATTACAACAGTACTGCGATGAG
 TGGCAGGGCGGGGCGTAA

 >ctg7180000001230:orf00002  ctg7180000001230  1277 840  len=438
 MAMKDGELVIWDSVHPCYTVFHEQTETFSSLWSEYHDDFRQFLHIYSQDVACYGENLAYF
 PKGFIENMFFVSANPWVSFTSFDLNVANMDNFFAPVFTMGKYYTQGDKVLMPLAIQVHHA
 VCDGFHVGRMLNELQQYCDEWQGGA*

 >gi|18466598|ref|NP_569406.1| chloramphenicol acetyltransferase [Salmonella enterica subsp. enterica serovar Typhi str. CT18]
 MEKKITGYTTVDISQWHRKEHFEAFQSVAQCTYNQTVQLDITAFLKTVKKNKHKFYPAFIHILARLMNAH
 PEFRMAMKDGELVIWDSVHPCYTVFHEQTETFSSLWSEYHDDFRQFLHIYSQDVACYGENLAYFPKGFIE
 NMFFVSANPWVSFTSFDLNVANMDNFFAPVFTMGKYYTQGDKVLMPLAIQVHHAVCDGFHVGRMLNELQQ
 YCDEWQGGA
 2. ctg7180000001305:orf00013 ctg7180000001305 11070 11735 +3   DNA repair protein RadC, putative [Wolbachia endosymbiont of Drosophila melanogaster]
 
 # 3 copies in NC_010981
 $ cat NC_010981.ptt | grepi RadC
 280207..280863  -       218     190570723       -       WP0276  -       -       Putative dna repair protein radc
 488966..489634  -       222     190570883       -       WP0459  -       -       Putative dna repair protein radc
 1418058..1418726        -       222     190571715       -       WP1343  -       -       Putative dna repair protein radc


 3. ctg7180000001305:orf00015 ctg7180000001305 13185 14093 +3   transcriptional regulator, putative [Wolbachia endosymbiont of Drosophila melanogaster]
 
 # 10 copies
 cat NC_010981.ptt | grepi "transcriptional regulator"
 247653..248570  -       305     190570687       -       WP0239  -       -       Putative transcriptional regulator
 277056..277895  -       279     190570720       -       WP0273  -       -       Putative transcriptional regulator
 277921..278835  -       304     190570721       -       WP0274  -       -       Putative transcriptional regulator
 281034..281954  -       306     190570724       -       WP0277  -       -       Putative transcriptional regulator
 296912..297466  -       184     190570733       -       WP0290  -       -       Putative transcriptional regulator
 486388..487365  -       325     190570881       -       WP0457  -       -       Putative transcriptional regulator
 630467..631237  +       256     190570997       -       WP0585  -       -       two component transcriptional regulator
 806511..806837  +       108     190571141       -       WP0739  -       -       Putative transcriptional regulator, MerR family
 1129005..1129301        +       98      190571445       -       WP1058  -       -       Putative transcriptional regulator
 1415480..1416457        -       325     190571713       -       WP1341  -       -       putative transcriptional regulator
 4. ?

Repeats

  • No alignments of ctg/deg to RepeatMaskerLib
  • Tandem repeats
    • Minisatellites copy number variation can be used to genotype bacteria strains
 $ show-coords NC_010981.trf-wPip.trf.filter-1.delta | grep -f NC_010981.trf-wPip.trf.filter-1.qry_diff
      1      100  |      126       27  |      100      100  |   100.00  |      100      208  |   100.00    48.08  | 34.18.100  98.54.208       [CONTAINED]
      1      122  |      122        1  |      122      122  |   100.00  |      122      208  |   100.00    58.65  | 35.54.122  98.54.208       [CONTAINED]
      1      273  |        1      273  |      273      273  |   100.00  |      280      355  |    97.50    76.90  | 68.75.280  75.75.355       [CONTAINED]


 $ infoseq wPip.trf.fasta    | grep -f NC_010981.trf-wPip.trf.filter-1.qry_diff
 60.65.213      213    38.97
 75.75.355      355    34.08
 98.54.208      208    45.19
 99.18.146      146    43.84
  • RepeatScout pipeline summary
 10 ctg+11 degen
                 #elem   min     max     mean    median  n50     sum
 families        90      67      7071    678     358     989     61024
 repeats         465     30      7071    631     378     984     293381
 uniq            248     68      46556   5071    1549    16001   1257709
 NC_010981
                 #elem   min     max     mean    median  n50     sum
 families        51      71      5779    726     307     1360    37012
 repeats         304     31      5779    682     548     989     207331
 uniq            199     68      58998   6384    2512    16001   1270466
  • !!! more repeats in our assembly
  • Comparison of the longest repeats (our strain vs Sanger strain):
 $ cd /fs/szasmg2/Culex_pipiens_symbiont/2008_0829_WGA-wgs-e.20/RepeatScout
 $ sort -nk2 -r wPip-NC_010981.families.infocount
 fam     len     gc%     #ref    #qry
 12      7071    36.18   5       2         # repeat family 12 has 5 copies in our assembly and 2 copies in NC_010981
 57      6770    35.05   4       2
 77      3129    34.48   4       5
 6       2461    35.11   3       2
 87      1468    34.81   4       2
 26      1399    35.67   0       0
 1       1346    36.18   2       2
 2       1345    38.74   33      31
 60      1097    39.56   3       5
  • there are differences in the copy numbers
  • there is no frequent repeat present in one genome but not in the other

Multiple copies in reference

Multiple copies in assembly

Snps

Rearrangements

  • ~ 10 rearrangements
  • some rearrangements are associated with IS elements: the 20 copy 1.3K repeats belong to "12 IS5 (IS256-family)" , transposase gene

Improving strategy

Identify more reads that align to those regions (blastn TA):

 236 : all
 198 : new
 395 : new+mates

Adding these reads did not improve the assembly.

 gi|42519920|ref|NC_002978.6|    243504  243803 # ctg7180000001305_11303_11867 565    33.45  : there are reads aligned to 241173-243822
 gi|42519920|ref|NC_002978.6|    488974  489912 # ctg7180000001305_13367_14006 640    33.91  : there are reads aligned to 487182-492517
 gi|58584261|ref|NC_006833.1|    754520  755170 # deg7180000001252_328_851     524    35.88  : there are reads aligned to 754300-755148 

All the reads aligned to the 3 regions above have been assembled; the 3 regions seem to contain rearrangements ---

Files & Directories

  • Wolbachia pipientis, complete genome
  /fs/szasmg2/Culex_pipiens_symbiont/NCBI/NC_010981.fna
  • qc file
  /fs/szasmg2/Culex_pipiens_symbiont/2008_0829_WGA-wgs-e.20/cpqg.qc
  • AMOS bank
  /fs/szasmg2/Culex_pipiens_symbiont/2008_0829_WGA-wgs-e.20/cpqg.bnk/
  • nucmer alignment files : assembly scaffolds/contigs/denenerates/unitigs vs the reference genome
  *.filter-q.* were generated using "delta-filter -q"
  *.filter-1.* were generated using "delta-filter -1"
  /fs/szasmg2/Culex_pipiens_symbiont/2008_0829_WGA-wgs-e.20/nucmer/NC_010981-*delta
 Filtered scaffolds:
 #id                   len     gc%     Wb_Cq.7 NC_gene cvg
 scf7180000001311      1389537 34.17   3744    1250    14.59   *
 scf7180000001316      70460   34.82   85      62      6.22
 scf7180000001315      42565   34.15   72      110     13.01   *
 scf7180000001310      8315    35.65   5       3       4.84
 scf7180000001319      1501    36.64   3       1       7.9
 scf7180000001312      1378    32.80   0       0       1.62
 scf7180000001309      1361    37.99   17      5       3.03
 scf7180000001320      1315    35.67   3       1       20.27
 scf7180000001317      1173    34.53   3       1       2.18
 scf7180000001318      1115    36.41   2       3       2.32
 scf7180000001321      1035    34.01   3       1       3.4
 ctg7180000001299      478325  34.12   1253    478     15.56   *
 ctg7180000001300      466173  34.13   1401    532     14.5    *
 ctg7180000001298      316943  34.05   541     343     13.99   *
 ctg7180000001301      126623  34.61   550     225     12.77   *
 ctg7180000001302      42565   34.15   72      110     13.01   *
 ctg7180000001305      37016   34.66   47      52      6.49
 ctg7180000001303      29919   34.92   35      52      6.29
 ctg7180000001248      8315    35.65   5       3       4.84
 ctg7180000001304      3421    35.19   3       2       2.83
 ctg7180000001270      1501    36.64   3       1       7.9
 ctg7180000001257      1378    32.80   0       0       1.62
 ctg7180000001230      1361    37.99   17      5       3.03
 ctg7180000001284      1315    35.67   3       1       6.83
 ctg7180000001232      1173    34.53   3       1       2.18
 ctg7180000001237      1115    36.41   2       3       2.32
 ctg7180000001297      1035    34.01   3       1       3.4
  • filtered contigs & degens
 /fs/szasmg2/Culex_pipiens_symbiont/2008_0829_WGA-wgs-e.20/cpqg.ctg-deg.filter.fasta
 /fs/szasmg2/Culex_pipiens_symbiont/2008_0829_WGA-wgs-e.20/cpqg.ctg-deg.filter.infoseq

Annotation (original)

Format annotation for NCBI submission:

 $ wc -l  cpqg.ctg.CDS cpqg.ctg.tRNA cpqg.ctg.rRNA
   1476 cpqg.ctg.CDS
     34 cpqg.ctg.tRNA
      4 cpqg.ctg.rRNA
 cat cpqg.ctg.CDS   | sed 's/orf00//' | sed 's/ctg718000000//' | ~/bin/tab2annotation.pl -hl 1 -t CDS >! cpqg.ctg.CDS.tbl
 cat  cpqg.ctg.tRNA | sed 's/orf00//' | sed 's/ctg718000000//' | ~/bin/tab2annotation.pl -hl 1 -t tRNA >! cpqg.ctg.tRNA.tbl
 cat cpqg.ctg.rRNA  | sed 's/orf00//' | sed 's/ctg718000000//' | ~/bin/tab2annotation.pl -hl 1 -t rRNA >! cpqg.ctg.rRNA.tbl
 cat cpqg.ctg.CDS.tbl cpqg.ctg.tRNA.tbl  cpqg.ctg.rRNA.tbl > cpqg.ctg.tbl

Sanger Wolbachia: much fewer genes !!!

 NC_010981.ptt
 1248 CDS
   25 tRNA : 1Leu (vs 5 in our strain) !!!
    2 rRNA

No CRISPRs found by CRISPRFinder

Annotation (revised)

  • Genes manually curated by Dan; many transposases deleted
 wc -l cpqg.ctg.CDS cpqg.ctg.tRNA cpqg.ctg.rRNA cpqg.deg.CDS
   1342 cpqg.ctg.CDS
     36 cpqg.deg.CDS
     34 cpqg.ctg.tRNA
      4 cpqg.ctg.rRNA    
   1416 total

NCBI submission

  • name: Wolbachia pipientis wPip(strain) JHB(substrain) (got this from Steven)
  • NCBI suggestion:
  [organism=Wolbachia endosymbiont of Culex quinquefasciatus JHB]
  [host=Culex quinquefasciatus JHB]
 http://www.ncbi.nlm.nih.gov/genomes/mpfsubmission.cgi?show=EB95B67D-199C-42C9-80CE-F2AC9C7C7A02
 Project ID: 	32209
 Locus Tag Prefix:	C1A
  • Submission dir:
 /fs/szasmg2/Culex_pipiens_symbiont/2008_0829_WGA-wgs-e.20/submission2/
  • Submission via GenomesMacroSend;
    • Direct Submit ID: DSub8465 (1st submission)
    • Direct Submit ID: DSub8474,DSub8475 (revisions to the 1st submission)
  • TaxId: 569881
  • ABZA00000000 Genome Project
  • ABZA00000000 Project accession number
    • ctg: ABZA01000001..ABZA01000021
    • scaff: DS996929-DS996944
  • NCBI files:
 /fs/szasmg2/Culex_pipiens_symbiont/best/submission2/ABZA01_accs :          21 ctg & deg accession numbers ABZA01000001..ABZA01000021
 /fs/szasmg2/Culex_pipiens_symbiont/best/submission2/ABZA.01.modified.p2g   1378  gene accession numbers   EEB55160.. EEB56537
 /fs/szasmg2/Culex_pipiens_symbiont/best/submission2/ABZA01_scfld_DS_accs   16 scaffold id's
  • Future updates
    • Protein id formats to use(?):
 gnl|umiacs|C1A_1|gb|EEB55198
 gnl|WGS:ABZA|C1A_1|gb|EEB55198
  • AA AI 3900

Article