Culex pipiens symbiont

From Cbcb
Jump to navigation Jump to search

Data Sources

Sanger

Wolbachia pipientis endosymbiont of Culex quinquefasciatus

  • December 2006 reference (95 sequences):
 file name: /fs/szasmg2/Culex_pipiens_symbiont/Sanger/Wb_Cq_061226.dbs
 
 Top 10 seqs
 Name                           Length     %GC
 culex173d08.p1k               1457497    34.17
 culexbac1d10Bg07.p1k            24726    35.11
 culex3d09.p1k                   15587    21.81
 culex166f03.q1k                 13962    36.17
 culex_1177_1189-1a02.w2k1177    13564    37.10
 culex26b07.p1k                   9245    35.53
 culex174d04.p1k                  8832    33.64
 J28015Ag08.q1ka                  7809    36.04
 culex180e07.p1k                  6960    36.59
 culex53a02.p1k                   5343    33.58
 ...
  • July 2007 reference (12 sequences):
 file name: /fs/szasmg2/Culex_pipiens_symbiont/Sanger/Wb_Cq.dbs
 
 All seqs:
 Name                    Length  %GC
 1  culexbac1b5Ab03.q1k     1136301 34.17           
 2  culex161b01.q1k         346054  34.25            
 3  culex166f03.q1k         13962   36.17          share almost all sequence with culex161b01.q1k & 1996bp with culexbac1b5Ab03.q1k
    subtotal(3)             1496317
 
 4  culex49c07.p1k          9245    35.53          looks circular(misoriented mates at the ends); region 4979-6364 aligns to culexbac1b5Ab03.q1k 3 times
 5  culex53a02.p1k          5343    33.58          ~ 1Kbp alignments to culexbac1b5Ab03.q1k & culex161b01.q1k
 6  culex117e02.p2kA55      3501    33.10          contained (in 2 pieces) in culexbac1b5Ab03.q1k
 7  culex141a08.q1k         1920    33.44          few hundred bp alignments to other culex* seqs
    subtotal(7)             1516326
 
 8  culex180e07.p1k         6960    36.62          "CONTAINED" culexbac1b5Ab03.q1k (surrogate in WGA) 
 9  culex5c05.p1k           15587   21.81          low GC%; no alignments to NC_002978 & NC_006833; best hit is  Anopheles gambiae complete mitochondrial genome : 15363 bp (96% coverage, 86% max id)
 10 culex14h11.p1k          3350    51.73          repeat (higher GC%): good cvg of culex 18SrRNA gene ; no alignments to NC_002978 & NC_006833
 11 culex22h10.q1k          2148    54.89          repeat (higher GC%): some alignment to culex 118S rRNA ; no alignments to NC_002978 & NC_006833
 12 culex166d08.p1k         2071    55.53          repeat (higher GC%): culex 18S rRNA & 28S rRNA ; no alignments to NC_002978 & NC_006833
    total(12)               1546442
  • Sept 2008 reference
 file name: /fs/szasmg2/Culex_pipiens_symbiont/Sanger/Wb_Cq_080903.dbs
 
 1  contig000310   1136301 34.17
 2  contig000307   346054 34.25
 3  contig000311   15587  21.81
 4  contig000305   13962  36.17
 5  contig000312   9245   35.53
 6  contig000309   6967   36.63
 7  contig000306   5343   33.58
 8  contig000315   3501   33.10
 9  contig000308   3350   51.73
 10 contig000313   2148   54.89
 11 contig000314   2071   55.53
 12 contig000304   1994   33.85

NCBI

Culex quinquefasciatus

  • Taxonomy:
 * Culex pipiens complex 
   * Culex australicus   
   * Culex pipiens (house mosquito)    1(project)
         o Culex pipiens molestus   
         o Culex pipiens pallens   
         o Culex pipiens pipiens (northern house mosquito)    
   * Culex pipiens x Culex quinquefasciatus   
   * Culex quinquefasciatus (southern house mosquito)    1(project)
 SEQ_LIB_ID                                            SIZE    STDEV    CENTER_NAME    TYPE            COUNT   PERCENT
 
 1099499586718                                         9000    2700    TIGR_JCVIJTC    WGS             15349   0.21
 1099522705601                                         3500    1050    TIGR_JCVIJTC    WGS             16116   0.22
 1099641499000                                         33000   9900    TIGR_JCVIJTC    WGS             768     0.01
 
 G766BES1                                              120000  .       WIBR            WGS             100434  1.37   BE
 
 G771K1                                                5000    500     WIBR            CLONEEND        51540   0.7
 G772K1                                                5000    500     WIBR            CLONEEND        25314   0.34
 G809K1                                                2000    200     WIBR            CLONEEND        29949   0.41
 G810K1                                                2000    200     WIBR            CLONEEND        2295    0.03
 
 G818F1                                                40000   4000    WIBR            WGS             437994  5.96
 G818F2                                                40000   4000    WIBR            WGS             8505    0.12
 G818P1                                                4000    400     WIBR            WGS             580557  7.89
 G818P2                                                4000    400     WIBR            WGS             1091326 14.84
 G818P3                                                4000    400     WIBR            WGS             350523  4.77
 G818P4                                                4000    400     WIBR            WGS             1017105 13.83
 
 L31420P2                                              5000    .       WIBR            SHOTGUN         2259    0.03
 L31422P1                                              4000    .       WIBR            SHOTGUN         3766    0.05
 L31424P2                                              5000    .       WIBR            SHOTGUN         2226    0.03
 L31425P1                                              4000    .       WIBR            SHOTGUN         3817    0.05
 L31426P1                                              4000    .       WIBR            SHOTGUN         2274    0.03
 L31427P1                                              4000    .       WIBR            SHOTGUN         2273    0.03
 L31428P1                                              4000    .       WIBR            SHOTGUN         3045    0.04
 L31429P1                                              4000    .       WIBR            SHOTGUN         3034    0.04
 L31430P1                                              4000    .       WIBR            SHOTGUN         2261    0.03
 L31431P1                                              4000    .       WIBR            SHOTGUN         2918    0.04
 L31432P1                                              4000    .       WIBR            SHOTGUN         2947    0.04
 L31433P1                                              4000    .       WIBR            SHOTGUN         2251    0.03
 L31435P1                                              4000    .       WIBR            SHOTGUN         2987    0.04
 L31439P1                                              4000    .       WIBR            SHOTGUN         2292    0.03
 L31440P1                                              4000    400     WIBR            SHOTGUN         1478    0.02
 L31440P1                                              4000    .       WIBR            SHOTGUN         2281    0.03
 L31441P1                                              4000    .       WIBR            SHOTGUN         2297    0.03
 L31444P2                                              5000    .       WIBR            SHOTGUN         2241    0.03
 L31446P2                                              5000    .       WIBR            SHOTGUN         2278    0.03
 L31448P1                                              4000    .       WIBR            SHOTGUN         3052    0.04
 L31449P1                                              4000    .       WIBR            SHOTGUN         2234    0.03
 
 MSC-CULEX-PIPIENS-QUINQUEFASCIATUS_01-G-CULEX-10KB    10000   2000    TIGR_JCVIJTC    WGS             1939130 26.36
 MSC-CULEX-PIPIENS-QUINQUEFASCIATUS_02-G-CULEX-4KB     4000    800     TCAG_JCVIJTC    WGS             119990  1.63
 MSC-CULEX-PIPIENS-QUINQUEFASCIATUS_02-G-CULEX-4KB     4000    800     TIGR_JCVIJTC    WGS             213407  2.9
 MSC-CULEX-PIPIENS-QUINQUEFASCIATUS_03-F-CULEX-40KB    40000   8000    TCAG_JCVIJTC    WGS             2405    0.03
 MSC-CULEX-PIPIENS-QUINQUEFASCIATUS_03-F-CULEX-40KB    40000   8000    TIGR_JCVIJTC    WGS             101370  1.38
 MSC-CULEX-PIPIENS-QUINQUEFASCIATUS_04-F-CULEX-40KB    40000   8000    TCAG_JCVIJTC    WGS             16126   0.22
 MSC-CULEX-PIPIENS-QUINQUEFASCIATUS_04-F-CULEX-40KB    40000   8000    TIGR_JCVIJTC    WGS             22134   0.3
 MSC-CULEX-PIPIENS-QUINQUEFASCIATUS_05-F-CULEX-40KB    40000   8000    TIGR_JCVIJTC    WGS             51281   0.7
 MSC-CULEX-PIPIENS-QUINQUEFASCIATUS_06-G-CULEX-10KB    11000   2200    TIGR_JCVIJTC    WGS             992283  13.49
 MSC-CULEX-PIPIENS-QUINQUEFASCIATUS_07-G-CULEX-10KB    9000    1800    TIGR_JCVIJTC    WGS             106326  1.45
 
 .       .       .                                                     WIBR            OTHER           229     0
 .       .       .                                                     WIBR            PCR             228     0
 .       .       .                                                     WIBR            TRANSPOSON      8096    0.11
 
 Total                                                                                                 7354992 100

 CENTER_NAME     TRACE_TYPE_CODE         COUNT           PERCENT
 
 WIBR            WGS                     3586444         48.76
 TIGR_JCVIJTC    WGS                     3458164         47.02
 TCAG_JCVIJTC    WGS                     138521          1.88
 WIBR            CLONEEND                109098          1.48
 WIBR            SHOTGUN                 54211           0.74
 WIBR            TRANSPOSON              8096            0.11
 WIBR            OTHER                   229             0
 WIBR            PCR                     228             0
 
 Total                                   7354992         100


JCVI:

Articles

Other Strains (complete)

                                                                    RefSeq 	GenBank 	Pub 	Length (Mbp) 	GC 	Prot 	RNAs
 Wolbachia endosymbiont of Drosophila melanogaster(TIGR)            NC_002978 	AE017196 	1 	1.267782 	35.2% 	1195 	39
 Wolbachia endosymbiont strain TRS of Brugia malayi srain wMel(NEB) NC_006833 	AE017321 	1 	1.080084 	34.2% 	805 	37
 Wolbachia pipientis wPip(Sanger)                                   NC_010981 	AM999887 	1 	1.482455 	34.2% 	1275 	37  # 1386 CDSs (Sanger article 2008)

!!! Wolbachia pipientis wPip(Sanger) = culex161b01.q1k(346,054) + N(102) + culexbac1b5Ab03.q1k(1,136,301-2)

 $ cat NC_010981.gb | grep '\.\.' | egrep -v 'anticodon|source' | awk '{print $1}' | count.pl
 #       total
 gene    1423
 CDS     1275
 tRNA    34
 rRNA    3
 
 $ cat /fs/szasmg2/Culex_pipiens_symbiont/NCBI/NC_010981.gb | grep -c "\/pseudo"
 110
 
 1275+34+3+110=1422

Read Counts

 query_tracedb "query count SPECIES_CODE='CULEX PIPIENS QUINQUEFASCIATUS'"                                #  7552113  : all traces 
 query_tracedb "query count SPECIES_CODE='CULEX PIPIENS QUINQUEFASCIATUS' AND load_date >='09/01/2007'"   #  172799   : new traces (all cDNA)

Assembly

Locations:

 /fs/szasmg2/Culex_pipiens_symbiont/

2006_1226_WGA

initial assembly

Steps:

 1. All cpqg reads have been downloaded from the TA (July 2006). The reads have been grouped by libraries and the clear range has been computed. There were 6.6M reads in the download compared with 7.3M now.  Unfortunately I've only noticed this difference at the end of my experiment.
 2. The Wolbachia endosymbiont of Culex quinquefasciatus assembly has  been downloaded from the Sanger ftp site ( ftp://ftp.sanger.ac.uk/pub/pathogens/Wolbachia/Wb_Cq.dbs ) ; there are 95 sequences in this file. Most of them are very short. Below are listed the name,length & gc% of the longest 10: 
    name                        length(bp)   gc%
    culex173d08.p1k               1457497    34.17
    culexbac1d10Bg07.p1k            24726    35.11
    culex3d09.p1k                   15587    21.81
    culex166f03.q1k                 13962    36.17
    culex_1177_1189-1a02.w2k1177    13564    37.10
    culex26b07.p1k                   9245    35.53
    culex174d04.p1k                  8832    33.64
    J28015Ag08.q1ka                  7809    36.04
    culex180e07.p1k                  6960    36.59
    culex53a02.p1k                   5343    33.58
 3. The cpqg random reads (clr only) have been aligned to symbiont sequences using nucmer (default parameters)
 4.  The nucmer output has been analyzed. It's been noticed that many of the short symbiont sequences (2-3KB in length) have a higher than expected number of alignments. To avoid the repeats I've  elected only  the reads that aligned to the longest 10 symbiont sequences (see above).
 5. A 95% identity and minimum of 400 bp alignment thold has been used to determine the symbiont reads. There were 29,110 unique reads (30,690 reads+mates) selected. Below is a per library breakdown (reads+mates):
    MSC-CULEX-PIPIENS-QUINQUEFASCIATUS_01-G-CULEX-10KB      9581
    MSC-CULEX-PIPIENS-QUINQUEFASCIATUS_06-G-CULEX-10KB      4549
    G818P4                                                  3784
    G818P2                                                  3478
    G818P1                                                  2238
    G818F1                                                  1283
    MSC-CULEX-PIPIENS-QUINQUEFASCIATUS_02-G-CULEX-4KB       1156
    MSC-CULEX-PIPIENS-QUINQUEFASCIATUS_03-F-CULEX-40KB      738
    G818P3                                                  723
    MSC-CULEX-PIPIENS-QUINQUEFASCIATUS_07-G-CULEX-10KB      556
    MSC-CULEX-PIPIENS-QUINQUEFASCIATUS_05-F-CULEX-40KB      327
    MSC-CULEX-PIPIENS-QUINQUEFASCIATUS_04-F-CULEX-40KB      185
    1099522705601                                           99
    G809K1                                                  89 : cDNA , should be removed
    1099499586718                                           77
    G772K1                                                  12 : cDNA , should be removed
    G771K1                                                  10 : cDNA , should be removed
    G766BES1                                                4 :  BE library
    1099641499000                                           2
 6. The reads have been assembled using the runCA-OBT.pl script (default parameters).  Most of the reads got assembled into 3 large scaffolds. There is mate pair evidence (outie mates) that the largest scaffold is circular. 
 All the scaffolds ens up in surrogates (20-50KB total surrogate length)
 Are there not enough BE to span the unique regions?  
 Cpqg.qc
 scaff_8 Longest scaff
 scaff_9 2nd longest scaff
 scaff_7 3rd longest scaff
 scaff_6 Small scaff that Looks circular
 7. The scaffolds/contigs have been aligned to longest 10 Wolbachia endosymbiont sequences. Most of the long alignments were at over 99% identity. However, several large rearrangements have been noticed.
 Wb_Cq-vs-scaff Reference vs scaff

2007_0802_WGA-default

new assembly

Steps:

 1. All Culex reads have been downloaded from TA . ~1M new reads since 2006_1226
 2. The reads have been aligned to the new reference (exclude mito,repeats) using nucmer (default parameters)
 3. A 95% identity and minimum of 400 bp alignment thold has been used to determine the symbiont reads.  3850 new reads & mates in addition to the previous ones were identified
 4. 33,783 reads have been assembled using the runCA-OBT.pl script (default parameters).  
 Cpqg.qc
 Compared to the initial assembly, many metrics went down (TotalBasesInScaffolds,MaxBasesInScaffolds,MaxContigLength ...)
 TotalSurrogates & SurrogateInstances more than doubled

2007_0802_WGA-0.5E

error rate =0.5 % => more fragmented assembly

2007_0802_WGA-0.5M

genome size=1.5M => more TotalBasesInScaffolds but more unhappy mates

What to do next?

  • use CA 5.1 (latest version)
  • remove 958 cDNA's aligned to culex*
  • increase utg error rate to from 1.5% to 2% (3% gave worse results than 2%)
  • recruite reads that align to contig ends: some ends are repetitive => too many; others no alignments
  • use 2 other complete strains; only 2 new aligned reads were identified
  • AMOScmp new reference => more unhappy mates then before
  • dropping the min Astat from 1 to -1 made some degens into places ctgs; did not improve overall stats
  • separate JCVI & WIBR reads, assemble separately => 5 obvious alignment breaks
  • use only the reads from lib with insert size <=11Kbp => more fragmented
  • use only the reads that aligned to the top2 Sanger ctgs (36606 instead of 36767)


Reads aligned to the 7 Sanger sequences:

 CENTER          STRATEGY        COUNT   PERCENTAGE
 TIGR_JCVIJTC    WGS             18268   48.97
 WIBR            WGS             17316   46.42          # 155 BE align but mostly at 80-90% id, only 4 at >=95% id, >=400bp
 WIBR            CLONEEND(CDNA)  884     2.37           # about avg 1.48
 TCAG_JCVIJTC    WGS             815     2.18           
 WIBR            SHOTGUN         20      0.05
 total                           37303   100            
 
 total+mates                     39027 (37724 in .frg file)
 wgs+mates                       38069 (36767 in .frg file)  # 302 BE

2008_0829_WGA-wgs

e=1.5

 [Top5Scaffolds=contigs,size,span,avgContig,avgGap]
 0=3,1363559,1364974,454520,708
 1=1,53307,53307,53307,0
 2=1,28821,28821,28821,0
 3=2,23208,23528,11604,320
 4=1,8315,8315,8315,0
 total=8,1477210,1478945,184651,578

2008_0829_WGA-wgs e=2.0

Reads used in the assembly:

  • all WGS reads aligned by nucmer (default params: min 65bp, 80%id) to the top 7 Sanger contigs(good) + mates
=> 36767 reads,  17875*2 mated reads
 all:                                                                16 scf, 21 ctg , 92 deg
 ones with gc% in the 32..36 range or have Wp genes aligned to them: 11 scf, 16 ctg , 41 deg
 [Top5Scaffolds=contigs,size,span,avgContig,avgGap]
 0=4,1388064,1389477,347016,471
 1=3,70356,70440,23452,42
 2=1,42565,42565,42565,0
 3=1,8315,8315,8315,0
 4=1,2425,2425,2425,0
 total=10,1511725,1513222,151172,299
 Media:NC_010981-scf.filter-q.png
 Media:NC_010981-ctg-deg.filter-q.png
 Media:Cpqg.all.infoseq‎
 top 2 scaffold size=1458420
 top 3 scaffold size=1500985

100+ bp 0cvg regions in the reference:

 1. culexbac1b5Ab03.q1k     429746  429972  226     0
 2. culexbac1b5Ab03.q1k     907129  908079  950     0
 
 1. NC_010981.1             775928  776047  120     0
 2. NC_010981.1             1253284 1254139 856     0
 1.1. NC_010981.1     RefSeq  gene    775763  777826  .       +       .     contains  GeneID:6385213 # WP0709 Putative outer membrane protein
 1.2. NC_010981.1     RefSeq  gene    1252115 1253287 .       +       .     begin     GeneID:6385392 # tuf translation elongation factor tu
 1.3. NC_010981.1     RefSeq  gene    1253302 1253622 .       +       .     contained GeneID:6385310 # rpsJ 30s ribosomal protein s10 ??? missing;                                                                                                     
                                                                                                     # very conserved in Wolbachia endosymbiont of Drosophila melanogaster
 2.1. NC_010981.1     RefSeq  gene    1253632 1254354 .       +       .     end       GeneID:6385679 # rplC ribosomal protein L3
                                                                                                     # very conserved in several species : Wolbachia, Erlichia ...
 No promer alignments of sequences to these regions

100+ bp 0cvg regions in the assembly:

    ctg_start_stop               len    gc%    comments
 1. ctg7180000001230_202_1361    1160   37.50  # first 600 have no alignments; last 400bp are cloning vector
 2. ctg7180000001305_11303_11867 565    33.45  # aligns at 100%len, 100%id to Wolbachia endosymbiont of Drosophila melanogaster, complete genome; NC_002978.6:243504..243803
 3. ctg7180000001305_13367_14006 640    33.91  # good blastx alignment to Wolabchia gene on 100% length; NC_002978.6:488974..489912
 4. deg7180000001252_328_851     524    35.88  # blastx align to We of Bm NC_006833.1:754520..755170

Identify more reads that align to those regions (blastn TA):

 236 : all
 198 : new
 395 : new+mates

Adding these reads did not improve the assembly.


 gi|42519920|ref|NC_002978.6|    243504  243803 # ctg7180000001305_11303_11867 565    33.45  : there are reads aligned to 241173-243822
 gi|42519920|ref|NC_002978.6|    488974  489912 # ctg7180000001305_13367_14006 640    33.91  : there are reads aligned to 487182-492517
 gi|58584261|ref|NC_006833.1|    754520  755170 # deg7180000001252_328_851     524    35.88  : there are reads aligned to 754300-755148 

All the reads aligned to the 3 regions above have been assembled; the 3 regions seem to contain rearrangements