Culex pipiens symbiont: Difference between revisions
No edit summary |
|||
(16 intermediate revisions by the same user not shown) | |||
Line 563: | Line 563: | ||
* No alignments of ctg/deg to RepeatMaskerLib | * No alignments of ctg/deg to RepeatMaskerLib | ||
* Tandem repeats | |||
** Minisatellites copy number variation can be used to genotype bacteria strains | |||
$ show-coords NC_010981.trf-wPip.trf.filter-1.delta | grep -f NC_010981.trf-wPip.trf.filter-1.qry_diff | |||
1 100 | 126 27 | 100 100 | 100.00 | 100 208 | 100.00 48.08 | 34.18.100 98.54.208 [CONTAINED] | |||
1 122 | 122 1 | 122 122 | 100.00 | 122 208 | 100.00 58.65 | 35.54.122 98.54.208 [CONTAINED] | |||
1 273 | 1 273 | 273 273 | 100.00 | 280 355 | 97.50 76.90 | 68.75.280 75.75.355 [CONTAINED] | |||
$ infoseq wPip.trf.fasta | grep -f NC_010981.trf-wPip.trf.filter-1.qry_diff | |||
60.65.213 213 38.97 | |||
75.75.355 355 34.08 | |||
98.54.208 208 45.19 | |||
99.18.146 146 43.84 | |||
* RepeatScout pipeline summary | * RepeatScout pipeline summary | ||
10 ctg+11 degen | |||
#elem min max mean median n50 sum | #elem min max mean median n50 sum | ||
families 90 67 7071 678 358 989 61024 | families 90 67 7071 678 358 989 61024 | ||
Line 676: | Line 691: | ||
/fs/szasmg2/Culex_pipiens_symbiont/2008_0829_WGA-wgs-e.20/cpqg.ctg-deg.filter.infoseq | /fs/szasmg2/Culex_pipiens_symbiont/2008_0829_WGA-wgs-e.20/cpqg.ctg-deg.filter.infoseq | ||
=== Annotation === | === Annotation (original) === | ||
Format annotation for NCBI submission: | Format annotation for NCBI submission: | ||
Line 696: | Line 711: | ||
No CRISPRs found by CRISPRFinder | No CRISPRs found by CRISPRFinder | ||
=== Annotation (revised) === | |||
* Genes manually curated by Dan; many transposases deleted | |||
wc -l cpqg.ctg.CDS cpqg.ctg.tRNA cpqg.ctg.rRNA cpqg.deg.CDS | |||
1342 cpqg.ctg.CDS | |||
36 cpqg.deg.CDS | |||
34 cpqg.ctg.tRNA | |||
4 cpqg.ctg.rRNA | |||
1416 total | |||
= NCBI submission = | = NCBI submission = | ||
Line 710: | Line 735: | ||
* Submission dir: | * Submission dir: | ||
/fs/szasmg2/Culex_pipiens_symbiont/2008_0829_WGA-wgs-e.20/submission/ | /fs/szasmg2/Culex_pipiens_symbiont/2008_0829_WGA-wgs-e.20/submission2/ | ||
* Submission via GenomesMacroSend; | |||
** Direct Submit ID: DSub8465 (1st submission) | |||
** Direct Submit ID: DSub8474,DSub8475 (revisions to the 1st submission) | |||
* TaxId: 569881 | |||
* [http://www.ncbi.nlm.nih.gov/sites/entrez?db=genome&cmd=search&term=ABZA00000000 ABZA00000000] Genome Project | |||
* [http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?db=nuccore&id=212995898 ABZA00000000] Project accession number | |||
** ctg: ABZA01000001..ABZA01000021 | |||
** scaff: DS996929-DS996944 | |||
* NCBI files: | |||
/fs/szasmg2/Culex_pipiens_symbiont/best/submission2/ABZA01_accs : 21 ctg & deg accession numbers ABZA01000001..ABZA01000021 | |||
/fs/szasmg2/Culex_pipiens_symbiont/best/submission2/ABZA.01.modified.p2g 1378 gene accession numbers EEB55160.. EEB56537 | |||
/fs/szasmg2/Culex_pipiens_symbiont/best/submission2/ABZA01_scfld_DS_accs 16 scaffold id's | |||
* Future updates | |||
** Protein id formats to use(?): | |||
gnl|umiacs|C1A_1|gb|EEB55198 | |||
gnl|WGS:ABZA|C1A_1|gb|EEB55198 | |||
* [http://www.ncbi.nlm.nih.gov/Traces/assembly/assmbrowser.cgi?cmd=browse&ai=3900&m=main&s=browse AA] AI 3900 | |||
= Article = | |||
* [[wPip_article|article submitted]] |
Latest revision as of 12:36, 11 December 2008
Data Sources
Sanger
Wolbachia pipientis endosymbiont of Culex quinquefasciatus
- Sanger Wolbachia Genome Project
- Sanger Wolbachia FTP 24,532 Sanger traces
- December 2006 reference (95 sequences):
file name: /fs/szasmg2/Culex_pipiens_symbiont/Sanger/Wb_Cq_061226.dbs Top 10 seqs Name Length %GC culex173d08.p1k 1457497 34.17 culexbac1d10Bg07.p1k 24726 35.11 culex3d09.p1k 15587 21.81 culex166f03.q1k 13962 36.17 culex_1177_1189-1a02.w2k1177 13564 37.10 culex26b07.p1k 9245 35.53 culex174d04.p1k 8832 33.64 J28015Ag08.q1ka 7809 36.04 culex180e07.p1k 6960 36.59 culex53a02.p1k 5343 33.58 ...
- July 2007 reference (12 sequences; 7 "good"; 4 "unique"):
file name: /fs/szasmg2/Culex_pipiens_symbiont/Sanger/Wb_Cq.dbs All seqs: Name Length %GC 1 culexbac1b5Ab03.q1k 1136301 34.17 2 culex161b01.q1k 346054 34.25 3 #culex166f03.q1k 13962 36.17 share almost all sequence with culex161b01.q1k & 1996bp with culexbac1b5Ab03.q1k subtotal(3) 1496317 4 culex49c07.p1k 9245 35.53 misoriented mates at the ends; region 4979-6364(1.3Kbp) aligns to culexbac1b5Ab03.q1k 3 times 5 culex53a02.p1k 5343 33.58 ~ 1Kbp alignments to culexbac1b5Ab03.q1k & culex161b01.q1k 6 #culex117e02.p2kA55 3501 33.10 contained (in 2 pieces) in culexbac1b5Ab03.q1k 7 #culex141a08.q1k 1920 33.44 contained in other seqs subtotal(7) 1516326 8 culex180e07.p1k 6960 36.62 "CONTAINED" culexbac1b5Ab03.q1k (surrogate in WGA) 9 culex5c05.p1k 15587 21.81 low GC%; no alignments to NC_002978 & NC_006833; best hit is Anopheles gambiae complete mitochondrial genome : 15363 bp (96% coverage, 86% max id) 10 culex14h11.p1k 3350 51.73 repeat (higher GC%): good cvg of culex 18SrRNA gene ; no alignments to NC_002978 & NC_006833 11 culex22h10.q1k 2148 54.89 repeat (higher GC%): some alignment to culex 118S rRNA ; no alignments to NC_002978 & NC_006833 12 culex166d08.p1k 2071 55.53 repeat (higher GC%): culex 18S rRNA & 28S rRNA ; no alignments to NC_002978 & NC_006833 total(12) 1546442
- Sept 2008 reference
file name: /fs/szasmg2/Culex_pipiens_symbiont/Sanger/Wb_Cq_080903.dbs 1 contig000310 1136301 34.17 2 contig000307 346054 34.25 3 contig000311 15587 21.81 4 contig000305 13962 36.17 5 contig000312 9245 35.53 6 contig000309 6967 36.63 7 contig000306 5343 33.58 8 contig000315 3501 33.10 9 contig000308 3350 51.73 10 contig000313 2148 54.89 11 contig000314 2071 55.53 12 contig000304 1994 33.85
NCBI
Culex quinquefasciatus
- Taxonomy:
* Culex pipiens complex * Culex australicus * Culex pipiens (house mosquito) 1(project) o Culex pipiens molestus o Culex pipiens pallens o Culex pipiens pipiens (northern house mosquito) * Culex pipiens x Culex quinquefasciatus * Culex quinquefasciatus (southern house mosquito) 1(project)
* Wolbachia Lineage: root; cellular organisms; Bacteria; Proteobacteria; Alphaproteobacteria; Rickettsiales; Rickettsiaceae; Wolbachieae; Wolbachia
* Wolbachia phage WO http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=112596 http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?db=nuccore&id=6723230 up to 2K alignments at ~90%id of our genome to this virus
- Culex quinquefasciatus Genome Project
- Taxonomy ID: 7176
- Culex quinquefasciatus TA : 7,379,314 traces (Sept 2007)
SEQ_LIB_ID SIZE STDEV CENTER_NAME TYPE COUNT PERCENT 1099499586718 9000 2700 TIGR_JCVIJTC WGS 15349 0.21 1099522705601 3500 1050 TIGR_JCVIJTC WGS 16116 0.22 1099641499000 33000 9900 TIGR_JCVIJTC WGS 768 0.01 G766BES1 120000 . WIBR WGS 100434 1.37 BE G771K1 5000 500 WIBR CLONEEND 51540 0.7 G772K1 5000 500 WIBR CLONEEND 25314 0.34 G809K1 2000 200 WIBR CLONEEND 29949 0.41 G810K1 2000 200 WIBR CLONEEND 2295 0.03 G818F1 40000 4000 WIBR WGS 437994 5.96 G818F2 40000 4000 WIBR WGS 8505 0.12 G818P1 4000 400 WIBR WGS 580557 7.89 G818P2 4000 400 WIBR WGS 1091326 14.84 G818P3 4000 400 WIBR WGS 350523 4.77 G818P4 4000 400 WIBR WGS 1017105 13.83 L31420P2 5000 . WIBR SHOTGUN 2259 0.03 L31422P1 4000 . WIBR SHOTGUN 3766 0.05 L31424P2 5000 . WIBR SHOTGUN 2226 0.03 L31425P1 4000 . WIBR SHOTGUN 3817 0.05 L31426P1 4000 . WIBR SHOTGUN 2274 0.03 L31427P1 4000 . WIBR SHOTGUN 2273 0.03 L31428P1 4000 . WIBR SHOTGUN 3045 0.04 L31429P1 4000 . WIBR SHOTGUN 3034 0.04 L31430P1 4000 . WIBR SHOTGUN 2261 0.03 L31431P1 4000 . WIBR SHOTGUN 2918 0.04 L31432P1 4000 . WIBR SHOTGUN 2947 0.04 L31433P1 4000 . WIBR SHOTGUN 2251 0.03 L31435P1 4000 . WIBR SHOTGUN 2987 0.04 L31439P1 4000 . WIBR SHOTGUN 2292 0.03 L31440P1 4000 400 WIBR SHOTGUN 1478 0.02 L31440P1 4000 . WIBR SHOTGUN 2281 0.03 L31441P1 4000 . WIBR SHOTGUN 2297 0.03 L31444P2 5000 . WIBR SHOTGUN 2241 0.03 L31446P2 5000 . WIBR SHOTGUN 2278 0.03 L31448P1 4000 . WIBR SHOTGUN 3052 0.04 L31449P1 4000 . WIBR SHOTGUN 2234 0.03 MSC-CULEX-PIPIENS-QUINQUEFASCIATUS_01-G-CULEX-10KB 10000 2000 TIGR_JCVIJTC WGS 1939130 26.36 MSC-CULEX-PIPIENS-QUINQUEFASCIATUS_02-G-CULEX-4KB 4000 800 TCAG_JCVIJTC WGS 119990 1.63 MSC-CULEX-PIPIENS-QUINQUEFASCIATUS_02-G-CULEX-4KB 4000 800 TIGR_JCVIJTC WGS 213407 2.9 MSC-CULEX-PIPIENS-QUINQUEFASCIATUS_03-F-CULEX-40KB 40000 8000 TCAG_JCVIJTC WGS 2405 0.03 MSC-CULEX-PIPIENS-QUINQUEFASCIATUS_03-F-CULEX-40KB 40000 8000 TIGR_JCVIJTC WGS 101370 1.38 MSC-CULEX-PIPIENS-QUINQUEFASCIATUS_04-F-CULEX-40KB 40000 8000 TCAG_JCVIJTC WGS 16126 0.22 MSC-CULEX-PIPIENS-QUINQUEFASCIATUS_04-F-CULEX-40KB 40000 8000 TIGR_JCVIJTC WGS 22134 0.3 MSC-CULEX-PIPIENS-QUINQUEFASCIATUS_05-F-CULEX-40KB 40000 8000 TIGR_JCVIJTC WGS 51281 0.7 MSC-CULEX-PIPIENS-QUINQUEFASCIATUS_06-G-CULEX-10KB 11000 2200 TIGR_JCVIJTC WGS 992283 13.49 MSC-CULEX-PIPIENS-QUINQUEFASCIATUS_07-G-CULEX-10KB 9000 1800 TIGR_JCVIJTC WGS 106326 1.45 . . . WIBR OTHER 229 0 . . . WIBR PCR 228 0 . . . WIBR TRANSPOSON 8096 0.11 Total 7354992 100
CENTER_NAME TRACE_TYPE_CODE COUNT PERCENT WIBR WGS 3586444 48.76 TIGR_JCVIJTC WGS 3458164 47.02 TCAG_JCVIJTC WGS 138521 1.88 WIBR CLONEEND 109098 1.48 WIBR SHOTGUN 54211 0.74 WIBR TRANSPOSON 8096 0.11 WIBR OTHER 229 0 WIBR PCR 228 0 Total 7354992 100
Broad:
JCVI:
Articles
- Salzberg_GB_2005.pdf
- Sanger_2008
- WOLBACHIA PIPIENTIS: Microbial Manipulator of Arthropod Reproduction(1999)
- Obligate intracellular parasite Wikipedia
- Bacteriophage WO and Virus-like Particles in Wolbachia, an Endosymbiont of Arthropods
Other Strains (complete)
RefSeq GenBank Pub Length (Mbp) GC Prot RNAs Wolbachia endosymbiont of Drosophila melanogaster(TIGR) NC_002978 AE017196 1 1.267782 35.2% 1195 39 Wolbachia endosymbiont strain TRS of Brugia malayi srain wMel(NEB) NC_006833 AE017321 1 1.080084 34.2% 805 37 Wolbachia pipientis wPip(Sanger) NC_010981 AM999887 1 1.482455 34.2% 1275 37 # 1386 CDSs (Sanger article 2008) # several ather @ JCVI, Sanger ...
!!! Wolbachia pipientis wPip(Sanger) = culex161b01.q1k(346,054) + N(102) + culexbac1b5Ab03.q1k(1,136,301-2)
$ cat NC_010981.gb | grep '\.\.' | egrep -v 'anticodon|source' | awk '{print $1}' | count.pl # total gene 1423 CDS 1275 tRNA 34 rRNA 3 $ cat /fs/szasmg2/Culex_pipiens_symbiont/NCBI/NC_010981.gb | grep -c "\/pseudo" 110 1275+34+3+110=1422
Read Counts
query_tracedb "query count SPECIES_CODE='CULEX PIPIENS QUINQUEFASCIATUS'" # 7552113 : all traces query_tracedb "query count SPECIES_CODE='CULEX PIPIENS QUINQUEFASCIATUS' AND load_date >='09/01/2007'" # 172799 : new traces (all cDNA)
Assembly
Locations:
/fs/szasmg2/Culex_pipiens_symbiont/
2006_1226_WGA
initial assembly
Steps:
1. All cpqg reads have been downloaded from the TA (July 2006). The reads have been grouped by libraries and the clear range has been computed. There were 6.6M reads in the download compared with 7.3M now. Unfortunately I've only noticed this difference at the end of my experiment.
2. The Wolbachia endosymbiont of Culex quinquefasciatus assembly has been downloaded from the Sanger ftp site ( ftp://ftp.sanger.ac.uk/pub/pathogens/Wolbachia/Wb_Cq.dbs ) ; there are 95 sequences in this file. Most of them are very short. Below are listed the name,length & gc% of the longest 10: name length(bp) gc% culex173d08.p1k 1457497 34.17 culexbac1d10Bg07.p1k 24726 35.11 culex3d09.p1k 15587 21.81 culex166f03.q1k 13962 36.17 culex_1177_1189-1a02.w2k1177 13564 37.10 culex26b07.p1k 9245 35.53 culex174d04.p1k 8832 33.64 J28015Ag08.q1ka 7809 36.04 culex180e07.p1k 6960 36.59 culex53a02.p1k 5343 33.58
3. The cpqg random reads (clr only) have been aligned to symbiont sequences using nucmer (default parameters)
4. The nucmer output has been analyzed. It's been noticed that many of the short symbiont sequences (2-3KB in length) have a higher than expected number of alignments. To avoid the repeats I've elected only the reads that aligned to the longest 10 symbiont sequences (see above).
5. A 95% identity and minimum of 400 bp alignment thold has been used to determine the symbiont reads. There were 29,110 unique reads (30,690 reads+mates) selected. Below is a per library breakdown (reads+mates): MSC-CULEX-PIPIENS-QUINQUEFASCIATUS_01-G-CULEX-10KB 9581 MSC-CULEX-PIPIENS-QUINQUEFASCIATUS_06-G-CULEX-10KB 4549 G818P4 3784 G818P2 3478 G818P1 2238 G818F1 1283 MSC-CULEX-PIPIENS-QUINQUEFASCIATUS_02-G-CULEX-4KB 1156 MSC-CULEX-PIPIENS-QUINQUEFASCIATUS_03-F-CULEX-40KB 738 G818P3 723 MSC-CULEX-PIPIENS-QUINQUEFASCIATUS_07-G-CULEX-10KB 556 MSC-CULEX-PIPIENS-QUINQUEFASCIATUS_05-F-CULEX-40KB 327 MSC-CULEX-PIPIENS-QUINQUEFASCIATUS_04-F-CULEX-40KB 185 1099522705601 99 G809K1 89 : cDNA , should be removed 1099499586718 77 G772K1 12 : cDNA , should be removed G771K1 10 : cDNA , should be removed G766BES1 4 : BE library 1099641499000 2
6. The reads have been assembled using the runCA-OBT.pl script (default parameters). Most of the reads got assembled into 3 large scaffolds. There is mate pair evidence (outie mates) that the largest scaffold is circular.
All the scaffolds ens up in surrogates (20-50KB total surrogate length) Are there not enough BE to span the unique regions?
Cpqg.qc
scaff_8 Longest scaff scaff_9 2nd longest scaff scaff_7 3rd longest scaff scaff_6 Small scaff that Looks circular
7. The scaffolds/contigs have been aligned to longest 10 Wolbachia endosymbiont sequences. Most of the long alignments were at over 99% identity. However, several large rearrangements have been noticed.
Wb_Cq-vs-scaff Reference vs scaff
2007_0802_WGA-default
new assembly
Steps:
1. All Culex reads have been downloaded from TA . ~1M new reads since 2006_1226
2. The reads have been aligned to the new reference (exclude mito,repeats) using nucmer (default parameters)
3. A 95% identity and minimum of 400 bp alignment thold has been used to determine the symbiont reads. 3850 new reads & mates in addition to the previous ones were identified
4. 33,783 reads have been assembled using the runCA-OBT.pl script (default parameters). Cpqg.qc
Compared to the initial assembly, many metrics went down (TotalBasesInScaffolds,MaxBasesInScaffolds,MaxContigLength ...) TotalSurrogates & SurrogateInstances more than doubled
2007_0802_WGA-0.5E
error rate =0.5 % => more fragmented assembly
2007_0802_WGA-0.5M
genome size=1.5M => more TotalBasesInScaffolds but more unhappy mates
What to do next?
- use CA 5.1 (latest version)
- remove 958 cDNA's aligned to culex*
- increase utg error rate to from 1.5% to 2% (3% gave worse results than 2%)
- recruite reads that align to contig ends: some ends are repetitive => too many; others no alignments
- use 2 other complete strains; only 2 new aligned reads were identified
- AMOScmp new reference => more unhappy mates then before
- dropping the min Astat from 1 to -1 made some degens into places ctgs; did not improve overall stats
- separate JCVI & WIBR reads, assemble separately => 5 obvious alignment breaks
- use only the reads from lib with insert size <=11Kbp => more fragmented
- use only the reads that aligned to the top2 Sanger ctgs (36606 instead of 36767)
Reads aligned to the 7 Sanger sequences:
CENTER STRATEGY COUNT PERCENTAGE TIGR_JCVIJTC WGS 18268 48.97 WIBR WGS 17316 46.42 # 155 BE align but mostly at 80-90% id, only 4 at >=95% id, >=400bp WIBR CLONEEND(CDNA) 884 2.37 # about avg 1.48 TCAG_JCVIJTC WGS 815 2.18 WIBR SHOTGUN 20 0.05 total 37303 100 total+mates 39027 (37724 in .frg file) wgs+mates 38069 (36767 in .frg file) # 302 BE
2008_0829_WGA-wgs
e=1.5
[Top5Scaffolds=contigs,size,span,avgContig,avgGap] 0=3,1363559,1364974,454520,708 1=1,53307,53307,53307,0 2=1,28821,28821,28821,0 3=2,23208,23528,11604,320 4=1,8315,8315,8315,0 total=8,1477210,1478945,184651,578
2008_0829_WGA-wgs e=2.0 -> best
Assembly description
- The Wolbachia pipientis endosymbiont of Culex quinquefasciatus assembly was downloaded from the Sanger web site (July 2007 version: 12 contigs)
- 5 of the 12 contigs were discarded due to their high GC% or repetitive content; 7 contigs were kept to be used for sequence alignments
- The NCBI TA Culex quinquefasciatus traces were downloaded locally (Sept 2007: 7,379,314 total Sanger traces; 7,183,129 WGS)
- The WGS traces were aligned to the 7 reference contigs using nucmer (default parameters: minimum 65bp length, 80% identity)
- The traces that aligned and their mates were filtered out and formatted as input for Celera Assembler (36,767 total traces; 35,750 mated, 1,017 unmated)
- The traces were assembled with CA (wgs-5.1) (default parameters except for unitiggerRrrorRate=2%)
- The assembler generated 16 scaffolds, 21 contigs and 92 degenerates;
- 5 scf, 10 ctg & 11 deg were filtered based on their "uniqueness"; 2 of the scaffolds contain multiple contigs
- There are 2 unique regions in reference not present in this genome (NC_010981.1: 775928-776047 120bp; 1253284-1254139 856bp)
- There are 4 unique regions (~ 500bp each ) in this genome not present in the reference sequence/assembly: ctg7180000001230_202_725, ctg7180000001305_11303_11867, ctg7180000001305_13367_14006, deg7180000001252_328_851
- 10 large scale rearrangements
Comments
- wgs-5.2-beta generated the same results
- modifying astatLowBound, astatHighBound did not result in better assembly
Cpqg.qc all: 16 scf, 21 ctg , 92 deg ones with gc% in the 32..36 range or have Wp genes aligned to them: 11 scf, 16 ctg , 41 deg: filtered(submission) 5 scf, 10 ctg, 11 deg
[Top5Scaffolds=contigs,size,span,avgContig,avgGap] 0=4,1388064,1389477,347016,471 1=3,70356,70440,23452,42 2=1,42565,42565,42565,0 3=1,8315,8315,8315,0 4=1,2425,2425,2425,0 total=10,1511725,1513222,151172,299
top 2 scaffold size=1458420 top 3 scaffold size=1500985
Alignment files Media:NC_010981-scf.filter-q.png Media:NC_010981-scf.filter-1.png Media:scf-NC_010981.filter-1.png Media:NC_010981-ctg.filter-q.png Media:NC_010981-ctg.filter-1.png Media:NC_010981-ctg-deg.filter-q.png Media:NC_010981-ctg-deg.filter-1.png
Stats: #elem min max mean median n50 sum scf 16 1035 1389537 95513 1501 1389537 1528210 ; 4 CONTAINED in bigger scf ctg 21 1035 478325 72697 1583 478325 1526633 ; 4 CONTAINED in bigger ctg deg 92 245 7632 1079 843 1000 99229 ; 22 CONTAINED in ctg
#id len gc% Wb_Cq.7 NC_gene cvg scf7180000001311 1389537 34.17 3744 1250 14.59 #ctg7180000001298..ctg7180000001301 scf7180000001316 70460 34.82 85 62 6.22 #ctg7180000001303..ctg7180000001305 scf7180000001315 42565 34.15 72 110 13.01 scf7180000001310 8315 35.65 5 3 4.84 scf7180000001307 2425 62.76 0 0 2.64 ... scf7180000001320 1315 35.67 3 1 20.27 ... ctg7180000001299 478325 34.12 1253 478 15.56 ctg7180000001300 466173 34.13 1401 532 14.5 ctg7180000001298 316943 34.05 541 343 13.99 ctg7180000001301 126623 34.61 550 225 12.77 ctg7180000001302 42565 34.15 72 110 13.01 ctg7180000001305 37016 34.66 47 52 6.49 ... deg7180000001279 7632 33.32 6 6 26.85 # the long degenerates have high coverage deg7180000001277 4159 35.18 61 72 40.3 deg7180000001280 3685 36.88 12 18 36.67 ... deg7180000001231 245 33.06 0 0 1.3
Filtering
Steps:
- Align scaffold & degenerates to top3 ref ctgs culex161b01.q1k(346,054)+culexbac1b5Ab03.q1k(1,136,301)+culex49c07.p1k(9,245) using nucmer;
- Filter alignments using "delta-filter -r"
- Remove CONTAINED scf & deg
- add scf & deg that contain UNIQUE seq & not in the list: scf7180000001309(1,361) & deg7180000001252(937)
- order & orient ctgs
#id len gc% Wb_Cq.7 NC_gene cvg contained ctg7180000001230 1361 37.99 17 5 3.03 N ctg7180000001248 8315 35.65 5 3 4.84 N ctg7180000001298 316943 34.05 541 343 13.99 N ctg7180000001299 478325 34.12 1253 478 15.56 N ctg7180000001300 466173 34.13 1401 532 14.5 N ctg7180000001301 126623 34.61 550 225 12.77 N ctg7180000001302 42565 34.15 72 110 13.01 N ctg7180000001303 29919 34.92 35 52 6.29 N ctg7180000001304 3421 35.19 3 2 2.83 N ctg7180000001305 37016 34.66 47 52 6.49 N scf7180000001309 1361 37.99 17 5 3.03 N scf7180000001310 8315 35.65 5 3 4.84 N scf7180000001311 1389537 34.17 3744 1250 14.59 N # origin of replication at pos 112,7008 (-) scf7180000001315 42565 34.15 72 110 13.01 N scf7180000001316 70460 34.82 85 62 6.22 N deg7180000001236 2346 34.02 4 5 35.34 N deg7180000001244 3090 33.66 8 10 33.09 N deg7180000001252 937 36.29 4 5 1.37 N deg7180000001256 1888 32.42 110 49 19.98 N deg7180000001260 1198 37.73 5 3 13.1 N deg7180000001266 2375 32.80 55 49 28.4 N deg7180000001272 2923 32.91 54 46 35.26 N deg7180000001277 4159 35.18 61 72 40.3 N deg7180000001279 7632 33.32 6 6 26.85 N deg7180000001280 3685 36.88 12 18 36.67 N deg7180000001290 1879 31.40 10 5 33.06 N => 10 ctgs (5 scaff) & 11 deg
.scaff file >7180000001309 1 1365 1364 7180000001230 BE 1365 0 >7180000001310 1 8319 8318 7180000001248 BE 8319 0 >7180000001311 4 1404424 1405836 7180000001298 BE 320424 -19 7180000001299 BE 484180 1434 7180000001300 BE 471690 1 7180000001301 BE 128130 0 >7180000001315 1 42888 42887 7180000001302 BE 42888 0 >7180000001316 3 70739 70822 7180000001303 BE 30041 1 7180000001304 BE 3421 85 7180000001305 BE 37277 0
Reference sequence not present in the assembly
100+ bp 0cvg regions in the reference:
1. culexbac1b5Ab03.q1k 429746 429972 226 0 2. culexbac1b5Ab03.q1k 907129 908079 950 0 1. NC_010981.1 775928 776047 120 0 2. NC_010981.1 1253284 1254139 856 0
1.1. NC_010981.1 RefSeq gene 775763 777826 . + . contains GeneID:6385213 # WP0709 Putative outer membrane protein 1.2. NC_010981.1 RefSeq gene 1252115 1253287 . + . begin GeneID:6385392 # tuf translation elongation factor tu (2 in Sanger wPip, none in Dan's annotation) 1.3. NC_010981.1 RefSeq gene 1253302 1253622 . + . contained GeneID:6385310 # rpsJ 30s ribosomal protein s10 ??? missing; # very conserved in Wolbachia endosymbiont of Drosophila melanogaster 2.1. NC_010981.1 RefSeq gene 1253632 1254354 . + . end GeneID:6385679 # rplC ribosomal protein L3 (partially present in Dan's annotation) # very conserved in several species : Wolbachia, Erlichia ...
No promer alignments of sequences to these regions
Assembly sequence not present in the reference
100+ bp 0cvg regions in the assembly:
ctg_start_stop len gc% comments 1. ctg7180000001230_202_725 524 37.50 # first 202bp have multiple alignments to NC_010981 # bases 203..725 have no alignments to NC_010981 # this contig used to comntain cloning vector at the 3' end which was removed (725..1160 # scf7180000001309 2. ctg7180000001305_11303_11867 565 33.45 # aligns at 100%len, 100%id to Wolbachia endosymbiont of Drosophila melanogaster, complete genome; NC_002978.6:243504..243803 # 11070..11735 Putative dna repair protein radc [Wolbachia pipientis] (Dan's annotation) # scf7180000001316 3 70739 70822 3. ctg7180000001305_13367_14006 640 33.91 # good blastx alignment to Wolabchia gene on 100% length; NC_002978.6:488974..489912 4. deg7180000001252_328_851 524 35.88 # blastx align to We of Bm NC_006833.1:754520..755170 no alignments to Sanger raw reads
Others: might be contaminated?
5. ctg7180000001257 1378 32.80 # 313:1378 Culex pipiens LINE repeat !!! # 537..1379 reverse transcriptase [Bacteroides thetaiotaomicron VPI-5482] (Dan's annotation)
6. ctg7180000001285 1568 37.37 # GC% higher than avg # 125..700 transcriptional regulator, XRE family [Thermotoga lettingae TMO] (Dan's annotation) # 984..1571 putative outer membrane protein probably involved in nutrient binding [Bacteroides fragilis YCH46] (Dan's annotation)
ORFS's
1.1 ctg7180000001230:orf00001 ctg7180000001230 -1 262 +2 transposase, IS256 family [Wolbachia endosymbiont of Drosophila melanogaster] $ cat NC_010981.ptt | grepi -c transposase 80
1.2 ctg7180000001230:orf00002 ctg7180000001230 1277 840 -3 blast e-val:3e-84 chloramphenicol acetyltransferase [Salmonella enterica subsp. enterica serovar Typhi str. CT18] not in NC_010981 !!! cloning vector !!! >ctg7180000001230:orf00002 ctg7180000001230 1277 840 len=438 ATGGCAATGAAAGACGGTGAGCTGGTGATATGGGATAGTGTTCACCCTTGTTACACCGTT TTCCATGAGCAAACTGAAACGTTTTCATCGCTCTGGAGTGAATACCACGACGATTTCCGG CAGTTTCTACACATATATTCGCAAGATGTGGCGTGTTACGGTGAAAACCTGGCCTATTTC CCTAAAGGGTTTATTGAGAATATGTTTTTCGTCTCAGCCAATCCCTGGGTGAGTTTCACC AGTTTTGATTTAAACGTGGCCAATATGGACAACTTCTTCGCCCCCGTTTTCACCATGGGC AAATATTATACGCAAGGCGACAAGGTGCTGATGCCGCTGGCGATTCAGGTTCATCATGCC GTTTGTGATGGCTTCCATGTCGGCAGAATGCTTAATGAATTACAACAGTACTGCGATGAG TGGCAGGGCGGGGCGTAA >ctg7180000001230:orf00002 ctg7180000001230 1277 840 len=438 MAMKDGELVIWDSVHPCYTVFHEQTETFSSLWSEYHDDFRQFLHIYSQDVACYGENLAYF PKGFIENMFFVSANPWVSFTSFDLNVANMDNFFAPVFTMGKYYTQGDKVLMPLAIQVHHA VCDGFHVGRMLNELQQYCDEWQGGA* >gi|18466598|ref|NP_569406.1| chloramphenicol acetyltransferase [Salmonella enterica subsp. enterica serovar Typhi str. CT18] MEKKITGYTTVDISQWHRKEHFEAFQSVAQCTYNQTVQLDITAFLKTVKKNKHKFYPAFIHILARLMNAH PEFRMAMKDGELVIWDSVHPCYTVFHEQTETFSSLWSEYHDDFRQFLHIYSQDVACYGENLAYFPKGFIE NMFFVSANPWVSFTSFDLNVANMDNFFAPVFTMGKYYTQGDKVLMPLAIQVHHAVCDGFHVGRMLNELQQ YCDEWQGGA
2. ctg7180000001305:orf00013 ctg7180000001305 11070 11735 +3 DNA repair protein RadC, putative [Wolbachia endosymbiont of Drosophila melanogaster] # 3 copies in NC_010981 $ cat NC_010981.ptt | grepi RadC 280207..280863 - 218 190570723 - WP0276 - - Putative dna repair protein radc 488966..489634 - 222 190570883 - WP0459 - - Putative dna repair protein radc 1418058..1418726 - 222 190571715 - WP1343 - - Putative dna repair protein radc
3. ctg7180000001305:orf00015 ctg7180000001305 13185 14093 +3 transcriptional regulator, putative [Wolbachia endosymbiont of Drosophila melanogaster] # 10 copies cat NC_010981.ptt | grepi "transcriptional regulator" 247653..248570 - 305 190570687 - WP0239 - - Putative transcriptional regulator 277056..277895 - 279 190570720 - WP0273 - - Putative transcriptional regulator 277921..278835 - 304 190570721 - WP0274 - - Putative transcriptional regulator 281034..281954 - 306 190570724 - WP0277 - - Putative transcriptional regulator 296912..297466 - 184 190570733 - WP0290 - - Putative transcriptional regulator 486388..487365 - 325 190570881 - WP0457 - - Putative transcriptional regulator 630467..631237 + 256 190570997 - WP0585 - - two component transcriptional regulator 806511..806837 + 108 190571141 - WP0739 - - Putative transcriptional regulator, MerR family 1129005..1129301 + 98 190571445 - WP1058 - - Putative transcriptional regulator 1415480..1416457 - 325 190571713 - WP1341 - - putative transcriptional regulator
4. ?
Repeats
- No alignments of ctg/deg to RepeatMaskerLib
- Tandem repeats
- Minisatellites copy number variation can be used to genotype bacteria strains
$ show-coords NC_010981.trf-wPip.trf.filter-1.delta | grep -f NC_010981.trf-wPip.trf.filter-1.qry_diff 1 100 | 126 27 | 100 100 | 100.00 | 100 208 | 100.00 48.08 | 34.18.100 98.54.208 [CONTAINED] 1 122 | 122 1 | 122 122 | 100.00 | 122 208 | 100.00 58.65 | 35.54.122 98.54.208 [CONTAINED] 1 273 | 1 273 | 273 273 | 100.00 | 280 355 | 97.50 76.90 | 68.75.280 75.75.355 [CONTAINED]
$ infoseq wPip.trf.fasta | grep -f NC_010981.trf-wPip.trf.filter-1.qry_diff 60.65.213 213 38.97 75.75.355 355 34.08 98.54.208 208 45.19 99.18.146 146 43.84
- RepeatScout pipeline summary
10 ctg+11 degen #elem min max mean median n50 sum families 90 67 7071 678 358 989 61024 repeats 465 30 7071 631 378 984 293381 uniq 248 68 46556 5071 1549 16001 1257709
NC_010981 #elem min max mean median n50 sum families 51 71 5779 726 307 1360 37012 repeats 304 31 5779 682 548 989 207331 uniq 199 68 58998 6384 2512 16001 1270466
- !!! more repeats in our assembly
- Comparison of the longest repeats (our strain vs Sanger strain):
$ cd /fs/szasmg2/Culex_pipiens_symbiont/2008_0829_WGA-wgs-e.20/RepeatScout $ sort -nk2 -r wPip-NC_010981.families.infocount
fam len gc% #ref #qry 12 7071 36.18 5 2 # repeat family 12 has 5 copies in our assembly and 2 copies in NC_010981 57 6770 35.05 4 2 77 3129 34.48 4 5 6 2461 35.11 3 2 87 1468 34.81 4 2 26 1399 35.67 0 0 1 1346 36.18 2 2 2 1345 38.74 33 31 60 1097 39.56 3 5
- there are differences in the copy numbers
- there is no frequent repeat present in one genome but not in the other
Multiple copies in reference
Multiple copies in assembly
Snps
Rearrangements
- ~ 10 rearrangements
- some rearrangements are associated with IS elements: the 20 copy 1.3K repeats belong to "12 IS5 (IS256-family)" , transposase gene
Improving strategy
Identify more reads that align to those regions (blastn TA):
236 : all 198 : new 395 : new+mates
Adding these reads did not improve the assembly.
gi|42519920|ref|NC_002978.6| 243504 243803 # ctg7180000001305_11303_11867 565 33.45 : there are reads aligned to 241173-243822 gi|42519920|ref|NC_002978.6| 488974 489912 # ctg7180000001305_13367_14006 640 33.91 : there are reads aligned to 487182-492517 gi|58584261|ref|NC_006833.1| 754520 755170 # deg7180000001252_328_851 524 35.88 : there are reads aligned to 754300-755148
All the reads aligned to the 3 regions above have been assembled; the 3 regions seem to contain rearrangements ---
Files & Directories
- Wolbachia pipientis, complete genome
/fs/szasmg2/Culex_pipiens_symbiont/NCBI/NC_010981.fna
- qc file
/fs/szasmg2/Culex_pipiens_symbiont/2008_0829_WGA-wgs-e.20/cpqg.qc
- AMOS bank
/fs/szasmg2/Culex_pipiens_symbiont/2008_0829_WGA-wgs-e.20/cpqg.bnk/
- nucmer alignment files : assembly scaffolds/contigs/denenerates/unitigs vs the reference genome
*.filter-q.* were generated using "delta-filter -q" *.filter-1.* were generated using "delta-filter -1" /fs/szasmg2/Culex_pipiens_symbiont/2008_0829_WGA-wgs-e.20/nucmer/NC_010981-*delta
Filtered scaffolds: #id len gc% Wb_Cq.7 NC_gene cvg scf7180000001311 1389537 34.17 3744 1250 14.59 * scf7180000001316 70460 34.82 85 62 6.22 scf7180000001315 42565 34.15 72 110 13.01 * scf7180000001310 8315 35.65 5 3 4.84 scf7180000001319 1501 36.64 3 1 7.9 scf7180000001312 1378 32.80 0 0 1.62 scf7180000001309 1361 37.99 17 5 3.03 scf7180000001320 1315 35.67 3 1 20.27 scf7180000001317 1173 34.53 3 1 2.18 scf7180000001318 1115 36.41 2 3 2.32 scf7180000001321 1035 34.01 3 1 3.4
ctg7180000001299 478325 34.12 1253 478 15.56 * ctg7180000001300 466173 34.13 1401 532 14.5 * ctg7180000001298 316943 34.05 541 343 13.99 * ctg7180000001301 126623 34.61 550 225 12.77 * ctg7180000001302 42565 34.15 72 110 13.01 * ctg7180000001305 37016 34.66 47 52 6.49 ctg7180000001303 29919 34.92 35 52 6.29 ctg7180000001248 8315 35.65 5 3 4.84 ctg7180000001304 3421 35.19 3 2 2.83 ctg7180000001270 1501 36.64 3 1 7.9 ctg7180000001257 1378 32.80 0 0 1.62 ctg7180000001230 1361 37.99 17 5 3.03 ctg7180000001284 1315 35.67 3 1 6.83 ctg7180000001232 1173 34.53 3 1 2.18 ctg7180000001237 1115 36.41 2 3 2.32 ctg7180000001297 1035 34.01 3 1 3.4
- filtered contigs & degens
/fs/szasmg2/Culex_pipiens_symbiont/2008_0829_WGA-wgs-e.20/cpqg.ctg-deg.filter.fasta /fs/szasmg2/Culex_pipiens_symbiont/2008_0829_WGA-wgs-e.20/cpqg.ctg-deg.filter.infoseq
Annotation (original)
Format annotation for NCBI submission:
$ wc -l cpqg.ctg.CDS cpqg.ctg.tRNA cpqg.ctg.rRNA 1476 cpqg.ctg.CDS 34 cpqg.ctg.tRNA 4 cpqg.ctg.rRNA
cat cpqg.ctg.CDS | sed 's/orf00//' | sed 's/ctg718000000//' | ~/bin/tab2annotation.pl -hl 1 -t CDS >! cpqg.ctg.CDS.tbl cat cpqg.ctg.tRNA | sed 's/orf00//' | sed 's/ctg718000000//' | ~/bin/tab2annotation.pl -hl 1 -t tRNA >! cpqg.ctg.tRNA.tbl cat cpqg.ctg.rRNA | sed 's/orf00//' | sed 's/ctg718000000//' | ~/bin/tab2annotation.pl -hl 1 -t rRNA >! cpqg.ctg.rRNA.tbl cat cpqg.ctg.CDS.tbl cpqg.ctg.tRNA.tbl cpqg.ctg.rRNA.tbl > cpqg.ctg.tbl
Sanger Wolbachia: much fewer genes !!!
NC_010981.ptt 1248 CDS 25 tRNA : 1Leu (vs 5 in our strain) !!! 2 rRNA
No CRISPRs found by CRISPRFinder
Annotation (revised)
- Genes manually curated by Dan; many transposases deleted
wc -l cpqg.ctg.CDS cpqg.ctg.tRNA cpqg.ctg.rRNA cpqg.deg.CDS 1342 cpqg.ctg.CDS 36 cpqg.deg.CDS 34 cpqg.ctg.tRNA 4 cpqg.ctg.rRNA 1416 total
NCBI submission
- name: Wolbachia pipientis wPip(strain) JHB(substrain) (got this from Steven)
- NCBI suggestion:
[organism=Wolbachia endosymbiont of Culex quinquefasciatus JHB] [host=Culex quinquefasciatus JHB]
http://www.ncbi.nlm.nih.gov/genomes/mpfsubmission.cgi?show=EB95B67D-199C-42C9-80CE-F2AC9C7C7A02 Project ID: 32209 Locus Tag Prefix: C1A
- Submission dir:
/fs/szasmg2/Culex_pipiens_symbiont/2008_0829_WGA-wgs-e.20/submission2/
- Submission via GenomesMacroSend;
- Direct Submit ID: DSub8465 (1st submission)
- Direct Submit ID: DSub8474,DSub8475 (revisions to the 1st submission)
- TaxId: 569881
- ABZA00000000 Genome Project
- ABZA00000000 Project accession number
- ctg: ABZA01000001..ABZA01000021
- scaff: DS996929-DS996944
- NCBI files:
/fs/szasmg2/Culex_pipiens_symbiont/best/submission2/ABZA01_accs : 21 ctg & deg accession numbers ABZA01000001..ABZA01000021 /fs/szasmg2/Culex_pipiens_symbiont/best/submission2/ABZA.01.modified.p2g 1378 gene accession numbers EEB55160.. EEB56537 /fs/szasmg2/Culex_pipiens_symbiont/best/submission2/ABZA01_scfld_DS_accs 16 scaffold id's
- Future updates
- Protein id formats to use(?):
gnl|umiacs|C1A_1|gb|EEB55198 gnl|WGS:ABZA|C1A_1|gb|EEB55198
- AA AI 3900