Pine tree: Difference between revisions
(→Links) |
|||
Line 118: | Line 118: | ||
== SOAPdenovo-31mer -K 27 -d 2 -D 3 -max_rd_len 100 == | == SOAPdenovo-31mer -K 27 -d 2 -D 3 -max_rd_len 100 == | ||
# | #stats | ||
. elem min q1 q2 q3 max mean n50 sum | . elem min q1 q2 q3 max mean n50 sum | ||
scf 70246 100 107 137 413 30683* 369.81 . 25977758 | scf 70246 100 107 137 413 30683* 369.81 . 25977758 | ||
ctg 8641885 28 28 31 37 7238 36.1 . 312425669 | ctg 8641885 28 28 31 37 7238 36.1 . 312425669 | ||
=== Alignment1 === | === Alignments === | ||
. elem min q1 q2 q3 max mean n50 sum | |||
cChloroplast 136 100 117 142 187 628 168.34 0 22894 | |||
cBAC 6385 100 116 187 499 23267 597.00 0 3811871 | |||
mito 84 110 479 1791 7050 30683 4268.99 0 358595 | |||
other 63641 100 106 134 409 22471 342.30 0 21784398 | |||
=== Alignment1 (old) === | |||
nucmer default parameters | nucmer default parameters | ||
# Legend: | # Legend: | ||
Line 172: | Line 179: | ||
total 80158 # 20X cvg for 100bp read len & 400K mito genome ; 29X cvg for 146bp read len | total 80158 # 20X cvg for 100bp read len & 400K mito genome ; 29X cvg for 146bp read len | ||
=== Alignment2 === | === Alignment2 (old) === | ||
nucmer -l 20 -c 20; delta-filter -l 65 -q -o 75 ; filter for gc% >=44 | nucmer -l 20 -c 20; delta-filter -l 65 -q -o 75 ; filter for gc% >=44 | ||
#some of the mito hits align to cChloroplast & cBAC => might have an overestimate | #some of the mito hits align to cChloroplast & cBAC => might have an overestimate | ||
Line 187: | Line 194: | ||
FC638TR_002_8_2 42101 | FC638TR_002_8_2 42101 | ||
total 122707 # 30X cvg for 100bp read len & 400K mito genome | total 122707 # 30X cvg for 100bp read len & 400K mito genome | ||
== SOAPdenovo-31mer -K 31 -d 20 -M 3 -max_rd_len 100 == | == SOAPdenovo-31mer -K 31 -d 20 -M 3 -max_rd_len 100 == |
Revision as of 20:34, 10 August 2011
Links
- dendrome@ucdavis
- pinegenome.org
- NCBI Taxonomy record Pinus taeda or "loblolly pine"
- LOBLOLLY PINE BAC LIBRARY@MSSTATE.EDU AC241263..AC241361
- Adventures in the enormous: a 1.8 million clone BAC library for the 21.7 Gb genome of loblolly pine. PLoS One Jan 2011
Abstract: Loblolly pine (LP; Pinus taeda L.) is the most economically important tree in the U.S. and a cornerstone species in southeastern forests. However, genomics research on LP and other conifers has lagged behind studies on flowering plants due, in part, to the large size of conifer genomes. As a means to accelerate conifer genome research, we constructed a BAC library for the LP genotype 7-56. The LP BAC library consists of 1,824,768 individually-archived clones making it the largest single BAC library constructed to date, has a mean insert size of 96 kb, and affords 7.6X coverage of the 21.7 Gb LP genome. To demonstrate the efficacy of the library in gene isolation, we screened macroarrays with overgos designed from a pine EST anchored on LP chromosome 10. A positive BAC was sequenced and found to contain the expected full-length target gene, several gene-like regions, and both known and novel repeats. Macroarray analysis using the retrotransposon IFG-7 (the most abundant repeat in the sequenced BAC) as a probe indicates that IFG-7 is found in roughly 210,557 copies and constitutes about 5.8% or 1.26 Gb of LP nuclear DNA; this DNA quantity is eight times the Arabidopsis genome. In addition to its use in genome characterization and gene isolation as demonstrated herein, the BAC library should hasten whole genome sequencing of LP via next-generation sequencing strategies/technologies and facilitate improvement of trees through molecular breeding and genetic engineering. The library and associated products are distributed by the Clemson University Genomics Institute (www.genome.clemson.edu).
Data
UCDAVIS plone
- Links
https://dendrome.ucdavis.edu/TGPlone/research-projects/pinerefseq dpuiu ddr5fft6 https://dendrome.ucdavis.edu/TGPlone/research-projects/pinerefseq/files/library-and-flow-cell-data/prs-tracking-database-archive/
- Documents
- PRS_experiment_agenda_2011-07-28_05-43pm_PDT.ods 21 July 2011
IPST ftp
ftp genomepc1.umd.edu ftpuser pinegenome cd PineUpload052911/ bin prompt # no Y/N? mget *
Local data
ginkgo: /fs/szattic-asmg7/PINE/PineUpload052911 /fs/szattic-asmg7/PINE/PineUpload070711
PineUpload052911
Chloroplast
len gc% cChloroplast 120481 38.55
cBACs
. elem min q1 q2 q3 max mean n50 sum len 102 8288 89909 116121 140549 172161 113400 126689 11566806 gc% 102 34.44 36.56 37.61 38.80 52.88 37.94 37.66 3870.87
Reads
lane readLen #mates mea,std ~gc% FC638TR_001_8 146 22,729,231 400 39.04 FC638TR_002_8 146 18,412,638 400 39.04
- Quality decreases sharply after pos 120
FC638TR.qual.png
- First 10bp of each read have higher AG count
FC638TR.content.png
- Over 0.5% Ns certain positions
fwd: 1.015% pos=100 ; 0.81% pos=119 rev: 1.114% pos=101 ; 0.92% pos=107 ; 0.87% pos=30; 0.21% pos 21
FC638TR.Ns.png
- GC% variation: cBAC(37.5%) < cChloroplast(38.5%) < reads(39%) < mito (44%+)
- cCholoplast alignments (bwasw)
lane #hits %hits #hits(uniq) FC638TR_001_8_1 475254 2.09 468309 FC638TR_001_8_2 473331 2.08 466185 FC638TR_002_8_1 1009331 5.48 995291 FC638TR_002_8_2 1004341 5.45 990122
- cBAC alignments (bwasw)
lane #hits %hits #hits(uniq) FC638TR_001_8_1 9722204 42.77 9533849 FC638TR_001_8_2 9481188 41.71 9303475 FC638TR_002_8_1 7684164 41.73 7535809 FC638TR_002_8_2 7469151 40.56 7330078
Sampled reads
- 100K sampled reads from each library (2*2*100K=400K)
. elem min q1 q2 q3 max mean n50 sum gc% 400000 0.68 34.93 39.04 43.15 95.89 39.20 40.41 .
- FC638TR_001_8_1 alignments
ref qry aligner #hits %hits %identity(median) cBAC FC638TR_001_8_1 bwasw 42971 43 nucmer 12477 12.5 95 bowtie 1186 1.2% cChloroplast bwasw 2031 2% nucmer 1943 1.9% 100 bowtie 1490 1.5%
- FC638TR_00[12]_8_[12] bwa alignments
ref qry aligner #hits %hits cBAC FC638TR_001_8_1 bwasw 42971 43 FC638TR_001_8_2 41915 42 FC638TR_002_8_1 42128 42 FC638TR_002_8_2 40606 41 cChloroplast FC638TR_001_8_1 2031 2 FC638TR_001_8_2 2033 2 FC638TR_002_8_1 5370 5.3 FC638TR_002_8_2 5330 5.3
SOAPdenovo's
#scaffold stats . elem min q1 q2 q3 max mean n50 sum -K47 -max_rd_len100 211820 100 143 156* 187 23273 227.95 . 48284629 -K31 -max_rd_len100 13747338 100 100 100 100 9185 108.04 . 1485269562 -K31 -d2 -D3 -max_rd_len100 74820 100 105 125 390 31673 320.75 . 23998536 -K31 -d20 -M3 -max_rd_len100 7859* 100 113 139 284 43079* 331.49 . 2605184* -K27 -d 2 -D 3 -max_rd_len100 70246 100 107 137 413 30683 369.81 . 25977758 -K27 -d 2 -D 2 -max_rd_len146 224963 100 110 128 343 23410 260.64 . 58635190
SOAPdenovo-31mer -K 27 -d 2 -D 3 -max_rd_len 100
#stats . elem min q1 q2 q3 max mean n50 sum scf 70246 100 107 137 413 30683* 369.81 . 25977758 ctg 8641885 28 28 31 37 7238 36.1 . 312425669
Alignments
. elem min q1 q2 q3 max mean n50 sum cChloroplast 136 100 117 142 187 628 168.34 0 22894 cBAC 6385 100 116 187 499 23267 597.00 0 3811871 mito 84 110 479 1791 7050 30683 4268.99 0 358595 other 63641 100 106 134 409 22471 342.30 0 21784398
Alignment1 (old)
nucmer default parameters # Legend: all : all SOAPdenovo scaffolds cBAC : scaffolds aligned to cBACs cChloroplast : scaffolds aligned to cChloroplast mito : scaffolds aligned to at least one of the 31 complete plant mitochondrion sequence mito.Cycas_taitungensis : scaffolds aligned to at least one of the Cycas_taitungensis mitochondrion sequence (most hits) other : unaligned scaffolds
# scaffold length stats . elem min q1 q2 q3 max mean n50 sum all 70246 100 107 137 413 30683 369.81 . 25977758 cBAC 1839 100 124 242 625 23267 637.13 . 1171678 cChloroplast 73 100 117 139 185 416 161.47 . 11787 # why so bad??? mito 68 131 867 2274 7241 30683 4675.18 . 317912 mito.Cycas_taitungensis 64 111 844 1931 7114 30683* 4529.91 . 289914 other 68266 100 106 136 412 26715 358.54 . 24476381
#scaffold gc stats . elem min q1 q2 q3 max mean n50 sum all 70246 4.90 35.40 40.74 44.52 74.26 39.78 . . cBAC 1839 10.64 35.63 41.22 44.87 74.26 39.95 . . cChloroplast 73 25.65 31.09 33.33 36.89 42.31 33.76 . . mito 68 43.08 45.96 47.45 49.19 56.41 47.77 . . mito.Cycas_taitungensis 64 41.44 46.27 47.81 50.00 56.41 48.16 . . other 68266 4.90 35.40 40.71 44.50 70.00 39.77 . .
- The longest assembled scaffold was 30683bp and aligned to the mitochondrion database.
- The mitochondrion gc% seems to be significantly higher than the one of rest of the genome (48% vs 40%)
- The Cycas taitungensis mitochondrion (414903bp, 46.92%gc) had the most scaffolds aligned to it (64 out of 68).
NC_009618 Cycas taitungensis chloroplast, complete genome DNA; circular; Length: 163,403 nt NC_010303 Cycas taitungensis mitochondrion, complete genome DNA; circular; Length: 414,903 nt Cycas_taitungensis_mito-chloroplast.png
- Mitochondrial scaffolds
. elem min q1 q2 q3 max mean n50 sum scf 68 131 867 2274 7241 30683 4675.18 9407 317912 # used for alignment scf.gc% 68 43.08 45.96 47.45 49.19 56.41 47.77 47.45 3248.1 scf.noGaps 68 131 743 2049 6660 27931 4262.46 9052 289847
- Reads aligned to mitochondrial scaffolds (bwa bwasw)
lane #hits %hits FC638TR_001_8_1 12307 0.054 FC638TR_001_8_2 11933 FC638TR_002_8_1 28707 0.12 FC638TR_002_8_2 27211 total 80158 # 20X cvg for 100bp read len & 400K mito genome ; 29X cvg for 146bp read len
Alignment2 (old)
nucmer -l 20 -c 20; delta-filter -l 65 -q -o 75 ; filter for gc% >=44 #some of the mito hits align to cChloroplast & cBAC => might have an overestimate
# Mitochondrial scaffolds . elem min q1 q2 q3 max mean n50 sum scf.len 102 101 608 1931 7271 30683 5044.88 11204 514578 scf.gc% 102 44.07 46.12 47.45 49.33 56.41 48.05 47.47 4901.06
lane #hits %hits FC638TR_001_8_1 18614 FC638TR_001_8_2 18035 FC638TR_002_8_1 43961 FC638TR_002_8_2 42101 total 122707 # 30X cvg for 100bp read len & 400K mito genome
SOAPdenovo-31mer -K 31 -d 20 -M 3 -max_rd_len 100
#scaffold stats . elem min q1 q2 q3 max mean n50 sum scf 7859* 100 113 139 284 43079* 331.49 . 2605184 ctg 200062 32 33 37 47 10392 48.52 . 9707307
# scaffold length stats . elem min q1 q2 q3 max mean n50 sum all 7859* 100 113 139 284 43079* 331.49 . 2605184 cChloroplast 20 111 193 436 6140 43079 5951.05 0 119021 cBAC 5117 100 114 141 320 13733 334.94 0 1713870 mito 8 101 134 685 1396 2166 749.75 0 5998 !!! VERY BAD other 2714 100 111 133 226 7353 282.35 0 766295
SOAPdenovo-31mer -K 31 -d 48 -max_rd_len 100 -M 3 choloplast_mated_reads
#scaffold stats . elem min q1 q2 q3 max mean n50 sum scf 20 111 193 436 6140 42707 5928.20 0 118564
PineUpload070711
Ecoli
len gc% cE_coli 4639675 50.79
Cloning vector
len gc% pFosDT5_2 8345 47.93
Drosophila refseq
Chromosome len gc% 2L 23,011,544 41 2R 21,146,708 43 3L 24,543,557 41 3R 27,905,053 42 4 1,351,857 35 X 22,422,827 42 un 10,049,037 ? mitochondrion 19,517 17 total 137,586,636 ? # actually the chromosome lengths sum to 130,450,100
Reads (Drosophila)
lib readLen #reads #cE_coli #pFosDT5_2 #cChloroplast #cBAC FC70M6V_6_001_1 160 23546475 2931496(12.44%) 5473141(23.24%) 24148(0.10%) 7739576(32.86%) FC70M6V_6_001_2 156 23546475 2885406(12.25%) 5854468(24.86%) 21794(0.09%) 7520343(31.93%)
lib readLen #mates mea,std ~gc% %merged(Tanja) %cE_coli %cpFosDT5_2 %cChloroplast %cBAC %other FC70M6V_6_001 160,156 23546475 343,30 42.5 12.5% 24% 0.09% 32.5 34 # sampled 100K
TIL_242_FC70M6V_2_002 160,156 9917211 242 . 91.4% TIL_242_FC70M6V_3_002 160,156 6276300 242 92.7% TIL_254_FC70M6V_2_004 160,156 9279789 254 . 91.5% TIL_254_FC70M6V_3_004 160,156 5924239 254 92.9% TIL_270_FC70M6V_2_003 160,156 10188776 270 . 88.1% TIL_270_FC70M6V_3_003 160,156 6556676 270 90.3% TIL_288_FC70M6V_2_001 160,156 9524524 288 . 80.0% TIL_288_FC70M6V_3_001 160,156 6158919 288 83.0%
- kastevens@ucdavis.edu:
- The files labeled TIL_XXX_FC70M6V_Y_00Z, are Drosophila libraries with a median target insert size of XXX. They come in pairs and can be merged.
- Regarding pairing, each insert size was run in two lanes Y at two different concentrations.
- Lane 3, with the lower concentration, should have higher quality data than lane 2 but with a higher cost per bp.
- The loss in quality was quantitativly small, so we don't expect the extra expense of lowering the concentration will be justified empirically.
- The first library, FC70M6V_6_001, is a ~40x library created from a pool of ~1000 fosmids. In general, we do not put the insert size in the filename.
- However, we did estimate the insert size to be 343bp with a below median standard deviation of 30. So roughly 15% of the inserts are < 313bp and have > 3bp overlap. This seems to fit well with your result.
- Each lane is multiplexed into sub-lanes indicated by 00Z. So the amount of reads in the file is variable and not nessesarily reflective of the cluster density.
- The Drosophila libraries were each run in 1/4 lane and the fosmid pool was run in 1/2 lane. The pool has roughy double the sequence content of the
- Drosophila libraries run in lane 2 at nominal density.