Pine tree: Difference between revisions

From Cbcb
Jump to navigation Jump to search
 
(44 intermediate revisions by the same user not shown)
Line 4: Line 4:
* [http://www.pinegenome.org/pinerefseq pinegenome.org]
* [http://www.pinegenome.org/pinerefseq pinegenome.org]
* [http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=3352 NCBI Taxonomy record] Pinus taeda or "loblolly pine"
* [http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=3352 NCBI Taxonomy record] Pinus taeda or "loblolly pine"
* [http://www.pine.msstate.edu/bac.htm LOBLOLLY PINE BAC LIBRARY@MSSTATE.EDU] AC241263..AC241361
* [http://www.pine.msstate.edu/bac.htm LOBLOLLY PINE BAC LIBRARY@MSSTATE.EDU]  
* [http://www.ncbi.nlm.nih.gov/pubmed/21283709 Adventures in the enormous: a 1.8 million clone BAC library for the 21.7 Gb genome of loblolly pine.] PLoS One Jan 2011
* [http://www.ncbi.nlm.nih.gov/pubmed/21283709 Adventures in the enormous: a 1.8 million clone BAC library for the 21.7 Gb genome of loblolly pine.] PLoS One Jan 2011
Abstract:
Abstract:
''Loblolly pine (LP; Pinus taeda L.) is the most economically important tree in the U.S. and a cornerstone species in southeastern forests. However, genomics research on LP and other conifers has lagged behind studies on flowering plants due, in part, to the large size of conifer genomes. As a means to accelerate conifer genome research, we constructed a BAC library for the LP genotype 7-56. The LP BAC library consists of 1,824,768 individually-archived clones making it the largest single BAC library constructed to date, has a mean insert size of 96 kb, and affords 7.6X coverage of the 21.7 Gb LP genome. To demonstrate the efficacy of the library in gene isolation, we screened macroarrays with overgos designed from a pine EST anchored on LP chromosome 10. A positive BAC was sequenced and found to contain the expected full-length target gene, several gene-like regions, and both known and novel repeats. Macroarray analysis using the retrotransposon IFG-7 (the most abundant repeat in the sequenced BAC) as a probe indicates that IFG-7 is found in roughly 210,557 copies and constitutes about 5.8% or 1.26 Gb of LP nuclear DNA; this DNA quantity is eight times the Arabidopsis genome. In addition to its use in genome characterization and gene isolation as demonstrated herein, the BAC library should hasten whole genome sequencing of LP via next-generation sequencing strategies/technologies and facilitate improvement of trees through molecular breeding and genetic engineering. The library and associated products are distributed by the Clemson University Genomics Institute (www.genome.clemson.edu).''
''Loblolly pine (LP; Pinus taeda L.) is the most economically important tree in the U.S. and a cornerstone species in southeastern forests. However, genomics research on LP and other conifers has lagged behind studies on flowering plants due, in part, to the large size of conifer genomes. As a means to accelerate conifer genome research, we constructed a BAC library for the LP genotype 7-56. The LP BAC library consists of 1,824,768 individually-archived clones making it the largest single BAC library constructed to date, has a mean insert size of 96 kb, and affords 7.6X coverage of the 21.7 Gb LP genome. To demonstrate the efficacy of the library in gene isolation, we screened macroarrays with overgos designed from a pine EST anchored on LP chromosome 10. A positive BAC was sequenced and found to contain the expected full-length target gene, several gene-like regions, and both known and novel repeats. Macroarray analysis using the retrotransposon IFG-7 (the most abundant repeat in the sequenced BAC) as a probe indicates that IFG-7 is found in roughly 210,557 copies and constitutes about 5.8% or 1.26 Gb of LP nuclear DNA; this DNA quantity is eight times the Arabidopsis genome. In addition to its use in genome characterization and gene isolation as demonstrated herein, the BAC library should hasten whole genome sequencing of LP via next-generation sequencing strategies/technologies and facilitate improvement of trees through molecular breeding and genetic engineering. The library and associated products are distributed by the Clemson University Genomics Institute (www.genome.clemson.edu).''
* [http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=studies&f=study&term=%28Pinus+taeda%29+&go=Go SRA traces]


= Data =
= Data =
   
   
== NCBI ==
* [http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=studies&f=study&term=%28Pinus+taeda%29+&go=Go SRA traces] BAC 454 reads
* BAC assembled sequences : AC241263..AC241361, HQ141589, GU477256..GU477266
* Plant mitochondrion finished sequences
  .      elem    min    q1      q2      q3      max      mean    sum
  len    31      45223  209482  414903  539368  982833  402851  12488404
  gc%    31      32.80  43.73  43.93  44.98  46.92    43.41    .
* Cycas taitungensis has the most similar mitochondrion
  NC_009618 chloroplast    163,403
  NC_010303 mitochondrion  414,903
  mitochondrion vs chloroplast:  [[Media:Cycas_taitungensis_mito-chloroplast.png|Cycas_taitungensis_mito-chloroplast.png]]
== UCDAVIS plone ==
== UCDAVIS plone ==
* Links
* Links
Line 52: Line 67:
   FC638TR_002_8  146      18,412,638    400          39.04
   FC638TR_002_8  146      18,412,638    400          39.04


* Quality decreases sharply after pos 120
* Quality decreases sharply after pos 120       [[Media:FC638TR.qual.png|FC638TR.qual.png]]
  [[Media: FC638TR.qual.png|FC638TR.qual.png]]
* First 10bp of each read have higher AG count  [[Media:FC638TR.content.png|FC638TR.content.png]]
* First 10bp of each read have higher AG count  
* Over 0.5% Ns certain positions                 [[Media:FC638TR.Ns.png|FC638TR.Ns.png]]
   [[Media:FC638TR.content.png|FC638TR.content.png]]
 
* Over 0.5% Ns certain positions  
   fwd: 1.015% pos=100 ; 0.81% pos=119
   fwd: 1.015% pos=100 ; 0.81% pos=119
   rev: 1.114% pos=101 ; 0.92% pos=107 ; 0.87% pos=30; 0.21% pos 21
   rev: 1.114% pos=101 ; 0.92% pos=107 ; 0.87% pos=30; 0.21% pos 21
  [[Media:FC638TR.Ns.png|FC638TR.Ns.png]]


* GC% variation: cBAC(37.5%) < cChloroplast(38.5%) < reads(39%) < mito (44%+)  
* GC% variation: cBAC(37.5%) < cChloroplast(38.5%) < reads(39%) < mito (44%+)  


* cCholoplast alignments (bwasw)
* Contamination:
   lane                 #hits   %hits  #hits(uniq)
   lane                   #reads      #cChloroplast   #cBAC              #mito
   FC638TR_001_8_1 475254 2.09  468309
   FC638TR_001_8_1 22,729,231  468,309(2%)    9,533,849(42.7%)    12715(0.056%)
   FC638TR_001_8_2 473331 2.08  466185
   FC638TR_001_8_2 22,729,231  466,185(2%)    9,303,475(41.7%)    12291
   FC638TR_002_8_1 1009331 5.48   995291
   FC638TR_002_8_1 18,412,638   995,291(5.4%7,535,809(41.7%)    30839 (0.16%)
  FC638TR_002_8_2 1004341 5.45  990122
   FC638TR_002_8_2 18,412,638   990,122(5.4%7,330,078(40.5%)   29444
 
   total                                                                  85289            # ~21X cvg for 100bp read len & 400K mito genome
 
* cBAC alignments (bwasw)
  lane                  #hits   %hits  #hits(uniq)
  FC638TR_001_8_1 9722204 42.77  9533849
  FC638TR_001_8_2 9481188 41.71  9303475
  FC638TR_002_8_1 7684164 41.73  7535809
   FC638TR_002_8_2 7469151 40.56   7330078
 
== Sampled reads ==
*  100K sampled reads from each library (2*2*100K=400K)
   .       elem      min   q1    q2    q3    max        mean      n50        sum           
   gc%    400000    0.68  34.93  39.04  43.15  95.89      39.20      40.41      .


* FC638TR_001_8_1 alignments
* alignments:
   ref            qry              aligner      #hits      %hits  %identity(median)
   program: bwa bwasw
  cBAC          FC638TR_001_8_1  bwasw       42971      43
   cChloroplast ref: 1 seq
                                  nucmer      12477      12.5    95
   cBAC:            101 seqs
                                  bowtie      1186      1.2%
   mito:            83 scaffolds ~358162bp
   cChloroplast                     bwasw        2031      2%
                                  nucmer      1943      1.9%    100
                                  bowtie      1490      1.5%
 
* FC638TR_00[12]_8_[12] bwa alignments
  ref            qry              aligner      #hits      %hits
   cBAC           FC638TR_001_8_1  bwasw        42971      43
                FC638TR_001_8_2                41915      42
                FC638TR_002_8_1                42128      42
                FC638TR_002_8_2                40606      41
   cChloroplast  FC638TR_001_8_1                2031      2
                FC638TR_001_8_2                2033      2
                FC638TR_002_8_1                5370      5.3
                FC638TR_002_8_2                5330      5.3


== SOAPdenovo's ==
== SOAPdenovo's ==
   #scaffold stats
   #scaffold stats
   .                                     elem       min    q1    q2    q3    max        mean      n50        sum
   .                               elem       min    q1    q2    q3    max        mean      n50        sum  
   -K47          -max_rd_len100        211820    100    143   156*  187   23273     227.95     .          48284629
   -K31 -d0  -max_rd_len100        13,747,338  100    100    100   100   9,185     108.04     .          1,485,269,562
   
   
   -K31           -max_rd_len100        13747338  100    100   100   100   9185      108.04     .          1485269562
   -K31 -d2  -max_rd_len72          28,934      100    111   136   426   23,376     378.53*    0         10,952,507
   -K31 -d2 -D3 -max_rd_len100        74820     100    105    125    390    31673      320.75    .          23998536  
   -K31 -d2  -max_rd_len100        74,820     100    105    125    390    31,673    320.75    .          23,998,536  
   -K31 -d20 -M3 -max_rd_len100         7859*      100    113   139   284   43079*     331.49    .          2605184*           
   -K31 -d2 -max_rd_len146         264,547    100    108   123   169   32,435     228.49    0         60,445,493
  -K27 -d 2 -D 3 -max_rd_len100        70246      100    107    137    413    30683      369.81    .          25977758
  -K27 -d 2 -D 2 -max_rd_len146        224963    100    110    128    343    23410      260.64    .         58635190


==  SOAPdenovo-31mer -K 27 -d 2 -D 3 -max_rd_len 100 ==
  -K31 -d20 -max_rd_len100        7,859*      100    113    139    284    43,079     331.49     .          2,605,184           
  #stats
   -K31 -d48 -max_rd_len100        3,626      100    113   139   255   43,131*   339.01     .         1,229,250
  .                          elem  min   q1     q2     q3    max    mean    n50    sum
   scf                        70246  100    107   137   413   30683* 369.81  .     25977758
  ctg                        8641885 28    28    31    37     7238  36.1     .      312425669


=== Alignments ===
   -K47 -d0 -max_rd_len100        211,820     100   143   156*   187    23,273    227.95     .         48,284,629
  .              elem   min q1  q2    q3    max    mean     n50  sum
   -K47 -d2 -max_rd_len100        61,152      100   121    151    200    30,846    286.05    0         17,492,450
  cChloroplast   136   100  117  142   187   628   168.34  0    22894
  cBAC            6385  100  116  187  499  23267  597.00  0    3811871
  mito            84     110  479  1791  7050  30683  4268.99  0    358595
   other          63641 100 106  134  409  22471  342.30  0   21784398


===  Alignment1 (old) ===
==  SOAPdenovo-31mer -K 31 -d 2 -max_rd_len 100 ==
   nucmer default parameters
   #stats
   # Legend:
   .              elem        min  q1   q2    q3    max    mean    n50  sum          readOnContig
   all                        : all SOAPdenovo scaffolds
   scf            74,820      100  105  125   390   31,673 320.75  0    23,998,536
   cBAC                      : scaffolds aligned to cBACs
   ctg            5,755,282  32  32  35    43    7,195  41.63   0    239,620,204  33,083,609(40%)
   cChloroplast              : scaffolds aligned to cChloroplast
   edge            11,015,468  1    2    4    11    7,164  8.75    0    96,380,983
   mito                      : scaffolds aligned to at least one of the 31 complete plant mitochondrion sequence
  reads          82,283,738                                            6,006,712,874
   mito.Cycas_taitungensis   : scaffolds aligned to at least one of the Cycas_taitungensis mitochondrion sequence (most hits)
   other                      : unaligned scaffolds


   # scaffold length stats
   #scf alignments
   .                         elem   min   q1     q2     q3    max    mean    n50   sum
   .               elem     min q1   q2   q3   max     mean    n50 sum
   all                       70246 100    107    137    413    30683 369.81   .      25977758
   all             74,820    100 105  125  390  31,673 320.75   0    23,998,536
   cBAC                      1839  100   124    242    625    23267 637.13   .      1171678
   cChloroplast    206      100  122  159   229   767     191.56   0    39,462      # VERY BAD
   cChloroplast              73     100    117    139    185    416    161.47   .      11787        # why so bad???
   cBAC            10,533   100  113  143   428   26,589 477.68  0    5,031,439
   mito                      68    131   867    2274   7241   30683 4675.18  .      317912
   mito           83        105  448  1730  6851  26,364  4315.20  0   358,162
   mito.Cycas_taitungensis   64    111   844    1931   7114   30683* 4529.91 .     289914
  other          63,998   100  104  122   382   31,673 290.16  0    18,569,473  # align to mito database ; Cycas_taitungensis was top hit
   other                     68266 100    106    136    412    26715 358.54  .      24476381
   other.long.hiGC 45        5066 6717 8233 10488 31,673 9662.07  0    434,793


   #scaffold gc stats
== SOAPdenovo-31mer -K 31 -d 20 -max_rd_len 100 ==
   .                         elem   min   q1     q2     q3    max    mean    n50   sum
   #stats
   all                        70246 4.90   35.40  40.74  44.52  74.26  39.78   .      .
   .               elem     min q1   q2   q3   max     mean    n50 sum         readOnContig
   cBAC                      1839   10.64  35.63  41.22  44.87  74.26  39.95    .      .
   scf            7,859    100  113 139  284  43,079* 331.49   .    2,605,184
   cChloroplast              73    25.65  31.09  33.33  36.89  42.31  33.76   .      .
   ctg            200,062   32   33   37   47   10,392 48.52   .    9,707,307   19,002,331(23%)
  mito                      68    43.08  45.96  47.45 49.19  56.41  47.77   .     .
   reads          82,283,738
  mito.Cycas_taitungensis   64    41.44  46.27  47.81  50.00  56.41  48.16   .      .
   other                      68266  4.90  35.40  40.71  44.50  70.00  39.77    .      .


* The longest assembled scaffold was 30683bp and aligned to the mitochondrion database.
  #scf alignments
* The mitochondrion gc% seems to be significantly higher than the one of rest of the genome (48% vs 40%)
  .               elem      min  q1  q2    q3    max    mean    n50  sum
* The Cycas taitungensis mitochondrion (414903bp, 46.92%gc) had the most scaffolds aligned to it (64 out of 68).  
  all            7,859*   100  113  139  284  43,079* 331.49  .   2,605,184
   NC_009618 Cycas taitungensis chloroplast, complete genome   DNA; circular; Length: 163,403 nt
   cChloroplast    20        111  193  436  6140  43,079  5951.05  0   119,021      # MUCH BETTER
   NC_010303 Cycas taitungensis mitochondrion, complete genome DNA; circular; Length: 414,903 nt
   cBAC            5,117    100 114  141  320  13,733  334.94  0    1,713,870
   [[Media:Cycas_taitungensis_mito-chloroplast.png|Cycas_taitungensis_mito-chloroplast.png]]
   mito            8        101  134  685  1396  2,166  749.75  0    5,998        # VERY BAD
  other          2,714    100  111  133  226  7,353  282.35  0    766,295


* Mitochondrial scaffolds
== SOAPdenovo-31mer -K 31 -d 48 -max_rd_len 100 choloplast_mated_reads==
  .                    elem      min    q1    q2    q3    max        mean      n50        sum         
  scf                  68        131    867    2274  7241  30683      4675.18    9407      317912          # used for alignment
  scf.gc%              68        43.08  45.96  47.45  49.19  56.41      47.77      47.45      3248.1
  scf.noGaps          68        131    743    2049  6660  27931      4262.46    9052      289847       
 
* Reads aligned to mitochondrial scaffolds (bwa bwasw)
  lane              #hits  %hits
  FC638TR_001_8_1    12307  0.054
  FC638TR_001_8_2    11933
  FC638TR_002_8_1    28707  0.12
  FC638TR_002_8_2    27211
  total              80158          # 20X cvg for 100bp read len & 400K mito genome ; 29X  cvg for 146bp read len
 
===  Alignment2 (old) ===
  nucmer -l 20 -c 20; delta-filter -l 65 -q -o 75 ; filter for gc% >=44
  #some of the mito hits align to cChloroplast & cBAC => might have an overestimate
 
  # Mitochondrial scaffolds
  .                    elem      min    q1    q2    q3    max        mean      n50        sum           
  scf.len              102        101    608    1931  7271  30683      5044.88    11204      514578         
  scf.gc%              102        44.07  46.12  47.45  49.33  56.41      48.05      47.47      4901.06
 
  lane              #hits  %hits
  FC638TR_001_8_1    18614
  FC638TR_001_8_2    18035
  FC638TR_002_8_1    43961
  FC638TR_002_8_2    42101
  total              122707            # 30X cvg for 100bp read len & 400K mito genome
 
== SOAPdenovo-31mer -K 31 -d 20 -M 3 -max_rd_len 100 ==
   #scaffold stats
   #scaffold stats
   .                         elem   min   q1     q2     q3     max    mean    n50    sum
   .               elem     min q1   q2   q3    max    mean    n50  sum             
  scf                        7859*  100    113    139    284    43079* 331.49  .      2605184
   scf             20       111 193 436   6140 42707 5928.20 0   118564
  ctg                        200062 32    33    37    47    10392  48.52   .      9707307
 
# scaffold length stats
  .                          elem  min    q1    q2    q3    max    mean    n50   sum
  all                        7859*  100    113    139    284    43079* 331.49  .      2605184
  cChloroplast              20    111    193    436    6140  43079  5951.05  0      119021
  cBAC                      5117  100    114    141    320    13733 334.94  0      1713870
  mito                      8      101    134    685    1396  2166  749.75  0      5998        !!! VERY BAD
  other                      2714  100    111    133    226    7353  282.35  0      766295
 
== SOAPdenovo-31mer -K 31 -d 48 -max_rd_len 100 -M 3 choloplast_mated_reads==
  #scaffold stats
  .                    elem      min    q1    q2    q3    max        mean      n50        sum             
   scf                 20         111   193   436   6140   42707     5928.20    0          118564


= PineUpload070711 =
= PineUpload070711 =
Line 238: Line 166:
   total          137,586,636    ?    # actually the chromosome lengths sum to 130,450,100
   total          137,586,636    ?    # actually the chromosome lengths sum to 130,450,100


== Reads (Drosophila) ==     
== Reads ==     


   lib                      readLen  #reads    #cE_coli        #pFosDT5_2      #cChloroplast  #cBAC   
   lib                      readLen  #reads    #cE_coli        #pFosDT5_2      #cChloroplast  #cBAC   
Line 244: Line 172:
   FC70M6V_6_001_2          156      23546475  2885406(12.25%)  5854468(24.86%)  21794(0.09%)  7520343(31.93%)
   FC70M6V_6_001_2          156      23546475  2885406(12.25%)  5854468(24.86%)  21794(0.09%)  7520343(31.93%)


 
   lib                      readLen  #mates    mea,std  ~gc%  %merged(Tanja) %cE_coli %cpFosDT5_2 %cChloro %cBAC %pBAC-DE %other   
   lib                      readLen  #mates    mea,std  ~gc%  %merged(Tanja)   %cE_coli %cpFosDT5_2  %cChloroplast %cBAC  %other   
   FC70M6V_6_001            160,156  23546475  343,30    42.5                 12.5%   24%         0.09%     32.5   19.3          # sampled 100K
   FC70M6V_6_001            160,156  23546475  343,30    42.5                   12.5%     24%         0.09%         32.5   34      # sampled 100K
 
   TIL_242_FC70M6V_2_002    160,156  9917211  242      .      91.4%   
   TIL_242_FC70M6V_2_002    160,156  9917211  242      .      91.4%   
   TIL_242_FC70M6V_3_002    160,156  6276300  242              92.7%   
   TIL_242_FC70M6V_3_002    160,156  6276300  242              92.7%   
Line 259: Line 186:
   TIL_288_FC70M6V_2_001    160,156  9524524  288        .    80.0%
   TIL_288_FC70M6V_2_001    160,156  9524524  288        .    80.0%
   TIL_288_FC70M6V_3_001    160,156  6158919  288              83.0%
   TIL_288_FC70M6V_3_001    160,156  6158919  288              83.0%


* kastevens@ucdavis.edu:
* kastevens@ucdavis.edu:
Line 272: Line 198:
** Drosophila libraries run in lane 2 at nominal density.
** Drosophila libraries run in lane 2 at nominal density.


== SOAPdenovo-31mer -K 31 -d 2 -D 3 -max_rd_len 100 ==
* About 8.5% of the reads contain a high copy kmer . they don't get assembles
AAAGAGTGTAGATCTCGGTGGT
AACTCCAGTCACTTAGGCATCT
AAGACGGCATACGAGATGCCTA
AAGAGTGTAGATCTCGGTGGTC
AAGATCGGAAGAGCGTCGTGTA
AAGCAGAAGACGGCATACGAGA
AAGTGACTGGAGTTCAGACGTG
ACACGTCTGAACTCCAGTCACT
ACCGAGATCTACACTCTTTCCC
ACGAGATGCCTAAGTGACTGGA
ACGGCATACGAGATGCCTAAGT
ACGGCGACCACCGAGATCTACA
ACGTCTGAACTCCAGTCACTTA
ACTCCAGTCACTTAGGCATCTC
AGAAGACGGCATACGAGATGCC
AGACGGCATACGAGATGCCTAA
AGACGTGTGCTCTTCCGATCTA
AGAGTGTAGATCTCGGTGGTCG
AGATCGGAAGAGCACACGTCTG
AGATCGGAAGAGCGTCGTGTAG
 
== SOAPdenovo-31mer -K 31 -d 2 -max_rd_len 100 ==
  #stats
  .              elem      min  q1  q2    q3    max      mean      n50  sum              readOnContig
  scf            20,441    100  124  374  1980  291,000  2575.50  0    52,645,707
  ctg            802,463  32  33  39    63    73,415  91.13    0    73,131,767        37,254,577
  edge            1,013,801 1    2    7    32    30,919  48.85    0    49,525,815
  reads          47,092,950                                              7,440,686,100
 
  #scf alignments
  .              elem      min  q1  q2    q3    max      mean      n50  sum
  all            20,441    100  124  374  1980  291,000  2575.50  0    52,645,707
  cE_coli        149      100  325  6612  41908  291,000  30160.59  0    4,493,928
  cpFosDT5_2      0
  cChloroplast    58        105  166  374  1950  24,932  1875.86  0    108,800
  cBAC            12,294    100  141  785  4204  45,781  3513.34  0    43,192,987
  other          7953      100  113  171  599    41,416  619.60    0    4,927,664
 
== SOAPdenovo-31mer -K 31 -d 20 -max_rd_len 100 ==
   #stats
   #stats
   .               elem     min q1   q2   q3    max     mean     n50 sum               readOnContig
=== Alignments ===
   scf            25,482    100 127  262  993    239,672  1339.89  0    34,143,040
   .                   elem       min   q1     q2     q3    max       mean       n50       sum
  ctg            265,450  32  34  50   121    49,599  143.69   0   38,141,459        40,191,864(85%)
   all                  20441      100    124   374   1980   291000     2575.50   0          52645707
  edge            530,926   1    3    11    40     41,918  63.06    0   33,477,999
   cE_coli              149        100    325   6612   41908 291000    30160.59   0         4493928
   reads          47,092,950                                              7,440,686,100
   cpFosDT5_2           0
 
   cChloroplast         58        105   166    374   1950  24932      1875.86   0         108800
  #scf alignments
   cBAC                 12294      100   141    785    4204   45781      3513.34   0          43192987
  .              elem      min  q1  q2   q3    max      mean      n50  sum
   other               7953      100   113    171    599   41416      619.60    0         4927664
  all            25,482   100  127  262   993    239,672 1339.89  0    34,143,040
   cE_coli        205      100  252  2244  30571  239,672  21916.78  0   4,492,939
   cpFosDT5_2     17        100  118  171  272    855      275.24    0   4,679
   cChloroplast    31        100  130  322  1363  5,717   986.52   0   30,582
   cBAC           15,668    100 133  336  1529  33,075   1559.92  0   24,440,863
   other           9,574    100 117  171   522   27,341  542.74   0   5,196,233

Latest revision as of 21:35, 11 August 2011

Links

Abstract: Loblolly pine (LP; Pinus taeda L.) is the most economically important tree in the U.S. and a cornerstone species in southeastern forests. However, genomics research on LP and other conifers has lagged behind studies on flowering plants due, in part, to the large size of conifer genomes. As a means to accelerate conifer genome research, we constructed a BAC library for the LP genotype 7-56. The LP BAC library consists of 1,824,768 individually-archived clones making it the largest single BAC library constructed to date, has a mean insert size of 96 kb, and affords 7.6X coverage of the 21.7 Gb LP genome. To demonstrate the efficacy of the library in gene isolation, we screened macroarrays with overgos designed from a pine EST anchored on LP chromosome 10. A positive BAC was sequenced and found to contain the expected full-length target gene, several gene-like regions, and both known and novel repeats. Macroarray analysis using the retrotransposon IFG-7 (the most abundant repeat in the sequenced BAC) as a probe indicates that IFG-7 is found in roughly 210,557 copies and constitutes about 5.8% or 1.26 Gb of LP nuclear DNA; this DNA quantity is eight times the Arabidopsis genome. In addition to its use in genome characterization and gene isolation as demonstrated herein, the BAC library should hasten whole genome sequencing of LP via next-generation sequencing strategies/technologies and facilitate improvement of trees through molecular breeding and genetic engineering. The library and associated products are distributed by the Clemson University Genomics Institute (www.genome.clemson.edu).

Data

NCBI

  • BAC assembled sequences : AC241263..AC241361, HQ141589, GU477256..GU477266
  • Plant mitochondrion finished sequences
 .      elem    min    q1      q2      q3      max      mean     sum
 len    31      45223  209482  414903  539368  982833   402851   12488404
 gc%    31      32.80  43.73   43.93   44.98   46.92    43.41    .
  • Cycas taitungensis has the most similar mitochondrion
 NC_009618	chloroplast     163,403
 NC_010303	mitochondrion   414,903
 mitochondrion vs chloroplast:  Cycas_taitungensis_mito-chloroplast.png

UCDAVIS plone

  • Links
 https://dendrome.ucdavis.edu/TGPlone/research-projects/pinerefseq  
 dpuiu
 ddr5fft6 
 https://dendrome.ucdavis.edu/TGPlone/research-projects/pinerefseq/files/library-and-flow-cell-data/prs-tracking-database-archive/

IPST ftp

 ftp genomepc1.umd.edu
 ftpuser
 pinegenome

 cd PineUpload052911/
 bin
 prompt             # no Y/N?
 mget *

Local data

 ginkgo:
 /fs/szattic-asmg7/PINE/PineUpload052911
 /fs/szattic-asmg7/PINE/PineUpload070711

PineUpload052911

Chloroplast

                len      gc%
 cChloroplast   120481   38.55

cBACs

 .       elem       min    q1     q2     q3     max        mean       n50        sum            
 len     102        8288   89909  116121 140549 172161     113400     126689     11566806       
 gc%     102        34.44  36.56  37.61  38.80  52.88      37.94      37.66      3870.87        

Reads

 lane           readLen   #mates        mea,std      ~gc%
 FC638TR_001_8  146       22,729,231    400           39.04
 FC638TR_002_8  146       18,412,638    400           39.04
 fwd: 1.015% pos=100 ; 0.81% pos=119
 rev: 1.114% pos=101 ; 0.92% pos=107 ; 0.87% pos=30; 0.21% pos 21
  • GC% variation: cBAC(37.5%) < cChloroplast(38.5%) < reads(39%) < mito (44%+)
  • Contamination:
 lane                   #reads       #cChloroplast   #cBAC               #mito
 FC638TR_001_8_1	22,729,231   468,309(2%)     9,533,849(42.7%)    12715(0.056%)
 FC638TR_001_8_2	22,729,231   466,185(2%)     9,303,475(41.7%)    12291
 FC638TR_002_8_1	18,412,638   995,291(5.4%)   7,535,809(41.7%)    30839 (0.16%) 
 FC638TR_002_8_2	18,412,638   990,122(5.4%)   7,330,078(40.5%)    29444
 total                                                                   85289             # ~21X cvg for 100bp read len & 400K mito genome
  • alignments:
 program: bwa bwasw
 cChloroplast ref: 1 seq
 cBAC:             101 seqs
 mito:             83 scaffolds ~358162bp

SOAPdenovo's

 #scaffold stats
 .                                elem        min    q1     q2     q3     max        mean       n50        sum 
 -K31 -d0  -max_rd_len100         13,747,338  100    100    100    100    9,185      108.04     .          1,485,269,562

 -K31 -d2  -max_rd_len72          28,934      100    111    136    426    23,376     378.53*    0          10,952,507
 -K31 -d2  -max_rd_len100         74,820      100    105    125    390    31,673     320.75     .          23,998,536  
 -K31 -d2  -max_rd_len146         264,547     100    108    123    169    32,435     228.49     0          60,445,493
 -K31 -d20 -max_rd_len100         7,859*      100    113    139    284    43,079     331.49     .          2,605,184            
 -K31 -d48 -max_rd_len100         3,626       100    113    139    255    43,131*    339.01     .          1,229,250
 -K47 -d0  -max_rd_len100         211,820     100    143    156*   187    23,273     227.95     .          48,284,629
 -K47 -d2  -max_rd_len100         61,152      100    121    151    200    30,846     286.05     0          17,492,450

SOAPdenovo-31mer -K 31 -d 2 -max_rd_len 100

 #stats
 .               elem        min  q1   q2    q3    max    mean     n50  sum           readOnContig
 scf             74,820      100  105  125   390   31,673 320.75   0    23,998,536
 ctg             5,755,282   32   32   35    43    7,195  41.63    0    239,620,204   33,083,609(40%)
 edge            11,015,468  1    2    4     11    7,164  8.75     0    96,380,983
 reads           82,283,738                                             6,006,712,874
 #scf alignments
 .               elem      min  q1   q2    q3    max     mean     n50  sum
 all             74,820    100  105  125   390   31,673  320.75   0    23,998,536
 cChloroplast    206       100  122  159   229   767     191.56   0    39,462       # VERY BAD
 cBAC            10,533    100  113  143   428   26,589  477.68   0    5,031,439
 mito            83        105  448  1730  6851  26,364  4315.20  0    358,162
 other           63,998    100  104  122   382   31,673  290.16   0    18,569,473   # align to mito database ; Cycas_taitungensis was top hit
 other.long.hiGC 45        5066 6717 8233  10488 31,673  9662.07  0    434,793

SOAPdenovo-31mer -K 31 -d 20 -max_rd_len 100

 #stats
 .               elem      min  q1   q2    q3    max     mean     n50  sum          readOnContig
 scf             7,859     100  113  139   284   43,079* 331.49   .    2,605,184
 ctg             200,062   32   33   37    47    10,392  48.52    .    9,707,307    19,002,331(23%)
 reads           82,283,738
 #scf alignments
 .               elem      min  q1   q2    q3    max     mean     n50  sum
 all             7,859*    100  113  139   284   43,079* 331.49   .    2,605,184
 cChloroplast    20        111  193  436   6140  43,079  5951.05  0    119,021      # MUCH BETTER
 cBAC            5,117     100  114  141   320   13,733  334.94   0    1,713,870
 mito            8         101  134  685   1396  2,166   749.75   0    5,998        # VERY BAD
 other           2,714     100  111  133   226   7,353   282.35   0    766,295

SOAPdenovo-31mer -K 31 -d 48 -max_rd_len 100 choloplast_mated_reads

 #scaffold stats
 .               elem      min  q1   q2    q3    max    mean     n50  sum            
 scf             20        111  193  436   6140  42707  5928.20  0    118564

PineUpload070711

Ecoli

                len     gc%
 cE_coli        4639675 50.79  

Cloning vector

                len    gc% 
 pFosDT5_2      8345   47.93

Drosophila refseq

 Chromosome      len            gc%
 2L              23,011,544     41
 2R              21,146,708     43
 3L              24,543,557     41
 3R              27,905,053     42
 4               1,351,857      35
 X               22,422,827     42 
 un              10,049,037     ?    
 mitochondrion   19,517         17
 total           137,586,636    ?     # actually the chromosome lengths sum to 130,450,100

Reads

 lib                      readLen  #reads    #cE_coli         #pFosDT5_2       #cChloroplast  #cBAC  
 FC70M6V_6_001_1          160      23546475  2931496(12.44%)  5473141(23.24%)  24148(0.10%)   7739576(32.86%)
 FC70M6V_6_001_2          156      23546475  2885406(12.25%)  5854468(24.86%)  21794(0.09%)   7520343(31.93%)
 lib                      readLen  #mates    mea,std   ~gc%  %merged(Tanja) %cE_coli %cpFosDT5_2 %cChloro  %cBAC  %pBAC-DE %other  
 FC70M6V_6_001            160,156  23546475  343,30    42.5                 12.5%    24%         0.09%     32.5   19.3          # sampled 100K

 TIL_242_FC70M6V_2_002    160,156  9917211   242       .      91.4%  
 TIL_242_FC70M6V_3_002    160,156  6276300   242              92.7%  

 TIL_254_FC70M6V_2_004    160,156  9279789   254        .     91.5%
 TIL_254_FC70M6V_3_004    160,156  5924239   254              92.9%

 TIL_270_FC70M6V_2_003    160,156  10188776  270        .     88.1%
 TIL_270_FC70M6V_3_003    160,156  6556676   270              90.3%

 TIL_288_FC70M6V_2_001    160,156  9524524   288        .     80.0%
 TIL_288_FC70M6V_3_001    160,156  6158919   288              83.0%
  • kastevens@ucdavis.edu:
    • The files labeled TIL_XXX_FC70M6V_Y_00Z, are Drosophila libraries with a median target insert size of XXX. They come in pairs and can be merged.
    • Regarding pairing, each insert size was run in two lanes Y at two different concentrations.
    • Lane 3, with the lower concentration, should have higher quality data than lane 2 but with a higher cost per bp.
    • The loss in quality was quantitativly small, so we don't expect the extra expense of lowering the concentration will be justified empirically.
    • The first library, FC70M6V_6_001, is a ~40x library created from a pool of ~1000 fosmids. In general, we do not put the insert size in the filename.
    • However, we did estimate the insert size to be 343bp with a below median standard deviation of 30. So roughly 15% of the inserts are < 313bp and have > 3bp overlap. This seems to fit well with your result.
    • Each lane is multiplexed into sub-lanes indicated by 00Z. So the amount of reads in the file is variable and not nessesarily reflective of the cluster density.
    • The Drosophila libraries were each run in 1/4 lane and the fosmid pool was run in 1/2 lane. The pool has roughy double the sequence content of the
    • Drosophila libraries run in lane 2 at nominal density.
  • About 8.5% of the reads contain a high copy kmer . they don't get assembles
AAAGAGTGTAGATCTCGGTGGT
AACTCCAGTCACTTAGGCATCT
AAGACGGCATACGAGATGCCTA
AAGAGTGTAGATCTCGGTGGTC
AAGATCGGAAGAGCGTCGTGTA
AAGCAGAAGACGGCATACGAGA
AAGTGACTGGAGTTCAGACGTG
ACACGTCTGAACTCCAGTCACT
ACCGAGATCTACACTCTTTCCC
ACGAGATGCCTAAGTGACTGGA
ACGGCATACGAGATGCCTAAGT
ACGGCGACCACCGAGATCTACA
ACGTCTGAACTCCAGTCACTTA
ACTCCAGTCACTTAGGCATCTC
AGAAGACGGCATACGAGATGCC
AGACGGCATACGAGATGCCTAA
AGACGTGTGCTCTTCCGATCTA
AGAGTGTAGATCTCGGTGGTCG
AGATCGGAAGAGCACACGTCTG
AGATCGGAAGAGCGTCGTGTAG

SOAPdenovo-31mer -K 31 -d 2 -max_rd_len 100

 #stats
 .               elem      min  q1   q2    q3     max      mean      n50  sum               readOnContig
 scf             20,441    100  124  374   1980   291,000  2575.50   0    52,645,707
 ctg             802,463   32   33   39    63     73,415   91.13     0    73,131,767        37,254,577
 edge            1,013,801 1    2    7     32     30,919   48.85     0    49,525,815
 reads           47,092,950                                               7,440,686,100
 #scf alignments
 .               elem      min  q1   q2    q3     max      mean      n50  sum
 all             20,441    100  124  374   1980   291,000  2575.50   0    52,645,707
 cE_coli         149       100  325  6612  41908  291,000  30160.59  0    4,493,928
 cpFosDT5_2      0
 cChloroplast    58        105  166  374   1950   24,932   1875.86   0    108,800
 cBAC            12,294    100  141  785   4204   45,781   3513.34   0    43,192,987
 other           7953      100  113  171   599    41,416   619.60    0    4,927,664

SOAPdenovo-31mer -K 31 -d 20 -max_rd_len 100

 #stats
 .               elem      min  q1   q2    q3     max      mean      n50  sum               readOnContig
 scf             25,482    100  127  262   993    239,672  1339.89   0    34,143,040
 ctg             265,450   32   34   50    121    49,599   143.69    0    38,141,459        40,191,864(85%)
 edge            530,926   1    3    11    40     41,918   63.06     0    33,477,999
 reads           47,092,950                                               7,440,686,100
 #scf alignments
 .               elem      min  q1   q2    q3     max      mean      n50  sum
 all             25,482    100  127  262   993    239,672  1339.89   0    34,143,040
 cE_coli         205       100  252  2244  30571  239,672  21916.78  0    4,492,939
 cpFosDT5_2      17        100  118  171   272    855      275.24    0    4,679
 cChloroplast    31        100  130  322   1363   5,717    986.52    0    30,582
 cBAC            15,668    100  133  336   1529   33,075   1559.92   0    24,440,863
 other           9,574     100  117  171   522    27,341   542.74    0    5,196,233