Bos taurus 3.0: Difference between revisions

From Cbcb
Jump to navigation Jump to search
Line 365: Line 365:


* '''Goal: find all syntenic regions longer than a certain % of the Cow/Human genome'''
* '''Goal: find all syntenic regions longer than a certain % of the Cow/Human genome'''
  Example: 2.86G => 0.1% = 2.86M
   
 
* nucmer params: -l 12 -c 65 -g 1000 -b 1000
* delta-filter -l 200
* 24 * 30 = 720 alignments (except for BtChrU)
 
* Chromosome counts (include gaps)
* Chromosome counts (include gaps)
   .                    elem      min      q1      q2        q3        max        mean      n50        sum             
   .                    elem      min      q1      q2        q3        max        mean      n50        sum             
Line 386: Line 381:
   cow                  72454      1        99      99        248      1074158    286        698        20,737,263  => 0.7% gaps
   cow                  72454      1        99      99        248      1074158    286        698        20,737,263  => 0.7% gaps


* nucmer params: -l 12 -c 65 -g 1000 -b 1000
* delta-filter -l 200
* 24 * 30 = 720 alignments (except for BtChrU)


* '''Alignments counts'''
* '''Alignments counts'''
   HsChr-BtChr.delta            532,866
                                >=200    >=2000  >=5000
  HsChr-BtChr.filter-1.delta    392,789
   HsChr-BtChr.delta            532,866 39,663  3,570
 
   HsChr-BtChr.filter-1.delta   392,789 38,185   3,560
  HsChr-BtChr.filter-2K.delta  39,663
      
  HsChr-BtChr.filter-2K-1.delta 38,185
 
  HsChr-BtChr.filter-5K.delta   3,570
   HsChr-BtChr.filter-5K-1.delta 3,560
* '''Best alignment summary:'''
  43 chr sets align over 1%
    nl #HsChr    BtChr  HsChrLen  BtChrLen  #al    #2Kal  min  max    median  sum        HsChrAl%  BtChAl%
    1  1        2      247249719  137060424  3229    238    13  9112  618    2763498    1        2
    2  1        3      247249719  121430405  18208  2573  15  19711  855    21390308  8        17
    3  1        16    247249719  81724687  11817  1057  13  26957  727    11384863  4        13
    4  2        2      242951149  137060424  18147  2379  15  28727  857    21057071  8        15
    5 2        11    242951149  107310763  13922  1292  15  21608  740    13879256  5        12
    6  3        1      199501827  158337067  17808  2100  14  24243  837    19561651  9        12
    7  3        22    199501827  61435874  10408  973    14  18031  752    10518104  5        17
    8  4        6      191273063  119458736  14857  1378  14  15997  776    14933999  7        12
    9  4        17    191273063  75158596  5498    507    15  14456  800    5603523    2        7
    10  5        7      180857866  112638659  14201  1891  13  26095  866    16433557  9        14
    11  5        20    180857866  72042655  9426    855    15  13811  767    9274385    5        12
    12  6        9      170899992  105708250  14577  1559  14  18079  818    15402674  9        14
    13  6        23    170899992  52530062  8082    804    14  15593  731    8040414    4        15
    14  7        4      158821424  120829699  17230  1835  13  34598  803    18223663  11        15
    15  7        25    158821424  42904170  2574    129    14  11489  536    1927855    1        4
    16  8        8      146274826  113384836  1891    100    15  11015  598    1491998    1        1
    17  8        14    146274826  84648390  11817  1025  14  30396  744    11637718  7        13
    18  8        27    146274826  45407902  3207    143    14  8051  651    2597335    1        5
    19  9        8      140273252  113384836  14511  1540  14  17317  801    15339451  10        13
    20  9        11    140273252  107310763  2874    253    15  19675  609    2568633    1        2
    21  10        13    135374737  84240350  4371    245    14  12436  691    3784917    2        4
    22  10        26    135374737  51681464  8467    917    13  19568  757    8745051    6        16
    23  10        28    135374737  46312546  5712    567    14  11567  761    5753856    4        12
    24  11        15    134452384  85296676  12655  1340  14  22774  808    13419432  9        15
    25  11        29    134452384  51505224  7120    528    13  16632  685    6446478    4        12
    26  12        5      132349534  121191424  15222  1787  14  29685  804    16677907  12        13
    27  12        17    132349534  75158596  3924    213    13  8484  585    3042898    2        4
    28  13        12    114142980  91163125  11389  837    14  19959  738    10665467  9        11
    29  14        10    106368585  104305016  8626    980    14  17863  807    9394454    8        9
    30  14        21    106368585  71599096  5319    651    14  18953  785    5766996    5        8
    31  15        10    100338915  104305016  7914    1088  14  16238  805    8977632    8        8
    32  15        21    100338915  71599096  5057    438    15  10071  686    4689441    4        6
    33  16        18    88827254  66004023  7632    655    14  19289  711    7291016    8        11
    34  16        25    88827254  42904170  4680    278    12  9947  589    3698904    4        8
    35  17        19    78774742  64057457  11876  1195  13  15954  651    11528275  14        17
    36  18        24    76117153  62714930  9815    755    13  16525  704    9096937    11        14
    37  19        7      63811651  112638659  2718    181    14  9082  507    2027086    3        1
    38 19        18    63811651  66004023   3765    354    12  13895  624    3457684    5        5
    39  20        13    62435964  84240350  8838    692    13  13905  671    7962468    12        9
    40  21        1      46944323  158337067  3653    205    13  10470  644    3032218    6        1
    41  22        5      49691432  121191424  2618    161    14  7411  540    1966904    3         1
     42  22        17    49691432  75158596  1681    114    15  14130  558    1328105    2        1
    43  23        30    154913754  148823899  17707  2895  15  19690  934    22575384  14        15
    ...
    .  .        .      .          .          532794  41843  12  34598  718    493618379  .        .
* 54 chr sets have at least one 5K alignments
* 54 chr sets have at least one 5K alignments



Revision as of 16:18, 30 September 2009

Sequence

The genome of the domestic cow, Bos taurus, was sequenced using a mixture of hierarchical and whole-genome shotgun sequencing methods.

Read download

  • All reads were downloaded from the NCBI Trace Archive (TA) ftp: ftp://ftp.ncbi.nih.gov/pub/TraceDB/bos_taurus/
  • There were 37,829,394 reads organized into 91 volumes
    • 36,820,485 WGS, SHOTGUN, CLONEEND & FINISHING reads
      • 36,170,352 quality reads
      • 650,133 quality-less reads
    • 1,008,909 EST & PCR reads
  • 25,312 read libraries

Sequencing centers

  • Most reads were sequenced by the Baylor College of Medicine
    TRACE_COUNT     CENTER_NAME     
 1  35629020        BCM             Baylor College of Medicine
 2  737900          NISC            NIH Intramural Sequencing Center
 3  652614          BCCAGSC         British Columbia Cancer Agency Genome Sciences Center
 4  378871          MARC            USDA, ARS, US Meat Animal Research Center
 5  114753          UIUC            University of Illinois at Urbana-Champaign
 6  107367          BARC            USDA, ARS, Beltsville Agricultural Research Center
 7  65171           TIGR            The Institute for Genome Research
 8  53556           GSC             Genoscope
 9  43033           CENARGEN        Embrapa Genetic Resources and Biotechnology
 10 18623           SC              The Sanger Center
 11 15301           UOKNOR          University of Oklahoma Norman Campus, Advanced Center for Genome Technology
 12 10651           TIGR_JCVIJTC    The Institute for Genomic Research, Traces generated at JCVIJTC
 13 2485            UIACBCB         University of Iowa Center for Bioinformatics and Computation Biology (UIACBCB)
 14 49              WUGSC           Washington University, Genome Sequencing Center
    37829394        total           total                                    

Trace counts

    TRACE_COUNT   CENTER_NAME     TRACE_TYPE_CODE        
 1  24863599      BCM*            WGS                    
 2  10748529      BCM*            SHOTGUN                
 3  737900        NISC            SHOTGUN                
 4  125597        BCCAGSC         CLONEEND               
 5  114753        UIUC            CLONEEND               
 6  65171         TIGR            CLONEEND               
 7  53556         GSC             CLONEEND               
 8  26246         CENARGEN        WGS                    
 9  25454         BARC            CLONEEND               
 10 16892         BCM*            CLONEEND               
 11 16787         CENARGEN        CLONEEND               
 12 15150         UOKNOR          SHOTGUN                
 13 10651         TIGR_JCVIJTC    CLONEEND               
 14 151           UOKNOR          FINISHING              
 15 49            WUGSC           CLONEEND               
    36820485      total

 16 527017        BCCAGSC         EST
 17 207204        MARC            EST
 18 171667        MARC            PCR
 19 81913         BARC            EST
 20 18623         SC              EST 
 21 2485          UIACBCB         EST
    1008909       total

Data processing

Data issues

Issues:

  • Qualities
    • 650,133 reads don't have quality values and can't be reliably trimmed
  • Libraries
    • There are totally 25,312 libraries
    • Very fragmented especially the SHOTGUN and CLONEEND ones; can't be accurately re-estimated by the assembler
  • Clear ranges
    • Many traces are missing vector trimming coordinates (CLV=CLIP_VECTOR_LEFT..CLIP_VECTOR_RIGHT) or don't contain 3' trimming information (CLIP_VECTOR_RIGHT==0)
    • The read CLV's are need by the Celera Assembler overlap based trimming module (OBT) as input
    • Solution: identify the sequencing vector & linker sequences for each library and re-trim the reads

Identify linkers

For each library identify linker sequences:

  • Separate forward/reverse reads
  • Identify most frequent kmers (8mers,24mers)
  • Check if kmers a overrepresented
  • Verify if the most frequent 8mer is present in the top 10 most frequent 24mers
  • Align 24mers (extend them by a few bp) => linker

Identify vectors

For each library identify vector sequences:

  • Align linkers to the opposite strand sequences (nucmer -l 12 -c 24 -r)
  • Extract the subsequences following to linker (50..150bp)
  • Align the subsequences; if they align we've probably identified the vector
  • Identify the vector name/id by alignment to the UniVec database (nucmer -l 12 -c 24)
  • Check if the forward/reverse vector(s) are the same : we should find a common vector sequence; the UniVec alignments should be adjacent
  • create the Lucy vector & splice files that contain the linker+vector sequences

Trimming

  • Run Lucy on quality reads
  • Get CLV statistics: depending on the library, the Lucy CLV is 20bp+ shorter than the original CLV
  • Trim reads according to Lucy output CLV
  • Align Lucy trimmed reads to linker,vector,splice site & UniVec (there should be no alignments)
  • Method worked on BCM & NISC libraries (~ 98% of the reads)
  • For the other reads use the factory clipping points

BCM reads

  • linker:
 >J01636.linker.fwd 27bp
 TCGAGTTCGACTGCAAGTAGTTCATCA
 >J01636.linker.rev 27bp
 CTAATCAGATGGTACAGTAGTTCATCA
  • vector: J01636 E.coli lactose operon with lacI, lacZ, lacY and lacA genes (7477 bp)
  • avg(original CLV) - avg(Lucy CLV)> 20bp (1015 vs 973 in quality WGS reads , ...)

NISC reads

  • linker:
 >NGB00080.linker.fwd 24bp
 TATCATCGCCACTGTGGTGGAATT
 >NGB00080.linker.rev 26bp
 GCTGAAGCTCCATGTGGTGGAATTCC
  • vector NGB00080 (pOTW13 with linkers)
  • avg(original CLV) - avg(Lucy CLV)> 20bp (771 vs 747)

Preliminary assembly

  • Assembly version: wgs-5.2
  • Use only quality reads
  • Set read CLV to Lucy CLV or original CLV
  • Set non random flag = 1 on all reads except for WGS ones
  • Set obtMerThreshold = 200 (default 1000)
  • Set doOBT = 1

Input

 Reads=35,348,776 # WGS, SHOTGUN, CLONEEND & FINISHING quality reads
 Libraries=25,312   # mostly SHOTGUN and BARC.CLONEEND

Output

 TotalScaffolds=66,141
 MaxBasesInScaffolds=26,048,998
 MeanBasesInScaffolds=40,861
 
 TotalContigsInScaffolds=120,461
 MaxContigLength=627,911
 MeanContigLength=22,436
 
 TotalDegenContigs=269,031
 MaxDegenContig=33,824
 
 SingletonReads=3,721,123
 DeletedReads=421,379 (too short or zero CLR)

Preliminary assembly processing

Read clear ranges

  • Quality reads: extract OBT CLR from gatekeeper store
  • Qualityless reads:
    • Align them to contigs (no degenerates) : nucmer -l 50 -c 200 -b 10 -g 5 -d 0.05
    • Set CLR to the maximum alignment coordinates or 50..min(len,600)
    • Reduce CLR if there are multiple N's or low complexity regions in the read

Contaminant search

Databases:

  • Ecoli : 22 completed genomes + plasmids
  • UniVec_Core 1,348 sequences : mostly cloning vectors & primers, avg 250bp long
  • OtherVec: 100 other vector sequences (mostly complete), identified by aligning UMD2.0 contaminants to GenBank
  • bos_taurus UMD2.0 contaminant : 4,813 whole contigs and 30,329 partial contigs identified by NCBI as contamination in UMD2.0; many partial contigs contained cow sequences as well
  • Databases FASTA files:
 /nfshomes/dpuiu/db/Ecoli.all
 /nfshomes/dpuiu/db/UniVec_Core
 /nfshomes/dpuiu/db/OtherVec
 /nfshomes/dpuiu/db/bos_taurus.UMD2.contaminant.fasta

Alignment parameters:

 nucmer -maxmatch -l 40 -c 100 -b 10 -g 5 -d 0.05 

Contig/degenerate counts:

  • 2,951/1,266 aligned to Ecoli
  • 5,387/1,908 aligned to UniVec_Core
  • 5,657/1,963 aligned to OtherVec

Read/mate counts: TO BE DELETED

  • 40,699/22,607 in contaminated regions

Library estimates

  • Some library estimates are complete wrong
 Example: BCM.SHOTGUN libraries listed as long (180Kbp mean) are all short (2-6Kbp mean)  
  • Extract library insert estimates; merge libraries sequenced by same center that have similar mean/std : 25,312 libs => 344 libs
  • Assign new library ids; assign average means & stdevs to the libraries

Final assembly

  • Assembly version: wgs-5.2
  • Use all traces
  • Set read CLR to:
    • Quality reads: OBT CLR
    • Qualityless reads: alignment coordinates or 50..min(len,600)
  • Set nonRandom flag = 1 on all reads except for WGS reads
  • Set deleted flag = 1 on all reads deleted by OBT in the preliminary assembly
  • Set obtMerThreshold = 200 (default 1000)
  • Set doOBT = 0 (reads have been already trimmed)

Input

 Reads=35,973,728   # WGS, SHOTGUN, CLONEEND & FINISHING with and without qualities
 Libraries=344

Output

 TotalScaffolds=39,978
 TotalContigsInScaffolds=90,135
 MeanBasesInScaffolds=66,947
 MaxBasesInScaffolds=3,3907,885
 
 TotalContigsInScaffolds=90,135
 MeanContigLength=29,693
 MaxContigLength=1,160,130
 
 TotalDegenContigs=251,413
 MaxDegenContig=39,964

 SingletonReads=3,634,305(10.24%)

Final assembly processing

Contaminant search

  • Use same databases and alignment parameters as in preliminary assembly processing
  • Delete full contaminants & trim partial contaminants

Delete summary:

  • 65 Acinetobacter ctgs
  • 91 other contaminant ctgs <2Kbp
  • Total: 156 ctgs, 152 scf, 4105 reads

Trim summary:

  • 12 contigs >=2Kbp , 44 reads

Marker mapping

  • 126,013 total markers
  • Avg distance between markers is 25Kbp; marker position error is 50Kbp
  • Markers were aligned to all contigs/degenerates
  • Best alignments with %IDY>90 & %Matched>85 were identified
  • 107,271 markers align to 31,407 ctg & 2,640 scf
    • 552 scf have markers from multiple chromosomes
    • 212 scf have multiple markers from multiple chromosomes
    • 38 scf have multiple adjacent markers from multiple chromosomes: MIGHT BE MISASSEMBLED
  • 628 markers align to 562 degenerates

Scaffold/contig breaking

  • Analyze 38 scf that have multiple adjacent markers from multiple chromosomes
  • Compute coverage in the suspicious region (between different chromosome markers):
    • read cvg
    • mate ctg: good, bad
  • Break ctg/scf unless the region has "high read cvg" , "high good mate cvg" , "low bad mate cvg"
  • Break summary:
    • 14 scaffolds
    • 15 breaks : 8 on the same contig , 3 on adjacent contigs , 4 on non adjacent contigs

Assignment to chromosomes

Markers

  • 2640 scaffolds and 562 degenerates have markers
  • Assignment to chromosomes: use best alignment & majority rule
  • Position:
    • Filter out outliers according to position on chromosome & scaffold (interquartile range method)
    • Compute the average position on chromosome of the markers
  • Orientation:
    • use LeastSequareFit method : if slope is positive => forward; if slope is negative => reverse
    • if only 1 markers/scaffolds => direction=unknown (0)

Human synteny

  • Align all scaffolds/degenerates to the 24 Human chromosomes; filter all alignments longer than 200bp
 nucmer -mum -l 12 -c 30 -g 1000
 delta-filter -q -l 200 
  • 9,914 scaffolds and 16,527 degenerates align to Human chromosomes; most alignments are short, just over 200bp

Combine Human synteny & Marker data

  • 1,908 scaffolds and 118 degenerates both align to human and contain markers
  • 10,790 scaffolds and 16,590 degenerates align to human or contain markers
  • Try to infer the position/orientation on the chromosomes for the scaffolds/degenerates that align to human but contain no markers
  • Iteratively:
    • Find 2 adjacent scaffolds (preferably on left & right side) which both align to human, contain markers and placements agree (chromosome, position, direction)
    • Otherwise, find 1 adjacent scaffolds which both aligns to human, and contains markers
    • Extrapolate the position/orientation of the "unplaced" sequence based on its neighbor(s)
    • Sort the scaffolds/degenerates based on chromosome positions, identify incorrect markers & alignments, remove them from the input data and repeat the process

By linking information

  • Once scaffolds/degenerates were assign to chromosome use mate pair information to refine placements
  • Identify unplaced scaffolds/degenerates linked to placed scaffolds/degenerates and fit them into gaps

Comparison to UMD2.0

Alignment parameters:

 nucmer -mum -l 200 -c 1000

Haplotype search

Daniela:

  • Place scf/deg on Chr
  • Align each pair X,Y (len(X)<len(Y))of adjacent/overlapping scf/deg : nucmer -mum -l 40 -c 250 ( => avg 96 %id)
  • compute (X,Y) cvg
  • identify the X regions which had no alignments to Y; if the length of these regions were less than 2K bp => X is a variant

Guillaume:

  • there is a contig Y larger than X and, X and Y are placed on top of each other (with some play allowed)
  • there are a high quality sequence alignment such that: at least 200 bases out of the first 400 bases of X align with Y AND at least 200 bases out of the last 400 bases of X align with Y.
  • In other word, the ends of X have to align well with Y, but the middle can be significantly different.

Files:

 /fs/szasmg3/bos_taurus/UMD_Freeze3.0/contigs.haplotype-variants.fa.gz            # 40611 haplotype-variants sequences
 /fs/szasmg3/bos_taurus/UMD_Freeze3.0/contigs.haplotype-variants.ids              # 40611 haplotype-variants sequence ids
 /fs/szasmg3/bos_taurus/UMD_Freeze3.0/contigs.haplotype-variants.pairs            # 40300 pairs (haplotype-variants & reference sequences)
 /fs/szasmg3/bos_taurus/UMD_Freeze3.0/contigs.haplotype-variants.pairs.delta      # 39665 alignment pairs (haplotype-variant is the query)
 /fs/szasmg3/bos_taurus/UMD_Freeze3.0/contigs.haplotype-variants.pairs.cvg        # 39665 coverage  pairs (ref=col 1 ; haplotype-variant=col 5)

Summary:

 .                    elem       <=0        >0         min    max        mean       med        n50        sum
 ctg+deg              40611      0          40611      263    97877      1476       1205       1372       59958728
 ctg                  29452      0          29452      471    97877      1631       1297       1469       48039280
 deg                  11159      0          11159      263    12208      1068       979        1006       11919448

Other Files

/scratch1/bos_taurus/Assembly/2009_0312_CA/scf_placements/UMD_Freeze3.0/
 39864 contigs.haplotype-variants.daniela.pairs
   443 contigs.haplotype-variants.guillaume.pairs
   436 contigs.haplotype-variants.guillaume.pairs.orig
   334 contigs.haplotype-variants.ids.missing

Issues:

 .                    elem       <=0        >0         min    max        mean       med        n50        sum
 missing              339        0          339        462    97877      1613       1011       1204       547135
 mislabeled           6          0          6          2973   97877      31271      8275       97877      187628

Mislabeled haplotypes:

 Chr      begin      end        Pos   W  ctg            1  len    dir
 Chr2     131428091  131475109  5387  W  7180001925346  1  47019  +
 Chr3     11270384   11278308   637   W  7180002020315  1  7925   -
 Chr12    9183395    9281271    429   W  7180002024890  1  97877  +
 Chr14    34404395   34412669   3579  W  7180002021388  1  8275   -
 Chr15    50865607   50889165   2531  W  7180002015261  1  23559  -
 ChrU     8589989    8592961    5791  W  7180002026074  1  2973   +

Chromosome mapping

Assembly Summary

...

Hs vs Bt

  • Goal: find all syntenic regions longer than a certain % of the Cow/Human genome
  • Chromosome counts (include gaps)
 .                    elem       min      q1       q2        q3        max        mean       n50        sum            
 human                24         46944323 78774742 134452384 170899992 247249719  128350811  154913754  3,080,419,480     
 cow                  31         9828056  61435874 84240350  113384836 158337067  86152724   105708250  2,670,734,461
  • Chromosome counts (no gaps)
 .                    elem       sum
 human                24         2,858,012,910
 cow                  31         2649,997,198
  • Gap counts
 .                    elem       min      q1       q2        q3        max        mean       n50        sum
 human                290        100      35000    47000     90000     30000000   766919     17918000   222,406,570 => 7.2% gaps
 cow                  72454      1        99       99        248       1074158    286        698        20,737,263  => 0.7% gaps


  • nucmer params: -l 12 -c 65 -g 1000 -b 1000
  • delta-filter -l 200
  • 24 * 30 = 720 alignments (except for BtChrU)
  • Alignments counts
                               >=200    >=2000   >=5000
 HsChr-BtChr.delta             532,866  39,663   3,570
 HsChr-BtChr.filter-1.delta    392,789  38,185   3,560
   
  • 54 chr sets have at least one 5K alignments