Dpuiu Assemblathon: Difference between revisions

From Cbcb
Jump to navigation Jump to search
 
(125 intermediate revisions by the same user not shown)
Line 1: Line 1:
= Links =
= Links =
   
   
* http://www.genomeweb.com/informatics/us-european-teams-launch-parallel-challenges-improve-computational-methods-genom
* [http://assemblathon.org/ The Assemblathon] University of California, Santa Cruz & UC Davis;  synthetic & real genome.
* [http://cnag.bsc.es/ dnGASP] De Novo Genome Assembly Assessment Project (dnGASP):  Centro Nacional de Análisis Genómico in Barcelona, Spain, synthetic genome
* [http://gage.cbcb.umd.edu/ GAGE]
* [http://www.genomeweb.com/informatics/us-european-teams-launch-parallel-challenges-improve-computational-methods-genom genomeweb announcement]


= GAGE =
= GAGE =
Line 7: Line 10:
* Location
* Location
   http://gage.cbcb.umd.edu/ -> /fs/web-cbcb-new/html/gage
   http://gage.cbcb.umd.edu/ -> /fs/web-cbcb-new/html/gage
* Answer following questions:
# How much sequencing coverage do I need for my genome project?
# What can I expect the resulting assembly to look like?
# Which assembly software should I use?     
# What parameters should I use when I run the software?
= Read correction =
* quake
  echo frag_1.fastq      frag_2.fastq      >  genome.ls
  echo shortjump_1.fastq shortjump_2.fastq >> genome.ls
  echo longjump_1.fastq  longjump_2.fastq  >> genome.ls
  /fs/szdevel/core-cbcb-software/Linux-x86_64/bin/quake.py -f genome.ls  -k 18 -p 20 >&! quake.log


= Assemblers =
= Assemblers =


<pre>
* [http://www.broadinstitute.org/software/allpaths-lg/blog/ Allpaths-LG ] 
* CA            /fs/szdevel/core-cbcb-software/Linux-x86_64/packages/wgs-6.1/Linux-amd64/bin/runCA  
  paths:
* Newbler
    /fs/szdevel/core-cbcb-software/Linux-x86_64/packages/allpaths3-35218/
* Velvet
    /fs/szdevel/core-cbcb-software/Linux-x86_64/bin/
* SOAPdenovo
 
* Maq
  RunAllPaths3G \
</pre>
    PRE=$PWD REFERENCE_NAME=. DATA_SUBDIR=. RUN=allpaths SUBDIR=run1.orig THREADS=$P
 
* [http://sourceforge.net/apps/mediawiki/wgs-assembler/index.php?title=Main_Page CA]              
  paths:
    /fs/szdevel/core-cbcb-software/Linux-x86_64/packages/wgs-6.1/
    /fs/szdevel/core-cbcb-software/Linux-x86_64/bin/
 
  runCA \
    -d . \
    -p asm \
    -s /fs/szdevel/core-cbcb-software/Linux-x86_64/bin/runCA.parallel.spec \
    doOverlapBasedTrimming=0 ovlOverlapper=ovl unitigger=bog bogBreakAtIntersections=0 bogBadMateDepth=1000 \
    *.frg
 
* [http://www.ebi.ac.uk/~zerbino/velvet/ Velvet]       
  paths:
    /fs/szdevel/core-cbcb-software/Linux-x86_64/packages/velvet_1.0.13/
    /fs/szdevel/core-cbcb-software/Linux-x86_64/bin/
 
  velveth . $K -fastq \
    -shortPaired  frag_12.fastq \
    -shortPaired2 shortjump_12.rev.fastq \
    -shortPaired3 longjump_12.fastq
  velvetg . -exp_cov auto
    -ins_length  $MEA_FRAG      -ins_length_sd  $STD_FRAG \
    -ins_length2 $MEA_SHORTJUMP -ins_length2_sd $STD_SHORTJUMP \
    -ins_length3 $MEA_LONGJUMP  -ins_length3_sd $STD_LONGJUMP \
    -scaffolding yes -exportFiltered yes -unused_reads yes
 
* [http://soap.genomics.org.cn/soapdenovo.html SOAPdenovo]   
  paths:
    /fs/szdevel/core-cbcb-software/Linux-x86_64/packages/SOAPdenovo-V1.05/
    /fs/szdevel/core-cbcb-software/Linux-x86_64/bin/
    /fs/szdevel/core-cbcb-software/Linux-x86_64/packages/GapCloser/
 
  echo "[LIB]\navg_ins=$MEA_FRAG\nreverse_seq=0\nasm_flags=1\nrank=1\nq1=frag_1.fastq\nq2=frag_2.fastq\n" >! SOAPdenovo.config
  echo "[LIB]\navg_ins=$MEA_SHORTJUMP\nreverse_seq=1\nasm_flags=2\nrank=2\nq1=shortjump_1.fastq\nq2=shortjump_2.fastq\n" >> SOAPdenovo.config
  echo "[LIB]\navg_ins=$MEA_LONGJUMP\nreverse_seq=0\nasm_flags=2\nrank=4\nq1=longjump_1.fastq\nq2=longjump_2.fastq\n" >> SOAPdenovo.config
  SOAPdenovo all -K $K -p $P -s ./SOAPdenovo.config -o asm
  GapCloser -b SOAPdenovo.config -a asm.scafSeq -o asm2.scafSeq -t $P -p 31
 
* [http://www.genome.umd.edu/SR_CA_MANUAL.htm MSR-CA        Maryland Super-Reads + Celera Assembler.]
  paths:
    /fs/szdevel/core-cbcb-software/Linux-x86_64/packages/SR-CA-1.1/CA/Linux-amd64/bin/
* [http://www.bcgsc.ca/platform/bioinfo/software/abyss ABYSS]         
  paths:
    /fs/szdevel/core-cbcb-software/Linux-x86_64/packages/abyss-1.2.7/
    /fs/szdevel/core-cbcb-software/Linux-x86_64/bin
 
  abyss-pe  \
    k=$K n=5 name=asm lib='frag short' frag=frag_12.fastq short=short_12.fastq aligner=bowtie
 
* [https://github.com/jts/sga/wiki SGA]           
  paths:
    /fs/szdevel/core-cbcb-software/Linux-x86_64/packages/sga/src/SGA/                     # Version: 0.9.8
    /fs/szdevel/core-cbcb-software/Linux-x86_64/bin
 
  sga preprocess -p 1 frag_?.fastq > frag.pp.fa
  sga index -t $P frag.pp.fa
  sga correct -k $K -t $P frag.pp.fa -o frag.pp.ec.fa 
  sga index -t $K frag.pp.ec.fa
  sga filter frag.pp.ec.fa
  sga overlap -t $P frag.pp.ec.filter.pass.fa
  sga assemble frag.pp.ec.filter.pass.asqg.gz


= CBCB genomes =
= CBCB genomes =
Line 26: Line 114:
If we can agree on the data sets, then the next step would be to design the experiment - decide in advance which assemblers to run and how many ways to try each one.  I'm thinking we should also trim all the data with Quake.
If we can agree on the data sets, then the next step would be to design the experiment - decide in advance which assemblers to run and how many ways to try each one.  I'm thinking we should also trim all the data with Quake.


== Argentine ant ==
== Bombus impatiens ==
== Bee, Bombus impatiens ==
 
=== Data ===
 
* Estimated haploid genome size: 250M
* 497,318,144 Illumina 124bp reads (246X cvg)
* Reads:
  .        readLen  orientation  insLen  #reads      readCvg  comments 
  frag    124      innie        400    303,118,594  150X    6 libs
  short    124      outie        3-8K    194,199,550  96X      2 libs


Data
* [https://wiki.umiacs.umd.edu/cbcb/index.php/Bumblebee#Traces Online Traces]
* 497,318,144 Illumina 124bp reads
* 8 libraries; inserts:
**  400bp
**  3k (outie)
**  8k (outie)
* [https://wiki.umiacs.umd.edu/cbcb/index.php/Bumblebee#Traces Traces]


Adapters: in 3k & 8k libraries
* Issue: Adapters: in 3k & 8k libraries
  C CGTAATAACTTCGTATAGCATACATTATACGAAGTTATACGA
  C CGTAATAACTTCGTATAGCATACATTATACGAAGTTATACGA
  3 CGGCATTCCTGCTGAACCGAGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT
  3 CGGCATTCCTGCTGAACCGAGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT
  5 GATCGGAAGAGCGGTTCAGCAGGAATGCCGAGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCG
  5 GATCGGAAGAGCGGTTCAGCAGGAATGCCGAGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCG


Location:
* Read directories:
  /fs/szattic-asmg4/Bees/Bombus_impatiens/s_[12356789]_[12]_sequence.txt                            # original fastq files
  /fs/szattic-asmg4/Bees/Bombus_impatiens/s_[12356789]_[12]_sequence.txt                            # original fastq files
   
   
Line 48: Line 138:
  /fs/szattic-asmg4/Bees/Bombus_impatiens/error_free/fastq/s_[35678]_[012]_sequence.cor.txt          # corrected reads (short inserts)
  /fs/szattic-asmg4/Bees/Bombus_impatiens/error_free/fastq/s_[35678]_[012]_sequence.cor.txt          # corrected reads (short inserts)


== Bacterium, Staph aureus USA300 ==
* Original read files:
  /fs/szattic-asmg4/Bees/Bombus_impatiens/s_1_1_sequence.txt 
  /fs/szattic-asmg4/Bees/Bombus_impatiens/s_1_2_sequence.txt 
  /fs/szattic-asmg4/Bees/Bombus_impatiens/s_2_1_sequence.txt 
  /fs/szattic-asmg4/Bees/Bombus_impatiens/s_2_2_sequence.txt 
  /fs/szattic-asmg4/Bees/Bombus_impatiens/s_3_1_sequence.txt 
  /fs/szattic-asmg4/Bees/Bombus_impatiens/s_3_2_sequence.txt 
  /fs/szattic-asmg4/Bees/Bombus_impatiens/s_5_1_sequence.txt 
  /fs/szattic-asmg4/Bees/Bombus_impatiens/s_5_2_sequence.txt 
  /fs/szattic-asmg4/Bees/Bombus_impatiens/s_6_1_sequence.txt 
  /fs/szattic-asmg4/Bees/Bombus_impatiens/s_6_2_sequence.txt 
  /fs/szattic-asmg4/Bees/Bombus_impatiens/s_7_1_sequence.txt 
  /fs/szattic-asmg4/Bees/Bombus_impatiens/s_7_2_sequence.txt 
  /fs/szattic-asmg4/Bees/Bombus_impatiens/s_8_1_sequence.txt 
  /fs/szattic-asmg4/Bees/Bombus_impatiens/s_8_2_sequence.txt 
  /fs/szattic-asmg4/Bees/Bombus_impatiens/s_9_1_sequence.txt 
  /fs/szattic-asmg4/Bees/Bombus_impatiens/s_9_2_sequence.txt 
 
* Quake corrected files:
  /fs/szattic-asmg4/Bees/Bombus_impatiens/error_free/fastq/s_1_1_sequence.cor.rev.txt 
  /fs/szattic-asmg4/Bees/Bombus_impatiens/error_free/fastq/s_1_2_sequence.cor.rev.txt 
  /fs/szattic-asmg4/Bees/Bombus_impatiens/error_free/fastq/s_2_1_sequence.cor.rev.txt 
  /fs/szattic-asmg4/Bees/Bombus_impatiens/error_free/fastq/s_2_2_sequence.cor.rev.txt 
  /fs/szattic-asmg4/Bees/Bombus_impatiens/error_free/fastq/s_3_1_sequence.cor.txt 
  /fs/szattic-asmg4/Bees/Bombus_impatiens/error_free/fastq/s_3_2_sequence.cor.txt 
  /fs/szattic-asmg4/Bees/Bombus_impatiens/error_free/fastq/s_5_1_sequence.cor.txt 
  /fs/szattic-asmg4/Bees/Bombus_impatiens/error_free/fastq/s_5_2_sequence.cor.txt 
  /fs/szattic-asmg4/Bees/Bombus_impatiens/error_free/fastq/s_6_1_sequence.cor.txt 
  /fs/szattic-asmg4/Bees/Bombus_impatiens/error_free/fastq/s_6_2_sequence.cor.txt 
  /fs/szattic-asmg4/Bees/Bombus_impatiens/error_free/fastq/s_7_1_sequence.cor.txt 
  /fs/szattic-asmg4/Bees/Bombus_impatiens/error_free/fastq/s_7_2_sequence.cor.txt 
  /fs/szattic-asmg4/Bees/Bombus_impatiens/error_free/fastq/s_8_1_sequence.cor.txt 
  /fs/szattic-asmg4/Bees/Bombus_impatiens/error_free/fastq/s_8_2_sequence.cor.txt 
  /fs/szattic-asmg4/Bees/Bombus_impatiens/error_free/fastq/s_9_1_sequence.cor.rev.txt 
  /fs/szattic-asmg4/Bees/Bombus_impatiens/error_free/fastq/s_9_2_sequence.cor.rev.txt
 
* k_unitig corrected files: (in progress --[[User:Dpuiu|Dpuiu]] 10:38, 5 April 2011 (EDT))
 
=== Assembly ===


  Complete genome        : NC_010079      2872915bp Staphylococcus aureus subsp. aureus USA300_TCH1516
* [[Media:Bombus_impatiens.assembly.summary|Bombus_impatiens.assembly.summary]]
   In progress genome    : NZ_AASB00000000 2810505bp Staphylococcus aureus subsp. aureus USA300_TCH959, 256 contigs
* Assembly directories:                          
CA.quakeCor                /fs/szattic-asmg5/Bees/Bombus_impatiens/Assembly/CA.s_1-8.cor.redo2/                              # Celera Assembly
SOAPdenovo.quakeCor(K=47)  /fs/szattic-asmg5/Bees/Bombus_impatiens/Assembly/SOAPdenovo.K47.s_1-9.cor/                        # SOAPdenovo assembly (2011) K=47 quake corrected reads
#SOAPdenovo.orig(K=47)      /fs/szattic-asmg5/Bees/Bombus_impatiens/Assembly/SOAPdenovo.K47.s_1-9.orig/                      # SOAPdenovo assembly (2011) K=47 original reads    
#SOAPdenovo.quakeCor(K=31)  /fs/szattic-asmg5/Bees/Bombus_impatiens/Assembly/SOAPdenovo.s_1-9.cor/                            # SOAPdenovo assembly (2010) K=31 quake corrected reads (prev assembler version)
MSR-CA                    /fs/szattic-asmg5/Bees/Bombus_impatiens/Assembly/MSR-CA  ; genome9.umd.edu:/genome9/raid/alekseyz/GAGE/bombus/assembly/CA      # MSR-CA


  454 FLX                : Staphylococcus aureus subsp. aureus USA300_TCH959 HMP0023  http://www.ncbi.nlm.nih.gov/sra/SRX002327?report=full
== Staph aureus USA300 ==
  Illumina 101bp paired  : Staphylococcus aureus subsp. aureus USA300_TCH1516        http://www.ncbi.nlm.nih.gov/sra/SRX007714?report=full


=== 454FLX ===
=== Data ===


* Data
* Complete genome:
                        reads   min    q1    q2    q3     max        mean      n50        sum           
   id            len     description
   sff(original)        334241  36    201    256    277    362        235        264        78432227      
   NC_010079     2872915 Staphylococcus aureus subsp. aureus USA300_TCH1516, complete genome
   sffToCA(adaptor free) 325555  51    103   148   208   358        157        185        51225615  # 58723 mates
   NC_010063.1   27041  Staphylococcus aureus subsp. aureus USA300_TCH1516 plasmid pUSA300HOUMR, complete sequence
  NC_012417.1   3125   Staphylococcus aureus subsp. aureus USA300_TCH1516 plasmid pUSA01-HOU, complete sequence
                2903081 total


* DeNovo
* Reads (90X):
  # ctg stats
   .           readLen  insLen orientation     #reads     readCvg     SRA runs
   .                     ctgs min q1     q2      q3      max     mean      n50      sum      
   frag        101      180    innie          1,294,104 45X          SRR022868
   CA.6.1.bog            18  238  21014  135971  239466  567548* 155836    277888* 2805055
   shortjump   37      3500   outie          3,494,070 45X          SRR022865
   newbler.2.5p1.deNovo  100  103  295   3287   39467  229053 27879      78379    2787870


   # scf stats
   [http://trace.ncbi.nlm.nih.gov/Traces/sra/?study=SRP001086 SRP001086] Staphylococcus aureus Sequencing on Illumina
   CA.6.1.bog            6    284  21014  173065  1032129 1458733* 467554    1458733* 2805325
   [http://www.ncbi.nlm.nih.gov/sra/SRX007714?report=full SRX007714] pair lib
   newbler.2.5p1.deNovo  8    2475 20731  110137  1030785 1408642  349895    1408642  2799157
   [http://www.ncbi.nlm.nih.gov/sra/SRX007711?report=full SRX007711] jumping lib


* Reference based(Saureus USA300)
* Read directories:
  .                    ctgs min  q1    q2      q3      max    mean      n50      sum     
  /nfshomes/dpuiu/HTS/Staphylococcus_aureus/Data/Illuminap100/
   newbler.2.3.refMapper 206  103  556    3098    15366  117487  12749      40687    2626469
   /nfshomes/dpuiu/HTS/Staphylococcus_aureus/Data/Illuminaj/


=== Illumina101.78X.cor ===
* Original read files:
  /nfshomes/dpuiu/HTS/Staphylococcus_aureus/Data/Illuminap100/frag_1.fastq 
  /nfshomes/dpuiu/HTS/Staphylococcus_aureus/Data/Illuminap100/frag_2.fastq 
  /nfshomes/dpuiu/HTS/Staphylococcus_aureus/Data/Illuminaj/short_1.fastq 
  /nfshomes/dpuiu/HTS/Staphylococcus_aureus/Data/Illuminaj/short_2.fastq 


* Comments:
* Quake corrected files:
   mated reads; lib mea/srd (CA estimates)=170/21
   /nfshomes/dpuiu/GAGE/Staphylococcus_aureus/Illumina.180_45X.3500_45X/quake/frag_1.cor.fastq 
   first 2 and last base show composition bias; should we trim them???
   /nfshomes/dpuiu/GAGE/Staphylococcus_aureus/Illumina.180_45X.3500_45X/quake/frag_2.cor.fastq  
 
   /nfshomes/dpuiu/GAGE/Staphylococcus_aureus/Illumina.180_45X.3500_45X/quake/short_1.cor.fastq  
* Data
   /nfshomes/dpuiu/GAGE/Staphylococcus_aureus/Illumina.180_45X.3500_45X/quake/short_2.cor.fastq
                    reads      min    max        mean sum            cvg
   Total              30,597,352 101    101        101  3090332552    1065
   Sampled            2,295,176  101    101        101  231812776      78
  Corrected(paired)  1,479,510  30    101        89    131973249      45.9


* DeNovo
* Allpaths-LG corrected files:
   # ctg stats
   /fs/szattic-asmg5/dpuiu/HTS/Staphylococcus_aureus/Illumina.180_45X.3500_45X/allpathsCor/frag_1.cor.fasta  
              elem  min  q1    q2    q3    max    mean  n50    sum      q.0cvg r.0cvg 1.0cvg  1.snps 1.breaks.all  1.breaks.1k+
  /fs/szattic-asmg5/dpuiu/HTS/Staphylococcus_aureus/Illumina.180_45X.3500_45X/allpathsCor/frag_2.cor.fasta 
   CA.bog      73    1716  14503  26431  52086  148524* 39043  58038* 2850157  31075  7700  31075  75      14            6  
   /fs/szattic-asmg5/dpuiu/HTS/Staphylococcus_aureus/Illumina.180_45X.3500_45X/allpathsCor/short_1.cor.fasta  
   SOAPdenovo  9382  32    32    52    63    85850  347    16726  3259522  34960  722    37358  24      0            0 
   /fs/szattic-asmg5/dpuiu/HTS/Staphylococcus_aureus/Illumina.180_45X.3500_45X/allpathsCor/short_2.cor.fasta  
  velvet      453  61    85    224    3279  137163  6297  36496  2852552  53075  10478  53410  156    17            6  


  # scf stats
* k_unitig corrected files:
              elem  min   q1    q2    q3    max    mean  n50    sum      q.0cvg r.0cvg 1.0cvg  1.snps 1.breaks.all  1.breaks.1k+
   /nfshomes/dpuiu/GAGE/Staphylococcus_aureus/Illumina.180_45X.3500_45X/k_unitig/frag_1.cor.seq  
   CA.bog      67    1716  14869  32929  58038  148524* 42541  65383* 2850277  31075  7700  31075  75      20            7              # 1 large rearangement : scf120001252361 : NC_010079:622381-671743
  /nfshomes/dpuiu/GAGE/Staphylococcus_aureus/Illumina.180_45X.3500_45X/k_unitig/frag_2.cor.seq 
   SOAPdenovo  186  100  333    1528  17623  144079  15625  55558  2906207  61878  26070  62049  65      453          6 
   /nfshomes/dpuiu/GAGE/Staphylococcus_aureus/Illumina.180_45X.3500_45X/k_unitig/short_1.cor.seq  
  velvet      427  61    82    177    2982  137163  6685  37874  2854649  53268  10478  53623  156    42            8
   /nfshomes/dpuiu/GAGE/Staphylococcus_aureus/Illumina.180_45X.3500_45X/k_unitig/short_2.cor.seq


* Location
=== Assembly ===
  /fs/szattic-asmg4/dpuiu/HTS/Staphylococcus_aureus/Data/Illuminap100.78X/
  /fs/szattic-asmg4/dpuiu/HTS/Staphylococcus_aureus/Assembly//Illuminap100.78X.cor/


=== Illumina101.150X.cor ===
* [[Media:Staphylococcus_aureus.genome.summary|Staphylococcus_aureus.genome.summary]]
* Assembly directories:
  allpaths.orig                    /fs/szattic-asmg5/dpuiu/HTS/Staphylococcus_aureus/Illumina.180_45X.3500_45X/allpaths
  CA.orig                          /fs/szattic-asmg5/dpuiu/HTS/Staphylococcus_aureus/Illumina.180_45X.3500_45X/CA.orig
  CA.quakeCor                      /fs/szattic-asmg5/dpuiu/HTS/Staphylococcus_aureus/Illumina.180_45X.3500_45X/CA.quakeCor.k18
  CA.allpathsCor                    /fs/szattic-asmg5/dpuiu/HTS/Staphylococcus_aureus/Illumina.180_45X.3500_45X/CA.allpathsCor
  CA.SuperReads                    /fs/szattic-asmg5/dpuiu/HTS/Staphylococcus_aureus/Illumina.180_45X.3500_45X/CA.SuperReads.latest
  SOAPdenovo.orig(K=31)            /fs/szattic-asmg5/dpuiu/HTS/Staphylococcus_aureus/Illumina.180_45X.3500_45X/SOAPdenovo.K31.orig
  SOAPdenovo.orig(K=47)            /fs/szattic-asmg5/dpuiu/HTS/Staphylococcus_aureus/Illumina.180_45X.3500_45X/SOAPdenovo.K47.orig
  SOAPdenovo.quakeCor(K=31)        /fs/szattic-asmg5/dpuiu/HTS/Staphylococcus_aureus/Illumina.180_45X.3500_45X/SOAPdenovo.K31.quakeCor.k18
  SOAPdenovo.quakeCor(K=47)        /fs/szattic-asmg5/dpuiu/HTS/Staphylococcus_aureus/Illumina.180_45X.3500_45X/SOAPdenovo.K47.quakeCor.k18
  SOAPdenovo.allpathsCor(K=31)      /fs/szattic-asmg5/dpuiu/HTS/Staphylococcus_aureus/Illumina.180_45X.3500_45X/SOAPdenovo.K31.allpathsCor
  SOAPdenovo.allpathsCor(K=47)      /fs/szattic-asmg5/dpuiu/HTS/Staphylococcus_aureus/Illumina.180_45X.3500_45X/SOAPdenovo.K47.allpathsCor
  velvet.orig                      /fs/szattic-asmg5/dpuiu/HTS/Staphylococcus_aureus/Illumina.180_45X.3500_45X/velvet.orig
  velvet.quakeCor                  /fs/szattic-asmg5/dpuiu/HTS/Staphylococcus_aureus/Illumina.180_45X.3500_45X/velvet.quakeCor.k18
  velvet.allpathsCor                /fs/szattic-asmg5/dpuiu/HTS/Staphylococcus_aureus/Illumina.180_45X.3500_45X/velvet.allpathsCor
 
  ABYSS.quakeCor                    /fs/szattic-asmg5/dpuiu/HTS/Staphylococcus_aureus/Illumina.180_45X.3500_45X/ABYSS.K31.quakeCor.k18
  SGA.orig                          /fs/szattic-asmg5/dpuiu/HTS/Staphylococcus_aureus/Illumina.180_45X.3500_45X/SGA.orig
 
* SOAPdenovo v1.05 :
** new quake version did not help much (quake-0.2.2 vs davek44-error_correction-28dbe11)
** SOAPdenovo map -K 37+ : fails on quakeCor.k18 corrected reads
** "according" to kmerFreq , should probably not use -K >47
** longer kmer => longer scaffolds (K=63 : largest N50scf)
** longer kmer => shorted contigs  (K=31 : largest N50ctg)
** K40+ too large: no "valley" in the kmerFreq histogram
  paste SOAPdenovo.K??.quakeCor.k18/genome.K??.kmerFreq | nl0 | head
  paste SOAPdenovo.K??.allpathsCor/genome.K??.kmerFreq | nl0 | more


* Comments:
== Rhodobacter sphaeroides ==
  first 2 and last base show composition bias; should we trim them???


* Data
=== Data ===
                    reads      min    max        mean  sum            cvg
  Total              30,597,352  101    101        101  3090332552    1065
  Sampled            4,404,626  101    101        101  444867226      154
  Corrected(paired)  2,815,584  30    101        89    252331825      87


* DeNovo
* Complete genome: 2 chromosomes, 5 plasmids
   # ctg stats
   id            len    description
              elem  min   q1    q2    q3    max    mean   n50    sum      q.0cvg r.0cvg 1.0cvg 1.snps 1.breaks.all 1.breaks.1k+
   CP000143      3188609 Rhodobacter sphaeroides 2.4.1 chromosome 1, complete sequence.
   CA.bog      47    1866  11740  37171  88747  388257* 60272  121078* 2832762  39703  28204  39775   142    12           9  
   CP000144      943016  Rhodobacter sphaeroides 2.4.1 chromosome 2, complete sequence.
  SOAPdenovo  15229 32    33     48     63    74601  232    13773  3530251 31690  269    36416   17     0            0  
  DQ232586      114045 Rhodobacter sphaeroides 2.4.1 plasmid A, partial sequence.
   velvet      429  61   86    206   3147  134034  6650  40448  2852695  52754  9400  52940  152    24            9  
  CP000145      114178 Rhodobacter sphaeroides 2.4.1 plasmid B, complete sequence.
  CP000146      105284  Rhodobacter sphaeroides 2.4.1 plasmid C, complete sequence.
  CP000147      100828 Rhodobacter sphaeroides 2.4.1 plasmid D, complete sequence.
   DQ232587      37100  Rhodobacter sphaeroides 2.4.1 plasmid E, partial sequence.
                4603060 total
* Reads (90X):
   .           readLen insLen  orientation     #reads     readCvg      SRA runs  
   frag        101     180    innie          2,050,868 45X          SRR081522
   shortjump   101      3500   outie          2,050,868 45X          SRR034528


  # scf stats
* SRA traces
              elem  min   q1    q2    q3    max    mean  n50    sum      q.0cvg r.0cvg 1.0cvg  1.snps  1.breaks.all  1.breaks.1k+
   [http://www.ncbi.nlm.nih.gov/sra/SRX033397?report=full SRX033397] pair lib ;    readLen=101 ; insMea=180
  CA.bog      46    1866  11740  39650  88747  388257* 61582  129426* 2832782  39703  28204  39775  142    13            10 
  [http://www.ncbi.nlm.nih.gov/sra/SRX016063?report=full SRX016063] jumping lib ; readLen=101 ; insMea~=3455; ~15% of the mates are short inserts (~250bp)
  SOAPdenovo  158  101   202    1249  20910  150305  18353  74149  2899760  64028  30903  64392  52      506          7 
  velvet      409  61    85    180    2854  142341  6978  42466  2853854  52940  9400  52751  150    43            12


== Bacterium, E coli ==
* Original read files:
  /fs/szattic-asmg5/dpuiu/HTS/Rhodobacter_sphaeroides/Data/Illuminap/frag_1.fastq 
  /fs/szattic-asmg5/dpuiu/HTS/Rhodobacter_sphaeroides/Data/Illuminap/frag_2.fastq 
  /fs/szattic-asmg5/dpuiu/HTS/Rhodobacter_sphaeroides/Data/Illuminaj/short_1.fastq 
  /fs/szattic-asmg5/dpuiu/HTS/Rhodobacter_sphaeroides/Data/Illuminaj/short_2.fastq 


   Compelte genome        : NC_000913      4639675bp  Escherichia coli str. K-12 substr. MG1655
* Quake corrected read files:
   
   /fs/szattic-asmg5/dpuiu/HTS/Rhodobacter_sphaeroides/Illumina.180_45X.3500_45X/quake/frag_1.cor.fastq  
   454 FLX                : Escherichia coli str. K-12 substr. MG1655                  http://www.ncbi.nlm.nih.gov/sra/SRX000348?report=full
   /fs/szattic-asmg5/dpuiu/HTS/Rhodobacter_sphaeroides/Illumina.180_45X.3500_45X/quake/frag_2.cor.fastq 
  Illumina 101bp paired  : Escherichia coli str. K-12 substr. MG1655                  http://www.ncbi.nlm.nih.gov/sra/SRX016044?report=full
  /fs/szattic-asmg5/dpuiu/HTS/Rhodobacter_sphaeroides/Illumina.180_45X.3500_45X/quake/short_1.cor.fastq 
  /fs/szattic-asmg5/dpuiu/HTS/Rhodobacter_sphaeroides/Illumina.180_45X.3500_45X/quake/short_2.cor.fastq 


=== Illumina101.cor ===
* QuakeIter2 corrected read files:
  /fs/szattic-asmg5/dpuiu/HTS/Rhodobacter_sphaeroides/Illumina.180_45X.3500_45X/quake/iter2_dk/frag_1.cor.fastq 
  /fs/szattic-asmg5/dpuiu/HTS/Rhodobacter_sphaeroides/Illumina.180_45X.3500_45X/quake/iter2_dk/frag_2.cor.fastq 
  /fs/szattic-asmg5/dpuiu/HTS/Rhodobacter_sphaeroides/Illumina.180_45X.3500_45X/quake/iter2_dk/short_1.cor.fastq 
  /fs/szattic-asmg5/dpuiu/HTS/Rhodobacter_sphaeroides/Illumina.180_45X.3500_45X/quake/iter2_dk/short_2.cor.fastq 


* Data
* Allpaths-LG corrected files:
   mated reads; lib mea/srd (CA estimates)=160/24
   /fs/szattic-asmg5/dpuiu/HTS/Rhodobacter_sphaeroides/Illumina.180_45X.3500_45X/allpathsCor/frag_1.cor.fasta 
  /fs/szattic-asmg5/dpuiu/HTS/Rhodobacter_sphaeroides/Illumina.180_45X.3500_45X/allpathsCor/frag_2.cor.fasta 
  /fs/szattic-asmg5/dpuiu/HTS/Rhodobacter_sphaeroides/Illumina.180_45X.3500_45X/allpathsCor/short_1.cor.fasta 
  /fs/szattic-asmg5/dpuiu/HTS/Rhodobacter_sphaeroides/Illumina.180_45X.3500_45X/allpathsCor/short_2.cor.fasta 


                    reads      min    max        mean  sum            cvg
* k_unitig corrected files:
   Total              20,635,060 101    101        101  2084141060    449
   /nfshomes/dpuiu/GAGE/Rhodobacter_sphaeroides//Illumina.180_45X.3500_45X/k_unitig/frag_1.cor.seq  
   Sampled            3,591,676  101    101        101  362759276      78
   /nfshomes/dpuiu/GAGE/Rhodobacter_sphaeroides//Illumina.180_45X.3500_45X/k_unitig/frag_2.cor.seq 
   Corrected(paired) 1,556,316   30    101        79    122785294      26
   /nfshomes/dpuiu/GAGE/Rhodobacter_sphaeroides//Illumina.180_45X.3500_45X/k_unitig/short_1.cor.seq  
   /nfshomes/dpuiu/GAGE/Rhodobacter_sphaeroides//Illumina.180_45X.3500_45X/k_unitig/short_2.cor.seq


* DeNovo
=== Assembly ===
  #ctg stats
                      elem      min    q1    q2    q3    max        mean      n50        sum        q.0cvg  r.0cvg  1.0cvg  1.snps  1.breaks.all  1.breaks.1k+
  CA.bog              677        1006  2265  4527  8853  45500      6604      10186      4471053    189441  162537  190445  170    16            4 
  SOAPdenovo          4000      32    35    57    101    43816      1172      9327      4687158    63179  3026    78607  67      1            0 
  velvet              711        61    166    2135  8579  57936*    6375      16400*    4532895    113612  30252  114128  166    31            5 


  # scf stats
* [[Media:Rhodobacter_sphaeroides.assembly.summary|Rhodobacter_sphaeroides.assembly.summary]]
                      elem      min    q1    q2    q3    max        mean      n50        sum        q.0cvg  r.0cvg  1.0cvg  1.snps  1.breaks.all  1.breaks.1k+
* [[Media:Rhodobacter_sphaeroides.runCA|Rhodobacter_sphaeroides.runCA]] modified CA run
  CA.bog              658        1006  2343  4621  9033  45500      6795      10483      4471433    189441  162537  190445  170    35            4 
  SOAPdenovo          270        100    269    3431  24735  219248*    17273      53882*    4663670    111651  71505  112331  128    826          2 
  velvet              480        61    92    1147  13028  114907    9523      32233      4571240    113936  30253  114192  161    262          11


* Location
* Assembly directories:
   /fs/szattic-asmg4/dpuiu/HTS/Escherichia_coli/Data/Illuminap100/
  allpaths.orig                    /fs/szattic-asmg5/dpuiu/HTS/Rhodobacter_sphaeroides/Illumina.180_45X.3500_45X/allpaths
   /fs/szattic-asmg4/dpuiu/HTS/Escherichia_coli/Assembly/Illuminap100.cor/
 
  CA.orig                          /fs/szattic-asmg5/dpuiu/HTS/Rhodobacter_sphaeroides/Illumina.180_45X.3500_45X/CA.orig
  CA.quakeCor                      /fs/szattic-asmg5/dpuiu/HTS/Rhodobacter_sphaeroides/Illumina.180_45X.3500_45X/CA.quakeCor.k18
  CA.allpathsCor                  /fs/szattic-asmg5/dpuiu/HTS/Rhodobacter_sphaeroides/Illumina.180_45X.3500_45X/CA.allpathsCor
  CA.SuperReads                    /fs/szattic-asmg5/dpuiu/HTS/Rhodobacter_sphaeroides/Illumina.180_45X.3500_45X/CA.SuperReads.latest
  SOAPdenovo.orig(K=31)            /fs/szattic-asmg5/dpuiu/HTS/Rhodobacter_sphaeroides/Illumina.180_45X.3500_45X/SOAPdenovo.orig/K31
  SOAPdenovo.orig(K=47)            /fs/szattic-asmg5/dpuiu/HTS/Rhodobacter_sphaeroides/Illumina.180_45X.3500_45X/SOAPdenovo.orig
  SOAPdenovo.quakeCor(K=31)        /fs/szattic-asmg5/dpuiu/HTS/Rhodobacter_sphaeroides/Illumina.180_45X.3500_45X/SOAPdenovo.quakeCor.k18/K31
  SOAPdenovo.quakeCor(K=47)        /fs/szattic-asmg5/dpuiu/HTS/Rhodobacter_sphaeroides/Illumina.180_45X.3500_45X/SOAPdenovo.quakeCor.k18
  SOAPdenovo.allpathsCor(K=31)    /fs/szattic-asmg5/dpuiu/HTS/Rhodobacter_sphaeroides/Illumina.180_45X.3500_45X/SOAPdenovo.allpathsCor/K31
  SOAPdenovo.allpathsCor(K=47)    /fs/szattic-asmg5/dpuiu/HTS/Rhodobacter_sphaeroides/Illumina.180_45X.3500_45X/SOAPdenovo.allpathsCor
   velvet.orig                      /fs/szattic-asmg5/dpuiu/HTS/Rhodobacter_sphaeroides/Illumina.180_45X.3500_45X/velvet.orig
  velvet.quakeCor                  /fs/szattic-asmg5/dpuiu/HTS/Rhodobacter_sphaeroides/Illumina.180_45X.3500_45X/velvet.quakeCor
   velvet.allpathsCor              /fs/szattic-asmg5/dpuiu/HTS/Rhodobacter_sphaeroides/Illumina.180_45X.3500_45X/velvet.allpathsCor
  ABYSS.quakeCor                  /fs/szattic-asmg5/dpuiu/HTS/Rhodobacter_sphaeroides/Illumina.180_45X.3500_45X/ABYSS.K31.quakeCor.k18
 
  SGA.orig                        /fs/szattic-asmg5/dpuiu/HTS/Rhodobacter_sphaeroides/Illumina.180_45X.3500_45X/SGA.quakeCor.k18


== Human, a single chromosome, medium-sized ==
== Human, a single chromosome, medium-sized ==
=== Data ===


* Latest online assembly
* Latest online assembly
   ftp://ftp.ncbi.nih.gov/genomes/H_sapiens/Assembled_chromosomes/seq/
   ftp://ftp.ncbi.nih.gov/genomes/H_sapiens/Assembled_chromosomes/seq/
   NC_000014.8    107349540
   NC_000014.8    107,349,540  # total, with telomeric N's
                  88,289,540  # clean


* Human bowtie indexes  
* Human bowtie indexes  
   /fs/szdata/bowtie_indexes/h_sapiens_37_asm
   /fs/szdata/bowtie_indexes/h_sapiens_37_asm


* Illumina data
* Chr14 filtered reads (69.3X):
  .            readLen  insLen        orientation    #reads        readCvg       
  frag        101      155          innie          36,504,800    42
  shortjump    101      2283-2803    outie          22,669,408    26
  longjump    76-101  35295-35318  innie          2,405,064    1.3
 
* Illumina reads (all genome)
   [http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?study=SRP003680 Human NA12878 Genome on Illumina]  
   [http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?study=SRP003680 Human NA12878 Genome on Illumina]  
   ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByStudy/litesra/SRP/SRP003/SRP003680/
   ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByStudy/litesra/SRP/SRP003/SRP003680/
Line 179: Line 377:


   #Fragment (mean insert size: 155bp, SD 26), 101 bp read length
   #Fragment (mean insert size: 155bp, SD 26), 101 bp read length
   Lib          #Spots  #Bases  #Reads    #Mates    ReadLen  InsMea  InStd  InsMin  InsMax  TrimReadLen
   Lib          #Spots  #Bases  #Reads    #Mates    ReadLen  InsMea  InStd  InsMin  InsMax  TrimReadLen   Comments
   SRR067787    82.4M  16.6G  652448124  324283604  101      155    26    77      458    
   SRR067787    82.4M  16.6G  652448124  324283604  101      155    26    77      458                     Human HapMap individual NA12878 HiSeq 2000
   SRR067789    82.6M  16.7G  654133372  324876520  101      155    26    77      458     
   SRR067789    82.6M  16.7G  654133372  324876520  101      155    26    77      458     
   SRR067780    83.3M  16.8G  660001672  328021140  101      155    26    77      458     
   SRR067780    83.3M  16.8G  660001672  328021140  101      155    26    77      458     
Line 187: Line 385:
   SRR067784    83.3M  16.8G  660118460  328244560  101      155    26    77      458     
   SRR067784    83.3M  16.8G  660118460  328244560  101      155    26    77      458     
   SRR067785    81.6M  16.5G  646350512  321174108  101      155    26    77      458     
   SRR067785    81.6M  16.5G  646350512  321174108  101      155    26    77      458     
   SRR067792    83.8M  16.9G  663997828  330084304  101      155    26    77      458    
   SRR067792    83.8M  16.9G  663997828  330084304  101      155    26    77      458                    
   SRR067577    46.3M  9.3G    367673108  183472948  101      155    26    77      458    
   SRR067577    46.3M  9.3G    367673108  183472948  101      155    26    77      458                     Human HapMap individual NA12878 Illumina GAII
   SRR067579    46.0M  9.3G    365743380  182532676  101      155    26    77      458     
   SRR067579    46.0M  9.3G    365743380  182532676  101      155    26    77      458     
   SRR067578    46.5M  9.4G    369557476  184410788  101      155    26    77      458     
   SRR067578    46.5M  9.4G    369557476  184410788  101      155    26    77      458     
    
    
   #Jumping1 (mean insert size: 2283bp, SD 221), 101 bp read length
   #Jumping1 (mean insert size: 2283bp, SD 221), 101 bp read length
   SRR067771    81.5M  16.5G  644846296  320822716  101      2283    221    1620    2586  
   SRR067771    81.5M  16.5G  644846296  320822716  101      2283    221    1620    2586                     Human HapMap individual NA12878 HiSeq 2000
   SRR067777    82.6M  16.7G  653163608  325232944  101      2283    221    1620    2586     
   SRR067777    82.6M  16.7G  653163608  325232944  101      2283    221    1620    2586     
   SRR067781    82.1M  16.6G  649748720  323656576  101      2283    221    1620    2586     
   SRR067781    82.1M  16.6G  649748720  323656576  101      2283    221    1620    2586     
Line 199: Line 398:
    
    
   #Jumping2 (mean insert size: 2803bp, SD 271), 101 bp read length
   #Jumping2 (mean insert size: 2803bp, SD 271), 101 bp read length
   SRR067773    93.1M  18.8G  736456192  366884512  101      2803    271    1990    3106  
   SRR067773    93.1M  18.8G  736456192  366884512  101      2803    271    1990    3106                     Human HapMap individual NA12878 HiSeq 2000
   SRR067779    94.0M  19.0G  743564440  370214028  101      2803    271    1990    3106     
   SRR067779    94.0M  19.0G  743564440  370214028  101      2803    271    1990    3106     
   SRR067778    97.3M  19.6G  767984324  381879652  101      2803    271    1990    3106     
   SRR067778    97.3M  19.6G  767984324  381879652  101      2803    271    1990    3106     
Line 205: Line 404:
    
    
   #Fosmid1  (mean insert size: 35295bp, SD 2703), 76 bp read length
   #Fosmid1  (mean insert size: 35295bp, SD 2703), 76 bp read length
   SRR068214    13.1M  2.0G    104505420  52087176  76      35295  2703  27186  35523  36(trim 20bp at 5',20bp at 3')
   SRR068214    13.1M  2.0G    104505420  52087176  76      35295  2703  27186  35523  36(trim 20bp at 5',20bp at 3')       Human HapMap individual NA12878 Illumina GAII
   SRR068211    4.8M    736.9M  38612196  19252408  76      35295  2703  27186  35523  36(trim 20bp at 5',20bp at 3')
   SRR068211    4.8M    736.9M  38612196  19252408  76      35295  2703  27186  35523  36(trim 20bp at 5',20bp at 3')       Human HapMap individual NA12878 Illumina GAII
    
    
   #Fosmid2 (mean insert size: 35318bp, SD 2759),  101 bp read length
   #Fosmid2 (mean insert size: 35318bp, SD 2759),  101 bp read length
   SRR068335    67.4M  13.6G  533805860  265481252  101      35318  2759  27041  35621  61(trim 20bp at 5',20bp at 3')
   SRR068335    67.4M  13.6G  533805860  265481252  101      35318  2759  27041  35621  61(trim 20bp at 5',20bp at 3')       Human HapMap individual NA12878 HiSeq 2000


* Comments
* Comments
** Human chromosome 14.  The chromosome may change, but this is a new data set with 100X coverage in 100bp and 76bp reads, just assembled by the Broad group using Allpaths-LG and Soap.  We've downloaded the data and Todd is going to create a data set representing just chr 14, to make it feasible.  We'll then try to assemble that data w/all 3 assemblers: CA, SOAP, Allpaths-LG.
** Human chromosome 14.  The chromosome may change, but this is a new data set with 100X coverage in 100bp and 76bp reads, just assembled by the Broad group using Allpaths-LG and Soap.  We've downloaded the data and Todd is going to create a data set representing just chr 14, to make it feasible.  We'll then try to assemble that data w/all 3 assemblers: CA, SOAP, Allpaths-LG.
* Illumina chr14 reads (aligned with bowtie & corrected)
  /fs/szattic-asmg8/treangen/*fastq
  hard to align: bowtie -5 20 -3 20 -e 1000 ...
  jumping reads: only the ones aligned within coorect mean, stdev selected; these libraries usually have a high % of short inserts!!!
* Original read files:
  /fs/szattic-asmg8/treangen/chr14_fragment_1.fastq 
  /fs/szattic-asmg8/treangen/chr14_fragment_2.fastq 
  /fs/szattic-asmg8/treangen/chr14_shortjump_1.fastq 
  /fs/szattic-asmg8/treangen/chr14_shortjump_2.fastq 
  /fs/szattic-asmg8/treangen/chr14_longjump_1.fastq 
  /fs/szattic-asmg8/treangen/chr14_longjump_2.fastq 
* Quake corrected files:
  /fs/szattic-asmg8/treangen/chr14_fragment_1.cor.fastq 
  /fs/szattic-asmg8/treangen/chr14_fragment_2.cor.fastq 
  /fs/szattic-asmg8/treangen/chr14_shortjump_1.cor.fastq 
  /fs/szattic-asmg8/treangen/chr14_shortjump_2.cor.fastq 
  /fs/szattic-asmg8/treangen/chr14_longjump_1.cor.fastq 
  /fs/szattic-asmg8/treangen/chr14_longjump_2.cor.fastq 
* Allpaths-LG corrected files:
  /fs/szattic-asmg5/dpuiu/HTS/Homo_sapiens/Assembly/allpathsCor/chr14_fragment_1.cor.fasta
  /fs/szattic-asmg5/dpuiu/HTS/Homo_sapiens/Assembly/allpathsCor/chr14_fragment_2.cor.fasta
  /fs/szattic-asmg5/dpuiu/HTS/Homo_sapiens/Assembly/allpathsCor/chr14_shortjump_1.cor.fasta
  /fs/szattic-asmg5/dpuiu/HTS/Homo_sapiens/Assembly/allpathsCor/chr14_shortjump_2.cor.fasta
  /fs/szattic-asmg5/dpuiu/HTS/Homo_sapiens/Assembly/allpathsCor/chr14_longjump_1.cor.fasta
  /fs/szattic-asmg5/dpuiu/HTS/Homo_sapiens/Assembly/allpathsCor/chr14_longjump_2.cor.fasta
=== Assembly ===
* [[Media:Homo_sapiens.assembly.summary|Homo_sapiens.assembly.summary]]
* Assembly directories
  allpaths                        /fs/szattic-asmg5/dpuiu/HTS/Homo_sapiens/Assembly/allpaths
  CA.allpathsCor                  /fs/szattic-asmg5/dpuiu/HTS/Homo_sapiens/Assembly/CA.allpathsCor , /scratch1/dpuiu/HTS/Homo_sapiens/Assembly/CA.allpathsCor             
  CA.quakeCor                      /fs/szattic-asmg8/tmagoc/GAGE/human
  CA.SuperReads                    ginkgo:/scratch1/dpuiu/HTS/Homo_sapiens/Assembly/CA.SuperReads
  SOAPdenovo.orig(K=47)          /fs/szattic-asmg5/dpuiu/HTS/Homo_sapiens/Assembly/SOAPdenovo.orig/
  SOAPdenovo.quakeCor(K=31)      /fs/szattic-asmg5/dpuiu/HTS/Homo_sapiens/Assembly/SOAPdenovo.quakeCor/K31    , /scratch1/dpuiu/HTS/Homo_sapiens/Assembly/SOAPdenovo.quakeCor/K31
  SOAPdenovo.quakeCor(K=47)      /fs/szattic-asmg5/dpuiu/HTS/Homo_sapiens/Assembly/SOAPdenovo.quakeCor/      , /scratch1/dpuiu/HTS/Homo_sapiens/Assembly/SOAPdenovo.quakeCor
  SOAPdenovo.allpathsCor(K=31)    /fs/szattic-asmg5/dpuiu/HTS/Homo_sapiens/Assembly/SOAPdenovo.allpathsCor/K31 , /scratch1/dpuiu/HTS/Homo_sapiens/Assembly/SOAPdenovo.allpathsCor/K31
  SOAPdenovo.allpathsCor(K=47)    /fs/szattic-asmg5/dpuiu/HTS/Homo_sapiens/Assembly/SOAPdenovo.allpathsCor ,    /scratch1/dpuiu/HTS/Homo_sapiens/Assembly/SOAPdenovo.allpathsCor
  velvet.quakeCor                /fs/szattic-asmg5/dpuiu/HTS/Homo_sapiens/Assembly/velvet.quakeCor
  ABYSS.quakeCor                  /fs/szattic-asmg5/dpuiu/HTS/Homo_sapiens/Assembly/ABYSS.K31.quakeCor.K18  ,    /scratch1/dpuiu/HTS/Homo_sapiens/Assembly/ABYSS.K31.quakeCor.K18
  SGA.orig                        /fs/szattic-asmg5/dpuiu/HTS/Homo_sapiens/Assembly/SGA.orig                ,    /scratch1/dpuiu/HTS/Homo_sapiens/Assembly/SGA.orig
=== Allpaths-lg ===
* Read counts
                              orig      cor              cor(paired,all >64bp)
  chr14_fragment_12.fastq    36504800  35571477(97.44%)  34268444(10+bp ovl F/R)
  chr14_shortjump_12.fastq    22669408  11255320(49.64%)  11255320
  chr14_longjump_12.fastq    2405064    187398  (7.79%)  187398 
* Assembly stats:
  .          elem  min    q1    q2    q3      max      mean    n50      sum     
  scf        418  96    131    256    1236    81646936  209781  81646936  87688255 
  scf10K+    17    10330  11780  26536  269876  81646936  5135452  81646936  87302692 
  ctg        4722  96    2342  9101  24174  240773    17887    36530      84461065 
* Runtime 1104299.893u 126549.756s 18:50:05.80 1815.2%    0+0k 0+0io 8463pf+0w
  18hr 50min :                  multiprocessor
  1104299/(3600*24)=12.78 days : singleprocessor
== Argentine ant ==
=== Data ===
          #reads        readLen  readCvg
  Shotgun: 39,741,216    75        12
  3kb:    46,435,880    75        13 
  8kb:    43,839,748    75        13
  Total:  130,016,844  75        40
* Location
  /fs/szattic-asmg7/argentine_ant/Illumina/
= UC Assemblaton1 =
* [http://www.drive5.com/evolver/ Evolver]
* [https://github.com/jstjohn/SimSeq Read simulator]
* [http://korflab.ucdavis.edu/Datasets/Assemblathon/Assemblathon1/ Data download]
* speciesA.diploid.fa len
  chr0_1        76252953
  chr0_2        76285600
  chr1_1        18509915
  chr1_2        18539192
  chr2_1        17699484
  chr2_2        17710169
= UC Assemblaton1 =
...

Latest revision as of 17:07, 1 August 2011

Links

  • The Assemblathon University of California, Santa Cruz & UC Davis; synthetic & real genome.
  • dnGASP De Novo Genome Assembly Assessment Project (dnGASP): Centro Nacional de Análisis Genómico in Barcelona, Spain, synthetic genome
  • GAGE
  • genomeweb announcement

GAGE

  • Location
 http://gage.cbcb.umd.edu/ -> /fs/web-cbcb-new/html/gage
  • Answer following questions:
  1. How much sequencing coverage do I need for my genome project?
  2. What can I expect the resulting assembly to look like?
  3. Which assembly software should I use?
  4. What parameters should I use when I run the software?

Read correction

  • quake
 echo frag_1.fastq      frag_2.fastq      >  genome.ls
 echo shortjump_1.fastq shortjump_2.fastq >> genome.ls
 echo longjump_1.fastq  longjump_2.fastq  >> genome.ls 

 /fs/szdevel/core-cbcb-software/Linux-x86_64/bin/quake.py -f genome.ls  -k 18 -p 20 >&! quake.log

Assemblers

 paths: 
   /fs/szdevel/core-cbcb-software/Linux-x86_64/packages/allpaths3-35218/
   /fs/szdevel/core-cbcb-software/Linux-x86_64/bin/
 RunAllPaths3G \
    PRE=$PWD REFERENCE_NAME=. DATA_SUBDIR=. RUN=allpaths SUBDIR=run1.orig THREADS=$P
 paths: 
   /fs/szdevel/core-cbcb-software/Linux-x86_64/packages/wgs-6.1/
   /fs/szdevel/core-cbcb-software/Linux-x86_64/bin/
 runCA \
    -d . \
    -p asm \
    -s /fs/szdevel/core-cbcb-software/Linux-x86_64/bin/runCA.parallel.spec \
    doOverlapBasedTrimming=0 ovlOverlapper=ovl unitigger=bog bogBreakAtIntersections=0 bogBadMateDepth=1000 \
    *.frg
 paths: 
   /fs/szdevel/core-cbcb-software/Linux-x86_64/packages/velvet_1.0.13/
   /fs/szdevel/core-cbcb-software/Linux-x86_64/bin/
 velveth . $K -fastq \ 
   -shortPaired  frag_12.fastq \
   -shortPaired2 shortjump_12.rev.fastq \
   -shortPaired3 longjump_12.fastq

 velvetg . -exp_cov auto 
   -ins_length  $MEA_FRAG      -ins_length_sd  $STD_FRAG \
   -ins_length2 $MEA_SHORTJUMP -ins_length2_sd $STD_SHORTJUMP \
   -ins_length3 $MEA_LONGJUMP  -ins_length3_sd $STD_LONGJUMP \
   -scaffolding yes -exportFiltered yes -unused_reads yes
 paths:
   /fs/szdevel/core-cbcb-software/Linux-x86_64/packages/SOAPdenovo-V1.05/
   /fs/szdevel/core-cbcb-software/Linux-x86_64/bin/
   /fs/szdevel/core-cbcb-software/Linux-x86_64/packages/GapCloser/
 echo "[LIB]\navg_ins=$MEA_FRAG\nreverse_seq=0\nasm_flags=1\nrank=1\nq1=frag_1.fastq\nq2=frag_2.fastq\n" >! SOAPdenovo.config
 echo "[LIB]\navg_ins=$MEA_SHORTJUMP\nreverse_seq=1\nasm_flags=2\nrank=2\nq1=shortjump_1.fastq\nq2=shortjump_2.fastq\n" >> SOAPdenovo.config
 echo "[LIB]\navg_ins=$MEA_LONGJUMP\nreverse_seq=0\nasm_flags=2\nrank=4\nq1=longjump_1.fastq\nq2=longjump_2.fastq\n" >> SOAPdenovo.config

 SOAPdenovo all -K $K -p $P -s ./SOAPdenovo.config -o asm

 GapCloser -b SOAPdenovo.config -a asm.scafSeq -o asm2.scafSeq -t $P -p 31
 paths:
   /fs/szdevel/core-cbcb-software/Linux-x86_64/packages/SR-CA-1.1/CA/Linux-amd64/bin/

 paths:
   /fs/szdevel/core-cbcb-software/Linux-x86_64/packages/abyss-1.2.7/
   /fs/szdevel/core-cbcb-software/Linux-x86_64/bin
 abyss-pe  \
   k=$K n=5 name=asm lib='frag short' frag=frag_12.fastq short=short_12.fastq aligner=bowtie
 paths:
   /fs/szdevel/core-cbcb-software/Linux-x86_64/packages/sga/src/SGA/                      # Version: 0.9.8
   /fs/szdevel/core-cbcb-software/Linux-x86_64/bin
 sga preprocess -p 1 frag_?.fastq > frag.pp.fa 
 sga index -t $P frag.pp.fa 

 sga correct -k $K -t $P frag.pp.fa -o frag.pp.ec.fa  
 sga index -t $K frag.pp.ec.fa 

 sga filter frag.pp.ec.fa 
 sga overlap -t $P frag.pp.ec.filter.pass.fa

 sga assemble frag.pp.ec.filter.pass.asqg.gz

CBCB genomes

  • a bacterial genome. Instead of E. coli, we can use S. aureus USA300, which has sequence data in SRA from 454 and Illumina, paired and unpaired. Daniela has already assemblied it using CA, Newbler, Velvet, SOAPdenovo, and Maq (using its comparative assembly mode, where it aligns to a reference).
  • A medium-sized eukaryote. I'd like to use the Argentine ant or the Bombus impatiens bee - I've just written to Gene Robinson to ask about the bee.
  • Another eukaryote, ideally a larger one. Human would be great, but we just don't have enough time to do multiple human assemblies. So maybe another insect, or perhaps a plant if we can find one for which data is available.

If we can agree on the data sets, then the next step would be to design the experiment - decide in advance which assemblers to run and how many ways to try each one. I'm thinking we should also trim all the data with Quake.

Bombus impatiens

Data

  • Estimated haploid genome size: 250M
  • 497,318,144 Illumina 124bp reads (246X cvg)
  • Reads:
 .        readLen  orientation  insLen  #reads       readCvg  comments   
 frag     124      innie        400     303,118,594  150X     6 libs 
 short    124      outie        3-8K    194,199,550  96X      2 libs
  • Issue: Adapters: in 3k & 8k libraries
C CGTAATAACTTCGTATAGCATACATTATACGAAGTTATACGA
3 CGGCATTCCTGCTGAACCGAGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT
5 GATCGGAAGAGCGGTTCAGCAGGAATGCCGAGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCG
  • Read directories:
/fs/szattic-asmg4/Bees/Bombus_impatiens/s_[12356789]_[12]_sequence.txt                             # original fastq files

/fs/szattic-asmg4/Bees/Bombus_impatiens/error_free/fastq/s_[129]_[012]_sequence.cor.rev.txt        # adaptor free corrected reads (long inserts)
/fs/szattic-asmg4/Bees/Bombus_impatiens/error_free/fastq/s_[35678]_[012]_sequence.cor.txt          # corrected reads (short inserts)
  • Original read files:
 /fs/szattic-asmg4/Bees/Bombus_impatiens/s_1_1_sequence.txt  
 /fs/szattic-asmg4/Bees/Bombus_impatiens/s_1_2_sequence.txt  
 /fs/szattic-asmg4/Bees/Bombus_impatiens/s_2_1_sequence.txt  
 /fs/szattic-asmg4/Bees/Bombus_impatiens/s_2_2_sequence.txt  
 /fs/szattic-asmg4/Bees/Bombus_impatiens/s_3_1_sequence.txt  
 /fs/szattic-asmg4/Bees/Bombus_impatiens/s_3_2_sequence.txt  
 /fs/szattic-asmg4/Bees/Bombus_impatiens/s_5_1_sequence.txt  
 /fs/szattic-asmg4/Bees/Bombus_impatiens/s_5_2_sequence.txt  
 /fs/szattic-asmg4/Bees/Bombus_impatiens/s_6_1_sequence.txt  
 /fs/szattic-asmg4/Bees/Bombus_impatiens/s_6_2_sequence.txt  
 /fs/szattic-asmg4/Bees/Bombus_impatiens/s_7_1_sequence.txt  
 /fs/szattic-asmg4/Bees/Bombus_impatiens/s_7_2_sequence.txt  
 /fs/szattic-asmg4/Bees/Bombus_impatiens/s_8_1_sequence.txt  
 /fs/szattic-asmg4/Bees/Bombus_impatiens/s_8_2_sequence.txt  
 /fs/szattic-asmg4/Bees/Bombus_impatiens/s_9_1_sequence.txt  
 /fs/szattic-asmg4/Bees/Bombus_impatiens/s_9_2_sequence.txt  
  • Quake corrected files:
 /fs/szattic-asmg4/Bees/Bombus_impatiens/error_free/fastq/s_1_1_sequence.cor.rev.txt  
 /fs/szattic-asmg4/Bees/Bombus_impatiens/error_free/fastq/s_1_2_sequence.cor.rev.txt  
 /fs/szattic-asmg4/Bees/Bombus_impatiens/error_free/fastq/s_2_1_sequence.cor.rev.txt  
 /fs/szattic-asmg4/Bees/Bombus_impatiens/error_free/fastq/s_2_2_sequence.cor.rev.txt  
 /fs/szattic-asmg4/Bees/Bombus_impatiens/error_free/fastq/s_3_1_sequence.cor.txt  
 /fs/szattic-asmg4/Bees/Bombus_impatiens/error_free/fastq/s_3_2_sequence.cor.txt  
 /fs/szattic-asmg4/Bees/Bombus_impatiens/error_free/fastq/s_5_1_sequence.cor.txt  
 /fs/szattic-asmg4/Bees/Bombus_impatiens/error_free/fastq/s_5_2_sequence.cor.txt  
 /fs/szattic-asmg4/Bees/Bombus_impatiens/error_free/fastq/s_6_1_sequence.cor.txt  
 /fs/szattic-asmg4/Bees/Bombus_impatiens/error_free/fastq/s_6_2_sequence.cor.txt  
 /fs/szattic-asmg4/Bees/Bombus_impatiens/error_free/fastq/s_7_1_sequence.cor.txt  
 /fs/szattic-asmg4/Bees/Bombus_impatiens/error_free/fastq/s_7_2_sequence.cor.txt  
 /fs/szattic-asmg4/Bees/Bombus_impatiens/error_free/fastq/s_8_1_sequence.cor.txt  
 /fs/szattic-asmg4/Bees/Bombus_impatiens/error_free/fastq/s_8_2_sequence.cor.txt  
 /fs/szattic-asmg4/Bees/Bombus_impatiens/error_free/fastq/s_9_1_sequence.cor.rev.txt  
 /fs/szattic-asmg4/Bees/Bombus_impatiens/error_free/fastq/s_9_2_sequence.cor.rev.txt
  • k_unitig corrected files: (in progress --Dpuiu 10:38, 5 April 2011 (EDT))

Assembly

CA.quakeCor                /fs/szattic-asmg5/Bees/Bombus_impatiens/Assembly/CA.s_1-8.cor.redo2/                               # Celera Assembly

SOAPdenovo.quakeCor(K=47)  /fs/szattic-asmg5/Bees/Bombus_impatiens/Assembly/SOAPdenovo.K47.s_1-9.cor/                         # SOAPdenovo assembly (2011) K=47 quake corrected reads
#SOAPdenovo.orig(K=47)      /fs/szattic-asmg5/Bees/Bombus_impatiens/Assembly/SOAPdenovo.K47.s_1-9.orig/                       # SOAPdenovo assembly (2011) K=47 original reads   
#SOAPdenovo.quakeCor(K=31)  /fs/szattic-asmg5/Bees/Bombus_impatiens/Assembly/SOAPdenovo.s_1-9.cor/                            # SOAPdenovo assembly (2010) K=31 quake corrected reads (prev assembler version)

MSR-CA                     /fs/szattic-asmg5/Bees/Bombus_impatiens/Assembly/MSR-CA  ; genome9.umd.edu:/genome9/raid/alekseyz/GAGE/bombus/assembly/CA       # MSR-CA

Staph aureus USA300

Data

  • Complete genome:
 id             len     description
 NC_010079      2872915 Staphylococcus aureus subsp. aureus USA300_TCH1516, complete genome
 NC_010063.1    27041   Staphylococcus aureus subsp. aureus USA300_TCH1516 plasmid pUSA300HOUMR, complete sequence
 NC_012417.1    3125    Staphylococcus aureus subsp. aureus USA300_TCH1516 plasmid pUSA01-HOU, complete sequence
                2903081 total 
  • Reads (90X):
 .            readLen  insLen  orientation     #reads     readCvg      SRA runs
 frag         101      180     innie           1,294,104  45X          SRR022868 
 shortjump    37       3500    outie           3,494,070  45X          SRR022865 
 SRP001086 Staphylococcus aureus Sequencing on Illumina
 SRX007714 pair lib
 SRX007711 jumping lib
  • Read directories:
 /nfshomes/dpuiu/HTS/Staphylococcus_aureus/Data/Illuminap100/
 /nfshomes/dpuiu/HTS/Staphylococcus_aureus/Data/Illuminaj/
  • Original read files:
 /nfshomes/dpuiu/HTS/Staphylococcus_aureus/Data/Illuminap100/frag_1.fastq  
 /nfshomes/dpuiu/HTS/Staphylococcus_aureus/Data/Illuminap100/frag_2.fastq  
 /nfshomes/dpuiu/HTS/Staphylococcus_aureus/Data/Illuminaj/short_1.fastq  
 /nfshomes/dpuiu/HTS/Staphylococcus_aureus/Data/Illuminaj/short_2.fastq  
  • Quake corrected files:
 /nfshomes/dpuiu/GAGE/Staphylococcus_aureus/Illumina.180_45X.3500_45X/quake/frag_1.cor.fastq  
 /nfshomes/dpuiu/GAGE/Staphylococcus_aureus/Illumina.180_45X.3500_45X/quake/frag_2.cor.fastq  
 /nfshomes/dpuiu/GAGE/Staphylococcus_aureus/Illumina.180_45X.3500_45X/quake/short_1.cor.fastq  
 /nfshomes/dpuiu/GAGE/Staphylococcus_aureus/Illumina.180_45X.3500_45X/quake/short_2.cor.fastq
  • Allpaths-LG corrected files:
 /fs/szattic-asmg5/dpuiu/HTS/Staphylococcus_aureus/Illumina.180_45X.3500_45X/allpathsCor/frag_1.cor.fasta  
 /fs/szattic-asmg5/dpuiu/HTS/Staphylococcus_aureus/Illumina.180_45X.3500_45X/allpathsCor/frag_2.cor.fasta  
 /fs/szattic-asmg5/dpuiu/HTS/Staphylococcus_aureus/Illumina.180_45X.3500_45X/allpathsCor/short_1.cor.fasta  
 /fs/szattic-asmg5/dpuiu/HTS/Staphylococcus_aureus/Illumina.180_45X.3500_45X/allpathsCor/short_2.cor.fasta  
  • k_unitig corrected files:
 /nfshomes/dpuiu/GAGE/Staphylococcus_aureus/Illumina.180_45X.3500_45X/k_unitig/frag_1.cor.seq  
 /nfshomes/dpuiu/GAGE/Staphylococcus_aureus/Illumina.180_45X.3500_45X/k_unitig/frag_2.cor.seq  
 /nfshomes/dpuiu/GAGE/Staphylococcus_aureus/Illumina.180_45X.3500_45X/k_unitig/short_1.cor.seq  
 /nfshomes/dpuiu/GAGE/Staphylococcus_aureus/Illumina.180_45X.3500_45X/k_unitig/short_2.cor.seq

Assembly

 allpaths.orig                     /fs/szattic-asmg5/dpuiu/HTS/Staphylococcus_aureus/Illumina.180_45X.3500_45X/allpaths

 CA.orig                           /fs/szattic-asmg5/dpuiu/HTS/Staphylococcus_aureus/Illumina.180_45X.3500_45X/CA.orig
 CA.quakeCor                       /fs/szattic-asmg5/dpuiu/HTS/Staphylococcus_aureus/Illumina.180_45X.3500_45X/CA.quakeCor.k18
 CA.allpathsCor                    /fs/szattic-asmg5/dpuiu/HTS/Staphylococcus_aureus/Illumina.180_45X.3500_45X/CA.allpathsCor
 CA.SuperReads                     /fs/szattic-asmg5/dpuiu/HTS/Staphylococcus_aureus/Illumina.180_45X.3500_45X/CA.SuperReads.latest

 SOAPdenovo.orig(K=31)             /fs/szattic-asmg5/dpuiu/HTS/Staphylococcus_aureus/Illumina.180_45X.3500_45X/SOAPdenovo.K31.orig
 SOAPdenovo.orig(K=47)             /fs/szattic-asmg5/dpuiu/HTS/Staphylococcus_aureus/Illumina.180_45X.3500_45X/SOAPdenovo.K47.orig
 SOAPdenovo.quakeCor(K=31)         /fs/szattic-asmg5/dpuiu/HTS/Staphylococcus_aureus/Illumina.180_45X.3500_45X/SOAPdenovo.K31.quakeCor.k18
 SOAPdenovo.quakeCor(K=47)         /fs/szattic-asmg5/dpuiu/HTS/Staphylococcus_aureus/Illumina.180_45X.3500_45X/SOAPdenovo.K47.quakeCor.k18
 SOAPdenovo.allpathsCor(K=31)      /fs/szattic-asmg5/dpuiu/HTS/Staphylococcus_aureus/Illumina.180_45X.3500_45X/SOAPdenovo.K31.allpathsCor
 SOAPdenovo.allpathsCor(K=47)      /fs/szattic-asmg5/dpuiu/HTS/Staphylococcus_aureus/Illumina.180_45X.3500_45X/SOAPdenovo.K47.allpathsCor

 velvet.orig                       /fs/szattic-asmg5/dpuiu/HTS/Staphylococcus_aureus/Illumina.180_45X.3500_45X/velvet.orig
 velvet.quakeCor                   /fs/szattic-asmg5/dpuiu/HTS/Staphylococcus_aureus/Illumina.180_45X.3500_45X/velvet.quakeCor.k18 
 velvet.allpathsCor                /fs/szattic-asmg5/dpuiu/HTS/Staphylococcus_aureus/Illumina.180_45X.3500_45X/velvet.allpathsCor
 
 ABYSS.quakeCor                    /fs/szattic-asmg5/dpuiu/HTS/Staphylococcus_aureus/Illumina.180_45X.3500_45X/ABYSS.K31.quakeCor.k18
 SGA.orig                          /fs/szattic-asmg5/dpuiu/HTS/Staphylococcus_aureus/Illumina.180_45X.3500_45X/SGA.orig
  • SOAPdenovo v1.05 :
    • new quake version did not help much (quake-0.2.2 vs davek44-error_correction-28dbe11)
    • SOAPdenovo map -K 37+ : fails on quakeCor.k18 corrected reads
    • "according" to kmerFreq , should probably not use -K >47
    • longer kmer => longer scaffolds (K=63 : largest N50scf)
    • longer kmer => shorted contigs (K=31 : largest N50ctg)
    • K40+ too large: no "valley" in the kmerFreq histogram
 paste SOAPdenovo.K??.quakeCor.k18/genome.K??.kmerFreq | nl0 | head
 paste SOAPdenovo.K??.allpathsCor/genome.K??.kmerFreq | nl0 | more

Rhodobacter sphaeroides

Data

  • Complete genome: 2 chromosomes, 5 plasmids
 id             len     description
 CP000143       3188609 Rhodobacter sphaeroides 2.4.1 chromosome 1, complete sequence.
 CP000144       943016  Rhodobacter sphaeroides 2.4.1 chromosome 2, complete sequence.
 DQ232586       114045  Rhodobacter sphaeroides 2.4.1 plasmid A, partial sequence.
 CP000145       114178  Rhodobacter sphaeroides 2.4.1 plasmid B, complete sequence.
 CP000146       105284  Rhodobacter sphaeroides 2.4.1 plasmid C, complete sequence.
 CP000147       100828  Rhodobacter sphaeroides 2.4.1 plasmid D, complete sequence.
 DQ232587       37100   Rhodobacter sphaeroides 2.4.1 plasmid E, partial sequence.
                4603060 total 
  • Reads (90X):
 .            readLen  insLen  orientation     #reads     readCvg      SRA runs  
 frag         101      180     innie           2,050,868  45X          SRR081522 
 shortjump    101      3500    outie           2,050,868  45X          SRR034528 
  • SRA traces
 SRX033397 pair lib ;    readLen=101 ; insMea=180 
 SRX016063 jumping lib ; readLen=101 ; insMea~=3455; ~15% of the mates are short inserts (~250bp)
  • Original read files:
 /fs/szattic-asmg5/dpuiu/HTS/Rhodobacter_sphaeroides/Data/Illuminap/frag_1.fastq  
 /fs/szattic-asmg5/dpuiu/HTS/Rhodobacter_sphaeroides/Data/Illuminap/frag_2.fastq  
 /fs/szattic-asmg5/dpuiu/HTS/Rhodobacter_sphaeroides/Data/Illuminaj/short_1.fastq  
 /fs/szattic-asmg5/dpuiu/HTS/Rhodobacter_sphaeroides/Data/Illuminaj/short_2.fastq  
  • Quake corrected read files:
 /fs/szattic-asmg5/dpuiu/HTS/Rhodobacter_sphaeroides/Illumina.180_45X.3500_45X/quake/frag_1.cor.fastq  
 /fs/szattic-asmg5/dpuiu/HTS/Rhodobacter_sphaeroides/Illumina.180_45X.3500_45X/quake/frag_2.cor.fastq  
 /fs/szattic-asmg5/dpuiu/HTS/Rhodobacter_sphaeroides/Illumina.180_45X.3500_45X/quake/short_1.cor.fastq  
 /fs/szattic-asmg5/dpuiu/HTS/Rhodobacter_sphaeroides/Illumina.180_45X.3500_45X/quake/short_2.cor.fastq  
  • QuakeIter2 corrected read files:
 /fs/szattic-asmg5/dpuiu/HTS/Rhodobacter_sphaeroides/Illumina.180_45X.3500_45X/quake/iter2_dk/frag_1.cor.fastq  
 /fs/szattic-asmg5/dpuiu/HTS/Rhodobacter_sphaeroides/Illumina.180_45X.3500_45X/quake/iter2_dk/frag_2.cor.fastq  
 /fs/szattic-asmg5/dpuiu/HTS/Rhodobacter_sphaeroides/Illumina.180_45X.3500_45X/quake/iter2_dk/short_1.cor.fastq  
 /fs/szattic-asmg5/dpuiu/HTS/Rhodobacter_sphaeroides/Illumina.180_45X.3500_45X/quake/iter2_dk/short_2.cor.fastq  
  • Allpaths-LG corrected files:
 /fs/szattic-asmg5/dpuiu/HTS/Rhodobacter_sphaeroides/Illumina.180_45X.3500_45X/allpathsCor/frag_1.cor.fasta  
 /fs/szattic-asmg5/dpuiu/HTS/Rhodobacter_sphaeroides/Illumina.180_45X.3500_45X/allpathsCor/frag_2.cor.fasta  
 /fs/szattic-asmg5/dpuiu/HTS/Rhodobacter_sphaeroides/Illumina.180_45X.3500_45X/allpathsCor/short_1.cor.fasta  
 /fs/szattic-asmg5/dpuiu/HTS/Rhodobacter_sphaeroides/Illumina.180_45X.3500_45X/allpathsCor/short_2.cor.fasta  
  • k_unitig corrected files:
 /nfshomes/dpuiu/GAGE/Rhodobacter_sphaeroides//Illumina.180_45X.3500_45X/k_unitig/frag_1.cor.seq  
 /nfshomes/dpuiu/GAGE/Rhodobacter_sphaeroides//Illumina.180_45X.3500_45X/k_unitig/frag_2.cor.seq  
 /nfshomes/dpuiu/GAGE/Rhodobacter_sphaeroides//Illumina.180_45X.3500_45X/k_unitig/short_1.cor.seq  
 /nfshomes/dpuiu/GAGE/Rhodobacter_sphaeroides//Illumina.180_45X.3500_45X/k_unitig/short_2.cor.seq

Assembly

  • Assembly directories:
 allpaths.orig                    /fs/szattic-asmg5/dpuiu/HTS/Rhodobacter_sphaeroides/Illumina.180_45X.3500_45X/allpaths
 
 CA.orig                          /fs/szattic-asmg5/dpuiu/HTS/Rhodobacter_sphaeroides/Illumina.180_45X.3500_45X/CA.orig
 CA.quakeCor                      /fs/szattic-asmg5/dpuiu/HTS/Rhodobacter_sphaeroides/Illumina.180_45X.3500_45X/CA.quakeCor.k18
 CA.allpathsCor                   /fs/szattic-asmg5/dpuiu/HTS/Rhodobacter_sphaeroides/Illumina.180_45X.3500_45X/CA.allpathsCor

 CA.SuperReads                    /fs/szattic-asmg5/dpuiu/HTS/Rhodobacter_sphaeroides/Illumina.180_45X.3500_45X/CA.SuperReads.latest

 SOAPdenovo.orig(K=31)            /fs/szattic-asmg5/dpuiu/HTS/Rhodobacter_sphaeroides/Illumina.180_45X.3500_45X/SOAPdenovo.orig/K31
 SOAPdenovo.orig(K=47)            /fs/szattic-asmg5/dpuiu/HTS/Rhodobacter_sphaeroides/Illumina.180_45X.3500_45X/SOAPdenovo.orig
 SOAPdenovo.quakeCor(K=31)        /fs/szattic-asmg5/dpuiu/HTS/Rhodobacter_sphaeroides/Illumina.180_45X.3500_45X/SOAPdenovo.quakeCor.k18/K31
 SOAPdenovo.quakeCor(K=47)        /fs/szattic-asmg5/dpuiu/HTS/Rhodobacter_sphaeroides/Illumina.180_45X.3500_45X/SOAPdenovo.quakeCor.k18
 SOAPdenovo.allpathsCor(K=31)     /fs/szattic-asmg5/dpuiu/HTS/Rhodobacter_sphaeroides/Illumina.180_45X.3500_45X/SOAPdenovo.allpathsCor/K31
 SOAPdenovo.allpathsCor(K=47)     /fs/szattic-asmg5/dpuiu/HTS/Rhodobacter_sphaeroides/Illumina.180_45X.3500_45X/SOAPdenovo.allpathsCor

 velvet.orig                      /fs/szattic-asmg5/dpuiu/HTS/Rhodobacter_sphaeroides/Illumina.180_45X.3500_45X/velvet.orig
 velvet.quakeCor                  /fs/szattic-asmg5/dpuiu/HTS/Rhodobacter_sphaeroides/Illumina.180_45X.3500_45X/velvet.quakeCor
 velvet.allpathsCor               /fs/szattic-asmg5/dpuiu/HTS/Rhodobacter_sphaeroides/Illumina.180_45X.3500_45X/velvet.allpathsCor

 ABYSS.quakeCor                   /fs/szattic-asmg5/dpuiu/HTS/Rhodobacter_sphaeroides/Illumina.180_45X.3500_45X/ABYSS.K31.quakeCor.k18
 
 SGA.orig                         /fs/szattic-asmg5/dpuiu/HTS/Rhodobacter_sphaeroides/Illumina.180_45X.3500_45X/SGA.quakeCor.k18

Human, a single chromosome, medium-sized

Data

  • Latest online assembly
 ftp://ftp.ncbi.nih.gov/genomes/H_sapiens/Assembled_chromosomes/seq/
  NC_000014.8    107,349,540  # total, with telomeric N's 
                  88,289,540  # clean
  • Human bowtie indexes
  /fs/szdata/bowtie_indexes/h_sapiens_37_asm
  • Chr14 filtered reads (69.3X):
 .            readLen  insLen        orientation    #reads        readCvg        
 frag         101      155           innie          36,504,800    42
 shortjump    101      2283-2803     outie          22,669,408    26
 longjump     76-101   35295-35318   innie          2,405,064     1.3
  • Illumina reads (all genome)
 Human NA12878 Genome on Illumina 
 ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByStudy/litesra/SRP/SRP003/SRP003680/
 ginko:/scratch1/Human_NA12878_on_Illumina/
 #Fragment (mean insert size: 155bp, SD 26), 101 bp read length
 Lib          #Spots  #Bases  #Reads     #Mates     ReadLen  InsMea  InStd  InsMin  InsMax   TrimReadLen    Comments
 SRR067787    82.4M   16.6G   652448124  324283604  101      155     26     77      458                     Human HapMap individual NA12878 HiSeq 2000
 SRR067789    82.6M   16.7G   654133372  324876520  101      155     26     77      458     
 SRR067780    83.3M   16.8G   660001672  328021140  101      155     26     77      458     
 SRR067791    83.0M   16.8G   657963460  327205952  101      155     26     77      458     
 SRR067793    77.0M   15.5G   609634756  303094956  101      155     26     77      458     
 SRR067784    83.3M   16.8G   660118460  328244560  101      155     26     77      458     
 SRR067785    81.6M   16.5G   646350512  321174108  101      155     26     77      458     
 SRR067792    83.8M   16.9G   663997828  330084304  101      155     26     77      458                      

 SRR067577    46.3M   9.3G    367673108  183472948  101      155     26     77      458                      Human HapMap individual NA12878 Illumina GAII
 SRR067579    46.0M   9.3G    365743380  182532676  101      155     26     77      458     
 SRR067578    46.5M   9.4G    369557476  184410788  101      155     26     77      458     
 
 #Jumping1 (mean insert size: 2283bp, SD 221), 101 bp read length
 SRR067771    81.5M   16.5G   644846296  320822716  101      2283    221    1620    2586                     Human HapMap individual NA12878 HiSeq 2000
 SRR067777    82.6M   16.7G   653163608  325232944  101      2283    221    1620    2586    
 SRR067781    82.1M   16.6G   649748720  323656576  101      2283    221    1620    2586    
 SRR067776    79.9M   16.1G   632590344  315165892  101      2283    221    1620    2586    
 
 #Jumping2 (mean insert size: 2803bp, SD 271), 101 bp read length
 SRR067773    93.1M   18.8G   736456192  366884512  101      2803    271    1990    3106                      Human HapMap individual NA12878 HiSeq 2000
 SRR067779    94.0M   19.0G   743564440  370214028  101      2803    271    1990    3106    
 SRR067778    97.3M   19.6G   767984324  381879652  101      2803    271    1990    3106    
 SRR067786    94.6M   19.1G   747631104  372002548  101      2803    271    1990    3106    
 
 #Fosmid1  (mean insert size: 35295bp, SD 2703), 76 bp read length
 SRR068214    13.1M   2.0G    104505420  52087176   76       35295   2703   27186   35523   36(trim 20bp at 5',20bp at 3')       Human HapMap individual NA12878 Illumina GAII
 SRR068211    4.8M    736.9M  38612196   19252408   76       35295   2703   27186   35523   36(trim 20bp at 5',20bp at 3')       Human HapMap individual NA12878 Illumina GAII
 
 #Fosmid2 (mean insert size: 35318bp, SD 2759),  101 bp read length
 SRR068335    67.4M   13.6G   533805860  265481252  101      35318   2759   27041   35621   61(trim 20bp at 5',20bp at 3')       Human HapMap individual NA12878 HiSeq 2000
  • Comments
    • Human chromosome 14. The chromosome may change, but this is a new data set with 100X coverage in 100bp and 76bp reads, just assembled by the Broad group using Allpaths-LG and Soap. We've downloaded the data and Todd is going to create a data set representing just chr 14, to make it feasible. We'll then try to assemble that data w/all 3 assemblers: CA, SOAP, Allpaths-LG.
  • Illumina chr14 reads (aligned with bowtie & corrected)
 /fs/szattic-asmg8/treangen/*fastq
 hard to align: bowtie -5 20 -3 20 -e 1000 ...
 jumping reads: only the ones aligned within coorect mean, stdev selected; these libraries usually have a high % of short inserts!!!
  • Original read files:
 /fs/szattic-asmg8/treangen/chr14_fragment_1.fastq  
 /fs/szattic-asmg8/treangen/chr14_fragment_2.fastq  
 /fs/szattic-asmg8/treangen/chr14_shortjump_1.fastq  
 /fs/szattic-asmg8/treangen/chr14_shortjump_2.fastq  
 /fs/szattic-asmg8/treangen/chr14_longjump_1.fastq  
 /fs/szattic-asmg8/treangen/chr14_longjump_2.fastq  
  • Quake corrected files:
 /fs/szattic-asmg8/treangen/chr14_fragment_1.cor.fastq  
 /fs/szattic-asmg8/treangen/chr14_fragment_2.cor.fastq  
 /fs/szattic-asmg8/treangen/chr14_shortjump_1.cor.fastq  
 /fs/szattic-asmg8/treangen/chr14_shortjump_2.cor.fastq  
 /fs/szattic-asmg8/treangen/chr14_longjump_1.cor.fastq  
 /fs/szattic-asmg8/treangen/chr14_longjump_2.cor.fastq  
  • Allpaths-LG corrected files:
 /fs/szattic-asmg5/dpuiu/HTS/Homo_sapiens/Assembly/allpathsCor/chr14_fragment_1.cor.fasta
 /fs/szattic-asmg5/dpuiu/HTS/Homo_sapiens/Assembly/allpathsCor/chr14_fragment_2.cor.fasta
 /fs/szattic-asmg5/dpuiu/HTS/Homo_sapiens/Assembly/allpathsCor/chr14_shortjump_1.cor.fasta
 /fs/szattic-asmg5/dpuiu/HTS/Homo_sapiens/Assembly/allpathsCor/chr14_shortjump_2.cor.fasta
 /fs/szattic-asmg5/dpuiu/HTS/Homo_sapiens/Assembly/allpathsCor/chr14_longjump_1.cor.fasta
 /fs/szattic-asmg5/dpuiu/HTS/Homo_sapiens/Assembly/allpathsCor/chr14_longjump_2.cor.fasta

Assembly

  • Assembly directories
 allpaths                         /fs/szattic-asmg5/dpuiu/HTS/Homo_sapiens/Assembly/allpaths

 CA.allpathsCor                   /fs/szattic-asmg5/dpuiu/HTS/Homo_sapiens/Assembly/CA.allpathsCor , /scratch1/dpuiu/HTS/Homo_sapiens/Assembly/CA.allpathsCor               
 CA.quakeCor                      /fs/szattic-asmg8/tmagoc/GAGE/human
 CA.SuperReads                    ginkgo:/scratch1/dpuiu/HTS/Homo_sapiens/Assembly/CA.SuperReads

 SOAPdenovo.orig(K=47)           /fs/szattic-asmg5/dpuiu/HTS/Homo_sapiens/Assembly/SOAPdenovo.orig/ 
 SOAPdenovo.quakeCor(K=31)       /fs/szattic-asmg5/dpuiu/HTS/Homo_sapiens/Assembly/SOAPdenovo.quakeCor/K31    , /scratch1/dpuiu/HTS/Homo_sapiens/Assembly/SOAPdenovo.quakeCor/K31
 SOAPdenovo.quakeCor(K=47)       /fs/szattic-asmg5/dpuiu/HTS/Homo_sapiens/Assembly/SOAPdenovo.quakeCor/       , /scratch1/dpuiu/HTS/Homo_sapiens/Assembly/SOAPdenovo.quakeCor 
 SOAPdenovo.allpathsCor(K=31)    /fs/szattic-asmg5/dpuiu/HTS/Homo_sapiens/Assembly/SOAPdenovo.allpathsCor/K31 , /scratch1/dpuiu/HTS/Homo_sapiens/Assembly/SOAPdenovo.allpathsCor/K31
 SOAPdenovo.allpathsCor(K=47)    /fs/szattic-asmg5/dpuiu/HTS/Homo_sapiens/Assembly/SOAPdenovo.allpathsCor ,     /scratch1/dpuiu/HTS/Homo_sapiens/Assembly/SOAPdenovo.allpathsCor

 velvet.quakeCor                 /fs/szattic-asmg5/dpuiu/HTS/Homo_sapiens/Assembly/velvet.quakeCor

 ABYSS.quakeCor                  /fs/szattic-asmg5/dpuiu/HTS/Homo_sapiens/Assembly/ABYSS.K31.quakeCor.K18  ,    /scratch1/dpuiu/HTS/Homo_sapiens/Assembly/ABYSS.K31.quakeCor.K18

 SGA.orig                        /fs/szattic-asmg5/dpuiu/HTS/Homo_sapiens/Assembly/SGA.orig                ,    /scratch1/dpuiu/HTS/Homo_sapiens/Assembly/SGA.orig

Allpaths-lg

  • Read counts
                             orig       cor               cor(paired,all >64bp)
 chr14_fragment_12.fastq     36504800   35571477(97.44%)  34268444(10+bp ovl F/R)
 chr14_shortjump_12.fastq    22669408   11255320(49.64%)  11255320
 chr14_longjump_12.fastq     2405064    187398   (7.79%)  187398  
  • Assembly stats:
 .          elem  min    q1     q2     q3      max       mean     n50       sum       
 scf        418   96     131    256    1236    81646936  209781   81646936  87688255  
 scf10K+    17    10330  11780  26536  269876  81646936  5135452  81646936  87302692  
 ctg        4722  96     2342   9101   24174   240773    17887    36530       84461065  
  • Runtime 1104299.893u 126549.756s 18:50:05.80 1815.2% 0+0k 0+0io 8463pf+0w
 18hr 50min :                   multiprocessor
 1104299/(3600*24)=12.78 days : singleprocessor

Argentine ant

Data

          #reads        readLen   readCvg
 Shotgun: 39,741,216    75        12
 3kb:     46,435,880    75        13  
 8kb:     43,839,748    75        13
 Total:   130,016,844   75        40
  • Location
 /fs/szattic-asmg7/argentine_ant/Illumina/

UC Assemblaton1

 chr0_1         76252953
 chr0_2         76285600
 chr1_1         18509915
 chr1_2         18539192
 chr2_1         17699484
 chr2_2         17710169

UC Assemblaton1

...