Dpuiu CA: Difference between revisions

From Cbcb
Jump to navigation Jump to search
 
(155 intermediate revisions by the same user not shown)
Line 1: Line 1:
= Links =
* [http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2639302/?tool=pmcentrez BOG paper]
= Data Issues =
* Problems if the coverage > read length; solution: sample the reads
* Analyze the data before assemble it:
** If there is a reference genome map the reads to it
** Plot kmer frequencies (use multiple kmer sizes) ; see how they compare with the estimated coverage
= Sourceforge =
= Sourceforge =


* [http://apps.sourceforge.net/mediawiki/wgs-assembler/index.php?title=Main_Page Wiki]
* [http://apps.sourceforge.net/mediawiki/wgs-assembler/index.php?title=Main_Page Wiki]
* [http://sourceforge.net/projects/wgs-assembler/ WGS]
* [http://sourceforge.net/projects/wgs-assembler/ WGS Sourceforge Wiki]
* Download
* [http://sourceforge.net/apps/mediawiki/wgs-assembler/index.php?title=Release_History WGS Sourceforge Wiki Release History]
 
* Constants:
  cat src/AS_global.h
  ...
  #define AS_READ_MIN_LEN                  64  # should decrease for short Illumina reads  32 ???
  #define AS_OVERLAP_MIN_LEN                40  # should decrease for short Illumina reads  30 ??? 
  #define AS_READ_MAX_PACKED_LEN            104  # max Illumina read length; incread to 124
 
  #define AS_READ_MAX_NORMAL_LEN_BITS      11  # should be increased to 12+ if there are many frequent kmesra nd dpn't fit in the hash
 
cat src/AS_CGW/AS_CGW_dataTypes.h
#define CGW_MIN_DISCRIMINATOR_UNIQUE_LENGTH 1000  # should be decraesed for short reads assemblies ; all unitigs less than that do not get scaffolded
 
 
* Download & compile CVS tip
   cvs -z3 -d:pserver:anonymous@wgs-assembler.cvs.sourceforge.net:/cvsroot/wgs-assembler co -P wgs-assembler
   cvs -z3 -d:pserver:anonymous@wgs-assembler.cvs.sourceforge.net:/cvsroot/wgs-assembler co -P wgs-assembler
* CBCB
   ~/szdevel/SourceForge/wgs-5.4/
   cd wgs-assembler/src
* Compile for debugging :
  make SITE_NAME=LOCAL
   BUILDDEBUG=1  # in "c_make.as"
* Compile
  => executables under wgs-assembler/Linux-amd64/bin/
 
* Download & compile release
   wget ...
   bzip2 -dc wgs-6.0-beta.tar.bz2 | tar -xf -
   bzip2 -dc wgs-6.0-beta.tar.bz2 | tar -xf -
   cd wgs-6.0-beta
   cd wgs-6.0-beta
Line 18: Line 49:
   gmake
   gmake
   cd ..
   cd ..
* Compile for debugging :
  make SITE_NAME=LOCAL  BUILDDEBUG=1  # in "c_make.as"


= Run =   
= Run =   
* Help  
* Help  
  runCA.pl -fields
  runCA.pl -fields
runCA.pl -options


* Batch submission
* Batch submission
Line 34: Line 69:


* doOverlapTrimming=1  (default)
* doOverlapTrimming=1  (default)
* low qual reads (ex: 20 or 'D') are discarded
* low qual Sanger reads (ex: 20 or 'D') are discarded
* Illumina reads are trimmed at 5'/3' (bp with qual < 5 (E) are deleted; read is deleted if new length< 64)
* astatLowBound  1 (default)  ; can be lowered for low/un-uniform coverage
* astatLowBound  1 (default)  ; can be lowered for low/un-uniform coverage
* astatHighBound 5 (default)
* astatHighBound 5 (default)
Line 43: Line 79:
   runCA -d . ovlOverlapper=mer unitigger=bog -p prefix *.frg
   runCA -d . ovlOverlapper=mer unitigger=bog -p prefix *.frg
   runCA -d . -s ~/bin/runCA.454.spec -p prefix *.frg  
   runCA -d . -s ~/bin/runCA.454.spec -p prefix *.frg  
* run concurently
  cnsConcurrency=4 frgCorrConcurrency=4 merOverlapperExtendConcurrency=4 merOverlapperSeedConcurrency=4 ovlConcurrency=4 ovlCorrConcurrency=4 ...


* gatekeeper -L : search for 454 paired end linker
* gatekeeper -L : search for 454 paired end linker
Line 54: Line 87:
* stopAfter:  
* stopAfter:  
     initialStoreBuilding
     initialStoreBuilding
     overlapBasedTrimming | OBT
     overlapBasedTrimming | OBT    
     overlapper
     overlapper                         # runs overlap, overlapStore
     unitigger
     unitigger                         # runs frg/ovl correction
     consensusAfterUnitigger
     consensusAfterUnitigger
     scaffolder
     scaffolder
     consensusAfterScaffolder
     consensusAfterScaffolder
   
   
  Example:
* Example:
   ~/bin/runCA -d . -p asm -s ~/bin/runCA.454.spec stopAfter=overlapper Saureus.frg
   ~/bin/runCA -d . -p asm -s ~/bin/runCA.454.spec stopAfter=initialStoreBuilding *.frg
 
* cgwOutputIntermediate = 1 => "cgw -G"
  => .cgw & .cgw_contigs (generated only at the last doExtendClearRanges step)
* doExtendClearRanges = 2 => cgw run 3 times, ECR 2 times


= Gatekeeper =
= Gatekeeper =
Line 95: Line 124:
   lib    uid    19070  stddev  167001
   lib    uid    19070  stddev  167001
   lib    uid    19070  isNotRandom    1
   lib    uid    19070  isNotRandom    1
  lib    uid    19070  Orientation    O            # outie orientation for Illimina libs >1K
   
  gatekeeper --edit prefix.update prefix.gkpStore


   gatekeeper --edit prefix.update prefix.gkpStore > prefix.gkpStore.seq
* Update read CLR/CLV
   gatekeeper -a -r prefix.clr -o prefix.gkpStore   # not in the wgs release
  gatekeeper -a -v prefix.clv -o prefix.gkpStore


== Delete ==
== Delete ==
Line 126: Line 160:
   }
   }


cat delete.ids | perl -ane 'print "{FRG\nact:D\nacc:$F[0]\n}\n";'> delete.frg
  cat delete.ids | perl -ane 'print "{FRG\nact:D\nacc:$F[0]\n}\n";'> delete.frg
  gatekeeper -a -o prefix.gkpStore -T -F delete.frg
  cat delete.ids | ~/bin/deleteFrg.pl -forward 1 -reverse 2 > delete.frg            # forward reads have suffix 1; forward reads have suffix 1
   
  gatekeeper -a -o prefix.gkpStore -T -F delete.frg


== Dump ==
== Dump ==
Line 147: Line 183:
   >SRR001355.3635.1a template=2020+2021 dir=F library=SRX000348 trim=20-117
   >SRR001355.3635.1a template=2020+2021 dir=F library=SRX000348 trim=20-117
   >SRR001355.3635.1b template=2020+2021 dir=R library=SRX000348 trim=1-130
   >SRR001355.3635.1b template=2020+2021 dir=R library=SRX000348 trim=1-130
* FASTQ reads
  gatekeeper -dumpfastq prefix. prefix.gkpStore
  =>
  prefix.1.fastq
  prefix.2.fastq
  prefix.paired.fastq    # contains both asm.1.fastq & asm.2.fastq reads
  prefix.unmated.fastq
  cat prefix.1.fastq | perl ~/bin/clrFastq.pl > prefix.1.clr.fastq
  cat prefix.2.fastq | perl ~/bin/clrFastq.pl > prefix.2.clr.fastq
  cat prefix.paired.fastq | perl ~/bin/clrFastq.pl > prefix.paired.clr.fastq
  cat prefix.unmated.fastq | perl ~/bin/clrFastq.pl > prefix.unmated.clr.fastq


* FRG file
* FRG file
Line 153: Line 204:


* READS in TAB format
* READS in TAB format
   gatekeeper -dumpfragments -tabular prefix.gkpStore > prefix.frg.info
   gatekeeper -dumpfragments -tabular prefix.gkpStore > prefix.gkpStore.info
  gatekeeper -dumpfragments -tabular prefix.gkpStore | grep -v ^UID | awk '{print $1,$2}' > prefix.gkpStore.uid2iid


* LIBS in TAB format
* LIBS in TAB format
Line 159: Line 211:
   asm2mdi.pl < prefix.asm > prefix.mdi                                                                                      # final vals
   asm2mdi.pl < prefix.asm > prefix.mdi                                                                                      # final vals


* READ CLR's (default=ECR2)
* READ CLR's (default=LATEST)
  gatekeeper -dumpfragments -tabular prefix.gkpStore/ | perl -ane 'print join "\t", @F[1,0,-2,-1,6]; print "\n";' | head | pretty -o
  IID    UID                                  clrBeginLATEST  clrEndLATEST  isDeleted 
  1      HWI-EAS385_0001:1:1:1569:15197#0/1  0              65            0         
  2      HWI-EAS385_0001:1:1:2095:12498#0/1  0              67            0         
  3      HWI-EAS385_0001:1:1:5452:4010#0/1    0              78            0         
  4      HWI-EAS385_0001:1:1:12400:16773#0/1  0              105          0         
  5      HWI-EAS385_0001:1:1:12884:18158#0/1  0              80            0         
  ...
 
   gatekeeper -dumpfragments -tabular prefix.gkpStore/  | perl -ane 'print join "\t", @F[0,-2,-1]; print "\n";' > prefix.clr
   gatekeeper -dumpfragments -tabular prefix.gkpStore/  | perl -ane 'print join "\t", @F[0,-2,-1]; print "\n";' > prefix.clr
   gatekeeper -dumpfragments -tabular -clear [CLR|OBTINITIAL|OBTMERGE|OBTCHIMERA|ECR_0|ECR_1|ECR_2|LATEST] prefix.gkpStore/  | perl -ane 'print join "\t", @F[0,-2,-1]; print "\n";' > prefix.?.clr
   gatekeeper -dumpfragments -tabular -clear [CLR|OBTINITIAL|OBTMERGE|OBTCHIMERA|ECR_0|ECR_1|ECR_2|LATEST] prefix.gkpStore/  | perl -ane 'print join "\t", @F[0,-2,-1]; print "\n";' > prefix.?.clr
   CLR:    initial
 
   LATEST: final
   gatekeeper -dumpfragments -tabular -clear CLR asm.gkpStore        | perl -ane 'print join "\t", @F[0,-2,-1]; print "\n";' > asm.CLR.clr
  gatekeeper -dumpfragments -tabular -clear OBTINITIAL asm.gkpStore | perl -ane 'print join "\t", @F[0,-2,-1]; print "\n";' > asm.OBTINITIAL.clr &
  gatekeeper -dumpfragments -tabular -clear OBTCHIMERA asm.gkpStore | perl -ane 'print join "\t", @F[0,-2,-1]; print "\n";' > asm.OBTCHIMERA.clr &
  ...
   gatekeeper -dumpfragments -tabular -clear LATEST asm.gkpStore    | perl -ane 'print join "\t", @F[0,-2,-1]; print "\n";' > asm.LATEST.clr &
 
* READ CLR's summary
  gatekeeper -dumpfragments asm.gkpStore | grep ^fragmentClear | perl -ane '  print $1," ",$3-$2,"\n" if(/(\w+),(\d+),(\d+)$/);' | ~/bin/sum2.pl


* MATED
* MATED
   frg22mates.pl < prefix.frg > prerfix.mates
   frg22mates.pl < prefix.frg > prerfix.mates


* Subset
* Deleted reads
   touch file.uids
   gdt asm.gkpStore | p 'print $F[0],"\n" if($F[6]);'
   gatekeeper -uid file.uids ...
 
* All scaffold reads in FRG2 format
  cat asm.posmap.frgscf | grep scfid > scfid.posmap.frgscf
   gatekeeper -uid scfid.posmap.frgscf -dumpfrg -donotfixmates -format2 asm.gkpStore > scfid.frg
 
== Id mapping ==
 
* Frg FASTQ packed format id's are no recorded in the gatekeeper; assgned a new UID ; to map them back ...
  ~/bin/fastq2wgsUIDs.pl s_*txt > s.uid2iid2id
  gatekeeper -dumpfragments -tabular asm.gkpStore | awk '{print $1,$2,$10}' > s.uid2iid
  paste s.uid2iid2id s.uid2iid | p 'print "ERROR $_" if($F[3] ne $F[-1]);' # check the lengths


= Meryl =
= Meryl =
  # Get mer frequencies
  meryl -C -B -m 22 -s prefix.seq -o prefix.22mers
  # Dump mer sequences
  meryl -Dt -s prefix.22mers > prefix.22mers.seq
  # Dump histogram
  meryl -Dh -s prefix.22mers | sort -nk2 -r  >  prefix.22mers.histo
  # Get missing mers
  meryl -C -B -m 22 -s ref.fasta -o ref.22mers
  meryl -C -B -m 22 -s qry.fasta -o qry.22mers
  meryl -M sub -s ref.22mers -s qry.22mers -o diff.22mers
  meryl -Dt -s diff.22mers  > diff.22mers.seq
  # Combine mers
* jellyfish
  # Get mer frequencies
  jellyfish count -s 1000000000 -m 18 --both-strands -t 16 -o prefix.jf prefix*.fastq [prefix*.seq]
  jellyfish merge prefix.jf_? -o prefix.jf
  # Dump histogram
  jellyfish histo prefix.jf
* countKmers.pl (other)
  cat lib.seq | ~/bin/countKmers.pl -k 22 | head -20 |          grep -v "^>" > 22mers.fwd.grep
  cat lib.seq | ~/bin/countKmers.pl -k 22 | head -20 | revseq | grep -v "^>" > 22mers.rev.grep
  cat 22mers.fwd.grep 22mers.rev.grep > 22mers.grep
  cat lib.seq | ~/bin/fgrep.pl -f 22mer.grep | count.pl
----
meryl -C -B -m 12 -s prefix.seq -o prefix.12mers
meryl -Dh -s prefix.12mers | sort -nk2 -r | more


= OBT =
= OBT =
* runCA spec
  doOverlapTrimming = 1
* command
  overlap -G ...  # -G to compute partial overlaps
* Dump
  overlapStore -d ./0-overlaptrim/asm.obtStore
* Number of jobs: should be < fileListMax  (1024*10 by default) otherwise overlapStore fails
  ~/bin/getOvlJobsCount.pl ovlHashBlockSize ovlRefBlockSize frgCount
  #or
  cat 0-overlaptrim-overlap/ovlopts.pl | grep -c '^"h'


= Overlap =
= Overlap =
* Jobs
  merOverlapperSeedBatchSize    100000
  merOverlapperExtendBatchSize  75000
 
  overmerry.sh :      #frgs/merOverlapperSeedBatchSize  jobs -> seeds/ dir
  olap-from-seeds.sh : #frgs/merOverlapperExtendBatchSize jobs -> olaps/ dir
* Num ber of jobs
  cat 1-overlapper/ovlopts.pl | grep -c '^"h'
* Restart in runCA
  rm 1-overlapper/o*
= OverlapStore =
* Flags:
  -t -M            : don't improve performance much
  -I ignore_file  : not implemented


* Dump
* Dump
Line 185: Line 335:
   rm prefix.ovlStore.count.tmp
   rm prefix.ovlStore.count.tmp


* Get stats
* Get stats; median number of ovls (5'/3') for each lib
   overlapStats -G prefix.gkpStore -O prefix.ovlStore/ -o prefix
   overlapStats -G prefix.gkpStore -O prefix.ovlStore/ -o prefix
  cat prefix.repeatmodel.lib.00*stats | egrep '^Lib|nSamples|median' | p 'chomp; if(/Lib/) {print "\n$_"} else { print "  $F[1]"}' | pretty -o
* Delete overlaps from *.obt.gz files
  zcat *.obt.gz  | convertOverlap -a -ovl |  ~/bin/difference1or2.pl -f delete.acc | convertOverlap -b -ovl | gzip > new.obt.gz
* Delete overlaps from ovlStore :  asm=>new
  overlapStore -d asm.ovlStore | perl -ane 'print $_ if($F[0]<$F[1]);' | ~/bin/difference1or2.pl -f delete.iid |  convertOverlap -b -ovl | gzip  >! new.ovb.gz
  overlapStore -c new.ovlStore -g asm.gkpStore/ new.ovb.gz
  #not necessary to gzip the overlap files
  overlapStore -d asm.ovlStore | perl -ane 'print $_ if($F[0]<$F[1]);' | ~/bin/difference1or2.pl -f delete.iid |  convertOverlap -b -ovl >! new.ovb
  overlapStore -c new.ovlStore -g asm.gkpStore/ new.ovb
* Edit(invert) overlaps from ovlStore : asm=>new
  ... > invert.iid    # get the iids of the reads to change         
  overlapStore -d asm.ovlStore  | p 'print $_ if($F[0]<=$F[1]);' | ~/bin/invertOvl.pl -f invert.iid | convertOverlap -b -ovl | gzip > new.ovb.gz
  overlapStore -c new.ovlStore -g asm.gkpStore new.ovb.gz
* Count how many overlaps have errors
  overlapStore -d asm.ovlStore | p 'print if($F[5]>0);' | wc -l
* Get 5'/3' overlap stats
  overlapStore -d asm.ovlStore | perl /nfshomes/dpuiu/Archives/JCVI/bin/countOverlaps53.pl | getSummary.pl -i 1 -t 5
  overlapStore -d asm.ovlStore | perl /nfshomes/dpuiu/Archives/JCVI/bin/countOverlaps53.pl | getSummary.pl -i 2 -t 3
* Build faster ...
  Example (Bee):
  e 'foreach ("0000000001".."0000000058") { print "overlapStore -c $_.ovlStore -g asm.gkpStore/ -i 0 -M 8192 1-overlapper/$_/*gz\n" }' | scheduler.pl  # 13min/job 
  e 'foreach ("0000000002".."0000000058") { print "overlapStore -m 0000000001.ovlStore $_.ovlStore\n" }'                                              # 3 min/job
  total=2*13+57*3=3.5 hrs (instead of 13.85 reported in runCA.log)
= Frgcorr =
* Fails on deleted reads with overlaps ; must delete the overlaps as well


= Unitiger =
= Unitiger =
* Unitig stats:
  cat 4-unitigger/*.cga.0 | perl ~/bin/utgLog2utgLen.pl | getSummary.pl -i 0 -j 1
* Library insert estimates:
  cat 4-unitigger/unitigger.err | grepi -A 9 LIB | pretty  # in this case 9 is the number of libs
  should compare with the original values
  gatekeeper -dumplibraries -tabular *.gkpStore | more
* Bog: don't let small inserts break unitigs:
  edit AS_BOG_MateChecker.cc
  void
  MateLocation::buildHappinessGraphs(Unitig *utg, DistanceCompute *globalStats) {
  for (uint32 mleidx=0; mleidx<_table.size(); mleidx++) {
    MateLocationEntry &loc = _table[mleidx];
    uint32 lib =  _fi->libraryIID(loc.mleFrgID1);
    //Aelksey Zimin also do not use short Illumina mate pairs
    //  if (globalStats[lib].samples < 10 )
    if (globalStats[lib].samples < 10 || globalStats[lib].stddev < 100 || globalStats[lib].mean < 500)
= Tigstore =
* list unitigs
  tigStore -g asm.gkpStore -t asm.tigStore 2 -D unitiglist    # ids 
  maID    isPresent  isDeleted  ptID  svID  fileOffset  covStat 
  0      1          0          1    2    0          3334   
  1      1          0          1    2    521908      -1     
  2      1          0          1    2    522764      543     
  ...
  tigStore -g asm.gkpStore -t asm.tigStore 2 -D properties    # exit: stdout    # properties
  unitig_coverage_stat        0 660
  unitig_microhet_prob        0 0.000000
  unitig_status              0 X
  unitig_unique_rept          0 X
  ...
* number of tigs
tigStore -g asm.gkpStore -t asm.tigStore 2 -D unitiglist | tail -1 | awk '{print $1}'
* list frags/consensus/layout in a unitig
  tigStore -g asm.gkpStore -t asm.tigStore 2 -u 0 -d frags
  tigStore -g asm.gkpStore -t asm.tigStore 2 -u 0 -d consensus
  tigStore -g asm.gkpStore -t asm.tigStore 2 -u 0 -d layout
  tigStore -g asm.gkpStore -t asm.tigStore 2 -u 0  -d properties  # exit: stderr & stdout
* dump all unitig consensus
  tigStore -g asm.gkpStore -t asm.tigStore 2 -U -d consensus > utg.fasta
* get single read unitigs
  tigStore -g asm.gkpStore -t asm.tigStore 43 -D properties | grep ^unitig_num_frags | grep ' 1$'
* get read/utg or ctg stats
  tigStore -g asm.gkpStore -t asm.tigStore 43 -D properties |  grep ^unitig_num_frags | getSummary.pl -z 1 -i 2 -t unitig_num_frags
  tigStore -g asm.gkpStore -t asm.tigStore 43 -D properties |  grep ^contig_num_frags | getSummary.pl -z 1 -i 2 -t contig_num_frags
* delete a unitig : get & update layout
  tigStore -g asm.gkpStore -t asm.tigStore 2 -u id -d layout | tee utg.layout | grep ^unitig | awk '{print $2}' | ~/bin/deleteUnitigIdLayout.pl > utg.layout.delete
  # or
  tigStore -g asm.gkpStore -t asm.tigStore 2 -u id -d layout | tee utg.layout |  ~/bin/deleteUnitigLayout.pl > utg.layout.delete
  tigStore -g asm.gkpStore -t asm.tigStore 2 -R utg.layout.delete
  tigStore -g asm.gkpStore -t asm.tigStore 2 -D unitiglist | p 'print if($F[2]);'
* delete single frg unitigs
  tigStore -g asm.gkpStore -t asm.tigStore 2 -U -d layout | egrep '^unitig|^data.num_frags' | p 'chomp; print $_," "; print "\n" if(/^data/)' | p 'print $F[1],"\n" if($F[3]==1);' | ~/bin/deleteUnitigIdLayout.pl > utg.layout.delete
  tigStore -g asm.gkpStore -t asm.tigStore 2 -R utg.layout.delete
* assign unitig -1 to all unitigs (for addition to a  new store)
  tigStore -g asm.gkpStore -t asm.tigStore 2 -u id -d layout | tee utg.layout | p 'if (/^unitig/) { print "unitig -1\n"} elsif(/^UTG/) { $F[7]=-1; print join " ",@F; print "\n" } else { print $_};' > utg.layout.add
  tigStore -g asm.gkpStore -t new.tigStore 2 -R utg.layout.add
  tigStore -g asm.gkpStore -t asm.tigStore 1 -u id -d layout | tee utg.layout | p 'if (/^unitig/) { print "unitig -1\n"} else { print $_};' > utg.layout.add
  tigStore -g asm.gkpStore -t new.tigStore 1 -R utg.layout.add
* fix unitig consensus
  touch utg.layout
  grep FAILED 5-consensus/asm*err | cut -f3 -d ' ' | sort -u -n | p 'system "tigStore -g asm.gkpStore -t asm.tigStore 2 -u $F[0] -d layout >> utg.layout\n";' 
  cat  utg.layout | updateUTGlayout.pl  >  utg.layout.update
  tigStore -g asm.gkpStore -t asm.tigStore 2 -R utg.layout.update
  grep FAILED 5-consensus/asm*err | sed 's/\.err:/ /' | cut -f1 | sort -u | p 'system "touch $F[0].success\n";'
  touch 5-consensus/consensus.success
* Filter 2+ fragment unitigs


= CGW =
= CGW =
* Long run times usually mean problems (conflicts) with the data
* Drbug
  gdb /fs/szdevel/dpuiu/SourceForge/wgs/Linux-amd64/bin//cgw
  $ run -j 1 -k 5 -r 5 -s 2 -z -P 2 -B 75000 -m 100 -g ./asm.gkpStore -t ./asm.tigStore -o asm
  bt
* Differences between the LAST run of CGW and PREVIOUS ones
  # options
  PREV      LAST
            -R 7            # restart from checkpoint file number 'ckp'
            -N ckp01-ABS    # restart from checkpoint location 'ckp' (see the timing file)
  -s 0      -s 2            # stone throwing level
  -S 0                        # S==0 : do not resolve surrogates
  -G                          # Don't generate output (cgw or cam)
  #files
  difference.pl 7-4-CGW.ls 7-0-CGW.ls
  asm.asm.cam
  asm.dregs.cam
  asm.partitionInfo  #  "equivalent of" 4-unitigger/asm.partitioningInfo
  asm.partitioning  #  "equivalent of" 4-unitigger/asm.partitioning
* running only one instance of ECR
  rm -fr 7-[234]*
  /fs/szdevel/dpuiu/SourceForge/wgs-assembler.030210/Linux-amd64/bin/runCA -d . -p asm -s ./runCA.spec doExtendClearRanges=1 *.frg >> & runCA.log
* canceling ECR run (if doExtendClearRanges set to 1 or 2 and the run takes too long)
  grep ckp ./7-0-CGW/cgw.out
  ./7-0-CGW/asm.ckp.3 (logical ckp01-ABS) after building scaffolds at ...
  ./7-0-CGW/asm.ckp.4 (logical ckp03-ACD) after scaffold cleaning at ...
  ./7-0-CGW/asm.ckp.5 (logical ckp05-1SM) after 1st scaffold merge at ...
  ./7-0-CGW/asm.ckp.6 (logical ckp06-AS) after stone throwing at ...
  ./7-0-CGW/asm.ckp.7 (logical ckp08-2SM) after 2nd scaffold merge at ...
  ./7-0-CGW/asm.ckp.8 (logical ckp09-FR) after final rocks at ...
  cgw -R 7 -N ckp09-FR -j 1 -k 5 -r 5 -s 2 -z -P 2 -B 75000 -m 100 -g ./asm.gkpStore -t ./asm.tigStore -o ./7-0-CGW/asm
  grep ckp ./7-0-CGW/cgw.out
  ...
  ./7-0-CGW/asm.ckp.9 (logical ckp10-PS) after partial stones at ...
  ./7-0-CGW/asm.ckp.10 (logical ckp11-FCS) after final contained stones at ...
  ./7-0-CGW/asm.ckp.11 (logical ckp13-RS) after resolve surrogates at ...
  ./7-0-CGW/asm.ckp.12 (logical ckp14-FIN) after output at ...
* running ECR (if initially doExtendClearRanges=0) ; make sure cgwPurgeCheckpoints=0
  runCA -d . -p asm -s runCA.spec stopAfter=scaffolder cgwPurgeCheckpoints=0 doExtendClearRanges=0 *.frg >> & runCA.log
 
  rm 7-CGW
  cd 7-0-CGW
  mkdir OLD
  mv asm.ckp.9 asm.ckp.1? OLD asm.*.cam asm.partition* asm.timing OLD
  cd ..
 
  runCA -d . -p asm -s runCA.spec stopAfter=scaffolder cgwPurgeCheckpoints=0 doExtendClearRanges=1 *.frg >> & runCA.log
* DistUpadte file
  7-0-CGW/stat/scaffold_final.distupdate.dst
* Stats
  cat 7-0-CGW/rezlog/stone.i02.log | perl ~/bin/cgwLog2scfLen.pl | getSummary.pl -t scfLen
= ECR =
* doExtendClearRanges = 2 => cgw run 3 times, ECR 2 times


= Consensus =  
= Consensus =  


* Dump unitig seq
   tigStore -g asm.gkpStore -t asm.tigStore/ 20 -D unitigs > unitigs.txt
   tigStore -g asm.gkpStore -t asm.tigStore/ 20 -D unitigs > unitigs.txt
   cat unitigs.txt | grep -v ^maID | perl -ane 'system "tigStore -g asm.gkpStore -t asm.tigStore 20 -d consensus -u  $F[0]\n";' >& unitigs.fasta.tmp
   cat unitigs.txt | grep -v ^maID | perl -ane 'system "tigStore -g asm.gkpStore -t asm.tigStore 20 -d consensus -u  $F[0]\n";' >& unitigs.fasta.tmp
Line 199: Line 539:
   rm unitigs.fasta.tmp
   rm unitigs.fasta.tmp


* Dump contig seq
   tigStore -g asm.gkpStore -t asm.tigStore/ 20 -D contigs > contigs.txt
   tigStore -g asm.gkpStore -t asm.tigStore/ 20 -D contigs > contigs.txt
   cat contigs.txt | grep -v ^maID | perl -ane 'system "tigStore -g asm.gkpStore -t asm.tigStore 20 -d consensus -c  $F[0]\n";'>& contigs.fasta.tmp
   cat contigs.txt | grep -v ^maID | perl -ane 'system "tigStore -g asm.gkpStore -t asm.tigStore 20 -d consensus -c  $F[0]\n";'>& contigs.fasta.tmp
   cat contigs.fasta.tmp | grep -v null | sed 's/cns=//' | sed 's/-//g' | nl0 | tab2fasta.pl > contigs.fasta
   cat contigs.fasta.tmp | grep -v null | sed 's/cns=//' | sed 's/-//g' | nl0 | tab2fasta.pl > contigs.fasta
   rm contigs.fasta.tmp
   rm contigs.fasta.tmp
* Recompute unitig consensus
  utgcns -g ./asm.gkpStore -t ./asm.tigStore 2 011 -u 122533 -v -V


= Terminator =
= Terminator =
Line 208: Line 552:
= Log file & runtime (wgs-5.2) =
= Log file & runtime (wgs-5.2) =


* Use ~dpuiu/bin/getCAruntimes.pl
* Filter log
  ~/bin/getCAruntimes.pl -filter runCA.log > runCA.filter.log
 
* Display log
  cat runCA.log | ~/bin/getCAruntimes.pl -hour -total -plot  # => runCA.runTimes ; runCA.runTimes.gp ; runCA.runTimes.png
  display runCA.runTimes.png
* [[Media:RunCA.log|runCA.log]] [[Media:RunCA.log.times|runCA.times]] bos taurus BCM.CLONEEND 16,875 Sanger reads
* [[Media:RunCA.log|runCA.log]] [[Media:RunCA.log.times|runCA.times]] bos taurus BCM.CLONEEND 16,875 Sanger reads
* [[Media:RunCA.2.log|runCA.log]] [[Media:RunCA.2.times|runCA.times]] wolbachia symbioint of culex pipiens quinquefasciatus 36,767  Sanger reads
* [[Media:RunCA.2.log|runCA.log]] [[Media:RunCA.2.times|runCA.times]] wolbachia symbioint of culex pipiens quinquefasciatus 36,767  Sanger reads
* [[Media:runCA.runTimes.png|runCA.runTimes.png]] Bumblebee assembly


= Output processing =
= Output processing =
Line 228: Line 579:
   OLD/posmap2cvg.4.pl -mates prefix.posmap.mates < prefix.posmap.frgscf > prefix.posmap.scfmatecvg
   OLD/posmap2cvg.4.pl -mates prefix.posmap.mates < prefix.posmap.frgscf > prefix.posmap.scfmatecvg
   posmap2cvg.pl                                  < prefix.posmap.ctgscf > prefix.posmap.scfctgcvg
   posmap2cvg.pl                                  < prefix.posmap.ctgscf > prefix.posmap.scfctgcvg
* Lib assembly stats *
  cd 9-*
  gatekeeper -dumplibraries -tabular ../genome.gkpStore |                    pretty > genome.gkpStore.libinfo1
  gatekeeper -dumpinfo              ../genome.gkpStore | grep -A 100 ^Lib | pretty > genome.gkpStore.libinfo2
  gdt genome.gkpStore | p 'print "$F[0] $F[5]\n" unless($F[6]);' > genome.gkpStore.frg2lib
  paste genome.gkpStore.frg2lib genome.posmap.frags | p ' print "$F[1] $F[5]\n";' | count.pl >  genome.gkpStore.frgcount
  gdt genome.gkpStore | p 'print "$F[4] del\n" if($F[6] eq "1");' | count.pl >> genome.gkpStore.frgcount
  cat genome.gkpStore.frgcount | count2col.pl -i 0 -j 1 -k 2 | awk '{print $1,$4,$2,$3}' | pretty > genome.gkpStore.libcount
  more genome.gkpStore.lib* 
* Posmap2gff
  #getting frglib info
  gdt ../genome.gkpStore | p 'print "$F[0] $F[5]\n" unless($F[6]);' > genome.gkpStore.frg2lib
  # one of ...
  cat genome.posmap.frgscf | ~/bin/posmap2gff.pl -sou frg -fea frg -p > genome.posmap.frgscf.gff
  cat genome.posmap.frgscf | ~/bin/posmap2gff.pl -sf genome.gkpStore.frg2lib -fea frg -p | sed 's/[ab]//' > genome.posmap.frgscf.gff
  cat genome.posmap.frgscf | ~/bin/posmap2gff.pl -sf genome.gkpStore.frg2lib -fea frg -p >! genome.posmap.frgscf.gff
  cat genome.posmap.utgscf | ~/bin/posmap2gff.pl -sou utg -p >! genome.posmap.utgscf.gff
  cat genome.posmap.ctgscf | ~/bin/posmap2gff.pl -sou ctg -p >! genome.posmap.ctgscf.gff
  cat genome.posmap.scflen | perl -ane 'print "$F[0]\t$F[0]\t0\t$F[1]\tf\n";' | posmap2gff.pl -sou scf >>  genome.posmap.scflen.gff
  setenv SEQID 1374988
  cat genome*gff | p 'print $_ if($F[8]==$ENV{SEQID});' >! $SEQID.gff
  ~/szdevel/Apollo/apollo -i gff -f $SEQID.gff


* Get 0cvg regions
* Get 0cvg regions
Line 255: Line 638:
* Get scaff file
* Get scaff file
   ~/bin/posmap2scaff.pl prefix.posmap.ctgscf
   ~/bin/posmap2scaff.pl prefix.posmap.ctgscf
* Gaps longer than a certain size
  cat asm.posmap.ctgscf | p 'print $p,$_,"\n" if($F[1] eq $P[1] and $F[2]-$P[3]>1000); $p=$_;@P=@F;'
== Mate redundancy ==
* Sequence based (first 32bp in fwd/rev Illumina reads; max 1 mismatch)
  paste s_2_?_8kb*.txt | perl -ane 'chomp; if($.%4==1) { s/\@(\S+)[12]//; printf("%-45s",$1); } elsif($.%4==2) { print substr($F[0],0,32),"\t",substr($F[1],0,32),"\n" } ' | sort -k2,3 | ./getDuplicates.pl > s_2_8k.tab
  cat s_2_8k.tab | perl -ane 'print $F[0],"1\n";' >  s_2_1_8k.redundant
  cat s_2_8k.tab | perl -ane 'print $F[0],"2\n";' >  s_2_2_8k.redundant
* overlap(Store) based
  #Example
  getMateRedundancy.amos -D BEGIN=1 -D END=219  -D GKPSTORE=../asm.gkpStore -D OVLSTORE=../asm.ovlStore s_2_3kb
  #=> asm.redundant_mate_pairs , asm.redundant_mate_clusters
* Alignment based
  cat prefix*.soap2 | ~/bin/AMOS/soap2sameRef.pl -pos | sort -nk2 -nk3 | p 'print $_ if(@P and $F[1]-$P[1]<=1 and abs($F[2]-$P[2])<=1); @P=@F;' > prefix.redundant
* Posmap file based
  cat asm.posmap.frgscf |  ~/bin/posmap2ovl.pl -p 75 | grep -v ^ref | ~/bin/getMateRedundancy12.pl > asm.redundant_mate_pairs
  cat asm.redundant_mate_pairs |  perl ~/bin/cluster.pl | sort > asm.redundant_mate_clusters


== Fasta ==
== Fasta ==
Line 279: Line 684:


== Asm ==
== Asm ==
* Get message counts
cat asm.asm | grep "^{" | egrep -v 'MPS|UPS|VAR|CTP' | uniq -c | more
      1 {MDI
273375 {AFG  # reads
  36258 {AMP  # links
  2185 {UTG  # MPS+          messages
  4876 {ULK
  1443 {CCO  # VAR*,MPS+,UPS+ messages
  3277 {CLK
    15 {SCF
    40 {SLK  # CTP+          messages


* Get output seq
* Get output seq
Line 330: Line 747:
   scaff2agp.pl -i prefix.ctg.infoseq < prefix.scaff > prefix.agp
   scaff2agp.pl -i prefix.ctg.infoseq < prefix.scaff > prefix.agp


= Running subassemblies =
= Moving an assembly =
Update frg file:
* clr=ECR2
* MDI -> DST


runCA parameters:
* Minimum file set
* doOBT=no
  scp runCA.sh
  scp runCA.spec
  scp asm.gkpStore/*
  scp 3-overlapcorrection/asm.erates.updated
  scp 4-unitigger/asm.iidmap
  scp 4-unitigger/asm.partitioning
  scp 4-unitigger/asm.partitioningInfo
  scp 4-unitigger/unitigger.success
  scp 5-consensus/consensus.success
  scp 7-0-CGW/asm.timing
  scp 7-0-CGW/asm.ckp.11  # last one
  scp 7-0-CGW/asm.partitioning
  scp 7-0-CGW/asm.partitionInfo
  scp 7-0-CGW/cgw.out.00.ckp.11
  scp 7-0-CGW/cgw.out
  scp asm.tigStore/*
 
= Add reads to an assembly =
 
  ln -s ../wgs-prev/prefix.frg .
  ln -s ../Data/new.frg
 
  cp ../wgs-prev/runCA.spec ../wgs-prev/runCA.sh  .
 
  ln -s ../wgs-prev/0-mercounts
 
  cp -R ../wgs-prev/prefix.gkpStore
  gatekeeper -a -o prefix.gkpStore new.frg
  ll -d ../wgs-prev/1-overlapper/000000000*                  # if multiple jobs ...
  mkdir -p 1-overlapper/0000000001                          # mkdir -p 1-overlapper/0000000002 ...  ???
 
  cat ../wgs-prev/1-overlapper/*pl | grep '^"h' | tail -1    # get last qry/ref block  => "h0056000001r0050000000"
  cd 1-overlapper/0000000001
  ln -s ../../wgs-prev/1-overlapper/0000000001/*ovb.gz .
  rm h0056000001*.ovb.gz *r0050000000.ovb.gz                # this should change from case to case
  cd ../..
 
= Removing bad  mates =
 
  cat ../wgs-prev/9-terminator/asm.posmap.frags | grep bad > asm.posmap.frags.bad
  gatekeeper -dumpfragments -tabular -uid asm.posmap.frags.bad ./wgs-prev/asm.gkpStore |  grep -v ^UID | ~/bin/deleteLkg.pl -j 2 > asm.posmap.frags.bad.delete.lkg
 
  ln -s ../wgs-prev/*.frg
  gatekeeper -o asm.gkpStore *.frg asm.posmap.frags.bad.delete.lkg
 
  ln -s ../wgs-prev/0-mercounts/
  ln -s ../wgs-prev/asm.ovlStore/
  ln -s ../wgs-prev/3-overlapcorrection/
  ln -s ../wgs-prev/4-unitigger/
  cp -R ../wgs-prev/asm.tigStore/ .
  ln -s ../wgs-prev/5-consensus
  runCA ...


= Test datasets =
= Test datasets =
Line 359: Line 834:


== Read counts ==
== Read counts ==
   zcat anc.porphyromonas_gingivalis_w83.001.gz | ~/bin/traceanc2anc.pl SEQ_LIB_ID INSERT_SIZE  INSERT_STDEV SVECTOR_CODE | count.pl | t
   zcat anc.*.gz | ~/bin/traceanc2anc.pl SEQ_LIB_ID INSERT_SIZE  INSERT_STDEV SVECTOR_CODE | count.pl  
 
  SEQ_LIB_ID                                                                  INSERT_SIZE  INSERT_STDEV  SVECTOR_CODE          COUNT               
  T13146                                                                      500          150          PUC18                35287 
  PORPHYROMONAS-GINGIVALIS-W83,-FOSMID-END-SEQUENCING_PGINGIVALI-F-01-40KB    40000        8000          PCC1FOS              2876   
  T13150                                                                      10000        3000          LAMBDA DASHII        2352 
  T24315                                                                      .            .            DUMMYFORGLKMIGRATION  1149   
  T0612                                                                      500          150          PUC18                234   
  T0611                                                                      500          150          PUC18                177   
  T0628                                                                      10000        3000          LAMBDA DASHII        43   
  total                                                                      .            .            .                    42119


==  TA formatting ==
==  TA formatting ==


* frg FORMAT2  
* frg FORMAT2  
* TIs are frg uids
* TIs are frg uids


Line 390: Line 856:
   tracedb-to-frg.pl -only T13146 -frg xml.porphyromonas_gingivalis_w83.001.gz >>& tracedb-to-frg.log
   tracedb-to-frg.pl -only T13146 -frg xml.porphyromonas_gingivalis_w83.001.gz >>& tracedb-to-frg.log


* Sore reads are discarded?
== 454 SRA formatting ==
 
== Illumina formatting ==


== SRA formatting (454) ==
* use fastqToCA (wgs-6.0)


=== FASTQ ===
  fastqToCA \
    -insertsize 3000 300 \
    -libraryname 1 \
    -type illumina \
    -fastq /fs/szattic-asmg4/Bees/Bombus_impatiens/s_1_1_sequence.txt,/fs/szattic-asmg4/Bees/Bombus_impatiens/s_1_2_sequence.txt | \
    <span style="background:yellow">sed 's/ori:I/ori:O/'</span> \
    > 1.frg
  =>
  {LIB                                                                                                                                     
  act:A                                                                                                                                   
  acc:1                                                                                                                                   
  ori:O                                                                                                                                   
  mea:3000.000                                                                                                                             
  std:300.000                                                                                                                             
  src:                                                                                                                                     
  .                                                                                                                                       
  nft:10                                                                                                                                   
  fea:                                                                                                                                     
  forceBOGunitigger=1                                                                                                                     
  isNotRandom=0                                                                                                                           
  doNotTrustHomopolymerRuns=0                                                                                                             
  doRemoveDuplicateReads=0                                                                                                                 
  doNotQVTrim=0                                                                                                                           
  goodBadQVThreshold=0                                                                                                                     
  doNotOverlapTrim=0                                                                                                                       
  usePackedFragments=1                                                                                                                     
  illuminaFastQType=illumina                                                                                                               
  illuminaSequence=/fs/szattic-asmg4/Bees/Bombus_impatiens/s_1_1_sequence.txt,/fs/szattic-asmg4/Bees/Bombus_impatiens/s_1_2_sequence.txt   
  .                                                                                                                                       
  }
 
  gatekeeper \
    -o asm.gkpStore.BUILDING \
    -T -F  \
    -E asm.gkpStore.errorLog \
    1.frg
  => Segmentation fault
 
== FASTQ ==


* This won't remove the linker:
* This won't remove the linker:
Line 400: Line 906:
   zcat SRR001351.fastq.gz | ~/bin/fastq2qual.pl > SRR001351.qual
   zcat SRR001351.fastq.gz | ~/bin/fastq2qual.pl > SRR001351.qual
   convert-fasta-to-v2.pl -454 -l SRR001351 -s SRR001351.seq -q SRR001351.qual
   convert-fasta-to-v2.pl -454 -l SRR001351 -s SRR001351.seq -q SRR001351.qual
* Cleaning seq
  ~/bin/OLD/cleanFastaq.pl \
    s_2_1_1.1kb_sequence.txt s_2_2_1.1kb_sequence.txt \                                    # input
    s_2_1_1.1kb_sequence.clean.txt  s_2_2_1.1kb_sequence.clean.txt \                        # output mates
    s_2_1_1.1kb_sequence.clean.single.txt  s_2_2_1.1kb_sequence.clean.single.txt            # output single
    echo prefix_1.fastq prefix_2.fastq  > prefix.ls
    quake.py -f prefix.ls
* Illumina note:
  ''We recommend removing reads that do not pass the GA analysis software Failed_Chastity fi lter before attempting to assemble the sequence.''
  ''The chastity of a base call is the ratio of the intensity of the greatest signal divided by the sum of the two greatest signals. ''
  ''Reads do not pass the quality fi lter if there are two or more base calls with chastity of less than 0.6 in the fi rst 25 cycles. ''
  ''These reads have an “N” in the last column of the GA analysis software export file.''
  ''Removed all reads that contained ambiguous characters (Ns).''
  ''Removed reads that did not contain at least 25 Q30 bases among the first 35 cycles (s35). ''


=== SFF ===
=== SFF ===
   sffToCA -libraryname SRR001351 -output SRR001351.frg SRR001351.sff
   sffToCA -libraryname SRR001351 -output SRR001351.frg SRR001351.sff


Line 408: Line 933:
* Command
* Command
   ~/wgs/Linux-amd64/bin/runCA -p pging -d testassembly porphyromonas_gingivalis_w83.* >& runCA.log
   ~/wgs/Linux-amd64/bin/runCA -p pging -d testassembly porphyromonas_gingivalis_w83.* >& runCA.log
= CAUG notes =
Illimina mates
* short: innies PE (paired ends)
* long:  outies MP (mate pairs) "lots of issues" (eg> : linker is one bp; solution : take 36bp from each end of the read to amke sure you don't get the linker
* wgs 6.1 : Illumina limit of 104 bp; can recompile the code o make it 232
<pre>
fastqToCA -insertsize 280 30 -libraryname yersinia_pestis.100bp.0280bp.005x -type illumina -innie -fastq $PWD/yersinia_pestis.100bp.0280bp.005x.1.fastq,$PWD/yersinia_pestis.100bp.0280bp.005x.2.fastq > illumina.frg
{VER
ver:2
}
{LIB
act:A
acc:yersinia_pestis.100bp.0280bp.005x
ori:I
mea:280.000
std:30.000
src:
.
nft:12
fea:
forceBOGunitigger=1
shortOverlapModel=1
isNotRandom=0
doNotTrustHomopolymerRuns=0
doRemoveDuplicateReads=0
doNotQVTrim=0
goodBadQVThreshold=0
doNotOverlapTrim=0
usePackedFragments=1
illuminaFastQType=illumina
illuminaOrientation=innie
illuminaSequence=/home/training/CAUG/USERS/Daniela_Puiu/yersinia_pestis.100bp.0280bp.005x.1.fastq,/home/training/CAUG/USERS/Daniela_Puiu/yersinia_pestis.100bp.0280bp.005x.2.fastq
.
}
{VER
ver:1
}
#####################
gatekeeper -o illumina.gkpStore illumina.frg
Starting file 'illumina.frg' at line 0.
Type set to ILLUMINA 1.3+.
Orientation set to INNIE.
Processing illumina reads from '/home/training/CAUG/USERS/Daniela_Puiu/yersinia_pestis.100bp.0280bp.005x.1.fastq'
                          and '/home/training/CAUG/USERS/Daniela_Puiu/yersinia_pestis.100bp.0280bp.005x.2.fastq'
GKP finished with no errors.
#####################
gatekeeper -dumpinfo illumina.gkpStore/
LOAD STATS
1      libInput
1      libLoaded
0      libErrors
0      libWarnings
0      frgInput
228742  frgLoaded
0      frgErrors
0      frgWarnings
0      lkgInput
0      lkgLoaded
0      lkgErrors
0      lkgWarnings
0      sffInput
0      sffLoaded
0      sffErrors
0      sffWarnings
0      sffLibCreated
0      plcInput
0      plcLoaded
0      plcErrors
0      plcWarnings
228742  numRandom
228742  numPacked
0      numNormal
0      numStrobe
GLOBAL STATS
228742  FRG
1      LIB
LibraryName                            numActiveFRG    numDeletedFRG  numMatedFRG    readLength      clearLength
GLOBAL                                  228742          0              218724          22874200        22191085
LegacyUnmatedReads                      0              0              0              0              0
yersinia_pestis.100bp.0280bp.005x      228742          0              218724          22874200        22191085
################
sffToCA -insertsize 3200 900 -libraryname porphyromonas_gingivalis_w83.3200bp.0900bp.E8YURXS01.1x.sff -trim chop -linker flx -output 454 porphyromonas_gingivalis_w83.3200bp.0900bp.E8YURXS01.1x.sff
loadSFF()-- Loading 'porphyromonas_gingivalis_w83.3200bp.0900bp.E8YURXS01.1x.sff'.
removeDuplicateReads()-- from 1 to 19622
removeDuplicateReads()-- finished
detectMates()-- from 1 to 19621
#################
Overlap: check the hash load
  6% is too low
~75% would be ideal
</pre>

Latest revision as of 20:16, 26 July 2011

Links

Data Issues

  • Problems if the coverage > read length; solution: sample the reads
  • Analyze the data before assemble it:
    • If there is a reference genome map the reads to it
    • Plot kmer frequencies (use multiple kmer sizes) ; see how they compare with the estimated coverage

Sourceforge

  • Constants:
 cat src/AS_global.h
 ...
 #define AS_READ_MIN_LEN                   64  # should decrease for short Illumina reads  32 ???
 #define AS_OVERLAP_MIN_LEN                40  # should decrease for short Illumina reads  30 ???  

 #define AS_READ_MAX_PACKED_LEN            104  # max Illumina read length; incread to 124
 
 #define AS_READ_MAX_NORMAL_LEN_BITS       11   # should be increased to 12+ if there are many frequent kmesra nd dpn't fit in the hash
cat src/AS_CGW/AS_CGW_dataTypes.h
#define CGW_MIN_DISCRIMINATOR_UNIQUE_LENGTH 1000   # should be decraesed for short reads assemblies ; all unitigs less than that do not get scaffolded


  • Download & compile CVS tip
 cvs -z3 -d:pserver:anonymous@wgs-assembler.cvs.sourceforge.net:/cvsroot/wgs-assembler co -P wgs-assembler

 cd wgs-assembler/src
 make SITE_NAME=LOCAL 

 => executables under wgs-assembler/Linux-amd64/bin/
  • Download & compile release
 wget ...

 bzip2 -dc wgs-6.0-beta.tar.bz2 | tar -xf -
 cd wgs-6.0-beta
 cd kmer
 sh configure.sh
 gmake install
 cd ../src
 gmake
 cd ..
  • Compile for debugging :
 make SITE_NAME=LOCAL  BUILDDEBUG=1  # in "c_make.as"

Run

  • Help
runCA.pl -fields
runCA.pl -options
  • Batch submission
 touch runCA.sh
   echo --- runCA $$ restarted by $USER at `date` --- >>& runCA.log
   ~/bin/runCA -d . -p prefix -s ~/bin/runCA.spec  *.frg >>& runCA.log
 at runCA.sh   
  • Sanger
 runCA -d . -p prefix -s spec Sanger.frg
  • doOverlapTrimming=1 (default)
  • low qual Sanger reads (ex: 20 or 'D') are discarded
  • Illumina reads are trimmed at 5'/3' (bp with qual < 5 (E) are deleted; read is deleted if new length< 64)
  • astatLowBound 1 (default)  ; can be lowered for low/un-uniform coverage
  • astatHighBound 5 (default)
  • computeInsertSize=0 (default)
  • 454 & Hybrid : convert sff to frg
 sffToCA -libraryname 454 -output 454 454.sff
 runCA -d . ovlOverlapper=mer unitigger=bog -p prefix *.frg
 runCA -d . -s ~/bin/runCA.454.spec -p prefix *.frg 
  • gatekeeper -L : search for 454 paired end linker
  • AS_GKP_sff.c  : Load_SFF function
  • for mated libs set mean=3000,stdev=300

Parameters

  • stopAfter:
   initialStoreBuilding
   overlapBasedTrimming | OBT     
   overlapper                         # runs overlap, overlapStore
   unitigger                          # runs frg/ovl correction
   consensusAfterUnitigger
   scaffolder
   consensusAfterScaffolder

  • Example:
 ~/bin/runCA -d . -p asm -s ~/bin/runCA.454.spec stopAfter=initialStoreBuilding *.frg

Gatekeeper

Build

  • Create & add
 gatekeeper    -o prefix.gkpStore -T -F prefix*frg
 gatekeeper -a -o prefix.gkpStore -T -F prefix*frg

Update

  • edit frg: act:R !!!
 {DST
 act:R
 acc:10691
 mea:161662
 std:28760
 }
  • append
 gatekeeper -a -o prefix.gkpStore -T -F edit.frg
  • Using tab file
 Example: No restrictions on lib stddev with --edit option
 cat prefix.update
 lib     uid     19070   mean    167001
 lib     uid     19070   stddev  167001
 lib     uid     19070   isNotRandom     1
 lib     uid     19070   Orientation     O            # outie orientation for Illimina libs >1K
    
 gatekeeper --edit prefix.update prefix.gkpStore
  • Update read CLR/CLV
 gatekeeper -a -r prefix.clr -o prefix.gkpStore   # not in the wgs release
 gatekeeper -a -v prefix.clv -o prefix.gkpStore

Delete

  • DST messages can not be deleted
  • A FRG can be deleted only if there is no related LKG message; an abbreviated MSG
  • act:D !!!
 # frg1
 {LKG
 act:D
 typ:M
 fg1:94376
 fg2:94654
 }
 {FRG
 act:D
 acc:94376
 }

 # frg2 ???
 {LKG
 act:D
 fg:94376
 fg:94654
 }
 {FRG
 act:D
 acc:94376
 }
 cat delete.ids | perl -ane 'print "{FRG\nact:D\nacc:$F[0]\n}\n";'> delete.frg
 cat delete.ids | ~/bin/deleteFrg.pl -forward 1 -reverse 2 > delete.frg            # forward reads have suffix 1; forward reads have suffix 1

 gatekeeper -a -o prefix.gkpStore -T -F delete.frg

Dump

  • INFO
 gatekeeper -dumpinfo prefix.gkpStore
  • LAST frg in store
 gatekeeper -dumpinfo -lastfragiid prefix.gkpStore
  • FASTA sequences
 gatekeeper -dumpfastaseq prefix.gkpStore > prefix.seq
 gatekeeper -dumpfastaqlt prefix.gkpStore > prefix.qual
  • FASTA reads for newbler
 gatekeeper -dumpnewbler newbler prefix.gkpStore => prefix.fna,prefix.fna.qual

 grep "^>" prefix.fna | head -2
 >SRR001355.3635.1a template=2020+2021 dir=F library=SRX000348 trim=20-117
 >SRR001355.3635.1b template=2020+2021 dir=R library=SRX000348 trim=1-130
  • FASTQ reads
 gatekeeper -dumpfastq prefix. prefix.gkpStore
 =>
 prefix.1.fastq
 prefix.2.fastq
 prefix.paired.fastq    # contains both asm.1.fastq & asm.2.fastq reads
 prefix.unmated.fastq
 cat prefix.1.fastq | perl ~/bin/clrFastq.pl > prefix.1.clr.fastq 
 cat prefix.2.fastq | perl ~/bin/clrFastq.pl > prefix.2.clr.fastq 
 cat prefix.paired.fastq | perl ~/bin/clrFastq.pl > prefix.paired.clr.fastq 
 cat prefix.unmated.fastq | perl ~/bin/clrFastq.pl > prefix.unmated.clr.fastq 


  • FRG file
 gatekeeper -dumpfrg prefix.gkpStore > prefix.frg
 gatekeeper -dumpfrg prefix.gkpStore -format2 > prefix.frg2 
  • READS in TAB format
 gatekeeper -dumpfragments -tabular prefix.gkpStore > prefix.gkpStore.info
 gatekeeper -dumpfragments -tabular prefix.gkpStore | grep -v ^UID | awk '{print $1,$2}' > prefix.gkpStore.uid2iid
  • LIBS in TAB format
 gatekeeper -dumplibraries -tabular prefix.gkpStore | grep -v ^UID | awk '{print $1,$4,$5}' | sed 's/\.000//g'> prefix.dst  # original vals
 asm2mdi.pl < prefix.asm > prefix.mdi                                                                                       # final vals
  • READ CLR's (default=LATEST)
 gatekeeper -dumpfragments -tabular prefix.gkpStore/ | perl -ane 'print join "\t", @F[1,0,-2,-1,6]; print "\n";' | head | pretty -o
 IID    UID                                  clrBeginLATEST  clrEndLATEST  isDeleted  
 1      HWI-EAS385_0001:1:1:1569:15197#0/1   0               65            0          
 2      HWI-EAS385_0001:1:1:2095:12498#0/1   0               67            0          
 3      HWI-EAS385_0001:1:1:5452:4010#0/1    0               78            0          
 4      HWI-EAS385_0001:1:1:12400:16773#0/1  0               105           0          
 5      HWI-EAS385_0001:1:1:12884:18158#0/1  0               80            0          
 ...
 
 gatekeeper -dumpfragments -tabular prefix.gkpStore/  | perl -ane 'print join "\t", @F[0,-2,-1]; print "\n";' > prefix.clr
 gatekeeper -dumpfragments -tabular -clear [CLR|OBTINITIAL|OBTMERGE|OBTCHIMERA|ECR_0|ECR_1|ECR_2|LATEST] prefix.gkpStore/  | perl -ane 'print join "\t", @F[0,-2,-1]; print "\n";' > prefix.?.clr
 gatekeeper -dumpfragments -tabular -clear CLR asm.gkpStore        | perl -ane 'print join "\t", @F[0,-2,-1]; print "\n";' > asm.CLR.clr
 gatekeeper -dumpfragments -tabular -clear OBTINITIAL asm.gkpStore | perl -ane 'print join "\t", @F[0,-2,-1]; print "\n";' > asm.OBTINITIAL.clr &
 gatekeeper -dumpfragments -tabular -clear OBTCHIMERA asm.gkpStore | perl -ane 'print join "\t", @F[0,-2,-1]; print "\n";' > asm.OBTCHIMERA.clr &
 ...
 gatekeeper -dumpfragments -tabular -clear LATEST asm.gkpStore     | perl -ane 'print join "\t", @F[0,-2,-1]; print "\n";' > asm.LATEST.clr &
  • READ CLR's summary
 gatekeeper -dumpfragments asm.gkpStore | grep ^fragmentClear | perl -ane '  print $1," ",$3-$2,"\n" if(/(\w+),(\d+),(\d+)$/);' | ~/bin/sum2.pl 
  • MATED
 frg22mates.pl < prefix.frg > prerfix.mates
  • Deleted reads
 gdt asm.gkpStore | p 'print $F[0],"\n" if($F[6]);'
  • All scaffold reads in FRG2 format
 cat asm.posmap.frgscf | grep scfid > scfid.posmap.frgscf
 gatekeeper -uid scfid.posmap.frgscf -dumpfrg -donotfixmates -format2 asm.gkpStore > scfid.frg

Id mapping

  • Frg FASTQ packed format id's are no recorded in the gatekeeper; assgned a new UID ; to map them back ...
 ~/bin/fastq2wgsUIDs.pl s_*txt > s.uid2iid2id
 gatekeeper -dumpfragments -tabular asm.gkpStore | awk '{print $1,$2,$10}' > s.uid2iid

 paste s.uid2iid2id s.uid2iid | p 'print "ERROR $_" if($F[3] ne $F[-1]);' # check the lengths

Meryl

 # Get mer frequencies
 meryl -C -B -m 22 -s prefix.seq -o prefix.22mers 
 # Dump mer sequences
 meryl -Dt -s prefix.22mers > prefix.22mers.seq
 # Dump histogram
 meryl -Dh -s prefix.22mers | sort -nk2 -r  >  prefix.22mers.histo
 # Get missing mers
 meryl -C -B -m 22 -s ref.fasta -o ref.22mers
 meryl -C -B -m 22 -s qry.fasta -o qry.22mers
 meryl -M sub -s ref.22mers -s qry.22mers -o diff.22mers
 meryl -Dt -s diff.22mers  > diff.22mers.seq
 # Combine mers
  • jellyfish
 # Get mer frequencies
 jellyfish count -s 1000000000 -m 18 --both-strands -t 16 -o prefix.jf prefix*.fastq [prefix*.seq]
 jellyfish merge prefix.jf_? -o prefix.jf

 # Dump histogram
 jellyfish histo prefix.jf 


  • countKmers.pl (other)
 cat lib.seq | ~/bin/countKmers.pl -k 22 | head -20 |          grep -v "^>" > 22mers.fwd.grep
 cat lib.seq | ~/bin/countKmers.pl -k 22 | head -20 | revseq | grep -v "^>" > 22mers.rev.grep
 cat 22mers.fwd.grep 22mers.rev.grep > 22mers.grep
 cat lib.seq | ~/bin/fgrep.pl -f 22mer.grep | count.pl



meryl -C -B -m 12 -s prefix.seq -o prefix.12mers

meryl -Dh -s prefix.12mers | sort -nk2 -r | more

OBT

  • runCA spec
 doOverlapTrimming = 1
  • command
 overlap -G ...   # -G to compute partial overlaps
  • Dump
 overlapStore -d ./0-overlaptrim/asm.obtStore
  • Number of jobs: should be < fileListMax (1024*10 by default) otherwise overlapStore fails
 ~/bin/getOvlJobsCount.pl ovlHashBlockSize ovlRefBlockSize frgCount
 #or
 cat 0-overlaptrim-overlap/ovlopts.pl | grep -c '^"h'

Overlap

  • Jobs
 merOverlapperSeedBatchSize     100000
 merOverlapperExtendBatchSize   75000
 
 overmerry.sh :       #frgs/merOverlapperSeedBatchSize   jobs -> seeds/ dir
 olap-from-seeds.sh : #frgs/merOverlapperExtendBatchSize jobs -> olaps/ dir
  • Num ber of jobs
 cat 1-overlapper/ovlopts.pl | grep -c '^"h'
  • Restart in runCA
 rm 1-overlapper/o*

OverlapStore

  • Flags:
 -t -M            : don't improve performance much
 -I ignore_file   : not implemented
  • Dump
 overlapStore -d prefix.ovlStore > prefix.ovl.tab

 overlapStore -d prefix.ovlStore | awk '{print $1}' | uniq -c | awk '{print $2,$1}' > prefix.ovlStore.count.tmp
 cat prefix.ovlStore.count.tmp | perl -ane 'foreach ($p+1..$F[0]-1) { print "$_ 0\n" } $p=$F[0]; print $_'  > prefix.ovlStore.count
 rm prefix.ovlStore.count.tmp
  • Get stats; median number of ovls (5'/3') for each lib
 overlapStats -G prefix.gkpStore -O prefix.ovlStore/ -o prefix
 cat prefix.repeatmodel.lib.00*stats | egrep '^Lib|nSamples|median' | p 'chomp; if(/Lib/) {print "\n$_"} else { print "  $F[1]"}' | pretty -o
  • Delete overlaps from *.obt.gz files
 zcat *.obt.gz  | convertOverlap -a -ovl |  ~/bin/difference1or2.pl -f delete.acc | convertOverlap -b -ovl | gzip > new.obt.gz
  • Delete overlaps from ovlStore : asm=>new
 overlapStore -d asm.ovlStore | perl -ane 'print $_ if($F[0]<$F[1]);' | ~/bin/difference1or2.pl -f delete.iid |  convertOverlap -b -ovl | gzip  >! new.ovb.gz
 overlapStore -c new.ovlStore -g asm.gkpStore/ new.ovb.gz

 #not necessary to gzip the overlap files
 overlapStore -d asm.ovlStore | perl -ane 'print $_ if($F[0]<$F[1]);' | ~/bin/difference1or2.pl -f delete.iid |  convertOverlap -b -ovl >! new.ovb
 overlapStore -c new.ovlStore -g asm.gkpStore/ new.ovb
  • Edit(invert) overlaps from ovlStore : asm=>new
 ... > invert.iid     # get the iids of the reads to change          
 overlapStore -d asm.ovlStore  | p 'print $_ if($F[0]<=$F[1]);' | ~/bin/invertOvl.pl -f invert.iid | convertOverlap -b -ovl | gzip > new.ovb.gz
 overlapStore -c new.ovlStore -g asm.gkpStore new.ovb.gz
  • Count how many overlaps have errors
 overlapStore -d asm.ovlStore | p 'print if($F[5]>0);' | wc -l
  • Get 5'/3' overlap stats
 overlapStore -d asm.ovlStore | perl /nfshomes/dpuiu/Archives/JCVI/bin/countOverlaps53.pl | getSummary.pl -i 1 -t 5 
 overlapStore -d asm.ovlStore | perl /nfshomes/dpuiu/Archives/JCVI/bin/countOverlaps53.pl | getSummary.pl -i 2 -t 3


  • Build faster ...
 Example (Bee):
 e 'foreach ("0000000001".."0000000058") { print "overlapStore -c $_.ovlStore -g asm.gkpStore/ -i 0 -M 8192 1-overlapper/$_/*gz\n" }' | scheduler.pl  # 13min/job  
 e 'foreach ("0000000002".."0000000058") { print "overlapStore -m 0000000001.ovlStore $_.ovlStore\n" }'                                               # 3 min/job 
 total=2*13+57*3=3.5 hrs (instead of 13.85 reported in runCA.log)

Frgcorr

  • Fails on deleted reads with overlaps ; must delete the overlaps as well

Unitiger

  • Unitig stats:
 cat 4-unitigger/*.cga.0 | perl ~/bin/utgLog2utgLen.pl | getSummary.pl -i 0 -j 1
  • Library insert estimates:
 cat 4-unitigger/unitigger.err | grepi -A 9 LIB | pretty   # in this case 9 is the number of libs
 should compare with the original values
 gatekeeper -dumplibraries -tabular *.gkpStore | more
  • Bog: don't let small inserts break unitigs:
 edit AS_BOG_MateChecker.cc

 void
 MateLocation::buildHappinessGraphs(Unitig *utg, DistanceCompute *globalStats) {

 for (uint32 mleidx=0; mleidx<_table.size(); mleidx++) {
   MateLocationEntry &loc = _table[mleidx];

   uint32 lib =  _fi->libraryIID(loc.mleFrgID1);

   //Aelksey Zimin also do not use short Illumina mate pairs
   //  if (globalStats[lib].samples < 10 )
   if (globalStats[lib].samples < 10 || globalStats[lib].stddev < 100 || globalStats[lib].mean < 500)

Tigstore

  • list unitigs
 tigStore -g asm.gkpStore -t asm.tigStore 2 -D unitiglist     # ids  
 maID    isPresent  isDeleted  ptID  svID  fileOffset  covStat  
 0       1          0          1     2     0           3334     
 1       1          0          1     2     521908      -1       
 2       1          0          1     2     522764      543      
 ...
 tigStore -g asm.gkpStore -t asm.tigStore 2 -D properties     # exit: stdout    # properties
 unitig_coverage_stat        0 660
 unitig_microhet_prob        0 0.000000
 unitig_status               0 X
 unitig_unique_rept          0 X
 ...
  • number of tigs
tigStore -g asm.gkpStore -t asm.tigStore 2 -D unitiglist | tail -1 | awk '{print $1}'
  • list frags/consensus/layout in a unitig
 tigStore -g asm.gkpStore -t asm.tigStore 2 -u 0 -d frags 
 tigStore -g asm.gkpStore -t asm.tigStore 2 -u 0 -d consensus
 tigStore -g asm.gkpStore -t asm.tigStore 2 -u 0 -d layout
 tigStore -g asm.gkpStore -t asm.tigStore 2 -u 0  -d properties  # exit: stderr & stdout

  • dump all unitig consensus
 tigStore -g asm.gkpStore -t asm.tigStore 2 -U -d consensus > utg.fasta
  • get single read unitigs
 tigStore -g asm.gkpStore -t asm.tigStore 43 -D properties | grep ^unitig_num_frags | grep ' 1$' 
  • get read/utg or ctg stats
 tigStore -g asm.gkpStore -t asm.tigStore 43 -D properties |  grep ^unitig_num_frags | getSummary.pl -z 1 -i 2 -t unitig_num_frags 
 tigStore -g asm.gkpStore -t asm.tigStore 43 -D properties |  grep ^contig_num_frags | getSummary.pl -z 1 -i 2 -t contig_num_frags 
  • delete a unitig : get & update layout
 tigStore -g asm.gkpStore -t asm.tigStore 2 -u id -d layout | tee utg.layout | grep ^unitig | awk '{print $2}' | ~/bin/deleteUnitigIdLayout.pl > utg.layout.delete
 # or
 tigStore -g asm.gkpStore -t asm.tigStore 2 -u id -d layout | tee utg.layout |  ~/bin/deleteUnitigLayout.pl > utg.layout.delete
 tigStore -g asm.gkpStore -t asm.tigStore 2 -R utg.layout.delete
 tigStore -g asm.gkpStore -t asm.tigStore 2 -D unitiglist | p 'print if($F[2]);'
  • delete single frg unitigs
 tigStore -g asm.gkpStore -t asm.tigStore 2 -U -d layout | egrep '^unitig|^data.num_frags' | p 'chomp; print $_," "; print "\n" if(/^data/)' | p 'print $F[1],"\n" if($F[3]==1);' | ~/bin/deleteUnitigIdLayout.pl > utg.layout.delete 
 tigStore -g asm.gkpStore -t asm.tigStore 2 -R utg.layout.delete
  • assign unitig -1 to all unitigs (for addition to a new store)
 tigStore -g asm.gkpStore -t asm.tigStore 2 -u id -d layout | tee utg.layout | p 'if (/^unitig/) { print "unitig -1\n"} elsif(/^UTG/) { $F[7]=-1; print join " ",@F; print "\n" } else { print $_};' > utg.layout.add
 tigStore -g asm.gkpStore -t new.tigStore 2 -R utg.layout.add
 tigStore -g asm.gkpStore -t asm.tigStore 1 -u id -d layout | tee utg.layout | p 'if (/^unitig/) { print "unitig -1\n"} else { print $_};' > utg.layout.add
 tigStore -g asm.gkpStore -t new.tigStore 1 -R utg.layout.add
  • fix unitig consensus
 touch utg.layout
 grep FAILED 5-consensus/asm*err | cut -f3 -d ' ' | sort -u -n | p 'system "tigStore -g asm.gkpStore -t asm.tigStore 2 -u $F[0] -d layout >> utg.layout\n";'  
 cat  utg.layout | updateUTGlayout.pl  >  utg.layout.update
 tigStore -g asm.gkpStore -t asm.tigStore 2 -R utg.layout.update

 grep FAILED 5-consensus/asm*err | sed 's/\.err:/ /' | cut -f1 | sort -u | p 'system "touch $F[0].success\n";' 
 touch 5-consensus/consensus.success
  • Filter 2+ fragment unitigs

CGW

  • Long run times usually mean problems (conflicts) with the data
  • Drbug
 gdb /fs/szdevel/dpuiu/SourceForge/wgs/Linux-amd64/bin//cgw
 $ run -j 1 -k 5 -r 5 -s 2 -z -P 2 -B 75000 -m 100 -g ./asm.gkpStore -t ./asm.tigStore -o asm
 bt
  • Differences between the LAST run of CGW and PREVIOUS ones
 # options
 PREV       LAST
            -R 7             # restart from checkpoint file number 'ckp'
            -N ckp01-ABS     # restart from checkpoint location 'ckp' (see the timing file)
 -s 0       -s 2             # stone throwing level
 -S 0                        # S==0 : do not resolve surrogates
 -G                          # Don't generate output (cgw or cam) 
 #files
 difference.pl 7-4-CGW.ls 7-0-CGW.ls
 asm.asm.cam
 asm.dregs.cam
 asm.partitionInfo  #  "equivalent of" 4-unitigger/asm.partitioningInfo 
 asm.partitioning   #  "equivalent of" 4-unitigger/asm.partitioning
  • running only one instance of ECR
 rm -fr 7-[234]*
 /fs/szdevel/dpuiu/SourceForge/wgs-assembler.030210/Linux-amd64/bin/runCA -d . -p asm -s ./runCA.spec doExtendClearRanges=1 *.frg >> & runCA.log
  • canceling ECR run (if doExtendClearRanges set to 1 or 2 and the run takes too long)
 grep ckp ./7-0-CGW/cgw.out 
 ./7-0-CGW/asm.ckp.3 (logical ckp01-ABS) after building scaffolds at ...
 ./7-0-CGW/asm.ckp.4 (logical ckp03-ACD) after scaffold cleaning at ...
 ./7-0-CGW/asm.ckp.5 (logical ckp05-1SM) after 1st scaffold merge at ...
 ./7-0-CGW/asm.ckp.6 (logical ckp06-AS) after stone throwing at ...
 ./7-0-CGW/asm.ckp.7 (logical ckp08-2SM) after 2nd scaffold merge at ...
 ./7-0-CGW/asm.ckp.8 (logical ckp09-FR) after final rocks at ...

 cgw -R 7 -N ckp09-FR -j 1 -k 5 -r 5 -s 2 -z -P 2 -B 75000 -m 100 -g ./asm.gkpStore -t ./asm.tigStore -o ./7-0-CGW/asm

 grep ckp ./7-0-CGW/cgw.out 
 ...
 ./7-0-CGW/asm.ckp.9 (logical ckp10-PS) after partial stones at ...
 ./7-0-CGW/asm.ckp.10 (logical ckp11-FCS) after final contained stones at ...
 ./7-0-CGW/asm.ckp.11 (logical ckp13-RS) after resolve surrogates at ...
 ./7-0-CGW/asm.ckp.12 (logical ckp14-FIN) after output at ...
  • running ECR (if initially doExtendClearRanges=0) ; make sure cgwPurgeCheckpoints=0
 runCA -d . -p asm -s runCA.spec stopAfter=scaffolder cgwPurgeCheckpoints=0 doExtendClearRanges=0 *.frg >> & runCA.log
 
 rm 7-CGW
 cd 7-0-CGW
 mkdir OLD
 mv asm.ckp.9 asm.ckp.1? OLD asm.*.cam asm.partition* asm.timing OLD
 cd ..
 
 runCA -d . -p asm -s runCA.spec stopAfter=scaffolder cgwPurgeCheckpoints=0 doExtendClearRanges=1 *.frg >> & runCA.log
  • DistUpadte file
 7-0-CGW/stat/scaffold_final.distupdate.dst
  • Stats
 cat 7-0-CGW/rezlog/stone.i02.log | perl ~/bin/cgwLog2scfLen.pl | getSummary.pl -t scfLen

ECR

  • doExtendClearRanges = 2 => cgw run 3 times, ECR 2 times

Consensus

  • Dump unitig seq
 tigStore -g asm.gkpStore -t asm.tigStore/ 20 -D unitigs > unitigs.txt
 cat unitigs.txt | grep -v ^maID | perl -ane 'system "tigStore -g asm.gkpStore -t asm.tigStore 20 -d consensus -u  $F[0]\n";' >& unitigs.fasta.tmp
 cat unitigs.fasta.tmp  | grep -v null | sed 's/cns=//' | sed 's/-//g' | nl0 | tab2fasta.pl > unitigs.fasta
 rm unitigs.fasta.tmp
  • Dump contig seq
 tigStore -g asm.gkpStore -t asm.tigStore/ 20 -D contigs > contigs.txt
 cat contigs.txt | grep -v ^maID | perl -ane 'system "tigStore -g asm.gkpStore -t asm.tigStore 20 -d consensus -c  $F[0]\n";'>& contigs.fasta.tmp
 cat contigs.fasta.tmp | grep -v null | sed 's/cns=//' | sed 's/-//g' | nl0 | tab2fasta.pl > contigs.fasta
 rm contigs.fasta.tmp
  • Recompute unitig consensus
 utgcns -g ./asm.gkpStore -t ./asm.tigStore 2 011 -u 122533 -v -V

Terminator

Log file & runtime (wgs-5.2)

  • Filter log
 ~/bin/getCAruntimes.pl -filter runCA.log > runCA.filter.log
  • Display log
 cat runCA.log | ~/bin/getCAruntimes.pl -hour -total -plot  # => runCA.runTimes ; runCA.runTimes.gp ; runCA.runTimes.png
 display runCA.runTimes.png

Output processing

Posmap

  • Get posMap files
buildPosMap -o prefix < prefix.asm
  • Get number of reads/ctg or scf (from the pos files)
 cat prefix.posmap.frgctg | count.pl -i 1 > prefix.posmap.ctgfrgcnt
 cat prefix.posmap.frgscf | count.pl -i 1 > prefix.posmap.scffrgcnt
 cat prefix.posmap.ctgscf | count.pl -i 1 > prefix.posmap.scfctgcnt
  • Compute read/mate/ctg cvg (from the pos files)
 posmap2cvg.pl                                  < prefix.posmap.frgscf > prefix.posmap.scffrgcvg
 OLD/posmap2cvg.4.pl -mates prefix.posmap.mates < prefix.posmap.frgscf > prefix.posmap.scfmatecvg
 posmap2cvg.pl                                  < prefix.posmap.ctgscf > prefix.posmap.scfctgcvg
  • Lib assembly stats *
 cd 9-*
 gatekeeper -dumplibraries -tabular ../genome.gkpStore |                    pretty > genome.gkpStore.libinfo1
 gatekeeper -dumpinfo               ../genome.gkpStore | grep -A 100 ^Lib | pretty > genome.gkpStore.libinfo2

 gdt genome.gkpStore | p 'print "$F[0] $F[5]\n" unless($F[6]);' > genome.gkpStore.frg2lib
 paste genome.gkpStore.frg2lib genome.posmap.frags | p ' print "$F[1] $F[5]\n";' | count.pl >  genome.gkpStore.frgcount
 gdt genome.gkpStore | p 'print "$F[4] del\n" if($F[6] eq "1");' | count.pl >> genome.gkpStore.frgcount
 cat genome.gkpStore.frgcount | count2col.pl -i 0 -j 1 -k 2 | awk '{print $1,$4,$2,$3}' | pretty > genome.gkpStore.libcount 

 more genome.gkpStore.lib*  
  • Posmap2gff
 #getting frglib info
 gdt ../genome.gkpStore | p 'print "$F[0] $F[5]\n" unless($F[6]);' > genome.gkpStore.frg2lib
 # one of ...
 cat genome.posmap.frgscf | ~/bin/posmap2gff.pl -sou frg -fea frg -p > genome.posmap.frgscf.gff
 cat genome.posmap.frgscf | ~/bin/posmap2gff.pl -sf genome.gkpStore.frg2lib -fea frg -p | sed 's/[ab]//' > genome.posmap.frgscf.gff
 cat genome.posmap.frgscf | ~/bin/posmap2gff.pl -sf genome.gkpStore.frg2lib -fea frg -p >! genome.posmap.frgscf.gff

 cat genome.posmap.utgscf | ~/bin/posmap2gff.pl -sou utg -p >! genome.posmap.utgscf.gff
 cat genome.posmap.ctgscf | ~/bin/posmap2gff.pl -sou ctg -p >! genome.posmap.ctgscf.gff
 cat genome.posmap.scflen | perl -ane 'print "$F[0]\t$F[0]\t0\t$F[1]\tf\n";' | posmap2gff.pl -sou scf >>  genome.posmap.scflen.gff
 setenv SEQID 1374988
 cat genome*gff | p 'print $_ if($F[8]==$ENV{SEQID});' >! $SEQID.gff
 ~/szdevel/Apollo/apollo -i gff -f $SEQID.gff
  • Get 0cvg regions
 posmap2cvg.pl -f prefix.posmap.scflen -Max 0 < prefix.posmap.frgscf  # scaff gaps
 posmap2cvg.pl -f prefix.posmap.ctglen -Max 0 < prefix.posmap.frgctg  # surrogates
  • Get links
 ~/bin/scf2scflnk.sh prefix                      
 ~/bin/ctg2ctglnk.sh prefix                      
 ...
  • Add ctg/scf into to the mates fils
 join12.pl prefix.posmap.mates prefix.posmap.frgctg > prefix.posmap.mates2.tmp
 join12.pl prefix.posmap.mates2.tmp prefix.posmap.frgscf > prefix.posmap.mates2
 rm prefix.posmap.mates2.tmp
  • Display graphs
cat prefix.posmap.scf2scflnk | ~/bin/tab2dot.pl | dot -Tpng -o prefix.posmap.scf2scflnk.png
cat prefix.posmap.ctg2ctglnk | ~/bin/tab2dot.pl | dot -Tpng -o prefix.posmap.ctg2ctglnk.png
  • Qc stats based on posmap files
 posmap2qc.pl -ctgscf prefix.posmap.ctgscf > prefix.qc
  • Break scf/ctg pipeline
 ~/bin/breakPosmapKeep.amos
  • Get scaff file
 ~/bin/posmap2scaff.pl prefix.posmap.ctgscf
  • Gaps longer than a certain size
 cat asm.posmap.ctgscf | p 'print $p,$_,"\n" if($F[1] eq $P[1] and $F[2]-$P[3]>1000); $p=$_;@P=@F;'

Mate redundancy

  • Sequence based (first 32bp in fwd/rev Illumina reads; max 1 mismatch)
 paste s_2_?_8kb*.txt | perl -ane 'chomp; if($.%4==1) { s/\@(\S+)[12]//; printf("%-45s",$1); } elsif($.%4==2) { print substr($F[0],0,32),"\t",substr($F[1],0,32),"\n" } ' | sort -k2,3 | ./getDuplicates.pl > s_2_8k.tab
 cat s_2_8k.tab | perl -ane 'print $F[0],"1\n";' >  s_2_1_8k.redundant
 cat s_2_8k.tab | perl -ane 'print $F[0],"2\n";' >  s_2_2_8k.redundant
  • overlap(Store) based
 #Example
 getMateRedundancy.amos -D BEGIN=1 -D END=219  -D GKPSTORE=../asm.gkpStore -D OVLSTORE=../asm.ovlStore s_2_3kb
 #=> asm.redundant_mate_pairs , asm.redundant_mate_clusters
  • Alignment based
 cat prefix*.soap2 | ~/bin/AMOS/soap2sameRef.pl -pos | sort -nk2 -nk3 | p 'print $_ if(@P and $F[1]-$P[1]<=1 and abs($F[2]-$P[2])<=1); @P=@F;' > prefix.redundant
  • Posmap file based
 cat asm.posmap.frgscf |  ~/bin/posmap2ovl.pl -p 75 | grep -v ^ref | ~/bin/getMateRedundancy12.pl > asm.redundant_mate_pairs 
 cat asm.redundant_mate_pairs |  perl ~/bin/cluster.pl | sort > asm.redundant_mate_clusters

Fasta

  • Align deg to scaff
 nucmer -mum -l 40 -c 250 scf.fasta deg.fasta  -p scf-deg
 delta-filter -q scf-deg.delta > scf-deg.filter-q.delta
 ~/bin/delta2pairCvg+.pl < scf-deg.filter-q.delta > scf-deg.filter-q.cvg.pairs
 ~/bin/max2.pl -i 4 -j 6 < scf-deg.filter-q.cvg.pairs > scf-deg.filter-q.maxCvg.pairs
 cat scf-deg.filter-q.maxCvg.pairs | p 'print $_ if($F[5]-$F[6]<2000);' > scf-deg.filter-q.variants.pairs
 awk '{print $5,$6}' scf-deg.filter-q.variants.pairs > scf-deg.filter-q.variants.ids
  • Indexing
 # contigs
 ~/bin/formatdb.pl < prefix.ctg.fasta > prefix.ctg.fasta.offset
 ~/bin/fastacmd.pl -o prefix.ctg.fasta.offset -f filter.ctg.ids < prefix.ctg.fasta > filter.ctg.fasta
 # same for scaff, deg ...
  • Contaminant search : nucmer against
 /nfshomes/dpuiu/db/Ecoli.all
 /nfshomes/dpuiu/db/UniVec_Core
 /nfshomes/dpuiu/db/OtherVec

Asm

  • Get message counts
cat asm.asm | grep "^{" | egrep -v 'MPS|UPS|VAR|CTP' | uniq -c | more
     1 {MDI
273375 {AFG  # reads
 36258 {AMP  # links
  2185 {UTG  # MPS+           messages
  4876 {ULK
  1443 {CCO  # VAR*,MPS+,UPS+ messages
  3277 {CLK
    15 {SCF
    40 {SLK  # CTP+           messages
  • Get output seq
 asmOutputFasta -p prefix < prefix.asm
  • Get output seq clr
 ~/bin/asm2clrs.pl < prefix.asm > < prefix.asm.clr
  • Get new library insert sizes:
 asm2mdi.pl < prefix.asm > prefix.mdi
  • Get surrogate ids
 #most of them have negative aStat's
 $ cat *.asm | grep -B 8 ^sta:S | grep ^acc | perl -ane '/(\d+)/; print "utg".$1,"\n";'

Qc

 caqc.pl -euid  prefix.asm

Scaffolds

  • Get scaff files
 cavalidate -e 10 prefix
 bank2scaff prefix.bnk/ > prefix.scaff
 ~/bin/ca2scaff -i prefix.asm -o prefix.scaff
 cat prefix.scaff | grep "^>" | sed 's/>//' | awk '{print $1,$2}' > prefix.scfctg
 cat prefix.scaff | grep "^>" | sed 's/>//' | awk '{print $1,$3}' > prefix.scflen  
 cat prefix.scaff | grep "^>" | sed 's/>//' | awk '{print $1,$4}' > prefix.scfspan  
 tac prefix.scaff | p '$count=0 if(/^>/); $count++; print $_ if($count>2);' | awk '{print $4}'   # sequence gaps

AGP

 asmToAGP.pl < prefix.asm > prefix.agp


        /--- buildPosMap ------> .posmap
       /
  .asm /---- asmToAGP ------------------------------------------------> .AGP 
      \                                                              /   
       \---- cavalidate, bank2scaff?--> .scaff --- scaff2agp.pl --- 
  • Other scripts:
 bank2scaff : incorrect gap sizes
 valiadteAgp.pl < prefix.agp
 infoseq 9-terminator/prefix.ctg.fasta | sed 's/ctg//' >! prefix.ctg.infoseq
 scaff2agp.pl -i prefix.ctg.infoseq < prefix.scaff > prefix.agp

Moving an assembly

  • Minimum file set
 scp runCA.sh
 scp runCA.spec

 scp asm.gkpStore/*

 scp 3-overlapcorrection/asm.erates.updated 

 scp 4-unitigger/asm.iidmap
 scp 4-unitigger/asm.partitioning
 scp 4-unitigger/asm.partitioningInfo
 scp 4-unitigger/unitigger.success

 scp 5-consensus/consensus.success

 scp 7-0-CGW/asm.timing
 scp 7-0-CGW/asm.ckp.11  # last one
 scp 7-0-CGW/asm.partitioning
 scp 7-0-CGW/asm.partitionInfo
 scp 7-0-CGW/cgw.out.00.ckp.11
 scp 7-0-CGW/cgw.out

 scp asm.tigStore/*

Add reads to an assembly

 ln -s ../wgs-prev/prefix.frg .
 ln -s ../Data/new.frg
 cp ../wgs-prev/runCA.spec ../wgs-prev/runCA.sh   .
 ln -s ../wgs-prev/0-mercounts
 cp -R ../wgs-prev/prefix.gkpStore
 gatekeeper -a -o prefix.gkpStore new.frg

 ll -d ../wgs-prev/1-overlapper/000000000*                  # if multiple jobs ...
 mkdir -p 1-overlapper/0000000001                           # mkdir -p 1-overlapper/0000000002 ...  ???
 
 cat ../wgs-prev/1-overlapper/*pl | grep '^"h' | tail -1    # get last qry/ref block  => "h0056000001r0050000000"

 cd 1-overlapper/0000000001
 ln -s ../../wgs-prev/1-overlapper/0000000001/*ovb.gz .
 rm h0056000001*.ovb.gz *r0050000000.ovb.gz                 # this should change from case to case 
 cd ../..

Removing bad mates

 cat ../wgs-prev/9-terminator/asm.posmap.frags | grep bad > asm.posmap.frags.bad

 gatekeeper -dumpfragments -tabular -uid asm.posmap.frags.bad ./wgs-prev/asm.gkpStore |  grep -v ^UID | ~/bin/deleteLkg.pl -j 2 > asm.posmap.frags.bad.delete.lkg
 ln -s ../wgs-prev/*.frg 
 gatekeeper -o asm.gkpStore *.frg asm.posmap.frags.bad.delete.lkg 
 ln -s ../wgs-prev/0-mercounts/
 ln -s ../wgs-prev/asm.ovlStore/
 ln -s ../wgs-prev/3-overlapcorrection/
 ln -s ../wgs-prev/4-unitigger/
 cp -R ../wgs-prev/asm.tigStore/ .
 ln -s ../wgs-prev/5-consensus

 runCA ...

Test datasets

Porphyromonas_gingivalis_W83 : Sanger + 454

 wget ftp://ftp.ncbi.nih.gov/pub/TraceDB/porphyromonas_gingivalis_w83/anc.porphyromonas_gingivalis_w83.001.gz
 ...
 wget ftp://ftp.ncbi.nih.gov/pub/TraceDB/porphyromonas_gingivalis_w83/xml.porphyromonas_gingivalis_w83.001.gz
 /fs/sztmpscratch/dpuiu/porphyromonas_gingivalis_w83/
  • Complete genome at NCBI : NC_002950.2 ( 2,343,476bp 48.29%GC)
 bp_fetch.pl net::genbank:NC_002950.2 > porphyromonas_gingivalis_w83.1con

Input formatting

Read counts

 zcat anc.*.gz | ~/bin/traceanc2anc.pl SEQ_LIB_ID INSERT_SIZE  INSERT_STDEV SVECTOR_CODE | count.pl 

TA formatting

  • frg FORMAT2
  • TIs are frg uids
  • All libs
 tracedb-to-frg.pl -xml xml.porphyromonas_gingivalis_w83.001.gz
 tracedb-to-frg.pl -lib xml.porphyromonas_gingivalis_w83.001.gz
 tracedb-to-frg.pl -frg xml.porphyromonas_gingivalis_w83.001.gz
 =>
 porphyromonas_gingivalis_w83.1.lib.frg  
 porphyromonas_gingivalis_w83.2.001.frg.bz2  
 porphyromonas_gingivalis_w83.3.lkg.frg 
  • One lib at a time
 tracedb-to-frg.pl -only T13146 -xml xml.porphyromonas_gingivalis_w83.001.gz >& ! tracedb-to-frg.log
 tracedb-to-frg.pl -only T13146 -lib xml.porphyromonas_gingivalis_w83.001.gz >>& tracedb-to-frg.log
 tracedb-to-frg.pl -only T13146 -frg xml.porphyromonas_gingivalis_w83.001.gz >>& tracedb-to-frg.log

454 SRA formatting

Illumina formatting

  • use fastqToCA (wgs-6.0)
 fastqToCA \
   -insertsize 3000 300 \
   -libraryname 1 \
   -type illumina \
   -fastq /fs/szattic-asmg4/Bees/Bombus_impatiens/s_1_1_sequence.txt,/fs/szattic-asmg4/Bees/Bombus_impatiens/s_1_2_sequence.txt | \
    sed 's/ori:I/ori:O/' \
   > 1.frg
 =>
 {LIB                                                                                                                                      
 act:A                                                                                                                                     
 acc:1                                                                                                                                     
 ori:O                                                                                                                                     
 mea:3000.000                                                                                                                              
 std:300.000                                                                                                                               
 src:                                                                                                                                      
 .                                                                                                                                         
 nft:10                                                                                                                                    
 fea:                                                                                                                                      
 forceBOGunitigger=1                                                                                                                       
 isNotRandom=0                                                                                                                             
 doNotTrustHomopolymerRuns=0                                                                                                               
 doRemoveDuplicateReads=0                                                                                                                  
 doNotQVTrim=0                                                                                                                             
 goodBadQVThreshold=0                                                                                                                      
 doNotOverlapTrim=0                                                                                                                        
 usePackedFragments=1                                                                                                                      
 illuminaFastQType=illumina                                                                                                                
 illuminaSequence=/fs/szattic-asmg4/Bees/Bombus_impatiens/s_1_1_sequence.txt,/fs/szattic-asmg4/Bees/Bombus_impatiens/s_1_2_sequence.txt    
 .                                                                                                                                         
 }
 gatekeeper \
   -o asm.gkpStore.BUILDING \
   -T -F  \
   -E asm.gkpStore.errorLog \
   1.frg 
 => Segmentation fault

FASTQ

  • This won't remove the linker:
 zcat SRR001351.fastq.gz | ~/bin/fastq2seq.pl  > SRR001351.seq
 zcat SRR001351.fastq.gz | ~/bin/fastq2qual.pl > SRR001351.qual
 convert-fasta-to-v2.pl -454 -l SRR001351 -s SRR001351.seq -q SRR001351.qual
  • Cleaning seq
 ~/bin/OLD/cleanFastaq.pl \
    s_2_1_1.1kb_sequence.txt s_2_2_1.1kb_sequence.txt \                                     # input
    s_2_1_1.1kb_sequence.clean.txt  s_2_2_1.1kb_sequence.clean.txt \                        # output mates
    s_2_1_1.1kb_sequence.clean.single.txt  s_2_2_1.1kb_sequence.clean.single.txt            # output single
   echo prefix_1.fastq prefix_2.fastq  > prefix.ls
   quake.py -f prefix.ls
  • Illumina note:
 We recommend removing reads that do not pass the GA analysis software Failed_Chastity fi lter before attempting to assemble the sequence.
 The chastity of a base call is the ratio of the intensity of the greatest signal divided by the sum of the two greatest signals. 
 Reads do not pass the quality fi lter if there are two or more base calls with chastity of less than 0.6 in the fi rst 25 cycles. 
 These reads have an “N” in the last column of the GA analysis software export file.

 Removed all reads that contained ambiguous characters (Ns).
 Removed reads that did not contain at least 25 Q30 bases among the first 35 cycles (s35). 

SFF

 sffToCA -libraryname SRR001351 -output SRR001351.frg SRR001351.sff

Running

  • Command
 ~/wgs/Linux-amd64/bin/runCA -p pging -d testassembly porphyromonas_gingivalis_w83.* >& runCA.log

CAUG notes

Illimina mates

  • short: innies PE (paired ends)
  • long: outies MP (mate pairs) "lots of issues" (eg> : linker is one bp; solution : take 36bp from each end of the read to amke sure you don't get the linker
  • wgs 6.1 : Illumina limit of 104 bp; can recompile the code o make it 232
fastqToCA -insertsize 280 30 -libraryname yersinia_pestis.100bp.0280bp.005x -type illumina -innie -fastq $PWD/yersinia_pestis.100bp.0280bp.005x.1.fastq,$PWD/yersinia_pestis.100bp.0280bp.005x.2.fastq > illumina.frg
{VER
ver:2
}
{LIB
act:A
acc:yersinia_pestis.100bp.0280bp.005x
ori:I
mea:280.000
std:30.000
src:
.
nft:12
fea:
forceBOGunitigger=1
shortOverlapModel=1
isNotRandom=0
doNotTrustHomopolymerRuns=0
doRemoveDuplicateReads=0
doNotQVTrim=0
goodBadQVThreshold=0
doNotOverlapTrim=0
usePackedFragments=1
illuminaFastQType=illumina
illuminaOrientation=innie
illuminaSequence=/home/training/CAUG/USERS/Daniela_Puiu/yersinia_pestis.100bp.0280bp.005x.1.fastq,/home/training/CAUG/USERS/Daniela_Puiu/yersinia_pestis.100bp.0280bp.005x.2.fastq
.
}
{VER
ver:1
}

#####################

gatekeeper -o illumina.gkpStore illumina.frg 
Starting file 'illumina.frg' at line 0.
Type set to ILLUMINA 1.3+.
Orientation set to INNIE.
Processing illumina reads from '/home/training/CAUG/USERS/Daniela_Puiu/yersinia_pestis.100bp.0280bp.005x.1.fastq'
                           and '/home/training/CAUG/USERS/Daniela_Puiu/yersinia_pestis.100bp.0280bp.005x.2.fastq'
GKP finished with no errors.

#####################

gatekeeper -dumpinfo illumina.gkpStore/
LOAD STATS

1       libInput
1       libLoaded
0       libErrors
0       libWarnings

0       frgInput
228742  frgLoaded
0       frgErrors
0       frgWarnings

0       lkgInput
0       lkgLoaded
0       lkgErrors
0       lkgWarnings

0       sffInput
0       sffLoaded
0       sffErrors
0       sffWarnings

0       sffLibCreated

0       plcInput
0       plcLoaded
0       plcErrors
0       plcWarnings

228742  numRandom
228742  numPacked
0       numNormal
0       numStrobe

GLOBAL STATS

228742  FRG
1       LIB

LibraryName                             numActiveFRG    numDeletedFRG   numMatedFRG     readLength      clearLength
GLOBAL                                  228742          0               218724          22874200        22191085
LegacyUnmatedReads                      0               0               0               0               0
yersinia_pestis.100bp.0280bp.005x       228742          0               218724          22874200        22191085

################

sffToCA -insertsize 3200 900 -libraryname porphyromonas_gingivalis_w83.3200bp.0900bp.E8YURXS01.1x.sff -trim chop -linker flx -output 454 porphyromonas_gingivalis_w83.3200bp.0900bp.E8YURXS01.1x.sff 
loadSFF()-- Loading 'porphyromonas_gingivalis_w83.3200bp.0900bp.E8YURXS01.1x.sff'.
removeDuplicateReads()-- from 1 to 19622
removeDuplicateReads()-- finished
detectMates()-- from 1 to 19621

#################

Overlap: check the hash load 
  6% is too low
~75% would be ideal