Dpuiu CA

From Cbcb
Revision as of 19:28, 16 September 2010 by Dpuiu (talk | contribs) (→‎Tigstore)
Jump to navigation Jump to search

Data Issues

  • Problems if the coverage > read length; solution: sample the reads
  • Analyze the data before assemble it:
    • If there is a reference genome map the reads to it
    • Plot kmer frequencies (use multiple kmer sizes) ; see how they compare with the estimated coverage

Sourceforge

  • Constants:
 cat src/AS_global.h
 ...
 #define AS_READ_MIN_LEN                   64
 #define AS_OVERLAP_MIN_LEN                40
 ...
 #define AS_READ_MAX_NORMAL_LEN_BITS       11 # changed to 12 for Megachile ... otherwise problems with the hash table :  "Too many skip kmer strings"
  • Download & compile CVS tip
 cvs -z3 -d:pserver:anonymous@wgs-assembler.cvs.sourceforge.net:/cvsroot/wgs-assembler co -P wgs-assembler

 cd wgs-assembler/src
 make SITE_NAME=LOCAL 

 => executables under wgs-assembler/Linux-amd64/bin/
  • Download & compile release
 wget ...

 bzip2 -dc wgs-6.0-beta.tar.bz2 | tar -xf -
 cd wgs-6.0-beta
 cd kmer
 sh configure.sh
 gmake install
 cd ../src
 gmake
 cd ..
  • Compile for debugging :
 make SITE_NAME=LOCAL  BUILDDEBUG=1  # in "c_make.as"

Run

  • Help
runCA.pl -fields
  • Batch submission
 touch runCA.sh
   echo --- runCA $$ restarted by $USER at `date` --- >>& runCA.log
   ~/bin/runCA -d . -p prefix -s ~/bin/runCA.spec  *.frg >>& runCA.log
 at runCA.sh   
  • Sanger
 runCA -d . -p prefix -s spec Sanger.frg
  • doOverlapTrimming=1 (default)
  • low qual reads (ex: 20 or 'D') are discarded
  • astatLowBound 1 (default)  ; can be lowered for low/un-uniform coverage
  • astatHighBound 5 (default)
  • computeInsertSize=0 (default)
  • 454 & Hybrid : convert sff to frg
 sffToCA -libraryname 454 -output 454 454.sff
 runCA -d . ovlOverlapper=mer unitigger=bog -p prefix *.frg
 runCA -d . -s ~/bin/runCA.454.spec -p prefix *.frg 
  • run concurently
 cnsConcurrency=4 frgCorrConcurrency=4 merOverlapperExtendConcurrency=4 merOverlapperSeedConcurrency=4 ovlConcurrency=4 ovlCorrConcurrency=4 ...
  • gatekeeper -L : search for 454 paired end linker
  • AS_GKP_sff.c  : Load_SFF function
  • for mated libs set mean=3000,stdev=300

Parameters

  • stopAfter:
   initialStoreBuilding
   overlapBasedTrimming | OBT     
   overlapper                         # runs overlap, overlapStore
   unitigger                          # runs frg/ovl correction
   consensusAfterUnitigger
   scaffolder
   consensusAfterScaffolder

 Example:
 ~/bin/runCA -d . -p asm -s ~/bin/runCA.454.spec stopAfter=initialStoreBuilding Saureus.frg
  • cgwOutputIntermediate = 1 => "cgw -G"
 => .cgw & .cgw_contigs (generated only at the last doExtendClearRanges step)
  • doExtendClearRanges = 2 => cgw run 3 times, ECR 2 times

Restarting a job

 # kill runCA 
 ps -C perl                                                       # displays the runCA  PID
 ps -C perl | tac | perl -ane 'system "kill -9 $F[0]\n";'
 # in OBT/overlap
 ps -C perl    | tac | perl -ane 'system "kill -9 $F[0]\n";'
 ps -C overlap | tac | perl -ane 'system "kill -9 $F[0]\n";'

 # OBT 
 mkdir 0-overlaptrim-overlap/OLD
 mv 0-overlaptrim-overlap/*sh 0-overlaptrim-overlap/OLD
 # overlap
 mkdir 1-overlapper/OLD
 mv 1-overlapper/*sh 1-overlapper/OLD

Gatekeeper

Build

  • Create & add
 gatekeeper    -o prefix.gkpStore -T -F prefix*frg
 gatekeeper -a -o prefix.gkpStore -T -F prefix*frg

Update

  • edit frg: act:R !!!
 {DST
 act:R
 acc:10691
 mea:161662
 std:28760
 }
  • append
 gatekeeper -a -o prefix.gkpStore -T -F edit.frg
  • Using tab file
 Example: No restrictions on lib stddev with --edit option
 cat prefix.update
 lib     uid     19070   mean    167001
 lib     uid     19070   stddev  167001
 lib     uid     19070   isNotRandom     1
 lib     uid     19070   Orientation     O            # outie orientation for Illimina libs >1K
    
 gatekeeper --edit prefix.update prefix.gkpStore
  • Update read CLR/CLV
 gatekeeper -a -r prefix.clr -o prefix.gkpStore   # not in the wgs release
 gatekeeper -a -v prefix.clv -o prefix.gkpStore

Delete

  • DST messages can not be deleted
  • A FRG can be deleted only if there is no related LKG message; an abbreviated MSG
  • act:D !!!
 # frg1
 {LKG
 act:D
 typ:M
 fg1:94376
 fg2:94654
 }
 {FRG
 act:D
 acc:94376
 }

 # frg2 ???
 {LKG
 act:D
 fg:94376
 fg:94654
 }
 {FRG
 act:D
 acc:94376
 }
 cat delete.ids | perl -ane 'print "{FRG\nact:D\nacc:$F[0]\n}\n";'> delete.frg
 cat delete.ids | ~/bin/deleteFrg.pl -forward 1 -reverse 2 > delete.frg            # forward reads have suffix 1; forward reads have suffix 1

 gatekeeper -a -o prefix.gkpStore -T -F delete.frg

Dump

  • INFO
 gatekeeper -dumpinfo prefix.gkpStore
  • LAST frg in store
 gatekeeper -dumpinfo -lastfragiid prefix.gkpStore
  • FASTA sequences
 gatekeeper -dumpfastaseq prefix.gkpStore > prefix.seq
 gatekeeper -dumpfastaqlt prefix.gkpStore > prefix.qual
  • FASTA reads for newbler
 gatekeeper -dumpnewbler newbler prefix.gkpStore => prefix.fna,prefix.fna.qual

 grep "^>" prefix.fna | head -2
 >SRR001355.3635.1a template=2020+2021 dir=F library=SRX000348 trim=20-117
 >SRR001355.3635.1b template=2020+2021 dir=R library=SRX000348 trim=1-130
  • FRG file
 gatekeeper -dumpfrg prefix.gkpStore > prefix.frg
 gatekeeper -dumpfrg prefix.gkpStore -format2 > prefix.frg2 
  • READS in TAB format
 gatekeeper -dumpfragments -tabular prefix.gkpStore > prefix.gkpStore.info
 gatekeeper -dumpfragments -tabular prefix.gkpStore | grep -v ^UID | awk '{print $1,$2}' > prefix.gkpStore.uid2iid
  • LIBS in TAB format
 gatekeeper -dumplibraries -tabular prefix.gkpStore | grep -v ^UID | awk '{print $1,$4,$5}' | sed 's/\.000//g'> prefix.dst  # original vals
 asm2mdi.pl < prefix.asm > prefix.mdi                                                                                       # final vals
  • READ CLR's (default=LATEST)
 gatekeeper -dumpfragments -tabular prefix.gkpStore/ | perl -ane 'print join "\t", @F[1,0,-2,-1,6]; print "\n";' | head | pretty -o
 IID    UID                                  clrBeginLATEST  clrEndLATEST  isDeleted  
 1      HWI-EAS385_0001:1:1:1569:15197#0/1   0               65            0          
 2      HWI-EAS385_0001:1:1:2095:12498#0/1   0               67            0          
 3      HWI-EAS385_0001:1:1:5452:4010#0/1    0               78            0          
 4      HWI-EAS385_0001:1:1:12400:16773#0/1  0               105           0          
 5      HWI-EAS385_0001:1:1:12884:18158#0/1  0               80            0          
 ...
 
 gatekeeper -dumpfragments -tabular prefix.gkpStore/  | perl -ane 'print join "\t", @F[0,-2,-1]; print "\n";' > prefix.clr
 gatekeeper -dumpfragments -tabular -clear [CLR|OBTINITIAL|OBTMERGE|OBTCHIMERA|ECR_0|ECR_1|ECR_2|LATEST] prefix.gkpStore/  | perl -ane 'print join "\t", @F[0,-2,-1]; print "\n";' > prefix.?.clr
 gatekeeper -dumpfragments -tabular -clear CLR asm.gkpStore        | perl -ane 'print join "\t", @F[0,-2,-1]; print "\n";' > asm.CLR.clr
 gatekeeper -dumpfragments -tabular -clear OBTINITIAL asm.gkpStore | perl -ane 'print join "\t", @F[0,-2,-1]; print "\n";' > asm.OBTINITIAL.clr &
 gatekeeper -dumpfragments -tabular -clear OBTCHIMERA asm.gkpStore | perl -ane 'print join "\t", @F[0,-2,-1]; print "\n";' > asm.OBTCHIMERA.clr &
 ...
 gatekeeper -dumpfragments -tabular -clear LATEST asm.gkpStore     | perl -ane 'print join "\t", @F[0,-2,-1]; print "\n";' > asm.LATEST.clr &
  • READ CLR's summary
 gatekeeper -dumpfragments asm.gkpStore | grep ^fragmentClear | perl -ane '  print $1," ",$3-$2,"\n" if(/(\w+),(\d+),(\d+)$/);' | ~/bin/sum2.pl 
  • MATED
 frg22mates.pl < prefix.frg > prerfix.mates
  • All scaffold reads in FRG2 format
 cat asm.posmap.frgscf | grep scfid > scfid.posmap.frgscf
 gatekeeper -uid scfid.posmap.frgscf -dumpfrg -donotfixmates -format2 asm.gkpStore > scfid.frg

Meryl

  • Get mer frequencies
 meryl -C -B -m 22 -s prefix.seq -o prefix.22mers
  • Dump mer sequences
 meryl -Dt -s prefix.22mers 
  • Dump histogram
 meryl -Dh -s prefix.22mers | sort -nk2 -r | more
  • Other programs
 cat lib.seq | ~/bin/countKmers.pl -k 22 | head -20 |          grep -v "^>" > 22mers.fwd.grep
 cat lib.seq | ~/bin/countKmers.pl -k 22 | head -20 | revseq | grep -v "^>" > 22mers.rev.grep
 cat 22mers.fwd.grep 22mers.rev.grep > 22mers.grep
 cat lib.seq | ~/bin/fgrep.pl -f 22mer.grep | count.pl

OBT

  • runCA spec
 doOverlapTrimming = 1
  • command
 overlap -G ...   # -G to compute partial overlaps
  • Dump
 overlapStore -d ./0-overlaptrim/asm.obtStore
  • Number of jobs: should be < fileListMax (1024*10 by default) otherwise overlapStore fails
 ~/bin/getOvlJobsCount.pl ovlHashBlockSize ovlRefBlockSize frgCount

Overlap

  • Jobs
 merOverlapperSeedBatchSize     100000
 merOverlapperExtendBatchSize   75000
 
 overmerry.sh :       #frgs/merOverlapperSeedBatchSize   jobs -> seeds/ dir
 olap-from-seeds.sh : #frgs/merOverlapperExtendBatchSize jobs -> olaps/ dir

OverlapStore

  • Flags:
 -t -M            : don't improve performance much
 -I ignore_file   : not implemented
  • Dump
 overlapStore -d prefix.ovlStore > prefix.ovl.tab

 overlapStore -d prefix.ovlStore | awk '{print $1}' | uniq -c | awk '{print $2,$1}' > prefix.ovlStore.count.tmp
 cat prefix.ovlStore.count.tmp | perl -ane 'foreach ($p+1..$F[0]-1) { print "$_ 0\n" } $p=$F[0]; print $_'  > prefix.ovlStore.count
 rm prefix.ovlStore.count.tmp
  • Get stats; median number of ovls (5'/3') for each lib
 overlapStats -G prefix.gkpStore -O prefix.ovlStore/ -o prefix
 cat prefix.repeatmodel.lib.00*stats | egrep '^Lib|nSamples|median' | p 'chomp; if(/Lib/) {print "\n$_"} else { print "  $F[1]"}' | pretty -o
  • Delete overlaps from *.obt.gz files
 zcat *.obt.gz  | convertOverlap -a -ovl |  ~/bin/difference1or2.pl -f delete.acc | convertOverlap -b -ovl | gzip > new.obt.gz
  • Delete overlaps from ovlStore : asm=>new
 overlapStore -d asm.ovlStore | perl -ane 'print $_ if($F[0]<$F[1]);' | ~/bin/difference1or2.pl -f delete.iid |  convertOverlap -b -ovl | gzip  >! new.ovb.gz
 overlapStore -c new.ovlStore -g asm.gkpStore/ new.ovb.gz

 #not necessary to gzip the overlap files
 overlapStore -d asm.ovlStore | perl -ane 'print $_ if($F[0]<$F[1]);' | ~/bin/difference1or2.pl -f delete.iid |  convertOverlap -b -ovl >! new.ovb
 overlapStore -c new.ovlStore -g asm.gkpStore/ new.ovb
  • Edit(invert) overlaps from ovlStore : asm=>new
 ... > invert.iid     # get the iids of the reads to change          
 overlapStore -d asm.ovlStore  | p 'print $_ if($F[0]<=$F[1]);' | ~/bin/invertOvl.pl -f invert.iid | convertOverlap -b -ovl | gzip > new.ovb.gz
 overlapStore -c new.ovlStore -g asm.gkpStore new.ovb.gz
  • Count how many overlaps have errors
 overlapStore -d asm.ovlStore | p 'print if($F[5]>0);' | wc -l
  • Build faster ...
 Example (Bee):
 e 'foreach ("0000000001".."0000000058") { print "overlapStore -c $_.ovlStore -g asm.gkpStore/ -i 0 -M 8192 1-overlapper/$_/*gz\n" }' | scheduler.pl  # 13min/job  
 e 'foreach ("0000000002".."0000000058") { print "overlapStore -m 0000000001.ovlStore $_.ovlStore\n" }'                                               # 3 min/job 
 total=2*13+57*3=3.5 hrs (instead of 13.85 reported in runCA.log)

Frgcorr

  • Fails on deleted reads with overlaps ; must delete the overlaps as well

Unitiger

Tigstore

  • list unitigs
 tigStore -g asm.gkpStore -t asm.tigStore 2 -D unitiglist     # ids  
 maID    isPresent  isDeleted  ptID  svID  fileOffset  covStat  
 0       1          0          1     2     0           3334     
 1       1          0          1     2     521908      -1       
 2       1          0          1     2     522764      543      
 ...
 tigStore -g asm.gkpStore -t asm.tigStore 2 -D properties     # exit: stdout    # properties
 unitig_coverage_stat        0 660
 unitig_microhet_prob        0 0.000000
 unitig_status               0 X
 unitig_unique_rept          0 X
 ...
  • list frags/consensus/layout in a unitig
 tigStore -g asm.gkpStore -t asm.tigStore 2 -u 0 -d frags 
 tigStore -g asm.gkpStore -t asm.tigStore 2 -u 0 -d consensus
 tigStore -g asm.gkpStore -t asm.tigStore 2 -u 0 -d layout
 tigStore -g asm.gkpStore -t asm.tigStore 2 -u 0  -d properties  # exit: stderr & stdout

  • dump all unitig consensus
 tigStore -g asm.gkpStore -t asm.tigStore 2 -U -d consensus > utg.fasta
  • get single read unitigs
 tigStore -g asm.gkpStore -t asm.tigStore 43 -D properties | grep ^unitig_num_frags | grep ' 1$' 
  • get read/utg or ctg stats
 tigStore -g asm.gkpStore -t asm.tigStore 43 -D properties |  grep ^unitig_num_frags | getSummary.pl -z 1 -i 2 -t unitig_num_frags 
 tigStore -g asm.gkpStore -t asm.tigStore 43 -D properties |  grep ^contig_num_frags | getSummary.pl -z 1 -i 2 -t contig_num_frags 
  • delete a unitig : get & update layout
 tigStore -g asm.gkpStore -t asm.tigStore 2 -u id -d layout | tee > utg.layout | grep -v FRG | p 'if(/^data.num_frags/) { print "data.num_frags\t0\n"} else { print $_ }' > utg.layout.delete
 tigStore -g asm.gkpStore -t asm.tigStore 2 -R utg.layout.delete
 # or
 tigStore -g asm.gkpStore -t asm.tigStore 2 -u id -d layout | tee > utg.layout |  ~/bin/deleteUnitigLayout.pl > utg.layout.delete
 tigStore -g asm.gkpStore -t asm.tigStore 2 -R utg.layout.delete
  • assign unitig -1 to all unitigs (for addition to a new store)
 tigStore -g asm.gkpStore -t asm.tigStore 2 -u id -d layout | tee > utg.layout | p 'if (/^unitig/) { print "unitig -1\n"} elsif(/^UTG/) { $F[7]=-1; print join " ",@F; print "\n" } else { print $_};' > utg.layout.add
 tigStore -g asm.gkpStore -t new.tigStore 2 -R utg.layout.add
  • fix unitig consensus
 touch utg.layout
 grep FAILED 5-consensus/asm*err | cut -f3 -d ' ' | sort -u -n | p 'system "tigStore -g asm.gkpStore -t asm.tigStore 2 -u $F[0] -d layout >> utg.layout\n";'  
 cat  utg.layout | updateUTGlayout.pl  >  utg.layout.update
 tigStore -g asm.gkpStore -t asm.tigStore 2 -R utg.layout.update

 grep FAILED 5-consensus/asm*err | sed 's/\.err:/ /' | cut -f1 | sort -u | p 'system "touch $F[0].success\n";' 
 touch 5-consensus/consensus.success

CGW

  • Long run times usually mean problems (conflicts) with the data
  • Differences between the LAST run of CGW and PREVIOUS ones
 # options
 PREV       LAST
            -R 7             # restart from checkpoint file number 'ckp'
            -N ckp01-ABS     # restart from checkpoint location 'ckp' (see the timing file)
 -s 0       -s 2             # stone throwing level
 -S 0                        # S==0 : do not resolve surrogates
 -G                          # Don't generate output (cgw or cam) 
 #files
 difference.pl 7-4-CGW.ls 7-0-CGW.ls
 asm.asm.cam
 asm.dregs.cam
 asm.partitionInfo  #  "equivalent of" 4-unitigger/asm.partitioningInfo 
 asm.partitioning   #  "equivalent of" 4-unitigger/asm.partitioning
  • running only one instance of ECR
 rm -fr 7-[234]*
 /fs/szdevel/dpuiu/SourceForge/wgs-assembler.030210/Linux-amd64/bin/runCA -d . -p asm -s ./runCA.spec doExtendClearRanges=1 *.frg >> & runCA.log
  • canceling ECR run (if doExtendClearRanges set to 1 or 2 and the run takes too long)
 grep ckp ./7-0-CGW/cgw.out 
 ./7-0-CGW/asm.ckp.3 (logical ckp01-ABS) after building scaffolds at ...
 ./7-0-CGW/asm.ckp.4 (logical ckp03-ACD) after scaffold cleaning at ...
 ./7-0-CGW/asm.ckp.5 (logical ckp05-1SM) after 1st scaffold merge at ...
 ./7-0-CGW/asm.ckp.6 (logical ckp06-AS) after stone throwing at ...
 ./7-0-CGW/asm.ckp.7 (logical ckp08-2SM) after 2nd scaffold merge at ...
 ./7-0-CGW/asm.ckp.8 (logical ckp09-FR) after final rocks at ...

 cgw -R 7 -N ckp09-FR -j 1 -k 5 -r 5 -s 2 -z -P 2 -B 75000 -m 100 -g ./asm.gkpStore -t ./asm.tigStore -o ./7-0-CGW/asm

 grep ckp ./7-0-CGW/cgw.out 
 ...
 ./7-0-CGW/asm.ckp.9 (logical ckp10-PS) after partial stones at ...
 ./7-0-CGW/asm.ckp.10 (logical ckp11-FCS) after final contained stones at ...
 ./7-0-CGW/asm.ckp.11 (logical ckp13-RS) after resolve surrogates at ...
 ./7-0-CGW/asm.ckp.12 (logical ckp14-FIN) after output at ...
  • running ECR (if initially doExtendClearRanges=0) ; make sure cgwPurgeCheckpoints=0
 runCA -d . -p asm -s runCA.spec stopAfter=scaffolder cgwPurgeCheckpoints=0 doExtendClearRanges=0 *.frg >> & runCA.log
 
 rm 7-CGW
 cd 7-0-CGW
 mkdir OLD
 mv asm.ckp.9 asm.ckp.1? OLD asm.*.cam asm.partition* asm.timing OLD
 cd ..
 
 runCA -d . -p asm -s runCA.spec stopAfter=scaffolder cgwPurgeCheckpoints=0 doExtendClearRanges=1 *.frg >> & runCA.log
  • DistUpadte file
7-0-CGW/stat/scaffold_final.distupdate.dst

Consensus

  • Dump unitig seq
 tigStore -g asm.gkpStore -t asm.tigStore/ 20 -D unitigs > unitigs.txt
 cat unitigs.txt | grep -v ^maID | perl -ane 'system "tigStore -g asm.gkpStore -t asm.tigStore 20 -d consensus -u  $F[0]\n";' >& unitigs.fasta.tmp
 cat unitigs.fasta.tmp  | grep -v null | sed 's/cns=//' | sed 's/-//g' | nl0 | tab2fasta.pl > unitigs.fasta
 rm unitigs.fasta.tmp
  • Dump contig seq
 tigStore -g asm.gkpStore -t asm.tigStore/ 20 -D contigs > contigs.txt
 cat contigs.txt | grep -v ^maID | perl -ane 'system "tigStore -g asm.gkpStore -t asm.tigStore 20 -d consensus -c  $F[0]\n";'>& contigs.fasta.tmp
 cat contigs.fasta.tmp | grep -v null | sed 's/cns=//' | sed 's/-//g' | nl0 | tab2fasta.pl > contigs.fasta
 rm contigs.fasta.tmp
  • Recompute unitig consensus
 utgcns -g ./asm.gkpStore -t ./asm.tigStore 2 011 -u 122533 -v -V

Terminator

Log file & runtime (wgs-5.2)

  • Filter log
 ~/bin/getCAruntimes.pl -filter runCA.log > runCA.filter.log
  • Display log
 cat runCA.log | ~/bin/getCAruntimes.pl -hour -total -plot  # => runCA.runTimes ; runCA.runTimes.gp ; runCA.runTimes.png
 display runCA.runTimes.png

Output processing

Posmap

  • Get posMap files
buildPosMap -o prefix < prefix.asm
  • Get number of reads/ctg or scf (from the pos files)
 cat prefix.posmap.frgctg | count.pl -i 1 > prefix.posmap.ctgfrgcnt
 cat prefix.posmap.frgscf | count.pl -i 1 > prefix.posmap.scffrgcnt
 cat prefix.posmap.ctgscf | count.pl -i 1 > prefix.posmap.scfctgcnt
  • Compute read/mate/ctg cvg (from the pos files)
 posmap2cvg.pl                                  < prefix.posmap.frgscf > prefix.posmap.scffrgcvg
 OLD/posmap2cvg.4.pl -mates prefix.posmap.mates < prefix.posmap.frgscf > prefix.posmap.scfmatecvg
 posmap2cvg.pl                                  < prefix.posmap.ctgscf > prefix.posmap.scfctgcvg
  • Get 0cvg regions
 posmap2cvg.pl -f prefix.posmap.scflen -Max 0 < prefix.posmap.frgscf  # scaff gaps
 posmap2cvg.pl -f prefix.posmap.ctglen -Max 0 < prefix.posmap.frgctg  # surrogates
  • Get links
 ~/bin/scf2scflnk.sh prefix                      
 ~/bin/ctg2ctglnk.sh prefix                      
 ...
  • Add ctg/scf into to the mates fils
 join12.pl prefix.posmap.mates prefix.posmap.frgctg > prefix.posmap.mates2.tmp
 join12.pl prefix.posmap.mates2.tmp prefix.posmap.frgscf > prefix.posmap.mates2
 rm prefix.posmap.mates2.tmp
  • Display graphs
cat prefix.posmap.scf2scflnk | ~/bin/tab2dot.pl | dot -Tpng -o prefix.posmap.scf2scflnk.png
cat prefix.posmap.ctg2ctglnk | ~/bin/tab2dot.pl | dot -Tpng -o prefix.posmap.ctg2ctglnk.png
  • Qc stats based on posmap files
 posmap2qc.pl -ctgscf prefix.posmap.ctgscf > prefix.qc
  • Break scf/ctg pipeline
 ~/bin/breakPosmapKeep.amos
  • Get scaff file
 ~/bin/posmap2scaff.pl prefix.posmap.ctgscf
  • Mate redundancy (based on posmap file)
 cat asm.posmap.frgutg | awk '{print $1}' | sed 's/[12]$//' | count.pl -m 2 | awk '{print $1}' > asm.posmap.frgutg.mates

 cat asm.posmap.frgutg | ~/bin/posmap2ovl.pl -p 75 | ~/bin/getMateRedundancy12.pl              > asm.posmap.frgutg.redundant_mate_pairs
 cat asm.posmap.frgutg.redundant_mate_pairs | p 'print "$F[0]\n$F[1]\n"' | sort -u             > asm.posmap.frgutg.redundant_mates
 cat asm.posmap.frgutg.redundant_mate_pairs | ~/bin/cluster.pl                                 > asm.posmap.frgutg.redundant_mate_clusters
  • Mate redundancy (based on overlapStore)
 #Example
 /nfshomes/dpuiu/bin/getMateRedundancy.amos -D BEGIN=1 -D END=11301039  -D GKPSTORE=../asm.gkpStore -D OVLSTORE=../asm.ovlStore s_1
  • Gaps longer than a certain size
 cat asm.posmap.ctgscf | p 'print $p,$_,"\n" if($F[1] eq $P[1] and $F[2]-$P[3]>1000); $p=$_;@P=@F;'

Fasta

  • Align deg to scaff
 nucmer -mum -l 40 -c 250 scf.fasta deg.fasta  -p scf-deg
 delta-filter -q scf-deg.delta > scf-deg.filter-q.delta
 ~/bin/delta2pairCvg+.pl < scf-deg.filter-q.delta > scf-deg.filter-q.cvg.pairs
 ~/bin/max2.pl -i 4 -j 6 < scf-deg.filter-q.cvg.pairs > scf-deg.filter-q.maxCvg.pairs
 cat scf-deg.filter-q.maxCvg.pairs | p 'print $_ if($F[5]-$F[6]<2000);' > scf-deg.filter-q.variants.pairs
 awk '{print $5,$6}' scf-deg.filter-q.variants.pairs > scf-deg.filter-q.variants.ids
  • Indexing
 # contigs
 ~/bin/formatdb.pl < prefix.ctg.fasta > prefix.ctg.fasta.offset
 ~/bin/fastacmd.pl -o prefix.ctg.fasta.offset -f filter.ctg.ids < prefix.ctg.fasta > filter.ctg.fasta
 # same for scaff, deg ...
  • Contaminant search : nucmer against
 /nfshomes/dpuiu/db/Ecoli.all
 /nfshomes/dpuiu/db/UniVec_Core
 /nfshomes/dpuiu/db/OtherVec

Asm

  • Get message counts
cat asm.asm | grep "^{" | egrep -v 'MPS|UPS|VAR|CTP' | uniq -c | more
     1 {MDI
273375 {AFG  # reads
 36258 {AMP  # links
  2185 {UTG  # MPS+           messages
  4876 {ULK
  1443 {CCO  # VAR*,MPS+,UPS+ messages
  3277 {CLK
    15 {SCF
    40 {SLK  # CTP+           messages
  • Get output seq
 asmOutputFasta -p prefix < prefix.asm
  • Get output seq clr
 ~/bin/asm2clrs.pl < prefix.asm > < prefix.asm.clr
  • Get new library insert sizes:
 asm2mdi.pl < prefix.asm > prefix.mdi
  • Get surrogate ids
 #most of them have negative aStat's
 $ cat *.asm | grep -B 8 ^sta:S | grep ^acc | perl -ane '/(\d+)/; print "utg".$1,"\n";'

Qc

 caqc.pl -euid  prefix.asm

Scaffolds

  • Get scaff files
 cavalidate -e 10 prefix
 bank2scaff prefix.bnk/ > prefix.scaff
 ~/bin/ca2scaff -i prefix.asm -o prefix.scaff
 cat prefix.scaff | grep "^>" | sed 's/>//' | awk '{print $1,$2}' > prefix.scfctg
 cat prefix.scaff | grep "^>" | sed 's/>//' | awk '{print $1,$3}' > prefix.scflen  
 cat prefix.scaff | grep "^>" | sed 's/>//' | awk '{print $1,$4}' > prefix.scfspan  
 tac prefix.scaff | p '$count=0 if(/^>/); $count++; print $_ if($count>2);' | awk '{print $4}'   # sequence gaps

AGP

 asmToAGP.pl < prefix.asm > prefix.agp


        /--- buildPosMap ------> .posmap
       /
  .asm /---- asmToAGP ------------------------------------------------> .AGP 
      \                                                              /   
       \---- cavalidate, bank2scaff?--> .scaff --- scaff2agp.pl --- 
  • Other scripts:
 bank2scaff : incorrect gap sizes
 valiadteAgp.pl < prefix.agp
 infoseq 9-terminator/prefix.ctg.fasta | sed 's/ctg//' >! prefix.ctg.infoseq
 scaff2agp.pl -i prefix.ctg.infoseq < prefix.scaff > prefix.agp

Moving an assembly

  • Minimum file set
 scp runCA.sh
 scp runCA.spec

 scp asm.gkpStore/*

 scp 3-overlapcorrection/asm.erates.updated 

 scp 4-unitigger/asm.iidmap
 scp 4-unitigger/asm.partitioning
 scp 4-unitigger/asm.partitioningInfo
 scp 4-unitigger/unitigger.success

 scp 5-consensus/consensus.success

 scp 7-0-CGW/asm.timing
 scp 7-0-CGW/asm.ckp.11  # last one
 scp 7-0-CGW/asm.partitioning
 scp 7-0-CGW/asm.partitionInfo
 scp 7-0-CGW/cgw.out.00.ckp.11
 scp 7-0-CGW/cgw.out

 scp asm.tigStore/*

Test datasets

Porphyromonas_gingivalis_W83 : Sanger + 454

 wget ftp://ftp.ncbi.nih.gov/pub/TraceDB/porphyromonas_gingivalis_w83/anc.porphyromonas_gingivalis_w83.001.gz
 ...
 wget ftp://ftp.ncbi.nih.gov/pub/TraceDB/porphyromonas_gingivalis_w83/xml.porphyromonas_gingivalis_w83.001.gz
 /fs/sztmpscratch/dpuiu/porphyromonas_gingivalis_w83/
  • Complete genome at NCBI : NC_002950.2 ( 2,343,476bp 48.29%GC)
 bp_fetch.pl net::genbank:NC_002950.2 > porphyromonas_gingivalis_w83.1con

Input formatting

Read counts

 zcat anc.*.gz | ~/bin/traceanc2anc.pl SEQ_LIB_ID INSERT_SIZE  INSERT_STDEV SVECTOR_CODE | count.pl 

TA formatting

  • frg FORMAT2
  • TIs are frg uids
  • All libs
 tracedb-to-frg.pl -xml xml.porphyromonas_gingivalis_w83.001.gz
 tracedb-to-frg.pl -lib xml.porphyromonas_gingivalis_w83.001.gz
 tracedb-to-frg.pl -frg xml.porphyromonas_gingivalis_w83.001.gz
 =>
 porphyromonas_gingivalis_w83.1.lib.frg  
 porphyromonas_gingivalis_w83.2.001.frg.bz2  
 porphyromonas_gingivalis_w83.3.lkg.frg 
  • One lib at a time
 tracedb-to-frg.pl -only T13146 -xml xml.porphyromonas_gingivalis_w83.001.gz >& ! tracedb-to-frg.log
 tracedb-to-frg.pl -only T13146 -lib xml.porphyromonas_gingivalis_w83.001.gz >>& tracedb-to-frg.log
 tracedb-to-frg.pl -only T13146 -frg xml.porphyromonas_gingivalis_w83.001.gz >>& tracedb-to-frg.log

454 SRA formatting

Illumina formatting

  • use fastqToCA (wgs-6.0)
 fastqToCA \
   -insertsize 3000 300 \
   -libraryname 1 \
   -type illumina \
   -fastq /fs/szattic-asmg4/Bees/Bombus_impatiens/s_1_1_sequence.txt,/fs/szattic-asmg4/Bees/Bombus_impatiens/s_1_2_sequence.txt | \
    sed 's/ori:I/ori:O/' \
   > 1.frg
 =>
 {LIB                                                                                                                                      
 act:A                                                                                                                                     
 acc:1                                                                                                                                     
 ori:O                                                                                                                                     
 mea:3000.000                                                                                                                              
 std:300.000                                                                                                                               
 src:                                                                                                                                      
 .                                                                                                                                         
 nft:10                                                                                                                                    
 fea:                                                                                                                                      
 forceBOGunitigger=1                                                                                                                       
 isNotRandom=0                                                                                                                             
 doNotTrustHomopolymerRuns=0                                                                                                               
 doRemoveDuplicateReads=0                                                                                                                  
 doNotQVTrim=0                                                                                                                             
 goodBadQVThreshold=0                                                                                                                      
 doNotOverlapTrim=0                                                                                                                        
 usePackedFragments=1                                                                                                                      
 illuminaFastQType=illumina                                                                                                                
 illuminaSequence=/fs/szattic-asmg4/Bees/Bombus_impatiens/s_1_1_sequence.txt,/fs/szattic-asmg4/Bees/Bombus_impatiens/s_1_2_sequence.txt    
 .                                                                                                                                         
 }
 gatekeeper \
   -o asm.gkpStore.BUILDING \
   -T -F  \
   -E asm.gkpStore.errorLog \
   1.frg 
 => Segmentation fault

FASTQ

  • This won't remove the linker:
 zcat SRR001351.fastq.gz | ~/bin/fastq2seq.pl  > SRR001351.seq
 zcat SRR001351.fastq.gz | ~/bin/fastq2qual.pl > SRR001351.qual
 convert-fasta-to-v2.pl -454 -l SRR001351 -s SRR001351.seq -q SRR001351.qual

SFF

 sffToCA -libraryname SRR001351 -output SRR001351.frg SRR001351.sff

Running

  • Command
 ~/wgs/Linux-amd64/bin/runCA -p pging -d testassembly porphyromonas_gingivalis_w83.* >& runCA.log

CAUG notes

Illimina mates

  • short: innies PE (paired ends)
  • long: outies MP (mate pairs) "lots of issues" (eg> : linker is one bp; solution : take 36bp from each end of the read to amke sure you don't get the linker
  • wgs 6.1 : Illumina limit of 104 bp; can recompile the code o make it 232

fastqToCA -insertsize 280 30 -libraryname yersinia_pestis.100bp.0280bp.005x -type illumina -innie -fastq $PWD/yersinia_pestis.100bp.0280bp.005x.1.fastq,$PWD/yersinia_pestis.100bp.0280bp.005x.2.fastq > illumina.frg {VER ver:2 } {LIB act:A acc:yersinia_pestis.100bp.0280bp.005x ori:I mea:280.000 std:30.000 src: . nft:12 fea: forceBOGunitigger=1 shortOverlapModel=1 isNotRandom=0 doNotTrustHomopolymerRuns=0 doRemoveDuplicateReads=0 doNotQVTrim=0 goodBadQVThreshold=0 doNotOverlapTrim=0 usePackedFragments=1 illuminaFastQType=illumina illuminaOrientation=innie illuminaSequence=/home/training/CAUG/USERS/Daniela_Puiu/yersinia_pestis.100bp.0280bp.005x.1.fastq,/home/training/CAUG/USERS/Daniela_Puiu/yersinia_pestis.100bp.0280bp.005x.2.fastq . } {VER ver:1 }

gatekeeper -o illumina.gkpStore illumina.frg Starting file 'illumina.frg' at line 0. Type set to ILLUMINA 1.3+. Orientation set to INNIE. Processing illumina reads from '/home/training/CAUG/USERS/Daniela_Puiu/yersinia_pestis.100bp.0280bp.005x.1.fastq'

                          and '/home/training/CAUG/USERS/Daniela_Puiu/yersinia_pestis.100bp.0280bp.005x.2.fastq'

GKP finished with no errors.

gatekeeper -dumpinfo illumina.gkpStore/ LOAD STATS

1 libInput 1 libLoaded 0 libErrors 0 libWarnings

0 frgInput 228742 frgLoaded 0 frgErrors 0 frgWarnings

0 lkgInput 0 lkgLoaded 0 lkgErrors 0 lkgWarnings

0 sffInput 0 sffLoaded 0 sffErrors 0 sffWarnings

0 sffLibCreated

0 plcInput 0 plcLoaded 0 plcErrors 0 plcWarnings

228742 numRandom 228742 numPacked 0 numNormal 0 numStrobe

GLOBAL STATS

228742 FRG 1 LIB

LibraryName numActiveFRG numDeletedFRG numMatedFRG readLength clearLength GLOBAL 228742 0 218724 22874200 22191085 LegacyUnmatedReads 0 0 0 0 0 yersinia_pestis.100bp.0280bp.005x 228742 0 218724 22874200 22191085

sffToCA -insertsize 3200 900 -libraryname porphyromonas_gingivalis_w83.3200bp.0900bp.E8YURXS01.1x.sff -trim chop -linker flx -output 454 porphyromonas_gingivalis_w83.3200bp.0900bp.E8YURXS01.1x.sff loadSFF()-- Loading 'porphyromonas_gingivalis_w83.3200bp.0900bp.E8YURXS01.1x.sff'. removeDuplicateReads()-- from 1 to 19622 removeDuplicateReads()-- finished detectMates()-- from 1 to 19621

Overlap: check the hash load

 6% is too low

~75% would be ideal