Dpuiu CA

From Cbcb
Revision as of 20:23, 22 October 2010 by Dpuiu (talk | contribs) (→‎Sourceforge)
Jump to navigation Jump to search

Data Issues

  • Problems if the coverage > read length; solution: sample the reads
  • Analyze the data before assemble it:
    • If there is a reference genome map the reads to it
    • Plot kmer frequencies (use multiple kmer sizes) ; see how they compare with the estimated coverage

Sourceforge

  • Constants:
 cat src/AS_global.h
 ...
 #define AS_READ_MIN_LEN                   64  # should decrease for short Illumina reads  32 ???
 #define AS_OVERLAP_MIN_LEN                40  # should decrease for short Illumina reads  30 ???  

 #define AS_READ_MAX_PACKED_LEN            104  # max Illumina read length; incread to 124
 
 #define AS_READ_MAX_NORMAL_LEN_BITS       11   # should be increased to 12+ if there are many frequent kmesra nd dpn't fit in the hash
cat src/AS_CGW/AS_CGW_dataTypes.h
#define CGW_MIN_DISCRIMINATOR_UNIQUE_LENGTH 1000   # should be decraesed for short reads assemblies ; all unitigs less than that do not get scaffolded


  • Download & compile CVS tip
 cvs -z3 -d:pserver:anonymous@wgs-assembler.cvs.sourceforge.net:/cvsroot/wgs-assembler co -P wgs-assembler

 cd wgs-assembler/src
 make SITE_NAME=LOCAL 

 => executables under wgs-assembler/Linux-amd64/bin/
  • Download & compile release
 wget ...

 bzip2 -dc wgs-6.0-beta.tar.bz2 | tar -xf -
 cd wgs-6.0-beta
 cd kmer
 sh configure.sh
 gmake install
 cd ../src
 gmake
 cd ..
  • Compile for debugging :
 make SITE_NAME=LOCAL  BUILDDEBUG=1  # in "c_make.as"

Run

  • Help
runCA.pl -fields
runCA.pl -options
  • Batch submission
 touch runCA.sh
   echo --- runCA $$ restarted by $USER at `date` --- >>& runCA.log
   ~/bin/runCA -d . -p prefix -s ~/bin/runCA.spec  *.frg >>& runCA.log
 at runCA.sh   
  • Sanger
 runCA -d . -p prefix -s spec Sanger.frg
  • doOverlapTrimming=1 (default)
  • low qual Sanger reads (ex: 20 or 'D') are discarded
  • Illumina reads are trimmed at 5'/3' (bp with qual < 5 (E) are deleted; read is deleted if new length< 64)
  • astatLowBound 1 (default)  ; can be lowered for low/un-uniform coverage
  • astatHighBound 5 (default)
  • computeInsertSize=0 (default)
  • 454 & Hybrid : convert sff to frg
 sffToCA -libraryname 454 -output 454 454.sff
 runCA -d . ovlOverlapper=mer unitigger=bog -p prefix *.frg
 runCA -d . -s ~/bin/runCA.454.spec -p prefix *.frg 
  • gatekeeper -L : search for 454 paired end linker
  • AS_GKP_sff.c  : Load_SFF function
  • for mated libs set mean=3000,stdev=300

Parameters

  • stopAfter:
   initialStoreBuilding
   overlapBasedTrimming | OBT     
   overlapper                         # runs overlap, overlapStore
   unitigger                          # runs frg/ovl correction
   consensusAfterUnitigger
   scaffolder
   consensusAfterScaffolder

  • Example:
 ~/bin/runCA -d . -p asm -s ~/bin/runCA.454.spec stopAfter=initialStoreBuilding *.frg

Gatekeeper

Build

  • Create & add
 gatekeeper    -o prefix.gkpStore -T -F prefix*frg
 gatekeeper -a -o prefix.gkpStore -T -F prefix*frg

Update

  • edit frg: act:R !!!
 {DST
 act:R
 acc:10691
 mea:161662
 std:28760
 }
  • append
 gatekeeper -a -o prefix.gkpStore -T -F edit.frg
  • Using tab file
 Example: No restrictions on lib stddev with --edit option
 cat prefix.update
 lib     uid     19070   mean    167001
 lib     uid     19070   stddev  167001
 lib     uid     19070   isNotRandom     1
 lib     uid     19070   Orientation     O            # outie orientation for Illimina libs >1K
    
 gatekeeper --edit prefix.update prefix.gkpStore
  • Update read CLR/CLV
 gatekeeper -a -r prefix.clr -o prefix.gkpStore   # not in the wgs release
 gatekeeper -a -v prefix.clv -o prefix.gkpStore

Delete

  • DST messages can not be deleted
  • A FRG can be deleted only if there is no related LKG message; an abbreviated MSG
  • act:D !!!
 # frg1
 {LKG
 act:D
 typ:M
 fg1:94376
 fg2:94654
 }
 {FRG
 act:D
 acc:94376
 }

 # frg2 ???
 {LKG
 act:D
 fg:94376
 fg:94654
 }
 {FRG
 act:D
 acc:94376
 }
 cat delete.ids | perl -ane 'print "{FRG\nact:D\nacc:$F[0]\n}\n";'> delete.frg
 cat delete.ids | ~/bin/deleteFrg.pl -forward 1 -reverse 2 > delete.frg            # forward reads have suffix 1; forward reads have suffix 1

 gatekeeper -a -o prefix.gkpStore -T -F delete.frg

Dump

  • INFO
 gatekeeper -dumpinfo prefix.gkpStore
  • LAST frg in store
 gatekeeper -dumpinfo -lastfragiid prefix.gkpStore
  • FASTA sequences
 gatekeeper -dumpfastaseq prefix.gkpStore > prefix.seq
 gatekeeper -dumpfastaqlt prefix.gkpStore > prefix.qual
  • FASTA reads for newbler
 gatekeeper -dumpnewbler newbler prefix.gkpStore => prefix.fna,prefix.fna.qual

 grep "^>" prefix.fna | head -2
 >SRR001355.3635.1a template=2020+2021 dir=F library=SRX000348 trim=20-117
 >SRR001355.3635.1b template=2020+2021 dir=R library=SRX000348 trim=1-130
  • FASTQ reads
 gatekeeper -dumpfastq prefix. prefix.gkpStore
 =>
 prefix.1.fastq
 prefix.2.fastq
 prefix.paired.fastq    # contains both asm.1.fastq & asm.2.fastq reads
 prefix.unmated.fastq
 cat prefix.1.fastq | perl ~/bin/clrFastq.pl > prefix.1.clr.fastq 
 cat prefix.2.fastq | perl ~/bin/clrFastq.pl > prefix.2.clr.fastq 
 cat prefix.paired.fastq | perl ~/bin/clrFastq.pl > prefix.paired.clr.fastq 
 cat prefix.unmated.fastq | perl ~/bin/clrFastq.pl > prefix.unmated.clr.fastq 


  • FRG file
 gatekeeper -dumpfrg prefix.gkpStore > prefix.frg
 gatekeeper -dumpfrg prefix.gkpStore -format2 > prefix.frg2 
  • READS in TAB format
 gatekeeper -dumpfragments -tabular prefix.gkpStore > prefix.gkpStore.info
 gatekeeper -dumpfragments -tabular prefix.gkpStore | grep -v ^UID | awk '{print $1,$2}' > prefix.gkpStore.uid2iid
  • LIBS in TAB format
 gatekeeper -dumplibraries -tabular prefix.gkpStore | grep -v ^UID | awk '{print $1,$4,$5}' | sed 's/\.000//g'> prefix.dst  # original vals
 asm2mdi.pl < prefix.asm > prefix.mdi                                                                                       # final vals
  • READ CLR's (default=LATEST)
 gatekeeper -dumpfragments -tabular prefix.gkpStore/ | perl -ane 'print join "\t", @F[1,0,-2,-1,6]; print "\n";' | head | pretty -o
 IID    UID                                  clrBeginLATEST  clrEndLATEST  isDeleted  
 1      HWI-EAS385_0001:1:1:1569:15197#0/1   0               65            0          
 2      HWI-EAS385_0001:1:1:2095:12498#0/1   0               67            0          
 3      HWI-EAS385_0001:1:1:5452:4010#0/1    0               78            0          
 4      HWI-EAS385_0001:1:1:12400:16773#0/1  0               105           0          
 5      HWI-EAS385_0001:1:1:12884:18158#0/1  0               80            0          
 ...
 
 gatekeeper -dumpfragments -tabular prefix.gkpStore/  | perl -ane 'print join "\t", @F[0,-2,-1]; print "\n";' > prefix.clr
 gatekeeper -dumpfragments -tabular -clear [CLR|OBTINITIAL|OBTMERGE|OBTCHIMERA|ECR_0|ECR_1|ECR_2|LATEST] prefix.gkpStore/  | perl -ane 'print join "\t", @F[0,-2,-1]; print "\n";' > prefix.?.clr
 gatekeeper -dumpfragments -tabular -clear CLR asm.gkpStore        | perl -ane 'print join "\t", @F[0,-2,-1]; print "\n";' > asm.CLR.clr
 gatekeeper -dumpfragments -tabular -clear OBTINITIAL asm.gkpStore | perl -ane 'print join "\t", @F[0,-2,-1]; print "\n";' > asm.OBTINITIAL.clr &
 gatekeeper -dumpfragments -tabular -clear OBTCHIMERA asm.gkpStore | perl -ane 'print join "\t", @F[0,-2,-1]; print "\n";' > asm.OBTCHIMERA.clr &
 ...
 gatekeeper -dumpfragments -tabular -clear LATEST asm.gkpStore     | perl -ane 'print join "\t", @F[0,-2,-1]; print "\n";' > asm.LATEST.clr &
  • READ CLR's summary
 gatekeeper -dumpfragments asm.gkpStore | grep ^fragmentClear | perl -ane '  print $1," ",$3-$2,"\n" if(/(\w+),(\d+),(\d+)$/);' | ~/bin/sum2.pl 
  • MATED
 frg22mates.pl < prefix.frg > prerfix.mates
  • Deleted reads
 gdt asm.gkpStore | p 'print $F[0],"\n" if($F[6]);'
  • All scaffold reads in FRG2 format
 cat asm.posmap.frgscf | grep scfid > scfid.posmap.frgscf
 gatekeeper -uid scfid.posmap.frgscf -dumpfrg -donotfixmates -format2 asm.gkpStore > scfid.frg

Id mapping

  • Frg FASTQ packed format id's are no recorded in the gatekeeper; assgned a new UID ; to map them back ...
 ~/bin/fastq2wgsUIDs.pl s_*txt > s.uid2iid2id
 gatekeeper -dumpfragments -tabular asm.gkpStore | awk '{print $1,$2,$10}' > s.uid2iid

 paste s.uid2iid2id s.uid2iid | p 'print "ERROR $_" if($F[3] ne $F[-1]);' # check the lengths

Meryl

  • Get mer frequencies
 meryl -C -B -m 22 -s prefix.seq -o prefix.22mers
  • Dump mer sequences
 meryl -Dt -s prefix.22mers 
  • Dump histogram
 meryl -Dh -s prefix.22mers | sort -nk2 -r | more
  • Other programs
 cat lib.seq | ~/bin/countKmers.pl -k 22 | head -20 |          grep -v "^>" > 22mers.fwd.grep
 cat lib.seq | ~/bin/countKmers.pl -k 22 | head -20 | revseq | grep -v "^>" > 22mers.rev.grep
 cat 22mers.fwd.grep 22mers.rev.grep > 22mers.grep
 cat lib.seq | ~/bin/fgrep.pl -f 22mer.grep | count.pl

OBT

  • runCA spec
 doOverlapTrimming = 1
  • command
 overlap -G ...   # -G to compute partial overlaps
  • Dump
 overlapStore -d ./0-overlaptrim/asm.obtStore
  • Number of jobs: should be < fileListMax (1024*10 by default) otherwise overlapStore fails
 ~/bin/getOvlJobsCount.pl ovlHashBlockSize ovlRefBlockSize frgCount
 #or
 cat 0-overlaptrim-overlap/ovlopts.pl | grep -c '^"h'

Overlap

  • Jobs
 merOverlapperSeedBatchSize     100000
 merOverlapperExtendBatchSize   75000
 
 overmerry.sh :       #frgs/merOverlapperSeedBatchSize   jobs -> seeds/ dir
 olap-from-seeds.sh : #frgs/merOverlapperExtendBatchSize jobs -> olaps/ dir
  • Restart in runCA
 rm 1-overlapper/o*

OverlapStore

  • Flags:
 -t -M            : don't improve performance much
 -I ignore_file   : not implemented
  • Dump
 overlapStore -d prefix.ovlStore > prefix.ovl.tab

 overlapStore -d prefix.ovlStore | awk '{print $1}' | uniq -c | awk '{print $2,$1}' > prefix.ovlStore.count.tmp
 cat prefix.ovlStore.count.tmp | perl -ane 'foreach ($p+1..$F[0]-1) { print "$_ 0\n" } $p=$F[0]; print $_'  > prefix.ovlStore.count
 rm prefix.ovlStore.count.tmp
  • Get stats; median number of ovls (5'/3') for each lib
 overlapStats -G prefix.gkpStore -O prefix.ovlStore/ -o prefix
 cat prefix.repeatmodel.lib.00*stats | egrep '^Lib|nSamples|median' | p 'chomp; if(/Lib/) {print "\n$_"} else { print "  $F[1]"}' | pretty -o
  • Delete overlaps from *.obt.gz files
 zcat *.obt.gz  | convertOverlap -a -ovl |  ~/bin/difference1or2.pl -f delete.acc | convertOverlap -b -ovl | gzip > new.obt.gz
  • Delete overlaps from ovlStore : asm=>new
 overlapStore -d asm.ovlStore | perl -ane 'print $_ if($F[0]<$F[1]);' | ~/bin/difference1or2.pl -f delete.iid |  convertOverlap -b -ovl | gzip  >! new.ovb.gz
 overlapStore -c new.ovlStore -g asm.gkpStore/ new.ovb.gz

 #not necessary to gzip the overlap files
 overlapStore -d asm.ovlStore | perl -ane 'print $_ if($F[0]<$F[1]);' | ~/bin/difference1or2.pl -f delete.iid |  convertOverlap -b -ovl >! new.ovb
 overlapStore -c new.ovlStore -g asm.gkpStore/ new.ovb
  • Edit(invert) overlaps from ovlStore : asm=>new
 ... > invert.iid     # get the iids of the reads to change          
 overlapStore -d asm.ovlStore  | p 'print $_ if($F[0]<=$F[1]);' | ~/bin/invertOvl.pl -f invert.iid | convertOverlap -b -ovl | gzip > new.ovb.gz
 overlapStore -c new.ovlStore -g asm.gkpStore new.ovb.gz
  • Count how many overlaps have errors
 overlapStore -d asm.ovlStore | p 'print if($F[5]>0);' | wc -l
  • Build faster ...
 Example (Bee):
 e 'foreach ("0000000001".."0000000058") { print "overlapStore -c $_.ovlStore -g asm.gkpStore/ -i 0 -M 8192 1-overlapper/$_/*gz\n" }' | scheduler.pl  # 13min/job  
 e 'foreach ("0000000002".."0000000058") { print "overlapStore -m 0000000001.ovlStore $_.ovlStore\n" }'                                               # 3 min/job 
 total=2*13+57*3=3.5 hrs (instead of 13.85 reported in runCA.log)

Frgcorr

  • Fails on deleted reads with overlaps ; must delete the overlaps as well

Unitiger

Tigstore

  • list unitigs
 tigStore -g asm.gkpStore -t asm.tigStore 2 -D unitiglist     # ids  
 maID    isPresent  isDeleted  ptID  svID  fileOffset  covStat  
 0       1          0          1     2     0           3334     
 1       1          0          1     2     521908      -1       
 2       1          0          1     2     522764      543      
 ...
 tigStore -g asm.gkpStore -t asm.tigStore 2 -D properties     # exit: stdout    # properties
 unitig_coverage_stat        0 660
 unitig_microhet_prob        0 0.000000
 unitig_status               0 X
 unitig_unique_rept          0 X
 ...
  • number of tigs
tigStore -g asm.gkpStore -t asm.tigStore 2 -D unitiglist | tail -1 | awk '{print $1}'
  • list frags/consensus/layout in a unitig
 tigStore -g asm.gkpStore -t asm.tigStore 2 -u 0 -d frags 
 tigStore -g asm.gkpStore -t asm.tigStore 2 -u 0 -d consensus
 tigStore -g asm.gkpStore -t asm.tigStore 2 -u 0 -d layout
 tigStore -g asm.gkpStore -t asm.tigStore 2 -u 0  -d properties  # exit: stderr & stdout

  • dump all unitig consensus
 tigStore -g asm.gkpStore -t asm.tigStore 2 -U -d consensus > utg.fasta
  • get single read unitigs
 tigStore -g asm.gkpStore -t asm.tigStore 43 -D properties | grep ^unitig_num_frags | grep ' 1$' 
  • get read/utg or ctg stats
 tigStore -g asm.gkpStore -t asm.tigStore 43 -D properties |  grep ^unitig_num_frags | getSummary.pl -z 1 -i 2 -t unitig_num_frags 
 tigStore -g asm.gkpStore -t asm.tigStore 43 -D properties |  grep ^contig_num_frags | getSummary.pl -z 1 -i 2 -t contig_num_frags 
  • delete a unitig : get & update layout
 tigStore -g asm.gkpStore -t asm.tigStore 2 -u id -d layout | tee utg.layout | grep ^unitig | awk '{print $2}' | ~/bin/deleteUnitigIdLayout.pl > utg.layout.delete
 # or
 tigStore -g asm.gkpStore -t asm.tigStore 2 -u id -d layout | tee utg.layout |  ~/bin/deleteUnitigLayout.pl > utg.layout.delete
 tigStore -g asm.gkpStore -t asm.tigStore 2 -R utg.layout.delete
 tigStore -g asm.gkpStore -t asm.tigStore 2 -D unitiglist | p 'print if($F[2]);'
  • delete single frg unitigs
 tigStore -g asm.gkpStore -t asm.tigStore 2 -U -d layout | egrep '^unitig|^data.num_frags' | p 'chomp; print $_," "; print "\n" if(/^data/)' | p 'print $F[1],"\n" if($F[3]==1);' | ~/bin/deleteUnitigIdLayout.pl > utg.layout.delete 
 tigStore -g asm.gkpStore -t asm.tigStore 2 -R utg.layout.delete
  • assign unitig -1 to all unitigs (for addition to a new store)
 tigStore -g asm.gkpStore -t asm.tigStore 2 -u id -d layout | tee utg.layout | p 'if (/^unitig/) { print "unitig -1\n"} elsif(/^UTG/) { $F[7]=-1; print join " ",@F; print "\n" } else { print $_};' > utg.layout.add
 tigStore -g asm.gkpStore -t new.tigStore 2 -R utg.layout.add
 tigStore -g asm.gkpStore -t asm.tigStore 1 -u id -d layout | tee utg.layout | p 'if (/^unitig/) { print "unitig -1\n"} else { print $_};' > utg.layout.add
 tigStore -g asm.gkpStore -t new.tigStore 1 -R utg.layout.add
  • fix unitig consensus
 touch utg.layout
 grep FAILED 5-consensus/asm*err | cut -f3 -d ' ' | sort -u -n | p 'system "tigStore -g asm.gkpStore -t asm.tigStore 2 -u $F[0] -d layout >> utg.layout\n";'  
 cat  utg.layout | updateUTGlayout.pl  >  utg.layout.update
 tigStore -g asm.gkpStore -t asm.tigStore 2 -R utg.layout.update

 grep FAILED 5-consensus/asm*err | sed 's/\.err:/ /' | cut -f1 | sort -u | p 'system "touch $F[0].success\n";' 
 touch 5-consensus/consensus.success

CGW

  • Long run times usually mean problems (conflicts) with the data
  • Drbug
 gdb /fs/szdevel/dpuiu/SourceForge/wgs/Linux-amd64/bin//cgw
 $ run -j 1 -k 5 -r 5 -s 2 -z -P 2 -B 75000 -m 100 -g ./asm.gkpStore -t ./asm.tigStore -o asm
 bt
  • Differences between the LAST run of CGW and PREVIOUS ones
 # options
 PREV       LAST
            -R 7             # restart from checkpoint file number 'ckp'
            -N ckp01-ABS     # restart from checkpoint location 'ckp' (see the timing file)
 -s 0       -s 2             # stone throwing level
 -S 0                        # S==0 : do not resolve surrogates
 -G                          # Don't generate output (cgw or cam) 
 #files
 difference.pl 7-4-CGW.ls 7-0-CGW.ls
 asm.asm.cam
 asm.dregs.cam
 asm.partitionInfo  #  "equivalent of" 4-unitigger/asm.partitioningInfo 
 asm.partitioning   #  "equivalent of" 4-unitigger/asm.partitioning
  • running only one instance of ECR
 rm -fr 7-[234]*
 /fs/szdevel/dpuiu/SourceForge/wgs-assembler.030210/Linux-amd64/bin/runCA -d . -p asm -s ./runCA.spec doExtendClearRanges=1 *.frg >> & runCA.log
  • canceling ECR run (if doExtendClearRanges set to 1 or 2 and the run takes too long)
 grep ckp ./7-0-CGW/cgw.out 
 ./7-0-CGW/asm.ckp.3 (logical ckp01-ABS) after building scaffolds at ...
 ./7-0-CGW/asm.ckp.4 (logical ckp03-ACD) after scaffold cleaning at ...
 ./7-0-CGW/asm.ckp.5 (logical ckp05-1SM) after 1st scaffold merge at ...
 ./7-0-CGW/asm.ckp.6 (logical ckp06-AS) after stone throwing at ...
 ./7-0-CGW/asm.ckp.7 (logical ckp08-2SM) after 2nd scaffold merge at ...
 ./7-0-CGW/asm.ckp.8 (logical ckp09-FR) after final rocks at ...

 cgw -R 7 -N ckp09-FR -j 1 -k 5 -r 5 -s 2 -z -P 2 -B 75000 -m 100 -g ./asm.gkpStore -t ./asm.tigStore -o ./7-0-CGW/asm

 grep ckp ./7-0-CGW/cgw.out 
 ...
 ./7-0-CGW/asm.ckp.9 (logical ckp10-PS) after partial stones at ...
 ./7-0-CGW/asm.ckp.10 (logical ckp11-FCS) after final contained stones at ...
 ./7-0-CGW/asm.ckp.11 (logical ckp13-RS) after resolve surrogates at ...
 ./7-0-CGW/asm.ckp.12 (logical ckp14-FIN) after output at ...
  • running ECR (if initially doExtendClearRanges=0) ; make sure cgwPurgeCheckpoints=0
 runCA -d . -p asm -s runCA.spec stopAfter=scaffolder cgwPurgeCheckpoints=0 doExtendClearRanges=0 *.frg >> & runCA.log
 
 rm 7-CGW
 cd 7-0-CGW
 mkdir OLD
 mv asm.ckp.9 asm.ckp.1? OLD asm.*.cam asm.partition* asm.timing OLD
 cd ..
 
 runCA -d . -p asm -s runCA.spec stopAfter=scaffolder cgwPurgeCheckpoints=0 doExtendClearRanges=1 *.frg >> & runCA.log
  • DistUpadte file
7-0-CGW/stat/scaffold_final.distupdate.dst
  • Stats
cat 7-0-CGW/rezlog/stone.i02.log | grep -B 1000000 "^Scaffold Edges" | grep ':' | p 'if(/^Scaff/) { print $sum,"\n"; $sum=0; } else { $sum+=$F[-1]}' | getSummary.pl -t scf
 cat 7-0-CGW/rezlog/stone.i02.log | grep -B 1000000 "^Scaffold Edges" | grep ':' | grep -v Scaff | getSummary.pl -i 8 -t ctg

ECR

  • doExtendClearRanges = 2 => cgw run 3 times, ECR 2 times

Consensus

  • Dump unitig seq
 tigStore -g asm.gkpStore -t asm.tigStore/ 20 -D unitigs > unitigs.txt
 cat unitigs.txt | grep -v ^maID | perl -ane 'system "tigStore -g asm.gkpStore -t asm.tigStore 20 -d consensus -u  $F[0]\n";' >& unitigs.fasta.tmp
 cat unitigs.fasta.tmp  | grep -v null | sed 's/cns=//' | sed 's/-//g' | nl0 | tab2fasta.pl > unitigs.fasta
 rm unitigs.fasta.tmp
  • Dump contig seq
 tigStore -g asm.gkpStore -t asm.tigStore/ 20 -D contigs > contigs.txt
 cat contigs.txt | grep -v ^maID | perl -ane 'system "tigStore -g asm.gkpStore -t asm.tigStore 20 -d consensus -c  $F[0]\n";'>& contigs.fasta.tmp
 cat contigs.fasta.tmp | grep -v null | sed 's/cns=//' | sed 's/-//g' | nl0 | tab2fasta.pl > contigs.fasta
 rm contigs.fasta.tmp
  • Recompute unitig consensus
 utgcns -g ./asm.gkpStore -t ./asm.tigStore 2 011 -u 122533 -v -V

Terminator

Log file & runtime (wgs-5.2)

  • Filter log
 ~/bin/getCAruntimes.pl -filter runCA.log > runCA.filter.log
  • Display log
 cat runCA.log | ~/bin/getCAruntimes.pl -hour -total -plot  # => runCA.runTimes ; runCA.runTimes.gp ; runCA.runTimes.png
 display runCA.runTimes.png

Output processing

Posmap

  • Get posMap files
buildPosMap -o prefix < prefix.asm
  • Get number of reads/ctg or scf (from the pos files)
 cat prefix.posmap.frgctg | count.pl -i 1 > prefix.posmap.ctgfrgcnt
 cat prefix.posmap.frgscf | count.pl -i 1 > prefix.posmap.scffrgcnt
 cat prefix.posmap.ctgscf | count.pl -i 1 > prefix.posmap.scfctgcnt
  • Compute read/mate/ctg cvg (from the pos files)
 posmap2cvg.pl                                  < prefix.posmap.frgscf > prefix.posmap.scffrgcvg
 OLD/posmap2cvg.4.pl -mates prefix.posmap.mates < prefix.posmap.frgscf > prefix.posmap.scfmatecvg
 posmap2cvg.pl                                  < prefix.posmap.ctgscf > prefix.posmap.scfctgcvg
  • Get 0cvg regions
 posmap2cvg.pl -f prefix.posmap.scflen -Max 0 < prefix.posmap.frgscf  # scaff gaps
 posmap2cvg.pl -f prefix.posmap.ctglen -Max 0 < prefix.posmap.frgctg  # surrogates
  • Get links
 ~/bin/scf2scflnk.sh prefix                      
 ~/bin/ctg2ctglnk.sh prefix                      
 ...
  • Add ctg/scf into to the mates fils
 join12.pl prefix.posmap.mates prefix.posmap.frgctg > prefix.posmap.mates2.tmp
 join12.pl prefix.posmap.mates2.tmp prefix.posmap.frgscf > prefix.posmap.mates2
 rm prefix.posmap.mates2.tmp
  • Display graphs
cat prefix.posmap.scf2scflnk | ~/bin/tab2dot.pl | dot -Tpng -o prefix.posmap.scf2scflnk.png
cat prefix.posmap.ctg2ctglnk | ~/bin/tab2dot.pl | dot -Tpng -o prefix.posmap.ctg2ctglnk.png
  • Qc stats based on posmap files
 posmap2qc.pl -ctgscf prefix.posmap.ctgscf > prefix.qc
  • Break scf/ctg pipeline
 ~/bin/breakPosmapKeep.amos
  • Get scaff file
 ~/bin/posmap2scaff.pl prefix.posmap.ctgscf
  • Gaps longer than a certain size
 cat asm.posmap.ctgscf | p 'print $p,$_,"\n" if($F[1] eq $P[1] and $F[2]-$P[3]>1000); $p=$_;@P=@F;'

Mate redundancy

  • Based on posmap file
 cat asm.posmap.frgscf |  ~/bin/posmap2ovl.pl -p 75 | grep -v ^ref | ~/bin/getMateRedundancy12.pl > asm.redundant_mate_pairs 
 cat asm.redundant_mate_pairs |  perl ~/bin/cluster.pl | sort > asm.redundant_mate_clusters
  • Based on overlapStore
 #Example
 getMateRedundancy.amos -D BEGIN=1 -D END=219  -D GKPSTORE=../asm.gkpStore -D OVLSTORE=../asm.ovlStore s_2_3kb
 #=> asm.redundant_mate_pairs , asm.redundant_mate_clusters
  • Based on sequence (first 32bp in fwd/rev Illumina reads; max 1 mismatch)
 paste s_2_?_8kb*.txt | perl -ane 'chomp; if($.%4==1) { s/\@(\S+)[12]//; printf("%-45s",$1); } elsif($.%4==2) { print substr($F[0],0,32),"\t",substr($F[1],0,32),"\n" } ' | sort -k2,3 | ./getDuplicates.pl > s_2_8k.tab
 cat s_2_8k.tab | perl -ane 'print $F[0],"1\n";' >  s_2_1_8k.redundant
 cat s_2_8k.tab | perl -ane 'print $F[0],"2\n";' >  s_2_2_8k.redundant

Fasta

  • Align deg to scaff
 nucmer -mum -l 40 -c 250 scf.fasta deg.fasta  -p scf-deg
 delta-filter -q scf-deg.delta > scf-deg.filter-q.delta
 ~/bin/delta2pairCvg+.pl < scf-deg.filter-q.delta > scf-deg.filter-q.cvg.pairs
 ~/bin/max2.pl -i 4 -j 6 < scf-deg.filter-q.cvg.pairs > scf-deg.filter-q.maxCvg.pairs
 cat scf-deg.filter-q.maxCvg.pairs | p 'print $_ if($F[5]-$F[6]<2000);' > scf-deg.filter-q.variants.pairs
 awk '{print $5,$6}' scf-deg.filter-q.variants.pairs > scf-deg.filter-q.variants.ids
  • Indexing
 # contigs
 ~/bin/formatdb.pl < prefix.ctg.fasta > prefix.ctg.fasta.offset
 ~/bin/fastacmd.pl -o prefix.ctg.fasta.offset -f filter.ctg.ids < prefix.ctg.fasta > filter.ctg.fasta
 # same for scaff, deg ...
  • Contaminant search : nucmer against
 /nfshomes/dpuiu/db/Ecoli.all
 /nfshomes/dpuiu/db/UniVec_Core
 /nfshomes/dpuiu/db/OtherVec

Asm

  • Get message counts
cat asm.asm | grep "^{" | egrep -v 'MPS|UPS|VAR|CTP' | uniq -c | more
     1 {MDI
273375 {AFG  # reads
 36258 {AMP  # links
  2185 {UTG  # MPS+           messages
  4876 {ULK
  1443 {CCO  # VAR*,MPS+,UPS+ messages
  3277 {CLK
    15 {SCF
    40 {SLK  # CTP+           messages
  • Get output seq
 asmOutputFasta -p prefix < prefix.asm
  • Get output seq clr
 ~/bin/asm2clrs.pl < prefix.asm > < prefix.asm.clr
  • Get new library insert sizes:
 asm2mdi.pl < prefix.asm > prefix.mdi
  • Get surrogate ids
 #most of them have negative aStat's
 $ cat *.asm | grep -B 8 ^sta:S | grep ^acc | perl -ane '/(\d+)/; print "utg".$1,"\n";'

Qc

 caqc.pl -euid  prefix.asm

Scaffolds

  • Get scaff files
 cavalidate -e 10 prefix
 bank2scaff prefix.bnk/ > prefix.scaff
 ~/bin/ca2scaff -i prefix.asm -o prefix.scaff
 cat prefix.scaff | grep "^>" | sed 's/>//' | awk '{print $1,$2}' > prefix.scfctg
 cat prefix.scaff | grep "^>" | sed 's/>//' | awk '{print $1,$3}' > prefix.scflen  
 cat prefix.scaff | grep "^>" | sed 's/>//' | awk '{print $1,$4}' > prefix.scfspan  
 tac prefix.scaff | p '$count=0 if(/^>/); $count++; print $_ if($count>2);' | awk '{print $4}'   # sequence gaps

AGP

 asmToAGP.pl < prefix.asm > prefix.agp


        /--- buildPosMap ------> .posmap
       /
  .asm /---- asmToAGP ------------------------------------------------> .AGP 
      \                                                              /   
       \---- cavalidate, bank2scaff?--> .scaff --- scaff2agp.pl --- 
  • Other scripts:
 bank2scaff : incorrect gap sizes
 valiadteAgp.pl < prefix.agp
 infoseq 9-terminator/prefix.ctg.fasta | sed 's/ctg//' >! prefix.ctg.infoseq
 scaff2agp.pl -i prefix.ctg.infoseq < prefix.scaff > prefix.agp

Moving an assembly

  • Minimum file set
 scp runCA.sh
 scp runCA.spec

 scp asm.gkpStore/*

 scp 3-overlapcorrection/asm.erates.updated 

 scp 4-unitigger/asm.iidmap
 scp 4-unitigger/asm.partitioning
 scp 4-unitigger/asm.partitioningInfo
 scp 4-unitigger/unitigger.success

 scp 5-consensus/consensus.success

 scp 7-0-CGW/asm.timing
 scp 7-0-CGW/asm.ckp.11  # last one
 scp 7-0-CGW/asm.partitioning
 scp 7-0-CGW/asm.partitionInfo
 scp 7-0-CGW/cgw.out.00.ckp.11
 scp 7-0-CGW/cgw.out

 scp asm.tigStore/*

Add reads to an assembly

 ln -s ../wgs-prev/prefix.frg .
 ln -s ../Data/new.frg
 cp ../wgs-prev/runCA.spec ../wgs-prev/runCA.sh   .
 ln -s ../wgs-prev/0-mercounts
 cp -R ../wgs-prev/prefix.gkpStore
 gatekeeper -a -o prefix.gkpStore new.frg

 ll -d ../wgs-prev/1-overlapper/000000000*                  # if multiple jobs ...
 mkdir -p 1-overlapper/0000000001                           # mkdir -p 1-overlapper/0000000002 ...  ???
 
 cat ../wgs-prev/1-overlapper/*pl | grep '^"h' | tail -1    # get last qry/ref block  => "h0056000001r0050000000"

 cd 1-overlapper/0000000001
 ln -s ../../wgs-prev/1-overlapper/0000000001/*ovb.gz .
 rm h0056000001*.ovb.gz *r0050000000.ovb.gz                 # this should change from case to case 
 cd ../..

Removing bad mates

 cat ../wgs-prev/9-terminator/asm.posmap.frags | grep bad > asm.posmap.frags.bad

 gatekeeper -dumpfragments -tabular -uid asm.posmap.frags.bad ./wgs-prev/asm.gkpStore |  grep -v ^UID | ~/bin/deleteLkg.pl -j 2 > asm.posmap.frags.bad.delete.lkg
 ln -s ../wgs-prev/*.frg 
 gatekeeper -o asm.gkpStore *.frg asm.posmap.frags.bad.delete.lkg 
 ln -s ../wgs-prev/0-mercounts/
 ln -s ../wgs-prev/asm.ovlStore/
 ln -s ../wgs-prev/3-overlapcorrection/
 ln -s ../wgs-prev/4-unitigger/
 cp -R ../wgs-prev/asm.tigStore/ .
 ln -s ../wgs-prev/5-consensus

 runCA ...

Test datasets

Porphyromonas_gingivalis_W83 : Sanger + 454

 wget ftp://ftp.ncbi.nih.gov/pub/TraceDB/porphyromonas_gingivalis_w83/anc.porphyromonas_gingivalis_w83.001.gz
 ...
 wget ftp://ftp.ncbi.nih.gov/pub/TraceDB/porphyromonas_gingivalis_w83/xml.porphyromonas_gingivalis_w83.001.gz
 /fs/sztmpscratch/dpuiu/porphyromonas_gingivalis_w83/
  • Complete genome at NCBI : NC_002950.2 ( 2,343,476bp 48.29%GC)
 bp_fetch.pl net::genbank:NC_002950.2 > porphyromonas_gingivalis_w83.1con

Input formatting

Read counts

 zcat anc.*.gz | ~/bin/traceanc2anc.pl SEQ_LIB_ID INSERT_SIZE  INSERT_STDEV SVECTOR_CODE | count.pl 

TA formatting

  • frg FORMAT2
  • TIs are frg uids
  • All libs
 tracedb-to-frg.pl -xml xml.porphyromonas_gingivalis_w83.001.gz
 tracedb-to-frg.pl -lib xml.porphyromonas_gingivalis_w83.001.gz
 tracedb-to-frg.pl -frg xml.porphyromonas_gingivalis_w83.001.gz
 =>
 porphyromonas_gingivalis_w83.1.lib.frg  
 porphyromonas_gingivalis_w83.2.001.frg.bz2  
 porphyromonas_gingivalis_w83.3.lkg.frg 
  • One lib at a time
 tracedb-to-frg.pl -only T13146 -xml xml.porphyromonas_gingivalis_w83.001.gz >& ! tracedb-to-frg.log
 tracedb-to-frg.pl -only T13146 -lib xml.porphyromonas_gingivalis_w83.001.gz >>& tracedb-to-frg.log
 tracedb-to-frg.pl -only T13146 -frg xml.porphyromonas_gingivalis_w83.001.gz >>& tracedb-to-frg.log

454 SRA formatting

Illumina formatting

  • use fastqToCA (wgs-6.0)
 fastqToCA \
   -insertsize 3000 300 \
   -libraryname 1 \
   -type illumina \
   -fastq /fs/szattic-asmg4/Bees/Bombus_impatiens/s_1_1_sequence.txt,/fs/szattic-asmg4/Bees/Bombus_impatiens/s_1_2_sequence.txt | \
    sed 's/ori:I/ori:O/' \
   > 1.frg
 =>
 {LIB                                                                                                                                      
 act:A                                                                                                                                     
 acc:1                                                                                                                                     
 ori:O                                                                                                                                     
 mea:3000.000                                                                                                                              
 std:300.000                                                                                                                               
 src:                                                                                                                                      
 .                                                                                                                                         
 nft:10                                                                                                                                    
 fea:                                                                                                                                      
 forceBOGunitigger=1                                                                                                                       
 isNotRandom=0                                                                                                                             
 doNotTrustHomopolymerRuns=0                                                                                                               
 doRemoveDuplicateReads=0                                                                                                                  
 doNotQVTrim=0                                                                                                                             
 goodBadQVThreshold=0                                                                                                                      
 doNotOverlapTrim=0                                                                                                                        
 usePackedFragments=1                                                                                                                      
 illuminaFastQType=illumina                                                                                                                
 illuminaSequence=/fs/szattic-asmg4/Bees/Bombus_impatiens/s_1_1_sequence.txt,/fs/szattic-asmg4/Bees/Bombus_impatiens/s_1_2_sequence.txt    
 .                                                                                                                                         
 }
 gatekeeper \
   -o asm.gkpStore.BUILDING \
   -T -F  \
   -E asm.gkpStore.errorLog \
   1.frg 
 => Segmentation fault

FASTQ

  • This won't remove the linker:
 zcat SRR001351.fastq.gz | ~/bin/fastq2seq.pl  > SRR001351.seq
 zcat SRR001351.fastq.gz | ~/bin/fastq2qual.pl > SRR001351.qual
 convert-fasta-to-v2.pl -454 -l SRR001351 -s SRR001351.seq -q SRR001351.qual
  • Cleaning seq
 ~/bin/OLD/cleanFastaq.pl \
    s_2_1_1.1kb_sequence.txt s_2_2_1.1kb_sequence.txt \                                     # input
    s_2_1_1.1kb_sequence.clean.txt  s_2_2_1.1kb_sequence.clean.txt \                        # output mates
    s_2_1_1.1kb_sequence.clean.single.txt  s_2_2_1.1kb_sequence.clean.single.txt            # output single
   echo prefix_1.fastq prefix_2.fastq  > prefix.ls
   quake.py -f prefix.ls
  • Illumina note:
 We recommend removing reads that do not pass the GA analysis software Failed_Chastity fi lter before attempting to assemble the sequence.
 The chastity of a base call is the ratio of the intensity of the greatest signal divided by the sum of the two greatest signals. 
 Reads do not pass the quality fi lter if there are two or more base calls with chastity of less than 0.6 in the fi rst 25 cycles. 
 These reads have an “N” in the last column of the GA analysis software export file.

 Removed all reads that contained ambiguous characters (Ns).
 Removed reads that did not contain at least 25 Q30 bases among the first 35 cycles (s35). 

SFF

 sffToCA -libraryname SRR001351 -output SRR001351.frg SRR001351.sff

Running

  • Command
 ~/wgs/Linux-amd64/bin/runCA -p pging -d testassembly porphyromonas_gingivalis_w83.* >& runCA.log

CAUG notes

Illimina mates

  • short: innies PE (paired ends)
  • long: outies MP (mate pairs) "lots of issues" (eg> : linker is one bp; solution : take 36bp from each end of the read to amke sure you don't get the linker
  • wgs 6.1 : Illumina limit of 104 bp; can recompile the code o make it 232

fastqToCA -insertsize 280 30 -libraryname yersinia_pestis.100bp.0280bp.005x -type illumina -innie -fastq $PWD/yersinia_pestis.100bp.0280bp.005x.1.fastq,$PWD/yersinia_pestis.100bp.0280bp.005x.2.fastq > illumina.frg {VER ver:2 } {LIB act:A acc:yersinia_pestis.100bp.0280bp.005x ori:I mea:280.000 std:30.000 src: . nft:12 fea: forceBOGunitigger=1 shortOverlapModel=1 isNotRandom=0 doNotTrustHomopolymerRuns=0 doRemoveDuplicateReads=0 doNotQVTrim=0 goodBadQVThreshold=0 doNotOverlapTrim=0 usePackedFragments=1 illuminaFastQType=illumina illuminaOrientation=innie illuminaSequence=/home/training/CAUG/USERS/Daniela_Puiu/yersinia_pestis.100bp.0280bp.005x.1.fastq,/home/training/CAUG/USERS/Daniela_Puiu/yersinia_pestis.100bp.0280bp.005x.2.fastq . } {VER ver:1 }

gatekeeper -o illumina.gkpStore illumina.frg Starting file 'illumina.frg' at line 0. Type set to ILLUMINA 1.3+. Orientation set to INNIE. Processing illumina reads from '/home/training/CAUG/USERS/Daniela_Puiu/yersinia_pestis.100bp.0280bp.005x.1.fastq'

                          and '/home/training/CAUG/USERS/Daniela_Puiu/yersinia_pestis.100bp.0280bp.005x.2.fastq'

GKP finished with no errors.

gatekeeper -dumpinfo illumina.gkpStore/ LOAD STATS

1 libInput 1 libLoaded 0 libErrors 0 libWarnings

0 frgInput 228742 frgLoaded 0 frgErrors 0 frgWarnings

0 lkgInput 0 lkgLoaded 0 lkgErrors 0 lkgWarnings

0 sffInput 0 sffLoaded 0 sffErrors 0 sffWarnings

0 sffLibCreated

0 plcInput 0 plcLoaded 0 plcErrors 0 plcWarnings

228742 numRandom 228742 numPacked 0 numNormal 0 numStrobe

GLOBAL STATS

228742 FRG 1 LIB

LibraryName numActiveFRG numDeletedFRG numMatedFRG readLength clearLength GLOBAL 228742 0 218724 22874200 22191085 LegacyUnmatedReads 0 0 0 0 0 yersinia_pestis.100bp.0280bp.005x 228742 0 218724 22874200 22191085

sffToCA -insertsize 3200 900 -libraryname porphyromonas_gingivalis_w83.3200bp.0900bp.E8YURXS01.1x.sff -trim chop -linker flx -output 454 porphyromonas_gingivalis_w83.3200bp.0900bp.E8YURXS01.1x.sff loadSFF()-- Loading 'porphyromonas_gingivalis_w83.3200bp.0900bp.E8YURXS01.1x.sff'. removeDuplicateReads()-- from 1 to 19622 removeDuplicateReads()-- finished detectMates()-- from 1 to 19621

Overlap: check the hash load

 6% is too low

~75% would be ideal