Dpuiu CA
Jump to navigation
Jump to search
Links
Data Issues
- Problems if the coverage > read length; solution: sample the reads
- Analyze the data before assemble it:
- If there is a reference genome map the reads to it
- Plot kmer frequencies (use multiple kmer sizes) ; see how they compare with the estimated coverage
Sourceforge
- Constants:
cat src/AS_global.h ... #define AS_READ_MIN_LEN 64 # should decrease for short Illumina reads 32 ??? #define AS_OVERLAP_MIN_LEN 40 # should decrease for short Illumina reads 30 ??? #define AS_READ_MAX_PACKED_LEN 104 # max Illumina read length; incread to 124 #define AS_READ_MAX_NORMAL_LEN_BITS 11 # should be increased to 12+ if there are many frequent kmesra nd dpn't fit in the hash
cat src/AS_CGW/AS_CGW_dataTypes.h #define CGW_MIN_DISCRIMINATOR_UNIQUE_LENGTH 1000 # should be decraesed for short reads assemblies ; all unitigs less than that do not get scaffolded
- Download & compile CVS tip
cvs -z3 -d:pserver:anonymous@wgs-assembler.cvs.sourceforge.net:/cvsroot/wgs-assembler co -P wgs-assembler cd wgs-assembler/src make SITE_NAME=LOCAL => executables under wgs-assembler/Linux-amd64/bin/
- Download & compile release
wget ... bzip2 -dc wgs-6.0-beta.tar.bz2 | tar -xf - cd wgs-6.0-beta cd kmer sh configure.sh gmake install cd ../src gmake cd ..
- Compile for debugging :
make SITE_NAME=LOCAL BUILDDEBUG=1 # in "c_make.as"
Run
- Help
runCA.pl -fields runCA.pl -options
- Batch submission
touch runCA.sh echo --- runCA $$ restarted by $USER at `date` --- >>& runCA.log ~/bin/runCA -d . -p prefix -s ~/bin/runCA.spec *.frg >>& runCA.log
at runCA.sh
- Sanger
runCA -d . -p prefix -s spec Sanger.frg
- doOverlapTrimming=1 (default)
- low qual Sanger reads (ex: 20 or 'D') are discarded
- Illumina reads are trimmed at 5'/3' (bp with qual < 5 (E) are deleted; read is deleted if new length< 64)
- astatLowBound 1 (default) ; can be lowered for low/un-uniform coverage
- astatHighBound 5 (default)
- computeInsertSize=0 (default)
- 454 & Hybrid : convert sff to frg
sffToCA -libraryname 454 -output 454 454.sff runCA -d . ovlOverlapper=mer unitigger=bog -p prefix *.frg runCA -d . -s ~/bin/runCA.454.spec -p prefix *.frg
- gatekeeper -L : search for 454 paired end linker
- AS_GKP_sff.c : Load_SFF function
- for mated libs set mean=3000,stdev=300
Parameters
- stopAfter:
initialStoreBuilding overlapBasedTrimming | OBT overlapper # runs overlap, overlapStore unitigger # runs frg/ovl correction consensusAfterUnitigger scaffolder consensusAfterScaffolder
- Example:
~/bin/runCA -d . -p asm -s ~/bin/runCA.454.spec stopAfter=initialStoreBuilding *.frg
Gatekeeper
Build
- Create & add
gatekeeper -o prefix.gkpStore -T -F prefix*frg gatekeeper -a -o prefix.gkpStore -T -F prefix*frg
Update
- edit frg: act:R !!!
{DST act:R acc:10691 mea:161662 std:28760 }
- append
gatekeeper -a -o prefix.gkpStore -T -F edit.frg
- Using tab file
Example: No restrictions on lib stddev with --edit option
cat prefix.update lib uid 19070 mean 167001 lib uid 19070 stddev 167001 lib uid 19070 isNotRandom 1 lib uid 19070 Orientation O # outie orientation for Illimina libs >1K gatekeeper --edit prefix.update prefix.gkpStore
- Update read CLR/CLV
gatekeeper -a -r prefix.clr -o prefix.gkpStore # not in the wgs release gatekeeper -a -v prefix.clv -o prefix.gkpStore
Delete
- DST messages can not be deleted
- A FRG can be deleted only if there is no related LKG message; an abbreviated MSG
- act:D !!!
# frg1 {LKG act:D typ:M fg1:94376 fg2:94654 } {FRG act:D acc:94376 } # frg2 ??? {LKG act:D fg:94376 fg:94654 } {FRG act:D acc:94376 }
cat delete.ids | perl -ane 'print "{FRG\nact:D\nacc:$F[0]\n}\n";'> delete.frg cat delete.ids | ~/bin/deleteFrg.pl -forward 1 -reverse 2 > delete.frg # forward reads have suffix 1; forward reads have suffix 1 gatekeeper -a -o prefix.gkpStore -T -F delete.frg
Dump
- INFO
gatekeeper -dumpinfo prefix.gkpStore
- LAST frg in store
gatekeeper -dumpinfo -lastfragiid prefix.gkpStore
- FASTA sequences
gatekeeper -dumpfastaseq prefix.gkpStore > prefix.seq gatekeeper -dumpfastaqlt prefix.gkpStore > prefix.qual
- FASTA reads for newbler
gatekeeper -dumpnewbler newbler prefix.gkpStore => prefix.fna,prefix.fna.qual grep "^>" prefix.fna | head -2 >SRR001355.3635.1a template=2020+2021 dir=F library=SRX000348 trim=20-117 >SRR001355.3635.1b template=2020+2021 dir=R library=SRX000348 trim=1-130
- FASTQ reads
gatekeeper -dumpfastq prefix. prefix.gkpStore => prefix.1.fastq prefix.2.fastq prefix.paired.fastq # contains both asm.1.fastq & asm.2.fastq reads prefix.unmated.fastq
cat prefix.1.fastq | perl ~/bin/clrFastq.pl > prefix.1.clr.fastq cat prefix.2.fastq | perl ~/bin/clrFastq.pl > prefix.2.clr.fastq cat prefix.paired.fastq | perl ~/bin/clrFastq.pl > prefix.paired.clr.fastq cat prefix.unmated.fastq | perl ~/bin/clrFastq.pl > prefix.unmated.clr.fastq
- FRG file
gatekeeper -dumpfrg prefix.gkpStore > prefix.frg gatekeeper -dumpfrg prefix.gkpStore -format2 > prefix.frg2
- READS in TAB format
gatekeeper -dumpfragments -tabular prefix.gkpStore > prefix.gkpStore.info gatekeeper -dumpfragments -tabular prefix.gkpStore | grep -v ^UID | awk '{print $1,$2}' > prefix.gkpStore.uid2iid
- LIBS in TAB format
gatekeeper -dumplibraries -tabular prefix.gkpStore | grep -v ^UID | awk '{print $1,$4,$5}' | sed 's/\.000//g'> prefix.dst # original vals asm2mdi.pl < prefix.asm > prefix.mdi # final vals
- READ CLR's (default=LATEST)
gatekeeper -dumpfragments -tabular prefix.gkpStore/ | perl -ane 'print join "\t", @F[1,0,-2,-1,6]; print "\n";' | head | pretty -o IID UID clrBeginLATEST clrEndLATEST isDeleted 1 HWI-EAS385_0001:1:1:1569:15197#0/1 0 65 0 2 HWI-EAS385_0001:1:1:2095:12498#0/1 0 67 0 3 HWI-EAS385_0001:1:1:5452:4010#0/1 0 78 0 4 HWI-EAS385_0001:1:1:12400:16773#0/1 0 105 0 5 HWI-EAS385_0001:1:1:12884:18158#0/1 0 80 0 ... gatekeeper -dumpfragments -tabular prefix.gkpStore/ | perl -ane 'print join "\t", @F[0,-2,-1]; print "\n";' > prefix.clr gatekeeper -dumpfragments -tabular -clear [CLR|OBTINITIAL|OBTMERGE|OBTCHIMERA|ECR_0|ECR_1|ECR_2|LATEST] prefix.gkpStore/ | perl -ane 'print join "\t", @F[0,-2,-1]; print "\n";' > prefix.?.clr
gatekeeper -dumpfragments -tabular -clear CLR asm.gkpStore | perl -ane 'print join "\t", @F[0,-2,-1]; print "\n";' > asm.CLR.clr gatekeeper -dumpfragments -tabular -clear OBTINITIAL asm.gkpStore | perl -ane 'print join "\t", @F[0,-2,-1]; print "\n";' > asm.OBTINITIAL.clr & gatekeeper -dumpfragments -tabular -clear OBTCHIMERA asm.gkpStore | perl -ane 'print join "\t", @F[0,-2,-1]; print "\n";' > asm.OBTCHIMERA.clr & ... gatekeeper -dumpfragments -tabular -clear LATEST asm.gkpStore | perl -ane 'print join "\t", @F[0,-2,-1]; print "\n";' > asm.LATEST.clr &
- READ CLR's summary
gatekeeper -dumpfragments asm.gkpStore | grep ^fragmentClear | perl -ane ' print $1," ",$3-$2,"\n" if(/(\w+),(\d+),(\d+)$/);' | ~/bin/sum2.pl
- MATED
frg22mates.pl < prefix.frg > prerfix.mates
- Deleted reads
gdt asm.gkpStore | p 'print $F[0],"\n" if($F[6]);'
- All scaffold reads in FRG2 format
cat asm.posmap.frgscf | grep scfid > scfid.posmap.frgscf gatekeeper -uid scfid.posmap.frgscf -dumpfrg -donotfixmates -format2 asm.gkpStore > scfid.frg
Id mapping
- Frg FASTQ packed format id's are no recorded in the gatekeeper; assgned a new UID ; to map them back ...
~/bin/fastq2wgsUIDs.pl s_*txt > s.uid2iid2id gatekeeper -dumpfragments -tabular asm.gkpStore | awk '{print $1,$2,$10}' > s.uid2iid paste s.uid2iid2id s.uid2iid | p 'print "ERROR $_" if($F[3] ne $F[-1]);' # check the lengths
Meryl
# Get mer frequencies meryl -C -B -m 22 -s prefix.seq -o prefix.22mers
# Dump mer sequences meryl -Dt -s prefix.22mers > prefix.22mers.seq
# Dump histogram meryl -Dh -s prefix.22mers | sort -nk2 -r > prefix.22mers.histo
# Get missing mers meryl -C -B -m 22 -s ref.fasta -o ref.22mers meryl -C -B -m 22 -s qry.fasta -o qry.22mers meryl -M sub -s ref.22mers -s qry.22mers -o diff.22mers meryl -Dt -s diff.22mers > diff.22mers.seq
# Combine mers
- jellyfish
# Get mer frequencies jellyfish count -s 1000000000 -m 18 --both-strands -t 16 -o prefix.jf prefix*.fastq [prefix*.seq] jellyfish merge prefix.jf_? -o prefix.jf # Dump histogram jellyfish histo prefix.jf
- countKmers.pl (other)
cat lib.seq | ~/bin/countKmers.pl -k 22 | head -20 | grep -v "^>" > 22mers.fwd.grep cat lib.seq | ~/bin/countKmers.pl -k 22 | head -20 | revseq | grep -v "^>" > 22mers.rev.grep cat 22mers.fwd.grep 22mers.rev.grep > 22mers.grep cat lib.seq | ~/bin/fgrep.pl -f 22mer.grep | count.pl
meryl -C -B -m 12 -s prefix.seq -o prefix.12mers
meryl -Dh -s prefix.12mers | sort -nk2 -r | more
OBT
- runCA spec
doOverlapTrimming = 1
- command
overlap -G ... # -G to compute partial overlaps
- Dump
overlapStore -d ./0-overlaptrim/asm.obtStore
- Number of jobs: should be < fileListMax (1024*10 by default) otherwise overlapStore fails
~/bin/getOvlJobsCount.pl ovlHashBlockSize ovlRefBlockSize frgCount
#or cat 0-overlaptrim-overlap/ovlopts.pl | grep -c '^"h'
Overlap
- Jobs
merOverlapperSeedBatchSize 100000 merOverlapperExtendBatchSize 75000 overmerry.sh : #frgs/merOverlapperSeedBatchSize jobs -> seeds/ dir olap-from-seeds.sh : #frgs/merOverlapperExtendBatchSize jobs -> olaps/ dir
- Num ber of jobs
cat 1-overlapper/ovlopts.pl | grep -c '^"h'
- Restart in runCA
rm 1-overlapper/o*
OverlapStore
- Flags:
-t -M : don't improve performance much -I ignore_file : not implemented
- Dump
overlapStore -d prefix.ovlStore > prefix.ovl.tab overlapStore -d prefix.ovlStore | awk '{print $1}' | uniq -c | awk '{print $2,$1}' > prefix.ovlStore.count.tmp cat prefix.ovlStore.count.tmp | perl -ane 'foreach ($p+1..$F[0]-1) { print "$_ 0\n" } $p=$F[0]; print $_' > prefix.ovlStore.count rm prefix.ovlStore.count.tmp
- Get stats; median number of ovls (5'/3') for each lib
overlapStats -G prefix.gkpStore -O prefix.ovlStore/ -o prefix cat prefix.repeatmodel.lib.00*stats | egrep '^Lib|nSamples|median' | p 'chomp; if(/Lib/) {print "\n$_"} else { print " $F[1]"}' | pretty -o
- Delete overlaps from *.obt.gz files
zcat *.obt.gz | convertOverlap -a -ovl | ~/bin/difference1or2.pl -f delete.acc | convertOverlap -b -ovl | gzip > new.obt.gz
- Delete overlaps from ovlStore : asm=>new
overlapStore -d asm.ovlStore | perl -ane 'print $_ if($F[0]<$F[1]);' | ~/bin/difference1or2.pl -f delete.iid | convertOverlap -b -ovl | gzip >! new.ovb.gz overlapStore -c new.ovlStore -g asm.gkpStore/ new.ovb.gz #not necessary to gzip the overlap files overlapStore -d asm.ovlStore | perl -ane 'print $_ if($F[0]<$F[1]);' | ~/bin/difference1or2.pl -f delete.iid | convertOverlap -b -ovl >! new.ovb overlapStore -c new.ovlStore -g asm.gkpStore/ new.ovb
- Edit(invert) overlaps from ovlStore : asm=>new
... > invert.iid # get the iids of the reads to change overlapStore -d asm.ovlStore | p 'print $_ if($F[0]<=$F[1]);' | ~/bin/invertOvl.pl -f invert.iid | convertOverlap -b -ovl | gzip > new.ovb.gz overlapStore -c new.ovlStore -g asm.gkpStore new.ovb.gz
- Count how many overlaps have errors
overlapStore -d asm.ovlStore | p 'print if($F[5]>0);' | wc -l
- Get 5'/3' overlap stats
overlapStore -d asm.ovlStore | perl /nfshomes/dpuiu/Archives/JCVI/bin/countOverlaps53.pl | getSummary.pl -i 1 -t 5 overlapStore -d asm.ovlStore | perl /nfshomes/dpuiu/Archives/JCVI/bin/countOverlaps53.pl | getSummary.pl -i 2 -t 3
- Build faster ...
Example (Bee): e 'foreach ("0000000001".."0000000058") { print "overlapStore -c $_.ovlStore -g asm.gkpStore/ -i 0 -M 8192 1-overlapper/$_/*gz\n" }' | scheduler.pl # 13min/job e 'foreach ("0000000002".."0000000058") { print "overlapStore -m 0000000001.ovlStore $_.ovlStore\n" }' # 3 min/job total=2*13+57*3=3.5 hrs (instead of 13.85 reported in runCA.log)
Frgcorr
- Fails on deleted reads with overlaps ; must delete the overlaps as well
Unitiger
- Unitig stats:
cat 4-unitigger/*.cga.0 | perl ~/bin/utgLog2utgLen.pl | getSummary.pl -i 0 -j 1
- Library insert estimates:
cat 4-unitigger/unitigger.err | grepi -A 9 LIB | pretty # in this case 9 is the number of libs
should compare with the original values gatekeeper -dumplibraries -tabular *.gkpStore | more
- Bog: don't let small inserts break unitigs:
edit AS_BOG_MateChecker.cc void MateLocation::buildHappinessGraphs(Unitig *utg, DistanceCompute *globalStats) { for (uint32 mleidx=0; mleidx<_table.size(); mleidx++) { MateLocationEntry &loc = _table[mleidx]; uint32 lib = _fi->libraryIID(loc.mleFrgID1); //Aelksey Zimin also do not use short Illumina mate pairs // if (globalStats[lib].samples < 10 ) if (globalStats[lib].samples < 10 || globalStats[lib].stddev < 100 || globalStats[lib].mean < 500)
Tigstore
- list unitigs
tigStore -g asm.gkpStore -t asm.tigStore 2 -D unitiglist # ids maID isPresent isDeleted ptID svID fileOffset covStat 0 1 0 1 2 0 3334 1 1 0 1 2 521908 -1 2 1 0 1 2 522764 543 ...
tigStore -g asm.gkpStore -t asm.tigStore 2 -D properties # exit: stdout # properties unitig_coverage_stat 0 660 unitig_microhet_prob 0 0.000000 unitig_status 0 X unitig_unique_rept 0 X ...
- number of tigs
tigStore -g asm.gkpStore -t asm.tigStore 2 -D unitiglist | tail -1 | awk '{print $1}'
- list frags/consensus/layout in a unitig
tigStore -g asm.gkpStore -t asm.tigStore 2 -u 0 -d frags tigStore -g asm.gkpStore -t asm.tigStore 2 -u 0 -d consensus tigStore -g asm.gkpStore -t asm.tigStore 2 -u 0 -d layout tigStore -g asm.gkpStore -t asm.tigStore 2 -u 0 -d properties # exit: stderr & stdout
- dump all unitig consensus
tigStore -g asm.gkpStore -t asm.tigStore 2 -U -d consensus > utg.fasta
- get single read unitigs
tigStore -g asm.gkpStore -t asm.tigStore 43 -D properties | grep ^unitig_num_frags | grep ' 1$'
- get read/utg or ctg stats
tigStore -g asm.gkpStore -t asm.tigStore 43 -D properties | grep ^unitig_num_frags | getSummary.pl -z 1 -i 2 -t unitig_num_frags tigStore -g asm.gkpStore -t asm.tigStore 43 -D properties | grep ^contig_num_frags | getSummary.pl -z 1 -i 2 -t contig_num_frags
- delete a unitig : get & update layout
tigStore -g asm.gkpStore -t asm.tigStore 2 -u id -d layout | tee utg.layout | grep ^unitig | awk '{print $2}' | ~/bin/deleteUnitigIdLayout.pl > utg.layout.delete # or tigStore -g asm.gkpStore -t asm.tigStore 2 -u id -d layout | tee utg.layout | ~/bin/deleteUnitigLayout.pl > utg.layout.delete tigStore -g asm.gkpStore -t asm.tigStore 2 -R utg.layout.delete tigStore -g asm.gkpStore -t asm.tigStore 2 -D unitiglist | p 'print if($F[2]);'
- delete single frg unitigs
tigStore -g asm.gkpStore -t asm.tigStore 2 -U -d layout | egrep '^unitig|^data.num_frags' | p 'chomp; print $_," "; print "\n" if(/^data/)' | p 'print $F[1],"\n" if($F[3]==1);' | ~/bin/deleteUnitigIdLayout.pl > utg.layout.delete tigStore -g asm.gkpStore -t asm.tigStore 2 -R utg.layout.delete
- assign unitig -1 to all unitigs (for addition to a new store)
tigStore -g asm.gkpStore -t asm.tigStore 2 -u id -d layout | tee utg.layout | p 'if (/^unitig/) { print "unitig -1\n"} elsif(/^UTG/) { $F[7]=-1; print join " ",@F; print "\n" } else { print $_};' > utg.layout.add tigStore -g asm.gkpStore -t new.tigStore 2 -R utg.layout.add
tigStore -g asm.gkpStore -t asm.tigStore 1 -u id -d layout | tee utg.layout | p 'if (/^unitig/) { print "unitig -1\n"} else { print $_};' > utg.layout.add tigStore -g asm.gkpStore -t new.tigStore 1 -R utg.layout.add
- fix unitig consensus
touch utg.layout grep FAILED 5-consensus/asm*err | cut -f3 -d ' ' | sort -u -n | p 'system "tigStore -g asm.gkpStore -t asm.tigStore 2 -u $F[0] -d layout >> utg.layout\n";' cat utg.layout | updateUTGlayout.pl > utg.layout.update tigStore -g asm.gkpStore -t asm.tigStore 2 -R utg.layout.update grep FAILED 5-consensus/asm*err | sed 's/\.err:/ /' | cut -f1 | sort -u | p 'system "touch $F[0].success\n";' touch 5-consensus/consensus.success
- Filter 2+ fragment unitigs
CGW
- Long run times usually mean problems (conflicts) with the data
- Drbug
gdb /fs/szdevel/dpuiu/SourceForge/wgs/Linux-amd64/bin//cgw $ run -j 1 -k 5 -r 5 -s 2 -z -P 2 -B 75000 -m 100 -g ./asm.gkpStore -t ./asm.tigStore -o asm bt
- Differences between the LAST run of CGW and PREVIOUS ones
# options PREV LAST -R 7 # restart from checkpoint file number 'ckp' -N ckp01-ABS # restart from checkpoint location 'ckp' (see the timing file) -s 0 -s 2 # stone throwing level -S 0 # S==0 : do not resolve surrogates -G # Don't generate output (cgw or cam)
#files difference.pl 7-4-CGW.ls 7-0-CGW.ls asm.asm.cam asm.dregs.cam asm.partitionInfo # "equivalent of" 4-unitigger/asm.partitioningInfo asm.partitioning # "equivalent of" 4-unitigger/asm.partitioning
- running only one instance of ECR
rm -fr 7-[234]* /fs/szdevel/dpuiu/SourceForge/wgs-assembler.030210/Linux-amd64/bin/runCA -d . -p asm -s ./runCA.spec doExtendClearRanges=1 *.frg >> & runCA.log
- canceling ECR run (if doExtendClearRanges set to 1 or 2 and the run takes too long)
grep ckp ./7-0-CGW/cgw.out ./7-0-CGW/asm.ckp.3 (logical ckp01-ABS) after building scaffolds at ... ./7-0-CGW/asm.ckp.4 (logical ckp03-ACD) after scaffold cleaning at ... ./7-0-CGW/asm.ckp.5 (logical ckp05-1SM) after 1st scaffold merge at ... ./7-0-CGW/asm.ckp.6 (logical ckp06-AS) after stone throwing at ... ./7-0-CGW/asm.ckp.7 (logical ckp08-2SM) after 2nd scaffold merge at ... ./7-0-CGW/asm.ckp.8 (logical ckp09-FR) after final rocks at ... cgw -R 7 -N ckp09-FR -j 1 -k 5 -r 5 -s 2 -z -P 2 -B 75000 -m 100 -g ./asm.gkpStore -t ./asm.tigStore -o ./7-0-CGW/asm grep ckp ./7-0-CGW/cgw.out ... ./7-0-CGW/asm.ckp.9 (logical ckp10-PS) after partial stones at ... ./7-0-CGW/asm.ckp.10 (logical ckp11-FCS) after final contained stones at ... ./7-0-CGW/asm.ckp.11 (logical ckp13-RS) after resolve surrogates at ... ./7-0-CGW/asm.ckp.12 (logical ckp14-FIN) after output at ...
- running ECR (if initially doExtendClearRanges=0) ; make sure cgwPurgeCheckpoints=0
runCA -d . -p asm -s runCA.spec stopAfter=scaffolder cgwPurgeCheckpoints=0 doExtendClearRanges=0 *.frg >> & runCA.log rm 7-CGW cd 7-0-CGW mkdir OLD mv asm.ckp.9 asm.ckp.1? OLD asm.*.cam asm.partition* asm.timing OLD cd .. runCA -d . -p asm -s runCA.spec stopAfter=scaffolder cgwPurgeCheckpoints=0 doExtendClearRanges=1 *.frg >> & runCA.log
- DistUpadte file
7-0-CGW/stat/scaffold_final.distupdate.dst
- Stats
cat 7-0-CGW/rezlog/stone.i02.log | perl ~/bin/cgwLog2scfLen.pl | getSummary.pl -t scfLen
ECR
- doExtendClearRanges = 2 => cgw run 3 times, ECR 2 times
Consensus
- Dump unitig seq
tigStore -g asm.gkpStore -t asm.tigStore/ 20 -D unitigs > unitigs.txt cat unitigs.txt | grep -v ^maID | perl -ane 'system "tigStore -g asm.gkpStore -t asm.tigStore 20 -d consensus -u $F[0]\n";' >& unitigs.fasta.tmp cat unitigs.fasta.tmp | grep -v null | sed 's/cns=//' | sed 's/-//g' | nl0 | tab2fasta.pl > unitigs.fasta rm unitigs.fasta.tmp
- Dump contig seq
tigStore -g asm.gkpStore -t asm.tigStore/ 20 -D contigs > contigs.txt cat contigs.txt | grep -v ^maID | perl -ane 'system "tigStore -g asm.gkpStore -t asm.tigStore 20 -d consensus -c $F[0]\n";'>& contigs.fasta.tmp cat contigs.fasta.tmp | grep -v null | sed 's/cns=//' | sed 's/-//g' | nl0 | tab2fasta.pl > contigs.fasta rm contigs.fasta.tmp
- Recompute unitig consensus
utgcns -g ./asm.gkpStore -t ./asm.tigStore 2 011 -u 122533 -v -V
Terminator
Log file & runtime (wgs-5.2)
- Filter log
~/bin/getCAruntimes.pl -filter runCA.log > runCA.filter.log
- Display log
cat runCA.log | ~/bin/getCAruntimes.pl -hour -total -plot # => runCA.runTimes ; runCA.runTimes.gp ; runCA.runTimes.png display runCA.runTimes.png
- runCA.log runCA.times bos taurus BCM.CLONEEND 16,875 Sanger reads
- runCA.log runCA.times wolbachia symbioint of culex pipiens quinquefasciatus 36,767 Sanger reads
- runCA.runTimes.png Bumblebee assembly
Output processing
Posmap
- Get posMap files
buildPosMap -o prefix < prefix.asm
- Get number of reads/ctg or scf (from the pos files)
cat prefix.posmap.frgctg | count.pl -i 1 > prefix.posmap.ctgfrgcnt cat prefix.posmap.frgscf | count.pl -i 1 > prefix.posmap.scffrgcnt cat prefix.posmap.ctgscf | count.pl -i 1 > prefix.posmap.scfctgcnt
- Compute read/mate/ctg cvg (from the pos files)
posmap2cvg.pl < prefix.posmap.frgscf > prefix.posmap.scffrgcvg OLD/posmap2cvg.4.pl -mates prefix.posmap.mates < prefix.posmap.frgscf > prefix.posmap.scfmatecvg posmap2cvg.pl < prefix.posmap.ctgscf > prefix.posmap.scfctgcvg
- Lib assembly stats *
cd 9-* gatekeeper -dumplibraries -tabular ../genome.gkpStore | pretty > genome.gkpStore.libinfo1 gatekeeper -dumpinfo ../genome.gkpStore | grep -A 100 ^Lib | pretty > genome.gkpStore.libinfo2 gdt genome.gkpStore | p 'print "$F[0] $F[5]\n" unless($F[6]);' > genome.gkpStore.frg2lib paste genome.gkpStore.frg2lib genome.posmap.frags | p ' print "$F[1] $F[5]\n";' | count.pl > genome.gkpStore.frgcount gdt genome.gkpStore | p 'print "$F[4] del\n" if($F[6] eq "1");' | count.pl >> genome.gkpStore.frgcount cat genome.gkpStore.frgcount | count2col.pl -i 0 -j 1 -k 2 | awk '{print $1,$4,$2,$3}' | pretty > genome.gkpStore.libcount more genome.gkpStore.lib*
- Posmap2gff
#getting frglib info gdt ../genome.gkpStore | p 'print "$F[0] $F[5]\n" unless($F[6]);' > genome.gkpStore.frg2lib
# one of ... cat genome.posmap.frgscf | ~/bin/posmap2gff.pl -sou frg -fea frg -p > genome.posmap.frgscf.gff cat genome.posmap.frgscf | ~/bin/posmap2gff.pl -sf genome.gkpStore.frg2lib -fea frg -p | sed 's/[ab]//' > genome.posmap.frgscf.gff cat genome.posmap.frgscf | ~/bin/posmap2gff.pl -sf genome.gkpStore.frg2lib -fea frg -p >! genome.posmap.frgscf.gff cat genome.posmap.utgscf | ~/bin/posmap2gff.pl -sou utg -p >! genome.posmap.utgscf.gff cat genome.posmap.ctgscf | ~/bin/posmap2gff.pl -sou ctg -p >! genome.posmap.ctgscf.gff
cat genome.posmap.scflen | perl -ane 'print "$F[0]\t$F[0]\t0\t$F[1]\tf\n";' | posmap2gff.pl -sou scf >> genome.posmap.scflen.gff
setenv SEQID 1374988 cat genome*gff | p 'print $_ if($F[8]==$ENV{SEQID});' >! $SEQID.gff
~/szdevel/Apollo/apollo -i gff -f $SEQID.gff
- Get 0cvg regions
posmap2cvg.pl -f prefix.posmap.scflen -Max 0 < prefix.posmap.frgscf # scaff gaps posmap2cvg.pl -f prefix.posmap.ctglen -Max 0 < prefix.posmap.frgctg # surrogates
- Get links
~/bin/scf2scflnk.sh prefix ~/bin/ctg2ctglnk.sh prefix ...
- Add ctg/scf into to the mates fils
join12.pl prefix.posmap.mates prefix.posmap.frgctg > prefix.posmap.mates2.tmp join12.pl prefix.posmap.mates2.tmp prefix.posmap.frgscf > prefix.posmap.mates2 rm prefix.posmap.mates2.tmp
- Display graphs
cat prefix.posmap.scf2scflnk | ~/bin/tab2dot.pl | dot -Tpng -o prefix.posmap.scf2scflnk.png cat prefix.posmap.ctg2ctglnk | ~/bin/tab2dot.pl | dot -Tpng -o prefix.posmap.ctg2ctglnk.png
- Qc stats based on posmap files
posmap2qc.pl -ctgscf prefix.posmap.ctgscf > prefix.qc
- Break scf/ctg pipeline
~/bin/breakPosmapKeep.amos
- Get scaff file
~/bin/posmap2scaff.pl prefix.posmap.ctgscf
- Gaps longer than a certain size
cat asm.posmap.ctgscf | p 'print $p,$_,"\n" if($F[1] eq $P[1] and $F[2]-$P[3]>1000); $p=$_;@P=@F;'
Mate redundancy
- Sequence based (first 32bp in fwd/rev Illumina reads; max 1 mismatch)
paste s_2_?_8kb*.txt | perl -ane 'chomp; if($.%4==1) { s/\@(\S+)[12]//; printf("%-45s",$1); } elsif($.%4==2) { print substr($F[0],0,32),"\t",substr($F[1],0,32),"\n" } ' | sort -k2,3 | ./getDuplicates.pl > s_2_8k.tab cat s_2_8k.tab | perl -ane 'print $F[0],"1\n";' > s_2_1_8k.redundant cat s_2_8k.tab | perl -ane 'print $F[0],"2\n";' > s_2_2_8k.redundant
- overlap(Store) based
#Example getMateRedundancy.amos -D BEGIN=1 -D END=219 -D GKPSTORE=../asm.gkpStore -D OVLSTORE=../asm.ovlStore s_2_3kb #=> asm.redundant_mate_pairs , asm.redundant_mate_clusters
- Alignment based
cat prefix*.soap2 | ~/bin/AMOS/soap2sameRef.pl -pos | sort -nk2 -nk3 | p 'print $_ if(@P and $F[1]-$P[1]<=1 and abs($F[2]-$P[2])<=1); @P=@F;' > prefix.redundant
- Posmap file based
cat asm.posmap.frgscf | ~/bin/posmap2ovl.pl -p 75 | grep -v ^ref | ~/bin/getMateRedundancy12.pl > asm.redundant_mate_pairs cat asm.redundant_mate_pairs | perl ~/bin/cluster.pl | sort > asm.redundant_mate_clusters
Fasta
- Align deg to scaff
nucmer -mum -l 40 -c 250 scf.fasta deg.fasta -p scf-deg delta-filter -q scf-deg.delta > scf-deg.filter-q.delta ~/bin/delta2pairCvg+.pl < scf-deg.filter-q.delta > scf-deg.filter-q.cvg.pairs ~/bin/max2.pl -i 4 -j 6 < scf-deg.filter-q.cvg.pairs > scf-deg.filter-q.maxCvg.pairs cat scf-deg.filter-q.maxCvg.pairs | p 'print $_ if($F[5]-$F[6]<2000);' > scf-deg.filter-q.variants.pairs awk '{print $5,$6}' scf-deg.filter-q.variants.pairs > scf-deg.filter-q.variants.ids
- Indexing
# contigs ~/bin/formatdb.pl < prefix.ctg.fasta > prefix.ctg.fasta.offset ~/bin/fastacmd.pl -o prefix.ctg.fasta.offset -f filter.ctg.ids < prefix.ctg.fasta > filter.ctg.fasta
# same for scaff, deg ...
- Contaminant search : nucmer against
/nfshomes/dpuiu/db/Ecoli.all /nfshomes/dpuiu/db/UniVec_Core /nfshomes/dpuiu/db/OtherVec
Asm
- Get message counts
cat asm.asm | grep "^{" | egrep -v 'MPS|UPS|VAR|CTP' | uniq -c | more 1 {MDI 273375 {AFG # reads 36258 {AMP # links 2185 {UTG # MPS+ messages 4876 {ULK 1443 {CCO # VAR*,MPS+,UPS+ messages 3277 {CLK 15 {SCF 40 {SLK # CTP+ messages
- Get output seq
asmOutputFasta -p prefix < prefix.asm
- Get output seq clr
~/bin/asm2clrs.pl < prefix.asm > < prefix.asm.clr
- Get new library insert sizes:
asm2mdi.pl < prefix.asm > prefix.mdi
- Get surrogate ids
#most of them have negative aStat's $ cat *.asm | grep -B 8 ^sta:S | grep ^acc | perl -ane '/(\d+)/; print "utg".$1,"\n";'
Qc
caqc.pl -euid prefix.asm
Scaffolds
- Get scaff files
cavalidate -e 10 prefix bank2scaff prefix.bnk/ > prefix.scaff
~/bin/ca2scaff -i prefix.asm -o prefix.scaff
cat prefix.scaff | grep "^>" | sed 's/>//' | awk '{print $1,$2}' > prefix.scfctg cat prefix.scaff | grep "^>" | sed 's/>//' | awk '{print $1,$3}' > prefix.scflen cat prefix.scaff | grep "^>" | sed 's/>//' | awk '{print $1,$4}' > prefix.scfspan
tac prefix.scaff | p '$count=0 if(/^>/); $count++; print $_ if($count>2);' | awk '{print $4}' # sequence gaps
AGP
asmToAGP.pl < prefix.asm > prefix.agp
/--- buildPosMap ------> .posmap / .asm /---- asmToAGP ------------------------------------------------> .AGP \ / \---- cavalidate, bank2scaff?--> .scaff --- scaff2agp.pl ---
- Other scripts:
bank2scaff : incorrect gap sizes
valiadteAgp.pl < prefix.agp
infoseq 9-terminator/prefix.ctg.fasta | sed 's/ctg//' >! prefix.ctg.infoseq scaff2agp.pl -i prefix.ctg.infoseq < prefix.scaff > prefix.agp
Moving an assembly
- Minimum file set
scp runCA.sh scp runCA.spec scp asm.gkpStore/* scp 3-overlapcorrection/asm.erates.updated scp 4-unitigger/asm.iidmap scp 4-unitigger/asm.partitioning scp 4-unitigger/asm.partitioningInfo scp 4-unitigger/unitigger.success scp 5-consensus/consensus.success scp 7-0-CGW/asm.timing scp 7-0-CGW/asm.ckp.11 # last one scp 7-0-CGW/asm.partitioning scp 7-0-CGW/asm.partitionInfo scp 7-0-CGW/cgw.out.00.ckp.11 scp 7-0-CGW/cgw.out scp asm.tigStore/*
Add reads to an assembly
ln -s ../wgs-prev/prefix.frg . ln -s ../Data/new.frg
cp ../wgs-prev/runCA.spec ../wgs-prev/runCA.sh .
ln -s ../wgs-prev/0-mercounts
cp -R ../wgs-prev/prefix.gkpStore gatekeeper -a -o prefix.gkpStore new.frg ll -d ../wgs-prev/1-overlapper/000000000* # if multiple jobs ... mkdir -p 1-overlapper/0000000001 # mkdir -p 1-overlapper/0000000002 ... ??? cat ../wgs-prev/1-overlapper/*pl | grep '^"h' | tail -1 # get last qry/ref block => "h0056000001r0050000000" cd 1-overlapper/0000000001 ln -s ../../wgs-prev/1-overlapper/0000000001/*ovb.gz . rm h0056000001*.ovb.gz *r0050000000.ovb.gz # this should change from case to case cd ../..
Removing bad mates
cat ../wgs-prev/9-terminator/asm.posmap.frags | grep bad > asm.posmap.frags.bad gatekeeper -dumpfragments -tabular -uid asm.posmap.frags.bad ./wgs-prev/asm.gkpStore | grep -v ^UID | ~/bin/deleteLkg.pl -j 2 > asm.posmap.frags.bad.delete.lkg
ln -s ../wgs-prev/*.frg gatekeeper -o asm.gkpStore *.frg asm.posmap.frags.bad.delete.lkg
ln -s ../wgs-prev/0-mercounts/ ln -s ../wgs-prev/asm.ovlStore/ ln -s ../wgs-prev/3-overlapcorrection/ ln -s ../wgs-prev/4-unitigger/ cp -R ../wgs-prev/asm.tigStore/ . ln -s ../wgs-prev/5-consensus runCA ...
Test datasets
Porphyromonas_gingivalis_W83 : Sanger + 454
wget ftp://ftp.ncbi.nih.gov/pub/TraceDB/porphyromonas_gingivalis_w83/anc.porphyromonas_gingivalis_w83.001.gz ... wget ftp://ftp.ncbi.nih.gov/pub/TraceDB/porphyromonas_gingivalis_w83/xml.porphyromonas_gingivalis_w83.001.gz
- SRA : http://www.ncbi.nlm.nih.gov/sites/entrez?db=genomeprj&cmd=Link&LinkName=genomeprj_sra&from_uid=29961
- CBCB
/fs/sztmpscratch/dpuiu/porphyromonas_gingivalis_w83/
- Complete genome at NCBI : NC_002950.2 ( 2,343,476bp 48.29%GC)
bp_fetch.pl net::genbank:NC_002950.2 > porphyromonas_gingivalis_w83.1con
Input formatting
Read counts
zcat anc.*.gz | ~/bin/traceanc2anc.pl SEQ_LIB_ID INSERT_SIZE INSERT_STDEV SVECTOR_CODE | count.pl
TA formatting
- frg FORMAT2
- TIs are frg uids
- All libs
tracedb-to-frg.pl -xml xml.porphyromonas_gingivalis_w83.001.gz tracedb-to-frg.pl -lib xml.porphyromonas_gingivalis_w83.001.gz tracedb-to-frg.pl -frg xml.porphyromonas_gingivalis_w83.001.gz => porphyromonas_gingivalis_w83.1.lib.frg porphyromonas_gingivalis_w83.2.001.frg.bz2 porphyromonas_gingivalis_w83.3.lkg.frg
- One lib at a time
tracedb-to-frg.pl -only T13146 -xml xml.porphyromonas_gingivalis_w83.001.gz >& ! tracedb-to-frg.log tracedb-to-frg.pl -only T13146 -lib xml.porphyromonas_gingivalis_w83.001.gz >>& tracedb-to-frg.log tracedb-to-frg.pl -only T13146 -frg xml.porphyromonas_gingivalis_w83.001.gz >>& tracedb-to-frg.log
454 SRA formatting
Illumina formatting
- use fastqToCA (wgs-6.0)
fastqToCA \
-insertsize 3000 300 \
-libraryname 1 \
-type illumina \
-fastq /fs/szattic-asmg4/Bees/Bombus_impatiens/s_1_1_sequence.txt,/fs/szattic-asmg4/Bees/Bombus_impatiens/s_1_2_sequence.txt | \
sed 's/ori:I/ori:O/' \
> 1.frg
=>
{LIB
act:A
acc:1
ori:O
mea:3000.000
std:300.000
src:
.
nft:10
fea:
forceBOGunitigger=1
isNotRandom=0
doNotTrustHomopolymerRuns=0
doRemoveDuplicateReads=0
doNotQVTrim=0
goodBadQVThreshold=0
doNotOverlapTrim=0
usePackedFragments=1
illuminaFastQType=illumina
illuminaSequence=/fs/szattic-asmg4/Bees/Bombus_impatiens/s_1_1_sequence.txt,/fs/szattic-asmg4/Bees/Bombus_impatiens/s_1_2_sequence.txt
.
}
gatekeeper \ -o asm.gkpStore.BUILDING \ -T -F \ -E asm.gkpStore.errorLog \ 1.frg => Segmentation fault
FASTQ
- This won't remove the linker:
zcat SRR001351.fastq.gz | ~/bin/fastq2seq.pl > SRR001351.seq zcat SRR001351.fastq.gz | ~/bin/fastq2qual.pl > SRR001351.qual convert-fasta-to-v2.pl -454 -l SRR001351 -s SRR001351.seq -q SRR001351.qual
- Cleaning seq
~/bin/OLD/cleanFastaq.pl \ s_2_1_1.1kb_sequence.txt s_2_2_1.1kb_sequence.txt \ # input s_2_1_1.1kb_sequence.clean.txt s_2_2_1.1kb_sequence.clean.txt \ # output mates s_2_1_1.1kb_sequence.clean.single.txt s_2_2_1.1kb_sequence.clean.single.txt # output single
echo prefix_1.fastq prefix_2.fastq > prefix.ls quake.py -f prefix.ls
- Illumina note:
We recommend removing reads that do not pass the GA analysis software Failed_Chastity fi lter before attempting to assemble the sequence. The chastity of a base call is the ratio of the intensity of the greatest signal divided by the sum of the two greatest signals. Reads do not pass the quality fi lter if there are two or more base calls with chastity of less than 0.6 in the fi rst 25 cycles. These reads have an “N” in the last column of the GA analysis software export file. Removed all reads that contained ambiguous characters (Ns). Removed reads that did not contain at least 25 Q30 bases among the first 35 cycles (s35).
SFF
sffToCA -libraryname SRR001351 -output SRR001351.frg SRR001351.sff
Running
- Command
~/wgs/Linux-amd64/bin/runCA -p pging -d testassembly porphyromonas_gingivalis_w83.* >& runCA.log
CAUG notes
Illimina mates
- short: innies PE (paired ends)
- long: outies MP (mate pairs) "lots of issues" (eg> : linker is one bp; solution : take 36bp from each end of the read to amke sure you don't get the linker
- wgs 6.1 : Illumina limit of 104 bp; can recompile the code o make it 232
fastqToCA -insertsize 280 30 -libraryname yersinia_pestis.100bp.0280bp.005x -type illumina -innie -fastq $PWD/yersinia_pestis.100bp.0280bp.005x.1.fastq,$PWD/yersinia_pestis.100bp.0280bp.005x.2.fastq > illumina.frg {VER ver:2 } {LIB act:A acc:yersinia_pestis.100bp.0280bp.005x ori:I mea:280.000 std:30.000 src: . nft:12 fea: forceBOGunitigger=1 shortOverlapModel=1 isNotRandom=0 doNotTrustHomopolymerRuns=0 doRemoveDuplicateReads=0 doNotQVTrim=0 goodBadQVThreshold=0 doNotOverlapTrim=0 usePackedFragments=1 illuminaFastQType=illumina illuminaOrientation=innie illuminaSequence=/home/training/CAUG/USERS/Daniela_Puiu/yersinia_pestis.100bp.0280bp.005x.1.fastq,/home/training/CAUG/USERS/Daniela_Puiu/yersinia_pestis.100bp.0280bp.005x.2.fastq . } {VER ver:1 } ##################### gatekeeper -o illumina.gkpStore illumina.frg Starting file 'illumina.frg' at line 0. Type set to ILLUMINA 1.3+. Orientation set to INNIE. Processing illumina reads from '/home/training/CAUG/USERS/Daniela_Puiu/yersinia_pestis.100bp.0280bp.005x.1.fastq' and '/home/training/CAUG/USERS/Daniela_Puiu/yersinia_pestis.100bp.0280bp.005x.2.fastq' GKP finished with no errors. ##################### gatekeeper -dumpinfo illumina.gkpStore/ LOAD STATS 1 libInput 1 libLoaded 0 libErrors 0 libWarnings 0 frgInput 228742 frgLoaded 0 frgErrors 0 frgWarnings 0 lkgInput 0 lkgLoaded 0 lkgErrors 0 lkgWarnings 0 sffInput 0 sffLoaded 0 sffErrors 0 sffWarnings 0 sffLibCreated 0 plcInput 0 plcLoaded 0 plcErrors 0 plcWarnings 228742 numRandom 228742 numPacked 0 numNormal 0 numStrobe GLOBAL STATS 228742 FRG 1 LIB LibraryName numActiveFRG numDeletedFRG numMatedFRG readLength clearLength GLOBAL 228742 0 218724 22874200 22191085 LegacyUnmatedReads 0 0 0 0 0 yersinia_pestis.100bp.0280bp.005x 228742 0 218724 22874200 22191085 ################ sffToCA -insertsize 3200 900 -libraryname porphyromonas_gingivalis_w83.3200bp.0900bp.E8YURXS01.1x.sff -trim chop -linker flx -output 454 porphyromonas_gingivalis_w83.3200bp.0900bp.E8YURXS01.1x.sff loadSFF()-- Loading 'porphyromonas_gingivalis_w83.3200bp.0900bp.E8YURXS01.1x.sff'. removeDuplicateReads()-- from 1 to 19622 removeDuplicateReads()-- finished detectMates()-- from 1 to 19621 ################# Overlap: check the hash load 6% is too low ~75% would be ideal