Data
- 7 pairs of data files (paired ends) : lanes 1..3,5..8 (lane 4 wasn't used)
Lane Insert ReadLen #Reads
1 3K(2..6,avg 4K) 124 34,944,099
2 8K(7..9,avg 8K) 124 32,540,640
3 500(450..600) 124 34,745,750 # gDNA
5 500 34,601,239
6 500 34,553,857
7 500 34,682,612
8 500 12,975,839
1. Erroneous reads/bases, which we need to correct or discard
2. GC bias, so we can compute a-stats properly
3. Redundancy in the long paired ends, which are lane 1 and lane 2.
- Used the 454 protocol to circularize the DNA for sequencing with the Illumina instrument.
- Some reads will begin in the circularization adaptor and thus will have only one usable read
- Some reads have a few bases of DNA sequence and hit the circularization adaptor right away
- Most reads will have at least 36bp from each end before hitting the adaptor.
- Many reads will not have any adaptor to trim (>125bp of DNA sequence at both ends of the adaptor)
TCGTATAACTTCGTATAATGTATGCTATACGAAGTTATTACG - revcomp -> CGTAATAACTTCGTATAGCATACATTATACGAAGTTATACGA
AGCATATTGAAGCATATTACATACGATATGCTTCAATAATGC
- Formatting: keep only the first 100bp (last 24 bp are anyway low qual)
/fs/szattic-asmg4/Bees/Bombus_impatiens
Assembly
No OBT
adaptors in the seqs
meryl -Dh -s 0-mercounts/asm-C-ms22-cm1 >! 22mers.hist
Found 3136399464 mers.
Found 379123530 distinct mers.
Found 201257394 unique mers.
Largest mercount is 12006651; 90 mers are too big for histogram.
countKmers.pl
most frequent 42mer : CGTAATAACTTCGTATAGCATACATTATACGAAGTTATACGA ~ 20% of the seqs : circularization adapter
- Unitigger : max utg len=852bp
- Consensus after unitigger : 3 out of 129 jobs failed
/fs/szdevel/dpuiu/SourceForge/wgs-assembler.030210/Linux-amd64/bin/runCA