Revision as of 15:20, 4 March 2010

Data

~ 500B genome

7 pairs of data files (paired ends) : lanes 1..3,5..8 (lane 4 wasn't used)

 Lane   Insert            ReadLen    #Reads 
 1      3Kbp(2..6,avg 4)  124        34,944,099 
 2      8Kbp(7..9,avg 8)  124        32,540,640

 3      gDNA              ~500       34,745,750 
 5      gDNA                         34,601,239
 6      gDNA                         34,553,857
 7      gDNA                         34,682,612
 8      gDNA                         12,975,839

Tasks to figure out:

1. Erroneous reads/bases, which we need to correct or discard
2. GC bias, so we can compute a-stats properly
3. Redundancy in the long paired ends, which are lane 1 and lane 2.

Used the 454 protocol to circularize the DNA for sequencing with the Illumina instrument.
- Some reads will begin in the circularization adaptor and thus will have only one usable read
- Some reads have a few bases of DNA sequence and hit the circularization adaptor right away
- Most reads will have at least 36bp from each end before hitting the adaptor.
- Many reads will not have any adaptor to trim (>125bp of DNA sequence at both ends of the adaptor)

circularization adaptors

 TCGTATAACTTCGTATAATGTATGCTATACGAAGTTATTACG - revcomp -> CGTAATAACTTCGTATAGCATACATTATACGAAGTTATACGA
 AGCATATTGAAGCATATTACATACGATATGCTTCAATAATGC

Formatting: keep only the first 100bp (last 24 bp are anyway low qual)

Location:

 /fs/szattic-asmg4/Bees/Bombus_impatiens

Assembly

Meryl

 meryl -Dh -s 0-mercounts/asm-C-ms22-cm1 >! 22mers.hist
 Found 3136399464 mers.
 Found 379123530 distinct mers.
 Found 201257394 unique mers.
 Largest mercount is 12006651; 90 mers are too big for histogram.

countKmers

 most frequent 22mer :                 AGCATACATTATACGAAGTTAT     ~ 16% of the seqs
 most frequent 42mer : CGTAATAACTTCGTATAGCATACATTATACGAAGTTATACGA ~ 10%  of the seqs (pPAC7.9124-9165)

Location

 /fs/szdevel/dpuiu/SourceForge/wgs-assembler.030210/Linux-amd64/bin/runCA

@@ Line 1: / Line 1: @@
 = Data =
-* Location:
+* ~ 500B genome
-  /fs/szattic-asmg4/Bees/Bombus_impatiens
+* 7 pairs of data files (paired ends) : lanes 1..3,5..8 (lane 4 wasn't used)
-* There are 7 pairs of data files (paired ends) : lanes 1..3,5..8 (lane 4 wasn't used)
+  Lane   Insert            ReadLen    #Reads
+      3Kbp(2..6,avg 4)  124        34,944,099
+      8Kbp(7..9,avg 8)  124        32,540,640
+     gDNA              ~500       34,745,750
+     gDNA                         34,601,239
+      gDNA                         34,553,857
+      gDNA                         34,682,612
+     gDNA                         12,975,839
 * Tasks to figure out:
@@ Line 11: / Line 20: @@
 . Redundancy in the long paired ends, which are lane 1 and lane 2.
-* Data stats
+* Used the 454 protocol to circularize the DNA for sequencing with the Illumina instrument.
-  Lane   Insert   #Reads
+** Some reads will begin in the circularization adaptor and thus will have only one usable read
-      3Kbp     34,944,099
+** Some reads have a few bases of DNA sequence and hit the circularization adaptor right away
-      8Kbp     32,540,640
+** Most reads will have at least 36bp from each end before hitting the adaptor.
+** Many reads will not have any adaptor to trim (>125bp of DNA sequence at both ends of the adaptor)
-* Formatting: keep only the first 100bp
 * circularization adaptors
    TCGTATAACTTCGTATAATGTATGCTATACGAAGTTATTACG - revcomp -> CGTAATAACTTCGTATAGCATACATTATACGAAGTTATACGA
    AGCATATTGAAGCATATTACATACGATATGCTTCAATAATGC
+* Formatting: keep only the first 100bp (last 24 bp are anyway low qual)
+* Location:
+  /fs/szattic-asmg4/Bees/Bombus_impatiens
 = Assembly =

Bumblebee: Difference between revisions

Revision as of 15:20, 4 March 2010

Data

Assembly

Navigation menu

Bumblebee: Difference between revisions

Revision as of 15:20, 4 March 2010

Data

Assembly

Navigation menu

Search