Cbcb:Pop-Lab:Ted-Report: Difference between revisions

From Cbcb
Jump to navigation Jump to search
(→‎January 29, 2010: Added more data, plus a couple of questionable models for the minimus run time)
(→‎January 29, 2010: Partially updated with information about the 20 million read minimus assembly)
Line 44: Line 44:
*  2 million reads => 6.0GB (18.6% of 32GB)
*  2 million reads => 6.0GB (18.6% of 32GB)
* 20 million reads => (yet to be seen but it's probably going to be about 60GB, which will presumably cause it to break)
* 20 million reads => (yet to be seen but it's probably going to be about 60GB, which will presumably cause it to break)
** Update: top froze showing that tigger was using 99+% of the 32GB of RAM on privet. The AMOS log showed there was a core dump. So the 60GB estimate is probably reasonable. Will try later on walnut, which has 64GB of RAM.




Line 49: Line 50:
*  1 million reads => 3 min (using 100% of a single 2.4GHz processor)
*  1 million reads => 3 min (using 100% of a single 2.4GHz processor)
*  2 million reads => 9 min (using 100% of a single 2.4GHz processor)
*  2 million reads => 9 min (using 100% of a single 2.4GHz processor)
* 20 million reads => 7hrs and counting.... (the model below predicts it will finish in about 1045 minutes if it is in fact running in polynomial time)
* 20 million reads => 13hrs (using 100% of a single 2.4GHz processor)


I built the following model by fitting a simple polynomial equation to the run times for 1 million and 2 million reads, and then averaging the constants I got for each (1.7 & 1.5, respectively).
I built the following model by fitting a simple polynomial equation to the run times for 1 million and 2 million reads, and then averaging the constants I got for each (1.7 & 1.5, respectively).
Line 58: Line 59:
*  1 million reads => 9 min  (using 100% of a single 2.4GHZ processor)
*  1 million reads => 9 min  (using 100% of a single 2.4GHZ processor)
*  2 million reads => 66 min (using 100% of a single 2.4GHz processor)
*  2 million reads => 66 min (using 100% of a single 2.4GHz processor)
* 20 million reads => (~3600 minutes, according to the very questionable model below)
* 20 million reads => N/A (crashed on privet due to lack of RAM)


Applying the same technique to these numbers (which doesn't appear to work nearly as well because it requires averaging 3 & 4 instead of 1.5 & 1.7), you get the following model.
Applying the same technique to these numbers (which doesn't appear to work nearly as well because it requires averaging 3 & 4 instead of 1.5 & 1.7), you get the following model.
  (#reads in millions / 3.5) ^ 2 = run time in min
  (#reads in millions / 3.5) ^ 2 = run time in min

Revision as of 06:09, 24 January 2010

Older Entries

2009

January 15, 2010

Minimus Documentation

Presently, the only relevant Google hit for "minimus" on the first page of results is the sourceforge wiki. The only example on this page is incomplete and appears to be an early draft made during development.

Ideally, it should be easy to find a complete guide with the general format:

  • Simple use case:
`toAmos -s path/to/fastaFile.seq -o path/to/fastaFile.afg`
`minimus path/to/fastaFile`
  • Necessary tools for set up (toAmos)
  • Other options
  • etc

The description found on the Minimus/README page (linked to from the middle of the starting page) is more appropriate, but features use cases that may no longer be common and references another required tool (toAmos) without linking to it or describing how to access it. A description of this tool can be found on Amos File Conversion Utilities page (again, linked to from the starting page), but it is less organized than what I've come to expect from a project page and it is easy to get lost or distracted by the rest of the Amos documentation while trying to peace together the necessary steps for a basic assembly.

Comparative Network Analysis pt. 2

  • Meeting with Volker this Friday to discuss how best to apply network alignment to what he's doing
  • I'm simultaneously trying to find a way to apply my network alignment technique to predicting genes in metagenomic samples
    • I've been trying to find a way to get beyond the restriction that my current program requires genes to be annotated with an EC number. A potentially interesting next step may be to use BioPython to BLAST the sequence of each enzyme annotated in every micro-organism in KEGG against a metagenomic library.
      • The results would be stretches of linked reactions that have been annotated in KEGG pathways.
      • This method could be applied to contigs just as easily as finished sequences. In a scenario where perhaps there was low coverage, it could be used to identify genes which are probably there but just weren't sampled by showing the presence of the rest pathway. In short, this could finally accomplish what Mihai asked me to work on when I showed up.
      • The major theoretical shortcoming of this approach is that it could only identify relatively well characterized pathways.
      • The practical shortcoming of this approach will start by obtaining a fairly complete copy of KEGG (which as we've learned is a mess to parse locally and unusably slow to call through the API), and will continue to the computational challenge of such a large scale BLAST operation.
    • Ask Bo about this when he gets back. He may have already done this.


January 29, 2010

I'm testing minimus and bambus in preparation of the oral microbiome data, and after spamming several lab members with email, it occurred to me that it would be considerably more considerate to put the information here instead. I'll put it in a table later.


Linear memory scaling of the Minimus overlapper:

  • 1 million reads => 1.2GB (3.7% of 32GB)
  • 2 million reads => 2.4GB (7.4% of 32GB)
  • 20 million reads => 21.5GB (~67% of 32GB)


Linear memory scaling of the Minimus contigger:

  • 1 million reads => 3.0GB (9.3% of 32GB)
  • 2 million reads => 6.0GB (18.6% of 32GB)
  • 20 million reads => (yet to be seen but it's probably going to be about 60GB, which will presumably cause it to break)
    • Update: top froze showing that tigger was using 99+% of the 32GB of RAM on privet. The AMOS log showed there was a core dump. So the 60GB estimate is probably reasonable. Will try later on walnut, which has 64GB of RAM.


Non-Linear run time scaling of the Minimus overlapper:

  • 1 million reads => 3 min (using 100% of a single 2.4GHz processor)
  • 2 million reads => 9 min (using 100% of a single 2.4GHz processor)
  • 20 million reads => 13hrs (using 100% of a single 2.4GHz processor)

I built the following model by fitting a simple polynomial equation to the run times for 1 million and 2 million reads, and then averaging the constants I got for each (1.7 & 1.5, respectively).

(#reads in millions / 1.6) ^ 2 = run time in min


Non-Linear run time scaling of the Minimus contigger:

  • 1 million reads => 9 min (using 100% of a single 2.4GHZ processor)
  • 2 million reads => 66 min (using 100% of a single 2.4GHz processor)
  • 20 million reads => N/A (crashed on privet due to lack of RAM)

Applying the same technique to these numbers (which doesn't appear to work nearly as well because it requires averaging 3 & 4 instead of 1.5 & 1.7), you get the following model.

(#reads in millions / 3.5) ^ 2 = run time in min