Partition: Difference between revisions
| No edit summary | No edit summary | ||
| (41 intermediate revisions by the same user not shown) | |||
| Line 1: | Line 1: | ||
| ==Summary== | |||
| Partition is a python script that takes an input of regular expressions and metadata to build an xml file of matching header information from a fasta-formatted file.  Partition.py is located at: | |||
|     /fs/szasmg/metagenomics/Partition/Partition.py ['''stable'''] | |||
|     /fs/szasmg/metagenomics/Partition/MetaPart/src/Partition.py ['''latest, but possibly unstable builds'''] | |||
| ===Options=== | |||
| <pre> | |||
| -p    Populate the given partition/XML. | |||
| -b    Given the input file, build a partition. | |||
| -m    Metadata file that will be used to populate the partitions. | |||
| -h    Header information for the metadata, if not present column information for metadata will be found in first line of the metadata. | |||
| -f    Input fasta file. | |||
| -s    Split the fasta file based on the partition information and output to the directory. | |||
| -o    Name of the output .part file. | |||
| -c    Convert an old partition format into the new xml format. | |||
| </pre> | |||
| ==Tutorial== | |||
| ===Format of an input regular expression (*.re) file=== | |||
| The format of the *.re input file is: | |||
|    [name] | |||
|    key1 = value1 | |||
|    key2 = value2 | |||
|    ... | |||
|    [name2] | |||
|    key1 = value1 | |||
|    ... | |||
| where [name] represents the name of the partition, and the key-value pairs will represent the attributes. | |||
| Example of an input file that clusters animals by their first letter: | |||
|    [animals] | |||
|    info = all animals | |||
|    [A] | |||
|    info = animals that start with A | |||
|    regexp = a.* | |||
|    [B] | |||
|    info = animals that start with B | |||
|    regexp = b.* | |||
|    ... | |||
| Using the format above, the partition can only have two levels.  It is possible to have multiple levels, but the input file needs to be an xml file (explained below). | |||
| ===Given the input file, build and populate a partition=== | |||
|    ./Partition.py -b [input.re] -f [input.fasta] [-o dir/file] | |||
| The -o option specifies if the user wants to save the partition file as something other than the default temppart.xml. | |||
| ===Use a metadata file to populate the partitions.=== | |||
| A metadata file is a file that contains information about a set of sequences. | |||
|    #SampleID       LRHand        Sex      | |||
|    S1              R             F        | |||
|    S2              L             M | |||
| If the first row of the metadata file does not contain the column information, then specify a separate column header file with '''-c [file]'''. | |||
| The format of a *.re file changes slightly with the addition of the metadata file.  In addition to specifying the regexp, a '''category''' field must also be added to specify which column the regexp will be checked against. | |||
| To populate the xml file with the metadata information and split a given fasta file: | |||
|    ./Partition.py -b [input.re] -f [input.fasta] -m [metadata.map] -s [dir/output] | |||
Latest revision as of 21:23, 27 August 2009
Summary
Partition is a python script that takes an input of regular expressions and metadata to build an xml file of matching header information from a fasta-formatted file. Partition.py is located at:
/fs/szasmg/metagenomics/Partition/Partition.py [stable] /fs/szasmg/metagenomics/Partition/MetaPart/src/Partition.py [latest, but possibly unstable builds]
Options
-p Populate the given partition/XML. -b Given the input file, build a partition. -m Metadata file that will be used to populate the partitions. -h Header information for the metadata, if not present column information for metadata will be found in first line of the metadata. -f Input fasta file. -s Split the fasta file based on the partition information and output to the directory. -o Name of the output .part file. -c Convert an old partition format into the new xml format.
Tutorial
Format of an input regular expression (*.re) file
The format of the *.re input file is:
[name] key1 = value1 key2 = value2 ... [name2] key1 = value1 ...
where [name] represents the name of the partition, and the key-value pairs will represent the attributes.
Example of an input file that clusters animals by their first letter:
[animals] info = all animals [A] info = animals that start with A regexp = a.* [B] info = animals that start with B regexp = b.* ...
Using the format above, the partition can only have two levels. It is possible to have multiple levels, but the input file needs to be an xml file (explained below).
Given the input file, build and populate a partition
./Partition.py -b [input.re] -f [input.fasta] [-o dir/file]
The -o option specifies if the user wants to save the partition file as something other than the default temppart.xml.
Use a metadata file to populate the partitions.
A metadata file is a file that contains information about a set of sequences.
#SampleID LRHand Sex S1 R F S2 L M
If the first row of the metadata file does not contain the column information, then specify a separate column header file with -c [file].
The format of a *.re file changes slightly with the addition of the metadata file. In addition to specifying the regexp, a category field must also be added to specify which column the regexp will be checked against.
To populate the xml file with the metadata information and split a given fasta file:
./Partition.py -b [input.re] -f [input.fasta] -m [metadata.map] -s [dir/output]