<?xml version="1.0"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
	<id>https://wiki.umiacs.umd.edu/cbcb/api.php?action=feedcontributions&amp;feedformat=atom&amp;user=Ghodsi</id>
	<title>Cbcb - User contributions [en]</title>
	<link rel="self" type="application/atom+xml" href="https://wiki.umiacs.umd.edu/cbcb/api.php?action=feedcontributions&amp;feedformat=atom&amp;user=Ghodsi"/>
	<link rel="alternate" type="text/html" href="https://wiki.umiacs.umd.edu/cbcb/index.php/Special:Contributions/Ghodsi"/>
	<updated>2026-04-12T19:12:46Z</updated>
	<subtitle>User contributions</subtitle>
	<generator>MediaWiki 1.43.7</generator>
	<entry>
		<id>https://wiki.umiacs.umd.edu/cbcb/index.php?title=Cbcb:Pop-Lab:16S-pipeline_16S_analysis_pipeline_(for_Gates_project)&amp;diff=7584</id>
		<title>Cbcb:Pop-Lab:16S-pipeline 16S analysis pipeline (for Gates project)</title>
		<link rel="alternate" type="text/html" href="https://wiki.umiacs.umd.edu/cbcb/index.php?title=Cbcb:Pop-Lab:16S-pipeline_16S_analysis_pipeline_(for_Gates_project)&amp;diff=7584"/>
		<updated>2010-09-27T19:16:29Z</updated>

		<summary type="html">&lt;p&gt;Ghodsi: /* Step 6: Run clustering tool */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= 16S analysis pipeline =&lt;br /&gt;
&lt;br /&gt;
== Overview ==&lt;br /&gt;
=== Assumptions ===&lt;br /&gt;
* 16S rRNA sequences are generated with barcoded 454 and are received as either: (i) .sff file; (ii) fasta and quality file; (iii) just fasta file&lt;br /&gt;
* each batch is a single 96-well plate and is accompanied by a tab-delimited file containing information about the sample, including the &amp;quot;Sample ID&amp;quot;, well on the plate, and additional information regarding the sample quality and DNA concentration&lt;br /&gt;
* we have a file that specifies the barcode and sequencing adapter used for each well (these entities do not change).&lt;br /&gt;
&lt;br /&gt;
=== Directory structure ===&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
Gates_SOM/&lt;br /&gt;
  Main/&lt;br /&gt;
    samples.csv      - information about all the samples available to us&lt;br /&gt;
    454.csv          - information about all 454 runs (essentially concatenation of .csvs from 454 dir)&lt;br /&gt;
    phylochip.csv    - information about all Phylochip runs&lt;br /&gt;
    IGS_Barcodes.csv - information about barcodes used to multiplex 454 runs&lt;br /&gt;
    scripts/         - scripts used to process the data (referred to as ${SCRIPTS} below)&lt;br /&gt;
    DB/              - scripts used to access the database (referred to as ${DB_SCRIPTS} below)&lt;br /&gt;
  454/               - here&#039;s where all 454 sequences live&lt;br /&gt;
    [batch1]/        - ... each batch in a separate directory&lt;br /&gt;
    [batch1].csv     - meta-information about the batch &lt;br /&gt;
    [fasta1]         - fasta files containing the batch&lt;br /&gt;
     ... &lt;br /&gt;
    [fastan]&lt;br /&gt;
    [batch1].part    - partition file describing how the sequences get split by barcode/sample&lt;br /&gt;
     part/           - directory where all the partitioned files live &lt;br /&gt;
     ... &lt;br /&gt;
    [batchn]&lt;br /&gt;
  Phylochip/         - all the CEL files and auxiliary information on the Phylochip runs&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== File-based approach == &lt;br /&gt;
&lt;br /&gt;
=== Step 0: Get the sequence information ===&lt;br /&gt;
* From .SFF files (assuming these are 454 sequences)&lt;br /&gt;
&lt;br /&gt;
This step uses the sff_extract program from the Staden package (if I&#039;m not mistaken)&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
for i in *.sff ;do&lt;br /&gt;
  name=`expr $i : &#039;\(.*\)\.sff&#039;`&lt;br /&gt;
  sff_extract -c -s $name.seq -q $name.qual $i&lt;br /&gt;
done&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 1: Cleanup meta-information ===&lt;br /&gt;
* Convert the Excel sheet containing the batch information into tab-delimited file (and run dos2unix) making sure the quotes added by Excel/OOffice are removed, adding the date (if not already in the file), and sorting the file by Sample ID.  At this stage also check that the header row information is in canonical format.&lt;br /&gt;
* Add barcode information using add_barcode.pl&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;${SCRIPTDIR}/add_barcode.pl [batch].csv ${MAINDIR}IGS_Barcodes.csv &amp;gt; [batch]_barcode.csv&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 2: Create partition file ===&lt;br /&gt;
&lt;br /&gt;
First concatenate all the sequence files (if multiple) from a batch into a single file [batch].all.seq.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;${SCRIPTDIR}/code2part.pl [batch].csv [batch].all.seq [batch] &amp;gt; [batch].part&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The files [batch].BAD.list and [batch].NONE.list contain the names of all sequences that have failed quality checks either because they are too short, or contain Ns (BAD), or because the barcode is not recognized (NONE). The files contain additional information that can be used to troubleshoot the pipeline:  sequences that are too short (&amp;lt; 75 454 cycles) are followed by the number of cycles, sequences that either contain Ns or have an unknown barcode are followed by the first 8 characters in the sequence.&lt;br /&gt;
&lt;br /&gt;
=== Step 3: Break up the fasta file into separate batches by partition ===&lt;br /&gt;
&lt;br /&gt;
* Create partition directory&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mkdir part&lt;br /&gt;
cd part&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Partition main file into sub-parts&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
${SCRIPTS}/partitionFasta.pl ../[batch].all.seq ../[batch].part&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The result is one fasta file per sub-partition (i.e. individual subject).&lt;br /&gt;
&lt;br /&gt;
* Remove barcodes &lt;br /&gt;
(still in the part/ subdirectory)&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
for i in *.seq; do&lt;br /&gt;
  ${SCRIPTS}/unbarcode.pl $i&lt;br /&gt;
done&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The output files will have the same name as the original file but with the addition of the .nbc suffix.  You should remove the .nbc files from the BAD/NONE files in order to prevent their addition to the pipeline downstream.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt; rm *.BAD.nbc.fa *.NONE.nbc.fa &amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 4: Add file names to sample tables ===&lt;br /&gt;
&lt;br /&gt;
The following needs to be run from the root of the 454 directory.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt; ${SCRIPTS}/add_file.pl [batch]/[batch]_barcode.csv [batch]/part &amp;gt; [batch]/[batch]_names.csv &amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
As a result, the file [batch]/[batch]_names.csv will associate each Sample ID to a file name and also record the number of sequences in that file.  Note that only files ending in .nbc.fa are procesed.&lt;br /&gt;
&lt;br /&gt;
=== Step 5: Merge all the batch meta-info files into a same file at the top ===&lt;br /&gt;
&lt;br /&gt;
Note: the addition of file names must be done on a batch by batch basis as multiple files might refer to a same Sample ID - within each batch it can be assumed that the Sample ID -&amp;gt; Filename mapping is unique.  In the 454.csv file in the top directory the unique key is the file name.  &lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt; &lt;br /&gt;
mv ../454.csv ../454.csv.bak&lt;br /&gt;
${SCRIPTS}/merge_csv.pl ../454.csv.bak [batch]/[batch]_names.csv &amp;quot;Sample ID&amp;quot; non-unique &amp;gt; ../454.csv&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This should probably be run with the filename as the key but then the tables need to be sorted  by filename.  Ultimately, this will all work better once the data are in a relational database.&lt;br /&gt;
&lt;br /&gt;
=== Step 6: Update sample file ===&lt;br /&gt;
From the top directory:&lt;br /&gt;
&lt;br /&gt;
* First add all new samples to the samples.csv file&lt;br /&gt;
&amp;lt;pre&amp;gt; &lt;br /&gt;
mv samples.csv samples.csv.bak&lt;br /&gt;
${SCRIPTS}/merge_csv.pl samples.csv.bak 454.csv &amp;quot;Sample ID&amp;quot; unique merge &amp;gt; samples.csv&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Note: merge means that if record keys conflict, the empty fields will be updated with the new data.&lt;br /&gt;
&lt;br /&gt;
* Update the tag indicating 454 sequences available for this sample&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mv samples.csv samples.csv.bak&lt;br /&gt;
${SCRIPTS}/update_field_csv.pl samples.csv.bak 454.csv &amp;quot;Sample ID&amp;quot; 454 Y &amp;gt; samples.csv&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 7: Assign numbers to all filenames ===&lt;br /&gt;
&lt;br /&gt;
Each file in the 454.csv file will be assigned an integer (if one is not already available).  This number will be used to prefix the sequences in the combined file for the project.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mv 454.csv 454.csv.bak&lt;br /&gt;
${SCRIPTS}/add_filenum.pl 454.csv.bak &amp;gt; 454.csv&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 8: Combine all fasta files into a single one ===&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Note: this assumes we&#039;re running the pipeline for the first time - a different protocol is necessary for adding new sequences to an already existing analysis&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
In the new file all sequences will be named &amp;lt;n&amp;gt;_&amp;lt;nn&amp;gt; where &amp;lt;n&amp;gt; is the value in the &amp;quot;File #&amp;quot;&lt;br /&gt;
field in 454.csv and &amp;lt;nn&amp;gt; is the index of the sequence in the file.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mkdir Analysis/Run[date]&lt;br /&gt;
${SCRIPTS}/combinefa.pl -c Analysis/Run[date]/Run[date] -i 454.csv 454&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The output will be in Analysis/Run[date]/Run[date].fna&lt;br /&gt;
&lt;br /&gt;
=== Step 9: Run clustering tool ===&lt;br /&gt;
&lt;br /&gt;
* First generate clusters&lt;br /&gt;
&#039;&#039;&#039;Note: This part assumes we&#039;re running the whole set of sequences as one batch.&#039;&#039;&#039;&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
cd Analysis/Run[date]/Run[date].fna&lt;br /&gt;
/fs/szasmg2/ghodsi/Src/dnaclust/dnaclust -i Run[date].fna &amp;gt; Run[date].fna.cluster&lt;br /&gt;
/fs/szasmg2/ghodsi/Src/dnaclust/dnaclust -m -i Run[date].fna &amp;gt; Run[date].fna.align&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Output will be in Run[date].fna.cluster, one cluster per line, cluster center listed as the first identifier.&lt;br /&gt;
&lt;br /&gt;
The .align file contains aligned FASTA records for all the sequences in each cluster.  Clusters are separated by #&amp;lt;number&amp;gt; where &amp;lt;number&amp;gt; is the number of sequences in the cluster.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* Then extract the cluster centers&lt;br /&gt;
&#039;&#039;&#039;From here on the code runs the same in both full-run and batch modes&#039;&#039;&#039;&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
/fs/szasmg2/ghodsi/Src/dnaclust/fastaselect Run[date].fna &amp;lt; Run[date].fna.cluster &amp;gt; Run[date].centers.fna&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 10: Assign putative taxonomic labels to clusters ===&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
/fs/szasmg2/ghodsi/rdp/findtax/findtaxid.sh Run[date].centers.fna &amp;gt; Run[date].centers.taxid&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Output is tab-delimited: sequence name &amp;lt;TAB&amp;gt; taxid&lt;br /&gt;
Note: using &amp;quot;findtax&amp;quot; instead of &amp;quot;findtaxid&amp;quot; will retrieve actual taxonomy names.&lt;br /&gt;
&lt;br /&gt;
=== Step 11: Build summary tables ===&lt;br /&gt;
&lt;br /&gt;
Using the output from steps 9 and 10 we construct a collection of tables linking OTUs, taxIDs, taxnames at various taxonomic levels to individual samples.  The colums are the samples and the rows are the respective units.  The cells are numbers of sequences assigned to the specific group. If looking at taxonomic levels, the sequences without an assignment at that level are assigned to a generic &amp;quot;No Assignment&amp;quot; bin.&lt;br /&gt;
&lt;br /&gt;
* First create a partition file that contains all the clusters and associated taxids (if they exist)&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
${SCRIPTS}/clusters2part.pl [batch].fna.cluster [batch] [batch].centers.taxid &amp;gt; [batch].part&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Using this partition, construct summary tables at various taxonomic levels&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
${SCRIPTS}/taxpart2summary.pl [batch].part ${MAIN}/454.csv [batch]&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The following outputs will be created:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
[batch].stats.txt - overall statistics for the data-set&lt;br /&gt;
[batch].otus.count.csv - table containing OTUs as rows, samples as columns, and entries represent&lt;br /&gt;
       # of sequences in each OTU/Sample pair&lt;br /&gt;
[batch].otus.percent.csv - same as &amp;quot;count&amp;quot; except that entries are percentages wrt total sequences&lt;br /&gt;
       in each sample&lt;br /&gt;
[batch].[tax].[otu|count|percent] - same as the &amp;quot;otus&amp;quot; file except at varying taxonomic levels.&lt;br /&gt;
       [tax] is one of &amp;quot;strain&amp;quot;, &amp;quot;species&amp;quot;, &amp;quot;genus&amp;quot;, &amp;quot;family&amp;quot;, &amp;quot;order&amp;quot;, &amp;quot;class&amp;quot;, &amp;quot;phylum&amp;quot;&lt;br /&gt;
       the &amp;quot;count&amp;quot; and &amp;quot;percent&amp;quot; entries are the same as for the &amp;quot;otus&amp;quot; files&lt;br /&gt;
       the &amp;quot;otu&amp;quot; entries contain number of OTUs assigned to the taxonomic group/Sample pair.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Database-based approach ==&lt;br /&gt;
&lt;br /&gt;
=== Database information === &lt;br /&gt;
* To access the database through a command line use the command shown below with password &amp;quot;access&amp;quot;:&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mysql -u access -p -h cbcbmysql00 gems&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
* A stub Perl script for playing with the database is provided at /fs/szasmg2/Gates_SOM/Main/454/DB/stub.pl&lt;br /&gt;
* All users can read the data from the database using users/password combo &amp;quot;access&amp;quot;/&amp;quot;access&amp;quot;.&lt;br /&gt;
* Database schema: [[Media:Schema.pdf]]&lt;br /&gt;
* The commands listed below assume you have write access to the database.&lt;br /&gt;
&lt;br /&gt;
=== Step 0: Get the sequence information ===&lt;br /&gt;
* From .SFF files (assuming these are 454 sequences)&lt;br /&gt;
&lt;br /&gt;
This step uses the sff_extract program from the Staden package (if I&#039;m not mistaken)&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
for i in *.sff ;do&lt;br /&gt;
  name=`expr $i : &#039;\(.*\)\.sff&#039;`&lt;br /&gt;
  sff_extract -c -s $name.seq -q $name.qual $i&lt;br /&gt;
done&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 1: Cleanup meta-information ===&lt;br /&gt;
* Convert the Excel sheet containing the batch information into tab-delimited file (and run dos2unix) making sure the quotes added by Excel/OOffice are removed, adding the date (if not already in the file), and sorting the file by Sample ID.  At this stage also check that the header row information is in canonical format.&lt;br /&gt;
* Add barcode information using add_barcode.pl&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;${SCRIPTDIR}/add_barcode.pl [batch].csv ${MAINDIR}IGS_Barcodes.csv &amp;gt; [batch]_barcode.csv&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 2: Create partition file ===&lt;br /&gt;
&lt;br /&gt;
First concatenate all the sequence files (if multiple) from a batch into a single file [batch].all.seq.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;${SCRIPTDIR}/code2part.pl [batch].csv [batch].all.seq [batch] &amp;gt; [batch].part&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The files [batch].BAD.list and [batch].NONE.list contain the names of all sequences that have failed quality checks either because they are too short, or contain Ns (BAD), or because the barcode is not recognized (NONE). The files contain additional information that can be used to troubleshoot the pipeline:  sequences that are too short (&amp;lt; 75 454 cycles) are followed by the number of cycles, sequences that either contain Ns or have an unknown barcode are followed by the first 8 characters in the sequence.&lt;br /&gt;
&lt;br /&gt;
=== Step 3: Break up the fasta file into separate batches by partition ===&lt;br /&gt;
&lt;br /&gt;
* Create partition directory&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mkdir part&lt;br /&gt;
cd part&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Partition main file into sub-parts&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
${SCRIPTS}/partitionFasta.pl ../[batch].all.seq ../[batch].part&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The result is one fasta file per sub-partition (i.e. individual subject).&lt;br /&gt;
&lt;br /&gt;
* Remove barcodes &lt;br /&gt;
(still in the part/ subdirectory)&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
for i in *.seq; do&lt;br /&gt;
  ${SCRIPTS}/unbarcode.pl $i&lt;br /&gt;
done&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The output files will have the same name as the original file but with the addition of the .nbc suffix.  You should remove the .nbc files from the BAD/NONE files in order to prevent their addition to the pipeline downstream.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt; rm *.BAD.nbc.fa *.NONE.nbc.fa &amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 4: Upload file names ===&lt;br /&gt;
Run from 454 directory&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
${DB_SCRIPTS}/add_file_db.pl [batch]/[batch]_barcode.csv [batch]/part&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 5: Combine all fasta files into a single one ===&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Note: this assumes we&#039;re running the pipeline for the first time - a different protocol is necessary for adding new sequences to an already existing analysis&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
In the new file all sequences will be named &amp;lt;n&amp;gt;_&amp;lt;nn&amp;gt; where &amp;lt;n&amp;gt; is the value in the &amp;quot;File #&amp;quot;&lt;br /&gt;
field in 454.csv and &amp;lt;nn&amp;gt; is the index of the sequence in the file.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mkdir Analysis/Run[date]&lt;br /&gt;
${DB_SCRIPTS}/combinefa_db.pl -c Analysis/Run[date]/Run[date] 454&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The output will be in Analysis/Run[date]/Run[date].fna&lt;br /&gt;
&lt;br /&gt;
=== Step 6: Run clustering tool ===&lt;br /&gt;
&lt;br /&gt;
* First generate clusters&lt;br /&gt;
&#039;&#039;&#039;Note: This part assumes we&#039;re running the whole set of sequences as one batch.&#039;&#039;&#039;&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
cd Analysis/Run[date]/Run[date].fna&lt;br /&gt;
/fs/szasmg2/ghodsi/Src/dnaclust/dnaclust -i Run[date].fna &amp;gt; Run[date].fna.cluster&lt;br /&gt;
/fs/szasmg2/ghodsi/Src/dnaclust/dnaclust -m -i Run[date].fna &amp;gt; Run[date].fna.align&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Output will be in Run[date].fna.cluster, one cluster per line, cluster center listed as the first identifier.&lt;br /&gt;
&lt;br /&gt;
The .align file contains aligned FASTA records for all the sequences in each cluster.  Clusters are separated by #&amp;lt;number&amp;gt; where &amp;lt;number&amp;gt; is the number of sequences in the cluster.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* Then extract the cluster centers&lt;br /&gt;
&#039;&#039;&#039;From here on the code runs the same in both full-run and batch modes&#039;&#039;&#039;&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
/fs/szasmg2/ghodsi/Src/dnaclust/fastaselect Run[date].fna &amp;lt; Run[date].fna.cluster &amp;gt; Run[date].centers.fna&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 7: Assign putative taxonomic labels to clusters ===&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
/fs/szasmg2/ghodsi/rdp/findtax/findtaxid.sh Run[date].centers.fna &amp;gt; Run[date].centers.taxid&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Output is tab-delimited: sequence name &amp;lt;TAB&amp;gt; taxid&lt;br /&gt;
Note: using &amp;quot;findtax&amp;quot; instead of &amp;quot;findtaxid&amp;quot; will retrieve actual taxonomy names.&lt;br /&gt;
&lt;br /&gt;
=== Step 8: Upload OTU information into the database ===&lt;br /&gt;
* First create a partition file that contains all the clusters and associated taxids (if they exist)&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
${SCRIPTS}/cluster2part.pl [batch].fna.cluster [batch] [batch].centers.taxid &amp;gt; [batch].part&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Then upload the resulting partition to the database&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
${DB_SCRIPTS}/upload_otus.pl [batch].part&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 9: Generate summary tables ===&lt;/div&gt;</summary>
		<author><name>Ghodsi</name></author>
	</entry>
	<entry>
		<id>https://wiki.umiacs.umd.edu/cbcb/index.php?title=Cbcb:Pop-Lab:16S-pipeline_16S_analysis_pipeline_(for_Gates_project)&amp;diff=7583</id>
		<title>Cbcb:Pop-Lab:16S-pipeline 16S analysis pipeline (for Gates project)</title>
		<link rel="alternate" type="text/html" href="https://wiki.umiacs.umd.edu/cbcb/index.php?title=Cbcb:Pop-Lab:16S-pipeline_16S_analysis_pipeline_(for_Gates_project)&amp;diff=7583"/>
		<updated>2010-09-27T19:14:21Z</updated>

		<summary type="html">&lt;p&gt;Ghodsi: /* Step 9: Run clustering tool */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= 16S analysis pipeline =&lt;br /&gt;
&lt;br /&gt;
== Overview ==&lt;br /&gt;
=== Assumptions ===&lt;br /&gt;
* 16S rRNA sequences are generated with barcoded 454 and are received as either: (i) .sff file; (ii) fasta and quality file; (iii) just fasta file&lt;br /&gt;
* each batch is a single 96-well plate and is accompanied by a tab-delimited file containing information about the sample, including the &amp;quot;Sample ID&amp;quot;, well on the plate, and additional information regarding the sample quality and DNA concentration&lt;br /&gt;
* we have a file that specifies the barcode and sequencing adapter used for each well (these entities do not change).&lt;br /&gt;
&lt;br /&gt;
=== Directory structure ===&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
Gates_SOM/&lt;br /&gt;
  Main/&lt;br /&gt;
    samples.csv      - information about all the samples available to us&lt;br /&gt;
    454.csv          - information about all 454 runs (essentially concatenation of .csvs from 454 dir)&lt;br /&gt;
    phylochip.csv    - information about all Phylochip runs&lt;br /&gt;
    IGS_Barcodes.csv - information about barcodes used to multiplex 454 runs&lt;br /&gt;
    scripts/         - scripts used to process the data (referred to as ${SCRIPTS} below)&lt;br /&gt;
    DB/              - scripts used to access the database (referred to as ${DB_SCRIPTS} below)&lt;br /&gt;
  454/               - here&#039;s where all 454 sequences live&lt;br /&gt;
    [batch1]/        - ... each batch in a separate directory&lt;br /&gt;
    [batch1].csv     - meta-information about the batch &lt;br /&gt;
    [fasta1]         - fasta files containing the batch&lt;br /&gt;
     ... &lt;br /&gt;
    [fastan]&lt;br /&gt;
    [batch1].part    - partition file describing how the sequences get split by barcode/sample&lt;br /&gt;
     part/           - directory where all the partitioned files live &lt;br /&gt;
     ... &lt;br /&gt;
    [batchn]&lt;br /&gt;
  Phylochip/         - all the CEL files and auxiliary information on the Phylochip runs&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== File-based approach == &lt;br /&gt;
&lt;br /&gt;
=== Step 0: Get the sequence information ===&lt;br /&gt;
* From .SFF files (assuming these are 454 sequences)&lt;br /&gt;
&lt;br /&gt;
This step uses the sff_extract program from the Staden package (if I&#039;m not mistaken)&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
for i in *.sff ;do&lt;br /&gt;
  name=`expr $i : &#039;\(.*\)\.sff&#039;`&lt;br /&gt;
  sff_extract -c -s $name.seq -q $name.qual $i&lt;br /&gt;
done&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 1: Cleanup meta-information ===&lt;br /&gt;
* Convert the Excel sheet containing the batch information into tab-delimited file (and run dos2unix) making sure the quotes added by Excel/OOffice are removed, adding the date (if not already in the file), and sorting the file by Sample ID.  At this stage also check that the header row information is in canonical format.&lt;br /&gt;
* Add barcode information using add_barcode.pl&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;${SCRIPTDIR}/add_barcode.pl [batch].csv ${MAINDIR}IGS_Barcodes.csv &amp;gt; [batch]_barcode.csv&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 2: Create partition file ===&lt;br /&gt;
&lt;br /&gt;
First concatenate all the sequence files (if multiple) from a batch into a single file [batch].all.seq.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;${SCRIPTDIR}/code2part.pl [batch].csv [batch].all.seq [batch] &amp;gt; [batch].part&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The files [batch].BAD.list and [batch].NONE.list contain the names of all sequences that have failed quality checks either because they are too short, or contain Ns (BAD), or because the barcode is not recognized (NONE). The files contain additional information that can be used to troubleshoot the pipeline:  sequences that are too short (&amp;lt; 75 454 cycles) are followed by the number of cycles, sequences that either contain Ns or have an unknown barcode are followed by the first 8 characters in the sequence.&lt;br /&gt;
&lt;br /&gt;
=== Step 3: Break up the fasta file into separate batches by partition ===&lt;br /&gt;
&lt;br /&gt;
* Create partition directory&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mkdir part&lt;br /&gt;
cd part&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Partition main file into sub-parts&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
${SCRIPTS}/partitionFasta.pl ../[batch].all.seq ../[batch].part&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The result is one fasta file per sub-partition (i.e. individual subject).&lt;br /&gt;
&lt;br /&gt;
* Remove barcodes &lt;br /&gt;
(still in the part/ subdirectory)&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
for i in *.seq; do&lt;br /&gt;
  ${SCRIPTS}/unbarcode.pl $i&lt;br /&gt;
done&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The output files will have the same name as the original file but with the addition of the .nbc suffix.  You should remove the .nbc files from the BAD/NONE files in order to prevent their addition to the pipeline downstream.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt; rm *.BAD.nbc.fa *.NONE.nbc.fa &amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 4: Add file names to sample tables ===&lt;br /&gt;
&lt;br /&gt;
The following needs to be run from the root of the 454 directory.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt; ${SCRIPTS}/add_file.pl [batch]/[batch]_barcode.csv [batch]/part &amp;gt; [batch]/[batch]_names.csv &amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
As a result, the file [batch]/[batch]_names.csv will associate each Sample ID to a file name and also record the number of sequences in that file.  Note that only files ending in .nbc.fa are procesed.&lt;br /&gt;
&lt;br /&gt;
=== Step 5: Merge all the batch meta-info files into a same file at the top ===&lt;br /&gt;
&lt;br /&gt;
Note: the addition of file names must be done on a batch by batch basis as multiple files might refer to a same Sample ID - within each batch it can be assumed that the Sample ID -&amp;gt; Filename mapping is unique.  In the 454.csv file in the top directory the unique key is the file name.  &lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt; &lt;br /&gt;
mv ../454.csv ../454.csv.bak&lt;br /&gt;
${SCRIPTS}/merge_csv.pl ../454.csv.bak [batch]/[batch]_names.csv &amp;quot;Sample ID&amp;quot; non-unique &amp;gt; ../454.csv&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This should probably be run with the filename as the key but then the tables need to be sorted  by filename.  Ultimately, this will all work better once the data are in a relational database.&lt;br /&gt;
&lt;br /&gt;
=== Step 6: Update sample file ===&lt;br /&gt;
From the top directory:&lt;br /&gt;
&lt;br /&gt;
* First add all new samples to the samples.csv file&lt;br /&gt;
&amp;lt;pre&amp;gt; &lt;br /&gt;
mv samples.csv samples.csv.bak&lt;br /&gt;
${SCRIPTS}/merge_csv.pl samples.csv.bak 454.csv &amp;quot;Sample ID&amp;quot; unique merge &amp;gt; samples.csv&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Note: merge means that if record keys conflict, the empty fields will be updated with the new data.&lt;br /&gt;
&lt;br /&gt;
* Update the tag indicating 454 sequences available for this sample&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mv samples.csv samples.csv.bak&lt;br /&gt;
${SCRIPTS}/update_field_csv.pl samples.csv.bak 454.csv &amp;quot;Sample ID&amp;quot; 454 Y &amp;gt; samples.csv&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 7: Assign numbers to all filenames ===&lt;br /&gt;
&lt;br /&gt;
Each file in the 454.csv file will be assigned an integer (if one is not already available).  This number will be used to prefix the sequences in the combined file for the project.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mv 454.csv 454.csv.bak&lt;br /&gt;
${SCRIPTS}/add_filenum.pl 454.csv.bak &amp;gt; 454.csv&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 8: Combine all fasta files into a single one ===&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Note: this assumes we&#039;re running the pipeline for the first time - a different protocol is necessary for adding new sequences to an already existing analysis&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
In the new file all sequences will be named &amp;lt;n&amp;gt;_&amp;lt;nn&amp;gt; where &amp;lt;n&amp;gt; is the value in the &amp;quot;File #&amp;quot;&lt;br /&gt;
field in 454.csv and &amp;lt;nn&amp;gt; is the index of the sequence in the file.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mkdir Analysis/Run[date]&lt;br /&gt;
${SCRIPTS}/combinefa.pl -c Analysis/Run[date]/Run[date] -i 454.csv 454&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The output will be in Analysis/Run[date]/Run[date].fna&lt;br /&gt;
&lt;br /&gt;
=== Step 9: Run clustering tool ===&lt;br /&gt;
&lt;br /&gt;
* First generate clusters&lt;br /&gt;
&#039;&#039;&#039;Note: This part assumes we&#039;re running the whole set of sequences as one batch.&#039;&#039;&#039;&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
cd Analysis/Run[date]/Run[date].fna&lt;br /&gt;
/fs/szasmg2/ghodsi/Src/dnaclust/dnaclust -i Run[date].fna &amp;gt; Run[date].fna.cluster&lt;br /&gt;
/fs/szasmg2/ghodsi/Src/dnaclust/dnaclust -m -i Run[date].fna &amp;gt; Run[date].fna.align&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Output will be in Run[date].fna.cluster, one cluster per line, cluster center listed as the first identifier.&lt;br /&gt;
&lt;br /&gt;
The .align file contains aligned FASTA records for all the sequences in each cluster.  Clusters are separated by #&amp;lt;number&amp;gt; where &amp;lt;number&amp;gt; is the number of sequences in the cluster.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* Then extract the cluster centers&lt;br /&gt;
&#039;&#039;&#039;From here on the code runs the same in both full-run and batch modes&#039;&#039;&#039;&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
/fs/szasmg2/ghodsi/Src/dnaclust/fastaselect Run[date].fna &amp;lt; Run[date].fna.cluster &amp;gt; Run[date].centers.fna&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 10: Assign putative taxonomic labels to clusters ===&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
/fs/szasmg2/ghodsi/rdp/findtax/findtaxid.sh Run[date].centers.fna &amp;gt; Run[date].centers.taxid&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Output is tab-delimited: sequence name &amp;lt;TAB&amp;gt; taxid&lt;br /&gt;
Note: using &amp;quot;findtax&amp;quot; instead of &amp;quot;findtaxid&amp;quot; will retrieve actual taxonomy names.&lt;br /&gt;
&lt;br /&gt;
=== Step 11: Build summary tables ===&lt;br /&gt;
&lt;br /&gt;
Using the output from steps 9 and 10 we construct a collection of tables linking OTUs, taxIDs, taxnames at various taxonomic levels to individual samples.  The colums are the samples and the rows are the respective units.  The cells are numbers of sequences assigned to the specific group. If looking at taxonomic levels, the sequences without an assignment at that level are assigned to a generic &amp;quot;No Assignment&amp;quot; bin.&lt;br /&gt;
&lt;br /&gt;
* First create a partition file that contains all the clusters and associated taxids (if they exist)&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
${SCRIPTS}/clusters2part.pl [batch].fna.cluster [batch] [batch].centers.taxid &amp;gt; [batch].part&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Using this partition, construct summary tables at various taxonomic levels&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
${SCRIPTS}/taxpart2summary.pl [batch].part ${MAIN}/454.csv [batch]&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The following outputs will be created:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
[batch].stats.txt - overall statistics for the data-set&lt;br /&gt;
[batch].otus.count.csv - table containing OTUs as rows, samples as columns, and entries represent&lt;br /&gt;
       # of sequences in each OTU/Sample pair&lt;br /&gt;
[batch].otus.percent.csv - same as &amp;quot;count&amp;quot; except that entries are percentages wrt total sequences&lt;br /&gt;
       in each sample&lt;br /&gt;
[batch].[tax].[otu|count|percent] - same as the &amp;quot;otus&amp;quot; file except at varying taxonomic levels.&lt;br /&gt;
       [tax] is one of &amp;quot;strain&amp;quot;, &amp;quot;species&amp;quot;, &amp;quot;genus&amp;quot;, &amp;quot;family&amp;quot;, &amp;quot;order&amp;quot;, &amp;quot;class&amp;quot;, &amp;quot;phylum&amp;quot;&lt;br /&gt;
       the &amp;quot;count&amp;quot; and &amp;quot;percent&amp;quot; entries are the same as for the &amp;quot;otus&amp;quot; files&lt;br /&gt;
       the &amp;quot;otu&amp;quot; entries contain number of OTUs assigned to the taxonomic group/Sample pair.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Database-based approach ==&lt;br /&gt;
&lt;br /&gt;
=== Database information === &lt;br /&gt;
* To access the database through a command line use the command shown below with password &amp;quot;access&amp;quot;:&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mysql -u access -p -h cbcbmysql00 gems&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
* A stub Perl script for playing with the database is provided at /fs/szasmg2/Gates_SOM/Main/454/DB/stub.pl&lt;br /&gt;
* All users can read the data from the database using users/password combo &amp;quot;access&amp;quot;/&amp;quot;access&amp;quot;.&lt;br /&gt;
* Database schema: [[Media:Schema.pdf]]&lt;br /&gt;
* The commands listed below assume you have write access to the database.&lt;br /&gt;
&lt;br /&gt;
=== Step 0: Get the sequence information ===&lt;br /&gt;
* From .SFF files (assuming these are 454 sequences)&lt;br /&gt;
&lt;br /&gt;
This step uses the sff_extract program from the Staden package (if I&#039;m not mistaken)&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
for i in *.sff ;do&lt;br /&gt;
  name=`expr $i : &#039;\(.*\)\.sff&#039;`&lt;br /&gt;
  sff_extract -c -s $name.seq -q $name.qual $i&lt;br /&gt;
done&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 1: Cleanup meta-information ===&lt;br /&gt;
* Convert the Excel sheet containing the batch information into tab-delimited file (and run dos2unix) making sure the quotes added by Excel/OOffice are removed, adding the date (if not already in the file), and sorting the file by Sample ID.  At this stage also check that the header row information is in canonical format.&lt;br /&gt;
* Add barcode information using add_barcode.pl&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;${SCRIPTDIR}/add_barcode.pl [batch].csv ${MAINDIR}IGS_Barcodes.csv &amp;gt; [batch]_barcode.csv&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 2: Create partition file ===&lt;br /&gt;
&lt;br /&gt;
First concatenate all the sequence files (if multiple) from a batch into a single file [batch].all.seq.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;${SCRIPTDIR}/code2part.pl [batch].csv [batch].all.seq [batch] &amp;gt; [batch].part&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The files [batch].BAD.list and [batch].NONE.list contain the names of all sequences that have failed quality checks either because they are too short, or contain Ns (BAD), or because the barcode is not recognized (NONE). The files contain additional information that can be used to troubleshoot the pipeline:  sequences that are too short (&amp;lt; 75 454 cycles) are followed by the number of cycles, sequences that either contain Ns or have an unknown barcode are followed by the first 8 characters in the sequence.&lt;br /&gt;
&lt;br /&gt;
=== Step 3: Break up the fasta file into separate batches by partition ===&lt;br /&gt;
&lt;br /&gt;
* Create partition directory&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mkdir part&lt;br /&gt;
cd part&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Partition main file into sub-parts&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
${SCRIPTS}/partitionFasta.pl ../[batch].all.seq ../[batch].part&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The result is one fasta file per sub-partition (i.e. individual subject).&lt;br /&gt;
&lt;br /&gt;
* Remove barcodes &lt;br /&gt;
(still in the part/ subdirectory)&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
for i in *.seq; do&lt;br /&gt;
  ${SCRIPTS}/unbarcode.pl $i&lt;br /&gt;
done&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The output files will have the same name as the original file but with the addition of the .nbc suffix.  You should remove the .nbc files from the BAD/NONE files in order to prevent their addition to the pipeline downstream.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt; rm *.BAD.nbc.fa *.NONE.nbc.fa &amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 4: Upload file names ===&lt;br /&gt;
Run from 454 directory&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
${DB_SCRIPTS}/add_file_db.pl [batch]/[batch]_barcode.csv [batch]/part&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 5: Combine all fasta files into a single one ===&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Note: this assumes we&#039;re running the pipeline for the first time - a different protocol is necessary for adding new sequences to an already existing analysis&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
In the new file all sequences will be named &amp;lt;n&amp;gt;_&amp;lt;nn&amp;gt; where &amp;lt;n&amp;gt; is the value in the &amp;quot;File #&amp;quot;&lt;br /&gt;
field in 454.csv and &amp;lt;nn&amp;gt; is the index of the sequence in the file.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mkdir Analysis/Run[date]&lt;br /&gt;
${DB_SCRIPTS}/combinefa_db.pl -c Analysis/Run[date]/Run[date] 454&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The output will be in Analysis/Run[date]/Run[date].fna&lt;br /&gt;
&lt;br /&gt;
=== Step 6: Run clustering tool ===&lt;br /&gt;
&lt;br /&gt;
* First generate clusters&lt;br /&gt;
&#039;&#039;&#039;Note: This part assumes we&#039;re running the whole set of sequences as one batch.&#039;&#039;&#039;&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
cd Analysis/Run[date]/Run[date].fna&lt;br /&gt;
/fs/szasmg2/ghodsi/Src/clusterk/clusterk7 -r 2 -i Run[date].fna &amp;gt; Run[date].fna.cluster&lt;br /&gt;
/fs/szasmg2/ghodsi/Src/clusterk/clusterk7 -r 2 -m -i Run[date].fna &amp;gt; Run[date].fna.align&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Output will be in Run[date].fna.cluster, one cluster per line, cluster center listed as the first identifier.&lt;br /&gt;
&lt;br /&gt;
The .align file contains aligned FASTA records for all the sequences in each cluster.  Clusters are separated by #&amp;lt;number&amp;gt; where &amp;lt;number&amp;gt; is the number of sequences in the cluster.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* Then extract the cluster centers&lt;br /&gt;
&#039;&#039;&#039;From here on the code runs the same in both full-run and batch modes&#039;&#039;&#039;&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
/fs/szasmg2/ghodsi/Src/clusterk/fastaselect Run[date].fna &amp;lt; Run[date].fna.cluster &amp;gt; Run[date].centers.fna&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 7: Assign putative taxonomic labels to clusters ===&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
/fs/szasmg2/ghodsi/rdp/findtax/findtaxid.sh Run[date].centers.fna &amp;gt; Run[date].centers.taxid&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Output is tab-delimited: sequence name &amp;lt;TAB&amp;gt; taxid&lt;br /&gt;
Note: using &amp;quot;findtax&amp;quot; instead of &amp;quot;findtaxid&amp;quot; will retrieve actual taxonomy names.&lt;br /&gt;
&lt;br /&gt;
=== Step 8: Upload OTU information into the database ===&lt;br /&gt;
* First create a partition file that contains all the clusters and associated taxids (if they exist)&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
${SCRIPTS}/cluster2part.pl [batch].fna.cluster [batch] [batch].centers.taxid &amp;gt; [batch].part&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Then upload the resulting partition to the database&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
${DB_SCRIPTS}/upload_otus.pl [batch].part&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 9: Generate summary tables ===&lt;/div&gt;</summary>
		<author><name>Ghodsi</name></author>
	</entry>
	<entry>
		<id>https://wiki.umiacs.umd.edu/cbcb/index.php?title=Cbcb_talk:Pop-Lab:16S-pipeline_16S_analysis_pipeline_(for_Gates_project)&amp;diff=7582</id>
		<title>Cbcb talk:Pop-Lab:16S-pipeline 16S analysis pipeline (for Gates project)</title>
		<link rel="alternate" type="text/html" href="https://wiki.umiacs.umd.edu/cbcb/index.php?title=Cbcb_talk:Pop-Lab:16S-pipeline_16S_analysis_pipeline_(for_Gates_project)&amp;diff=7582"/>
		<updated>2010-09-27T19:11:28Z</updated>

		<summary type="html">&lt;p&gt;Ghodsi: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=== Step 9: Run OLD clustering tool ===&lt;br /&gt;
&lt;br /&gt;
* First generate clusters&lt;br /&gt;
&#039;&#039;&#039;Note: This part assumes we&#039;re running the whole set of sequences as one batch.&#039;&#039;&#039;&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
cd Analysis/Run[date]/Run[date].fna&lt;br /&gt;
/fs/szasmg2/ghodsi/Src/clusterk/clusterk7 -r 2 -i Run[date].fna &amp;gt; Run[date].fna.cluster&lt;br /&gt;
/fs/szasmg2/ghodsi/Src/clusterk/clusterk7 -r 2 -m -i Run[date].fna &amp;gt; Run[date].fna.align&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Output will be in Run[date].fna.cluster, one cluster per line, cluster center listed as the first identifier.&lt;br /&gt;
&lt;br /&gt;
The .align file contains aligned FASTA records for all the sequences in each cluster.  Clusters are separated by #&amp;lt;number&amp;gt; where &amp;lt;number&amp;gt; is the number of sequences in the cluster.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* Then extract the cluster centers&lt;br /&gt;
&#039;&#039;&#039;From here on the code runs the same in both full-run and batch modes&#039;&#039;&#039;&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
/fs/szasmg2/ghodsi/Src/clusterk/fastaselect Run[date].fna &amp;lt; Run[date].fna.cluster &amp;gt; Run[date].centers.fna&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;/div&gt;</summary>
		<author><name>Ghodsi</name></author>
	</entry>
	<entry>
		<id>https://wiki.umiacs.umd.edu/cbcb/index.php?title=Cbcb:Pop-Lab:16S-pipeline_16S_analysis_pipeline_(for_Gates_project)&amp;diff=6109</id>
		<title>Cbcb:Pop-Lab:16S-pipeline 16S analysis pipeline (for Gates project)</title>
		<link rel="alternate" type="text/html" href="https://wiki.umiacs.umd.edu/cbcb/index.php?title=Cbcb:Pop-Lab:16S-pipeline_16S_analysis_pipeline_(for_Gates_project)&amp;diff=6109"/>
		<updated>2009-11-25T18:10:58Z</updated>

		<summary type="html">&lt;p&gt;Ghodsi: /* Step 9: Run clustering tool */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== 16S analysis pipeline ==&lt;br /&gt;
&lt;br /&gt;
Assumptions:&amp;lt;br&amp;gt;&lt;br /&gt;
* 16S rRNA sequences are generated with barcoded 454 and are received as either: (i) .sff file; (ii) fasta and quality file; (iii) just fasta file&lt;br /&gt;
* each batch is a single 96-well plate and is accompanied by a tab-delimited file containing information about the sample, including the &amp;quot;Sample ID&amp;quot;, well on the plate, and additional information regarding the sample quality and DNA concentration&lt;br /&gt;
* we have a file that specifies the barcode and sequencing adapter used for each well (these entities do not change).&lt;br /&gt;
&lt;br /&gt;
=== Directory structure ===&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
Gates_SOM/&lt;br /&gt;
   Main/&lt;br /&gt;
     samples.csv  - information about all the samples available to us&lt;br /&gt;
     454.csv      - information about all 454 runs (essentially concatenation of .csvs from 454 dir)&lt;br /&gt;
     phylochip.csv - information about all Phylochip runs&lt;br /&gt;
     IGS_Barcodes.csv - information about barcodes used to multiplex 454 runs&lt;br /&gt;
     scripts/     - scripts used to process the data&lt;br /&gt;
     454/         - here&#039;s where all 454 sequences live&lt;br /&gt;
       [batch1]/  - ... each batch in a separate directory&lt;br /&gt;
          [batch1].csv - meta-information about the batch &lt;br /&gt;
          [fasta1]- fasta files containing the batch&lt;br /&gt;
          ... &lt;br /&gt;
          [fastan]&lt;br /&gt;
          [batch1].part - partition file describing how the sequences get split by barcode/sample&lt;br /&gt;
          part/   - directory where all the partitioned files live &lt;br /&gt;
       ... &lt;br /&gt;
       [batchn]&lt;br /&gt;
     Phylochip/   - all the CEL files and auxiliary information on the Phylochip runs&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 0: Get the sequence information ===&lt;br /&gt;
* From .SFF files (assuming these are 454 sequences)&lt;br /&gt;
&lt;br /&gt;
This step uses the sff_extract program from the Staden package (if I&#039;m not mistaken)&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
for i in *.sff ;do&lt;br /&gt;
  name=`expr $i : &#039;\(.*\)\.sff&#039;`&lt;br /&gt;
  sff_extract -c -s $name.seq -q $name.qual $i&lt;br /&gt;
done&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 1: Cleanup meta-information ===&lt;br /&gt;
* Convert the Excel sheet containing the batch information into tab-delimited file (and run dos2unix) making sure the quotes added by Excel/OOffice are removed, adding the date (if not already in the file), and sorting the file by Sample ID.  At this stage also check that the header row information is in canonical format.&lt;br /&gt;
* Add barcode information using add_barcode.pl&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;${SCRIPTDIR}/add_barcode.pl [batch].csv ${MAINDIR}IGS_Barcodes.csv &amp;gt; [batch]_barcode.csv&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 2: Create partition file ===&lt;br /&gt;
&lt;br /&gt;
First concatenate all the sequence files (if multiple) from a batch into a single file [batch].all.seq.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;${SCRIPTDIR}/code2part.pl [batch].csv [batch].all.seq [batch] &amp;gt; [batch].part&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The files [batch].BAD.list and [batch].NONE.list contain the names of all sequences that have failed quality checks either because they are too short, or contain Ns (BAD), or because the barcode is not recognized (NONE). The files contain additional information that can be used to troubleshoot the pipeline:  sequences that are too short (&amp;lt; 75 454 cycles) are followed by the number of cycles, sequences that either contain Ns or have an unknown barcode are followed by the first 8 characters in the sequence.&lt;br /&gt;
&lt;br /&gt;
=== Step 3: Break up the fasta file into separate batches by partition ===&lt;br /&gt;
&lt;br /&gt;
* Create partition directory&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mkdir part&lt;br /&gt;
cd part&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Partition main file into sub-parts&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
${SCRIPTS}/partitionFasta.pl ../[batch].all.seq ../[batch].part&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The result is one fasta file per sub-partition (i.e. individual subject).&lt;br /&gt;
&lt;br /&gt;
* Remove barcodes &lt;br /&gt;
(still in the part/ subdirectory)&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
for i in *.seq; do&lt;br /&gt;
  ${SCRIPTS}/unbarcode.pl $i&lt;br /&gt;
done&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The output files will have the same name as the original file but with the addition of the .nbc suffix.  You should remove the .nbc files from the BAD/NONE files in order to prevent their addition to the pipeline downstream.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt; rm *.BAD.nbc.fa *.NONE.nbc.fa &amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 4: Add file names to sample tables ===&lt;br /&gt;
&lt;br /&gt;
The following needs to be run from the root of the 454 directory.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt; ${SCRIPTS}/add_file.pl [batch]/[batch]_barcode.csv [batch]/part &amp;gt; [batch]/[batch]_names.csv &amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
As a result, the file [batch]/[batch]_names.csv will associate each Sample ID to a file name and also record the number of sequences in that file.  Note that only files ending in .nbc.fa are procesed.&lt;br /&gt;
&lt;br /&gt;
=== Step 5: Merge all the batch meta-info files into a same file at the top ===&lt;br /&gt;
&lt;br /&gt;
Note: the addition of file names must be done on a batch by batch basis as multiple files might refer to a same Sample ID - within each batch it can be assumed that the Sample ID -&amp;gt; Filename mapping is unique.  In the 454.csv file in the top directory the unique key is the file name.  &lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt; &lt;br /&gt;
mv ../454.csv ../454.csv.bak&lt;br /&gt;
${SCRIPTS}/merge_csv.pl ../454.csv.bak [batch]/[batch]_names.csv &amp;quot;Sample ID&amp;quot; non-unique &amp;gt; ../454.csv&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This should probably be run with the filename as the key but then the tables need to be sorted  by filename.  Ultimately, this will all work better once the data are in a relational database.&lt;br /&gt;
&lt;br /&gt;
=== Step 6: Update sample file ===&lt;br /&gt;
From the top directory:&lt;br /&gt;
&lt;br /&gt;
* First add all new samples to the samples.csv file&lt;br /&gt;
&amp;lt;pre&amp;gt; &lt;br /&gt;
mv samples.csv samples.csv.bak&lt;br /&gt;
${SCRIPTS}/merge_csv.pl samples.csv.bak 454.csv &amp;quot;Sample ID&amp;quot; unique merge &amp;gt; samples.csv&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Note: merge means that if record keys conflict, the empty fields will be updated with the new data.&lt;br /&gt;
&lt;br /&gt;
* Update the tag indicating 454 sequences available for this sample&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mv samples.csv samples.csv.bak&lt;br /&gt;
${SCRIPTS}/update_field_csv.pl samples.csv.bak 454.csv &amp;quot;Sample ID&amp;quot; 454 Y &amp;gt; samples.csv&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 7: Assign numbers to all filenames ===&lt;br /&gt;
&lt;br /&gt;
Each file in the 454.csv file will be assigned an integer (if one is not already available).  This number will be used to prefix the sequences in the combined file for the project.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mv 454.csv 454.csv.bak&lt;br /&gt;
${SCRIPTS}/add_filenum.pl 454.csv.bak &amp;gt; 454.csv&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 8: Combine all fasta files into a single one ===&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Note: this assumes we&#039;re running the pipeline for the first time - a different protocol is necessary for adding new sequences to an already existing analysis&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
In the new file all sequences will be named &amp;lt;n&amp;gt;_&amp;lt;nn&amp;gt; where &amp;lt;n&amp;gt; is the value in the &amp;quot;File #&amp;quot;&lt;br /&gt;
field in 454.csv and &amp;lt;nn&amp;gt; is the index of the sequence in the file.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mkdir Analysis/Run[date]&lt;br /&gt;
${SCRIPTS}/combinefa.pl -c Analysis/Run[date]/Run[date] -i 454.csv 454&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The output will be in Analysis/Run[date]/Run[date].fna&lt;br /&gt;
&lt;br /&gt;
=== Step 9: Run clustering tool ===&lt;br /&gt;
&lt;br /&gt;
* First generate clusters&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
cd Analysis/Run[date]/Run[date].fna&lt;br /&gt;
/fs/szasmg2/ghodsi/Src/clusterk/clusterk7 -r 2 -i Run[date].fna &amp;gt; Run[date].fna.clusters&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Output will be in Run[date].fna.clusters, one cluster per line, cluster center listed as the first identifier.&lt;br /&gt;
&lt;br /&gt;
* Then extract the cluster centers&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
/fs/szasmg2/ghodsi/Src/clusterk/fastaselect Run[date].fna &amp;lt; Run[date].fna.clusters &amp;gt; Run[date].centers.fna&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 10: Assign putative taxonomic labels to clusters ===&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
/fs/szasmg2/ghodsi/rdp/findtax/findtaxid.sh Run[date].centers.fna &amp;gt; Run[date].centers.taxid&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Output is tab-delimited: sequence name &amp;lt;TAB&amp;gt; taxid&lt;br /&gt;
Note: using &amp;quot;findtax&amp;quot; instead of &amp;quot;findtaxid&amp;quot; will retrieve actual taxonomy names.&lt;br /&gt;
&lt;br /&gt;
=== Step 11: Build summary tables ===&lt;br /&gt;
&lt;br /&gt;
Using the output from steps 9 and 10 we construct a collection of tables linking OTUs, taxIDs, taxnames at various taxonomic levels to individual samples.  The colums are the samples and the rows are the respective units.  The cells are numbers of sequences assigned to the specific group. If looking at taxonomic levels, the sequences without an assignment at that level are assigned to a generic &amp;quot;No Assignment&amp;quot; bin.&lt;/div&gt;</summary>
		<author><name>Ghodsi</name></author>
	</entry>
	<entry>
		<id>https://wiki.umiacs.umd.edu/cbcb/index.php?title=Cbcb:Pop-Lab:Mohammad-Report&amp;diff=4858</id>
		<title>Cbcb:Pop-Lab:Mohammad-Report</title>
		<link rel="alternate" type="text/html" href="https://wiki.umiacs.umd.edu/cbcb/index.php?title=Cbcb:Pop-Lab:Mohammad-Report&amp;diff=4858"/>
		<updated>2009-03-11T17:40:36Z</updated>

		<summary type="html">&lt;p&gt;Ghodsi: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Known bounds for embedding Levenstein distance into metric spaces&lt;br /&gt;
*[http://portal.acm.org/citation.cfm?id=1109557.1109669 Lower bound of &amp;lt;math&amp;gt;\Omega(\log n)&amp;lt;/math&amp;gt; where n is number of points]&lt;br /&gt;
*[http://portal.acm.org/citation.cfm?id=1284322 Upper bound of &amp;lt;math&amp;gt;2^{O(\sqrt(\log d \log \log d))}&amp;lt;/math&amp;gt;  where d is dimension (i.e. length)]&lt;br /&gt;
&lt;br /&gt;
FFT and random projections&lt;br /&gt;
*[https://wiki.umiacs.umd.edu/cbcb/images/f/f6/Projection.pdf Presentation]&lt;br /&gt;
*[https://wiki.umiacs.umd.edu/cbcb/images/1/13/Fft2.pdf Writeup]&lt;/div&gt;</summary>
		<author><name>Ghodsi</name></author>
	</entry>
	<entry>
		<id>https://wiki.umiacs.umd.edu/cbcb/index.php?title=Cbcb:Pop-Lab:Mohammad-Report&amp;diff=4857</id>
		<title>Cbcb:Pop-Lab:Mohammad-Report</title>
		<link rel="alternate" type="text/html" href="https://wiki.umiacs.umd.edu/cbcb/index.php?title=Cbcb:Pop-Lab:Mohammad-Report&amp;diff=4857"/>
		<updated>2009-03-11T17:39:48Z</updated>

		<summary type="html">&lt;p&gt;Ghodsi: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Bounds for embedding Levenstein distance into metric spaces&lt;br /&gt;
*[http://portal.acm.org/citation.cfm?id=1109557.1109669 Lower bound of &amp;lt;math&amp;gt;\Omega(\log n)&amp;lt;/math&amp;gt; where n is number of points]&lt;br /&gt;
*[http://portal.acm.org/citation.cfm?id=1284322 Upper bound of &amp;lt;math&amp;gt;2^{O(\sqrt(log d log log d))}&amp;lt;/math&amp;gt;  where d is dimension (i.e. length)]&lt;br /&gt;
&lt;br /&gt;
FFT and random projections&lt;br /&gt;
*[https://wiki.umiacs.umd.edu/cbcb/images/f/f6/Projection.pdf Presentation]&lt;br /&gt;
*[https://wiki.umiacs.umd.edu/cbcb/images/1/13/Fft2.pdf Writeup]&lt;/div&gt;</summary>
		<author><name>Ghodsi</name></author>
	</entry>
	<entry>
		<id>https://wiki.umiacs.umd.edu/cbcb/index.php?title=Cbcb:Pop-Lab:Mohammad-Report&amp;diff=4856</id>
		<title>Cbcb:Pop-Lab:Mohammad-Report</title>
		<link rel="alternate" type="text/html" href="https://wiki.umiacs.umd.edu/cbcb/index.php?title=Cbcb:Pop-Lab:Mohammad-Report&amp;diff=4856"/>
		<updated>2009-03-11T17:39:31Z</updated>

		<summary type="html">&lt;p&gt;Ghodsi: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Bounds for embedding Levenstein distance into metric spaces&lt;br /&gt;
*[http://portal.acm.org/citation.cfm?id=1109557.1109669 Lower bound of &amp;lt;math&amp;gt;\Omega(\log n)&amp;lt;/math&amp;gt; where n is number of points&lt;br /&gt;
*[http://portal.acm.org/citation.cfm?id=1284322 Upper bound of &amp;lt;math&amp;gt;2^{O(\sqrt(log d log log d))}&amp;lt;/math&amp;gt;  where d is dimension (i.e. length)&lt;br /&gt;
&lt;br /&gt;
FFT and random projections&lt;br /&gt;
*[https://wiki.umiacs.umd.edu/cbcb/images/f/f6/Projection.pdf Presentation]&lt;br /&gt;
*[https://wiki.umiacs.umd.edu/cbcb/images/1/13/Fft2.pdf Writeup]&lt;/div&gt;</summary>
		<author><name>Ghodsi</name></author>
	</entry>
	<entry>
		<id>https://wiki.umiacs.umd.edu/cbcb/index.php?title=Cbcb:Pop-Lab:Mohammad-Report&amp;diff=4855</id>
		<title>Cbcb:Pop-Lab:Mohammad-Report</title>
		<link rel="alternate" type="text/html" href="https://wiki.umiacs.umd.edu/cbcb/index.php?title=Cbcb:Pop-Lab:Mohammad-Report&amp;diff=4855"/>
		<updated>2009-03-11T17:31:47Z</updated>

		<summary type="html">&lt;p&gt;Ghodsi: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Bounds for embedding Levenstein distance into l1&lt;br /&gt;
* &amp;lt;math&amp;gt;\Omega(n)&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
FFT and random projections&lt;br /&gt;
*[https://wiki.umiacs.umd.edu/cbcb/images/f/f6/Projection.pdf Presentation]&lt;br /&gt;
*[https://wiki.umiacs.umd.edu/cbcb/images/1/13/Fft2.pdf Writeup]&lt;/div&gt;</summary>
		<author><name>Ghodsi</name></author>
	</entry>
	<entry>
		<id>https://wiki.umiacs.umd.edu/cbcb/index.php?title=Cbcb:Pop-Lab:Mohammad-Report&amp;diff=4729</id>
		<title>Cbcb:Pop-Lab:Mohammad-Report</title>
		<link rel="alternate" type="text/html" href="https://wiki.umiacs.umd.edu/cbcb/index.php?title=Cbcb:Pop-Lab:Mohammad-Report&amp;diff=4729"/>
		<updated>2009-02-26T22:34:00Z</updated>

		<summary type="html">&lt;p&gt;Ghodsi: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;FFT and random projections&lt;br /&gt;
*[https://wiki.umiacs.umd.edu/cbcb/images/f/f6/Projection.pdf Presentation]&lt;br /&gt;
*[https://wiki.umiacs.umd.edu/cbcb/images/1/13/Fft2.pdf Writeup]&lt;/div&gt;</summary>
		<author><name>Ghodsi</name></author>
	</entry>
	<entry>
		<id>https://wiki.umiacs.umd.edu/cbcb/index.php?title=File:Fft2.pdf&amp;diff=4728</id>
		<title>File:Fft2.pdf</title>
		<link rel="alternate" type="text/html" href="https://wiki.umiacs.umd.edu/cbcb/index.php?title=File:Fft2.pdf&amp;diff=4728"/>
		<updated>2009-02-26T22:32:13Z</updated>

		<summary type="html">&lt;p&gt;Ghodsi: FFT for overlapping&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;FFT for overlapping&lt;/div&gt;</summary>
		<author><name>Ghodsi</name></author>
	</entry>
	<entry>
		<id>https://wiki.umiacs.umd.edu/cbcb/index.php?title=Cbcb:Pop-Lab&amp;diff=4727</id>
		<title>Cbcb:Pop-Lab</title>
		<link rel="alternate" type="text/html" href="https://wiki.umiacs.umd.edu/cbcb/index.php?title=Cbcb:Pop-Lab&amp;diff=4727"/>
		<updated>2009-02-26T17:47:07Z</updated>

		<summary type="html">&lt;p&gt;Ghodsi: /* Progress Reports */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= Important pages =&lt;br /&gt;
* Group members [[Cbcb:Pop-Lab:Members]] &lt;br /&gt;
* Blog: [http://genomes.blogspot.com Genomes.blogspot.com]&lt;br /&gt;
* How-tos: [[Cbcb:Pop-Lab:How-to]] - List of documentation on how to perform certain analyses, use different software, etc.&lt;br /&gt;
* Software developed by us: [[Cbcb:Pop-Lab:Software]]&lt;br /&gt;
* Metagenomics papers: [[Cbcb:Pop-Lab:Papers]]&lt;br /&gt;
&lt;br /&gt;
= Meetings/seminars =&lt;br /&gt;
* [[Pop_group_meeting|Group meeting]]: every Monday from 2:30pm in corner conference room (3120 C)&lt;br /&gt;
* [[seminars|CBCB seminar]]: Thursdays 2-3pm during the semester in main conference room&lt;br /&gt;
&lt;br /&gt;
= Progress Reports =&lt;br /&gt;
* [[Cbcb:Pop-Lab:Ben-Report|Ben]]&lt;br /&gt;
* [[Cbcb:Pop-Lab:Niranjan-Report|Niranjan]]&lt;br /&gt;
* [[Cbcb:Pop-Lab:Chris-Report|Chris]]&lt;br /&gt;
* [[Cbcb:Pop-Lab:James-Report|James]]&lt;br /&gt;
* [[Cbcb:Pop-Lab:Serge-Report|Serge]]&lt;br /&gt;
* [[Cbcb:Pop-Lab:Bo-Report|Bo]]&lt;br /&gt;
* [[Cbcb:Pop-Lab:Dan-Report|Dan]]&lt;br /&gt;
* [[Cbcb:Pop-Lab:Mohammad-Report|Mohammad]]&lt;br /&gt;
&lt;br /&gt;
= Pop Lab Presentations =&lt;br /&gt;
* [[Media:Xoo_Cripsr.odg|Xanthomonas Crispr Slides (openoffice format)]] &lt;br /&gt;
* Listeria Monocytogenes Slides (powerpoint)&lt;br /&gt;
* Metagenome pipelines overview [[Media:Pipeline_outline.pdf|Presentation(pdf)]]&lt;/div&gt;</summary>
		<author><name>Ghodsi</name></author>
	</entry>
	<entry>
		<id>https://wiki.umiacs.umd.edu/cbcb/index.php?title=Cbcb:Pop-Lab:Mohammad-Report&amp;diff=4726</id>
		<title>Cbcb:Pop-Lab:Mohammad-Report</title>
		<link rel="alternate" type="text/html" href="https://wiki.umiacs.umd.edu/cbcb/index.php?title=Cbcb:Pop-Lab:Mohammad-Report&amp;diff=4726"/>
		<updated>2009-02-26T17:36:50Z</updated>

		<summary type="html">&lt;p&gt;Ghodsi: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;FFT and random projections&lt;br /&gt;
*[https://wiki.umiacs.umd.edu/cbcb/images/f/f6/Projection.pdf Presentation]&lt;br /&gt;
*[https://wiki.umiacs.umd.edu/cbcb/images/f/f6/Projection.pdf Writeup]&lt;/div&gt;</summary>
		<author><name>Ghodsi</name></author>
	</entry>
	<entry>
		<id>https://wiki.umiacs.umd.edu/cbcb/index.php?title=Cbcb:Pop-Lab:Mohammad&amp;diff=4725</id>
		<title>Cbcb:Pop-Lab:Mohammad</title>
		<link rel="alternate" type="text/html" href="https://wiki.umiacs.umd.edu/cbcb/index.php?title=Cbcb:Pop-Lab:Mohammad&amp;diff=4725"/>
		<updated>2009-02-26T17:30:04Z</updated>

		<summary type="html">&lt;p&gt;Ghodsi: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;&lt;/div&gt;</summary>
		<author><name>Ghodsi</name></author>
	</entry>
	<entry>
		<id>https://wiki.umiacs.umd.edu/cbcb/index.php?title=Cbcb:Pop-Lab:Mohammad&amp;diff=4724</id>
		<title>Cbcb:Pop-Lab:Mohammad</title>
		<link rel="alternate" type="text/html" href="https://wiki.umiacs.umd.edu/cbcb/index.php?title=Cbcb:Pop-Lab:Mohammad&amp;diff=4724"/>
		<updated>2009-02-26T17:29:15Z</updated>

		<summary type="html">&lt;p&gt;Ghodsi: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Presentation&lt;/div&gt;</summary>
		<author><name>Ghodsi</name></author>
	</entry>
	<entry>
		<id>https://wiki.umiacs.umd.edu/cbcb/index.php?title=File:Projection.pdf&amp;diff=4723</id>
		<title>File:Projection.pdf</title>
		<link rel="alternate" type="text/html" href="https://wiki.umiacs.umd.edu/cbcb/index.php?title=File:Projection.pdf&amp;diff=4723"/>
		<updated>2009-02-26T17:27:53Z</updated>

		<summary type="html">&lt;p&gt;Ghodsi: Presentation to meeting on FFT and random projections&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Presentation to meeting on FFT and random projections&lt;/div&gt;</summary>
		<author><name>Ghodsi</name></author>
	</entry>
	<entry>
		<id>https://wiki.umiacs.umd.edu/cbcb/index.php?title=Cbcb:Pop-Lab:Mohammad-Report&amp;diff=4722</id>
		<title>Cbcb:Pop-Lab:Mohammad-Report</title>
		<link rel="alternate" type="text/html" href="https://wiki.umiacs.umd.edu/cbcb/index.php?title=Cbcb:Pop-Lab:Mohammad-Report&amp;diff=4722"/>
		<updated>2009-02-26T17:23:26Z</updated>

		<summary type="html">&lt;p&gt;Ghodsi: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;FFT and random projections&lt;br /&gt;
presentation&lt;br /&gt;
writeup&lt;/div&gt;</summary>
		<author><name>Ghodsi</name></author>
	</entry>
	<entry>
		<id>https://wiki.umiacs.umd.edu/cbcb/index.php?title=Pop_group_meeting&amp;diff=3998</id>
		<title>Pop group meeting</title>
		<link rel="alternate" type="text/html" href="https://wiki.umiacs.umd.edu/cbcb/index.php?title=Pop_group_meeting&amp;diff=3998"/>
		<updated>2008-12-02T20:18:42Z</updated>

		<summary type="html">&lt;p&gt;Ghodsi: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= 09/29/2008 =&lt;br /&gt;
&lt;br /&gt;
# Benjamin Langmead [http://www.plosone.org/article/info:doi%2F10.1371%2Fjournal.pone.0003263 Metagenomic Analysis of Lysogeny in Tampa Bay: Implications for Prophage Gene Expression] (PLoS one)&lt;br /&gt;
# Bo Liu [http://genomebiology.com/2008/9/9/R139 Systematic bioinformatic analysis of expression levels of 17,330 human genes across 9,783 samples from 175 types of healthy and pathological tissues] (Genome Biology)&lt;br /&gt;
# Daniela Puiu [http://genome.cshlp.org/cgi/reprint/gr.080200.108v1 Sequencing of natural strains of Arabidopsis thaliana with short reads] (Genome Research)&lt;br /&gt;
# Dan Sommer, Niranjan Nagarajan [http://www.nature.com/nature/journal/v455/n7212/edsumm/e080925-04.html Metagenomics: questions to answer] (Nature)&lt;br /&gt;
# James Robert White [http://www.sciencedirect.com/science?_ob=ArticleURL&amp;amp;_udi=B6V73-4SHVST0-1&amp;amp;_user=961305&amp;amp;_rdoc=1&amp;amp;_fmt=&amp;amp;_orig=search&amp;amp;_sort=d&amp;amp;view=c&amp;amp;_version=1&amp;amp;_urlVersion=0&amp;amp;_userid=961305&amp;amp;md5=4115e7b8186a36db60f6b11a217a8798 Microbial population dynamics during aerobic sludge granulation at different organic loading rates] (ScienceDirect)&lt;br /&gt;
# Mohammad Reza Ghodsi [http://nar.oxfordjournals.org/cgi/content/short/36/16/5180 A comparison of random sequence reads versus 16S rDNA sequences for estimating the biodiversity of a metagenomic library] (NAR)&lt;br /&gt;
# Sergey Koren [http://www.biomedcentral.com/1471-2105/9/390 R/parallel - speeding up bioinformatics analysis with R] (BMC Bioinformatics)&lt;br /&gt;
# Ted Gibbons [http://www.ncbi.nlm.nih.gov/pubmed/18711340 High-resolution metagenomics targets specific functional types in complex microbial communities] (Nature Biotechnology) &amp;lt;BR&amp;gt; [http://www.nature.com/nrmicro/journal/v6/n9/abs/nrmicro1935.html Molecular eco-systems biology: towards an understanding of community function] (Systems Microbiology)&lt;br /&gt;
&lt;br /&gt;
= 10/06/2008 =&lt;br /&gt;
# Bo Liu [http://www.pnas.org/content/105/39/15076.short The convergence of carbohydrate active gene repertoires in human gut microbes.] PNAS&lt;br /&gt;
# Daniela Puiu [http://bioinformatics.oxfordjournals.org/cgi/content/full/24/20/2395 SeqMap: mapping massive amount of oligonucleotides to the genome] Bioinformatics&lt;br /&gt;
# James Robert White [http://www.pnas.org/content/104/21/8918.full Artificial selection of simulated microbial ecosystems.] PNAS&lt;br /&gt;
# Mohammad Reza Ghodsi [http://aem.asm.org/cgi/content/abstract/74/5/1453 Metagenomics: Read Length Matters] Applied Environmental Microbiology&lt;br /&gt;
# Sergey Koren [http://www.ploscompbiol.org/article/info:doi%2F10.1371%2Fjournal.pcbi.1000186 Gene-Boosted Assembly of a Novel Bacterial Genome from Very Short Reads] PLoS Computational Biology&lt;br /&gt;
&lt;br /&gt;
= 10/13/2008 =&lt;br /&gt;
# Daniela Puiu, Daniel Sommer [http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0003373 MetaSim—A Sequencing Simulator for Genomics and Metagenomics] PLoS ONE&lt;br /&gt;
# Bo Liu, James Robert White [http://www.sciencemag.org/cgi/content/full/322/5899/275 Environmental Genomics Reveals a Single-Species Ecosystem Deep Within Earth.] Science&lt;br /&gt;
# MohammadReza Ghodsi [http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=2557142 Probing Metagenomics by Rapid Cluster Analysis of Very Large Datasets] (PLoS one)&lt;br /&gt;
&lt;br /&gt;
MetaSim software:&lt;br /&gt;
      Location:         /fs/sz-user-supported/common/packages/MetaSim/ &lt;br /&gt;
      Executable(Java): /fs/sz-user-supported/common/packages/MetaSim/metasim/MetaSim &lt;br /&gt;
      Database:         /fs/sz-user-supported/common/packages/MetaSim/database           &lt;br /&gt;
      create a symlink to it from your working directory, otherwise you have to import it &lt;br /&gt;
      NCBI databases: [ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/all.fna.tar.gz Complete bacteria];   [ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/gi_taxid_nucl.dmp.gz Taxonomy]&lt;br /&gt;
&lt;br /&gt;
= 10/20/2008 =&lt;br /&gt;
# Bo Liu [http://www.pnas.org/content/105/37/13977.short?rss=1 Accelerated evolution of resistance in multidrug environments] PNAS &amp;lt;br&amp;gt; [http://www.pnas.org/content/105/39/14918.short?rss=1 Drug interactions modulate the potential for evolution of resistance] PNAS&lt;br /&gt;
# Daniel Sommer [http://www.ncbi.nlm.nih.gov/pubmed/18927115?dopt=Abstract NCBI Reference Sequences: current status, policy and new initiatives] NAR&lt;br /&gt;
# Daniela Puiu [http://genomebiology.com/content/pdf/gb-2008-9-10-r151.pdf A simple, fast, and accurate method of phylogenomic inference] Genome Biology&lt;br /&gt;
# Niranjan Nagarajan [http://www.springerlink.com/content/cn3551154w232l73/fulltext.pdf The Stastical Power of Phylogenetic Motif Models] RECOMB 2008&lt;br /&gt;
# Sergey Koren [http://www.nature.com/nmeth/journal/v5/n10/full/nmeth1008-903.html Scientific software: seeing the SNPs between us] Nature Methods&lt;br /&gt;
# Mohammadreza Ghodsi [http://www.biomedcentral.com/1471-2105/9/386 The metagenomics RAST server – a public resource for the automatic phylogenetic and functional analysis of metagenomes] BMC Bioinformatics&lt;br /&gt;
&lt;br /&gt;
= 10/27/2008 =&lt;br /&gt;
# Bo Liu [http://www.pnas.org/content/105/37/14130.short?rss=1 Frequent emergence and limited geographic dispersal of methicillin-resistant Staphylococcus aureus] PNAS&lt;br /&gt;
# Sergey Koren [http://bioinformatics.oxfordjournals.org/cgi/content/abstract/btn548 Aggressive Assembly of Pyrosequencing Reads with Mates] Bioinformatics&lt;br /&gt;
# Daniela Puiu [http://bioinformatics.oxfordjournals.org/cgi/content/full/24/21/2431 ZOOM! Zillions of oligos mapped] Bioinformatics&lt;br /&gt;
# Mohammadreza Ghodsi [http://www.ncbi.nlm.nih.gov/pubmed/18692931 Biodiversity of the microbial community in a Spanish farmhouse cheese as revealed by culture-dependent and culture-independent methods]&lt;br /&gt;
&lt;br /&gt;
= 11/03/2008 = &lt;br /&gt;
# Daniela Puiu [http://genome.cshlp.org/cgi/content/abstract/gr.7088808v1 Short read fragment assembly of bacterial genomes] (Genome Research)&lt;br /&gt;
&lt;br /&gt;
= 11/10/2008 = &lt;br /&gt;
# Bo Liu [http://www.pnas.org/content/early/2008/11/04/0802782105.abstract Metagenome analysis of an extreme microbial symbiosis reveals eurythermal adaptation and metabolic flexibility] PNAS&lt;br /&gt;
&lt;br /&gt;
= 11/17/2008 =&lt;br /&gt;
# Bo Liu [http://bioinformatics.oxfordjournals.org/cgi/content/short/24/22/2579?rss=1 Phylogenetic distances are encoded in networks of interacting pathways] Bioinformatics&lt;br /&gt;
# Daniela Puiu [http://bioinformatics.oxfordjournals.org/cgi/content/abstract/btn582 Human genomes as email attachments] Bioinformatics&lt;br /&gt;
# Mohammad Ghodsi [http://nar.oxfordjournals.org/cgi/content/short/36/18/e120 Accurate taxonomy assignments from 16S rRNA sequences produced by highly parallel pyrosequencers] Nucleic Acids Research&lt;br /&gt;
# Sergey Koren [http://www.biomedcentral.com/1471-2105/9/420/abstract XplorSeq: A software environment for integrated management and phylogenetic analysis of metagenomic sequence data] (BMC Bioinformatics)&lt;br /&gt;
# James Robert White [http://www.pnas.org/content/early/2008/08/08/0801925105.abstract?etoc Resistance, resilience, and redundancy in microbial communities] PNAS&lt;br /&gt;
&lt;br /&gt;
= 11/24/2008 =&lt;br /&gt;
# Ted Gibbons [http://biology.plosjournals.org/perlserv/?request=get-document&amp;amp;doi=10.1371%2Fjournal.pbio.0060295 Gut Reaction: Pyrosequencing Provides the Poop on Distal Gut Bacteria (summary)] PLoS &amp;lt;br&amp;gt; [http://biology.plosjournals.org/perlserv/?request=get-document&amp;amp;doi=10.1371%2Fjournal.pbio.0060280 The Pervasive Effects of an Antibiotic on the Human Gut Microbiota, as Revealed by Deep 16S rRNA Sequencing] PLoS &amp;lt;br&amp;gt; [http://www.plosgenetics.org/article/info%3Adoi%2F10.1371%2Fjournal.pgen.1000255 Exploring Microbial Diversity and Taxonomy Using SSU rRNA Hypervariable Tag Sequencing] PLoS Genetics&lt;br /&gt;
&lt;br /&gt;
= 12/1/2008 =&lt;br /&gt;
# Mohammad Ghodsi [http://www.plosone.org/article/info:doi%2F10.1371%2Fjournal.pone.0002836 Comparative Analysis of Human Gut Microbiota by Barcoded Pyrosequencing] PLoS ONE&lt;/div&gt;</summary>
		<author><name>Ghodsi</name></author>
	</entry>
	<entry>
		<id>https://wiki.umiacs.umd.edu/cbcb/index.php?title=Pop_group_meeting&amp;diff=3883</id>
		<title>Pop group meeting</title>
		<link rel="alternate" type="text/html" href="https://wiki.umiacs.umd.edu/cbcb/index.php?title=Pop_group_meeting&amp;diff=3883"/>
		<updated>2008-11-17T19:15:46Z</updated>

		<summary type="html">&lt;p&gt;Ghodsi: /* 11/10/2008 */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= 09/29/2008 =&lt;br /&gt;
&lt;br /&gt;
# Benjamin Langmead [http://www.plosone.org/article/info:doi%2F10.1371%2Fjournal.pone.0003263 Metagenomic Analysis of Lysogeny in Tampa Bay: Implications for Prophage Gene Expression] (PLoS one)&lt;br /&gt;
# Bo Liu [http://genomebiology.com/2008/9/9/R139 Systematic bioinformatic analysis of expression levels of 17,330 human genes across 9,783 samples from 175 types of healthy and pathological tissues] (Genome Biology)&lt;br /&gt;
# Daniela Puiu [http://genome.cshlp.org/cgi/reprint/gr.080200.108v1 Sequencing of natural strains of Arabidopsis thaliana with short reads] (Genome Research)&lt;br /&gt;
# Dan Sommer, Niranjan Nagarajan [http://www.nature.com/nature/journal/v455/n7212/edsumm/e080925-04.html Metagenomics: questions to answer] (Nature)&lt;br /&gt;
# James Robert White [http://www.sciencedirect.com/science?_ob=ArticleURL&amp;amp;_udi=B6V73-4SHVST0-1&amp;amp;_user=961305&amp;amp;_rdoc=1&amp;amp;_fmt=&amp;amp;_orig=search&amp;amp;_sort=d&amp;amp;view=c&amp;amp;_version=1&amp;amp;_urlVersion=0&amp;amp;_userid=961305&amp;amp;md5=4115e7b8186a36db60f6b11a217a8798 Microbial population dynamics during aerobic sludge granulation at different organic loading rates] (ScienceDirect)&lt;br /&gt;
# Mohammad Reza Ghodsi [http://nar.oxfordjournals.org/cgi/content/short/36/16/5180 A comparison of random sequence reads versus 16S rDNA sequences for estimating the biodiversity of a metagenomic library] (NAR)&lt;br /&gt;
# Sergey Koren [http://www.biomedcentral.com/1471-2105/9/390 R/parallel - speeding up bioinformatics analysis with R] (BMC Bioinformatics)&lt;br /&gt;
# Theodore Gibbons [http://www.ncbi.nlm.nih.gov/pubmed/18711340 High-resolution metagenomics targets specific functional types in complex microbial communities] (Nature Biotechnology) &amp;lt;BR&amp;gt; [http://www.nature.com/nrmicro/journal/v6/n9/abs/nrmicro1935.html Molecular eco-systems biology: towards an understanding of community function] (Systems Microbiology)&lt;br /&gt;
&lt;br /&gt;
= 10/06/2008 =&lt;br /&gt;
# Bo Liu [http://www.pnas.org/content/105/39/15076.short The convergence of carbohydrate active gene repertoires in human gut microbes.] PNAS&lt;br /&gt;
# Daniela Puiu [http://bioinformatics.oxfordjournals.org/cgi/content/full/24/20/2395 SeqMap: mapping massive amount of oligonucleotides to the genome] Bioinformatics&lt;br /&gt;
# James Robert White [http://www.pnas.org/content/104/21/8918.full Artificial selection of simulated microbial ecosystems.] PNAS&lt;br /&gt;
# MohammadReza Ghodsi [http://aem.asm.org/cgi/content/abstract/74/5/1453 Metagenomics: Read Length Matters] Applied Environmental Microbiology&lt;br /&gt;
# Sergey Koren [http://www.ploscompbiol.org/article/info:doi%2F10.1371%2Fjournal.pcbi.1000186 Gene-Boosted Assembly of a Novel Bacterial Genome from Very Short Reads] PLoS Computational Biology&lt;br /&gt;
&lt;br /&gt;
= 10/13/2008 =&lt;br /&gt;
# Daniela Puiu, Daniel Sommer [http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0003373 MetaSim—A Sequencing Simulator for Genomics and Metagenomics] PLoS ONE&lt;br /&gt;
# Bo Liu, James Robert White [http://www.sciencemag.org/cgi/content/full/322/5899/275 Environmental Genomics Reveals a Single-Species Ecosystem Deep Within Earth.] Science&lt;br /&gt;
# MohammadReza Ghodsi [http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=2557142 Probing Metagenomics by Rapid Cluster Analysis of Very Large Datasets] (PLoS one)&lt;br /&gt;
&lt;br /&gt;
MetaSim software:&lt;br /&gt;
      Location:         /fs/sz-user-supported/common/packages/MetaSim/ &lt;br /&gt;
      Executable(Java): /fs/sz-user-supported/common/packages/MetaSim/metasim/MetaSim &lt;br /&gt;
      Database:         /fs/sz-user-supported/common/packages/MetaSim/database           &lt;br /&gt;
      create a symlink to it from your working directory, otherwise you have to import it &lt;br /&gt;
      NCBI databases: [ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/all.fna.tar.gz Complete bacteria];   [ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/gi_taxid_nucl.dmp.gz Taxonomy]&lt;br /&gt;
&lt;br /&gt;
= 10/20/2008 =&lt;br /&gt;
# Bo Liu [http://www.pnas.org/content/105/37/13977.short?rss=1 Accelerated evolution of resistance in multidrug environments] PNAS &amp;lt;br&amp;gt; [http://www.pnas.org/content/105/39/14918.short?rss=1 Drug interactions modulate the potential for evolution of resistance] PNAS&lt;br /&gt;
# Daniel Sommer [http://www.ncbi.nlm.nih.gov/pubmed/18927115?dopt=Abstract NCBI Reference Sequences: current status, policy and new initiatives] NAR&lt;br /&gt;
# Daniela Puiu [http://genomebiology.com/content/pdf/gb-2008-9-10-r151.pdf A simple, fast, and accurate method of phylogenomic inference] Genome Biology&lt;br /&gt;
# Niranjan Nagarajan [http://www.springerlink.com/content/cn3551154w232l73/fulltext.pdf The Stastical Power of Phylogenetic Motif Models] RECOMB 2008&lt;br /&gt;
# Sergey Koren [http://www.nature.com/nmeth/journal/v5/n10/full/nmeth1008-903.html Scientific software: seeing the SNPs between us] Nature Methods&lt;br /&gt;
# Mohammadreza Ghodsi [http://www.biomedcentral.com/1471-2105/9/386 The metagenomics RAST server – a public resource for the automatic phylogenetic and functional analysis of metagenomes] BMC Bioinformatics&lt;br /&gt;
&lt;br /&gt;
= 10/27/2008 =&lt;br /&gt;
# Bo Liu [http://www.pnas.org/content/105/37/14130.short?rss=1 Frequent emergence and limited geographic dispersal of methicillin-resistant Staphylococcus aureus] PNAS&lt;br /&gt;
# Sergey Koren [http://bioinformatics.oxfordjournals.org/cgi/content/abstract/btn548 Aggressive Assembly of Pyrosequencing Reads with Mates] Bioinformatics&lt;br /&gt;
# Daniela Puiu [http://bioinformatics.oxfordjournals.org/cgi/content/full/24/21/2431 ZOOM! Zillions of oligos mapped] Bioinformatics&lt;br /&gt;
# Mohammadreza Ghodsi [http://www.ncbi.nlm.nih.gov/pubmed/18692931 Biodiversity of the microbial community in a Spanish farmhouse cheese as revealed by culture-dependent and culture-independent methods]&lt;br /&gt;
&lt;br /&gt;
= 11/03/2008 = &lt;br /&gt;
# Daniela Puiu [http://genome.cshlp.org/cgi/content/abstract/gr.7088808v1 Short read fragment assembly of bacterial genomes] (Genome Research)&lt;br /&gt;
&lt;br /&gt;
= 11/10/2008 = &lt;br /&gt;
# Bo Liu [http://www.pnas.org/content/early/2008/11/04/0802782105.abstract Metagenome analysis of an extreme microbial symbiosis reveals eurythermal adaptation and metabolic flexibility] PNAS&lt;br /&gt;
# Sergey Koren [http://www.biomedcentral.com/1471-2105/9/420/abstract XplorSeq: A software environment for integrated management and phylogenetic analysis of metagenomic sequence data] (BMC Bioinformatics)&lt;br /&gt;
&lt;br /&gt;
= 11/17/2008 =&lt;br /&gt;
# Daniela Puiu [http://bioinformatics.oxfordjournals.org/cgi/content/abstract/btn582 Human genomes as email attachments] Bioinformatics&lt;br /&gt;
# Mohammad Ghodsi [http://nar.oxfordjournals.org/cgi/content/short/36/18/e120 Accurate taxonomy assignments from 16S rRNA sequences produced by highly parallel pyrosequencers] Nucleic Acids Research&lt;/div&gt;</summary>
		<author><name>Ghodsi</name></author>
	</entry>
	<entry>
		<id>https://wiki.umiacs.umd.edu/cbcb/index.php?title=Pop_group_meeting&amp;diff=3785</id>
		<title>Pop group meeting</title>
		<link rel="alternate" type="text/html" href="https://wiki.umiacs.umd.edu/cbcb/index.php?title=Pop_group_meeting&amp;diff=3785"/>
		<updated>2008-10-27T22:43:36Z</updated>

		<summary type="html">&lt;p&gt;Ghodsi: /* 10/27/2008 */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= 09/29/2008 =&lt;br /&gt;
&lt;br /&gt;
# Benjamin Langmead [http://www.plosone.org/article/info:doi%2F10.1371%2Fjournal.pone.0003263 Metagenomic Analysis of Lysogeny in Tampa Bay: Implications for Prophage Gene Expression] (PLoS one)&lt;br /&gt;
# Bo Liu [http://genomebiology.com/2008/9/9/R139 Systematic bioinformatic analysis of expression levels of 17,330 human genes across 9,783 samples from 175 types of healthy and pathological tissues] (Genome Biology)&lt;br /&gt;
# Daniela Puiu [http://genome.cshlp.org/cgi/reprint/gr.080200.108v1 Sequencing of natural strains of Arabidopsis thaliana with short reads] (Genome Research)&lt;br /&gt;
# Dan Sommer, Niranjan Nagarajan [http://www.nature.com/nature/journal/v455/n7212/edsumm/e080925-04.html Metagenomics: questions to answer] (Nature)&lt;br /&gt;
# James Robert White [http://www.sciencedirect.com/science?_ob=ArticleURL&amp;amp;_udi=B6V73-4SHVST0-1&amp;amp;_user=961305&amp;amp;_rdoc=1&amp;amp;_fmt=&amp;amp;_orig=search&amp;amp;_sort=d&amp;amp;view=c&amp;amp;_version=1&amp;amp;_urlVersion=0&amp;amp;_userid=961305&amp;amp;md5=4115e7b8186a36db60f6b11a217a8798 Microbial population dynamics during aerobic sludge granulation at different organic loading rates] (ScienceDirect)&lt;br /&gt;
# Mohammad Reza Ghodsi [http://nar.oxfordjournals.org/cgi/content/short/36/16/5180 A comparison of random sequence reads versus 16S rDNA sequences for estimating the biodiversity of a metagenomic library] (NAR)&lt;br /&gt;
# Sergey Koren [http://www.biomedcentral.com/1471-2105/9/390 R/parallel - speeding up bioinformatics analysis with R] (BMC Bioinformatics)&lt;br /&gt;
# Theodore Gibbons [http://www.ncbi.nlm.nih.gov/pubmed/18711340 High-resolution metagenomics targets specific functional types in complex microbial communities] (Nature Biotechnology) &amp;lt;BR&amp;gt; [http://www.nature.com/nrmicro/journal/v6/n9/abs/nrmicro1935.html Molecular eco-systems biology: towards an understanding of community function] (Systems Microbiology)&lt;br /&gt;
&lt;br /&gt;
= 10/06/2008 =&lt;br /&gt;
# Bo Liu [http://www.pnas.org/content/105/39/15076.short The convergence of carbohydrate active gene repertoires in human gut microbes.] PNAS&lt;br /&gt;
# Daniela Puiu [http://bioinformatics.oxfordjournals.org/cgi/content/full/24/20/2395 SeqMap: mapping massive amount of oligonucleotides to the genome] Bioinformatics&lt;br /&gt;
# James Robert White [http://www.pnas.org/content/104/21/8918.full Artificial selection of simulated microbial ecosystems.] PNAS&lt;br /&gt;
# MohammadReza Ghodsi [http://aem.asm.org/cgi/content/abstract/74/5/1453 Metagenomics: Read Length Matters] Applied Environmental Microbiology&lt;br /&gt;
# Sergey Koren [http://www.ploscompbiol.org/article/info:doi%2F10.1371%2Fjournal.pcbi.1000186 Gene-Boosted Assembly of a Novel Bacterial Genome from Very Short Reads] PLoS Computational Biology&lt;br /&gt;
&lt;br /&gt;
= 10/13/2008 =&lt;br /&gt;
# Daniela Puiu, Daniel Sommer [http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0003373 MetaSim—A Sequencing Simulator for Genomics and Metagenomics] PLoS ONE&lt;br /&gt;
# Bo Liu, James Robert White [http://www.sciencemag.org/cgi/content/full/322/5899/275 Environmental Genomics Reveals a Single-Species Ecosystem Deep Within Earth.] Science&lt;br /&gt;
# MohammadReza Ghodsi [http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=2557142 Probing Metagenomics by Rapid Cluster Analysis of Very Large Datasets] (PLoS one)&lt;br /&gt;
&lt;br /&gt;
MetaSim software:&lt;br /&gt;
      Location:         /fs/sz-user-supported/common/packages/MetaSim/ &lt;br /&gt;
      Executable(Java): /fs/sz-user-supported/common/packages/MetaSim/metasim/MetaSim &lt;br /&gt;
      Database:         /fs/sz-user-supported/common/packages/MetaSim/database           &lt;br /&gt;
      create a symlink to it from your working directory, otherwise you have to import it &lt;br /&gt;
      NCBI databases: [ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/all.fna.tar.gz Complete bacteria];   [ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/gi_taxid_nucl.dmp.gz Taxonomy]&lt;br /&gt;
&lt;br /&gt;
= 10/20/2008 =&lt;br /&gt;
# Bo Liu [http://www.pnas.org/content/105/37/13977.short?rss=1 Accelerated evolution of resistance in multidrug environments] PNAS &amp;lt;br&amp;gt; [http://www.pnas.org/content/105/39/14918.short?rss=1 Drug interactions modulate the potential for evolution of resistance] PNAS&lt;br /&gt;
# Daniel Sommer [http://www.ncbi.nlm.nih.gov/pubmed/18927115?dopt=Abstract NCBI Reference Sequences: current status, policy and new initiatives] NAR&lt;br /&gt;
# Daniela Puiu [http://genomebiology.com/content/pdf/gb-2008-9-10-r151.pdf A simple, fast, and accurate method of phylogenomic inference] Genome Biology&lt;br /&gt;
# Niranjan Nagarajan [http://www.springerlink.com/content/cn3551154w232l73/fulltext.pdf The Stastical Power of Phylogenetic Motif Models] RECOMB 2008&lt;br /&gt;
# Sergey Koren [http://www.nature.com/nmeth/journal/v5/n10/full/nmeth1008-903.html Scientific software: seeing the SNPs between us] Nature Methods&lt;br /&gt;
# Mohammadreza Ghodsi [http://www.biomedcentral.com/1471-2105/9/386 The metagenomics RAST server – a public resource for the automatic phylogenetic and functional analysis of metagenomes] BMC Bioinformatics&lt;br /&gt;
&lt;br /&gt;
= 10/27/2008 =&lt;br /&gt;
&lt;br /&gt;
# Bo Liu [http://www.pnas.org/content/105/37/14130.short?rss=1 Frequent emergence and limited geographic dispersal of methicillin-resistant Staphylococcus aureus] PNAS&lt;br /&gt;
# Sergey Koren [http://bioinformatics.oxfordjournals.org/cgi/content/abstract/btn548 Aggressive Assembly of Pyrosequencing Reads with Mates] Bioinformatics&lt;br /&gt;
# Daniela Puiu [http://bioinformatics.oxfordjournals.org/cgi/content/full/24/21/2431 ZOOM! Zillions of oligos mapped] Bioinformatics&lt;br /&gt;
# Mohammadreza Ghodsi [http://www.ncbi.nlm.nih.gov/pubmed/18692931 Biodiversity of the microbial community in a Spanish farmhouse cheese as revealed by culture-dependent and culture-independent methods]&lt;/div&gt;</summary>
		<author><name>Ghodsi</name></author>
	</entry>
	<entry>
		<id>https://wiki.umiacs.umd.edu/cbcb/index.php?title=Pop_group_meeting&amp;diff=3781</id>
		<title>Pop group meeting</title>
		<link rel="alternate" type="text/html" href="https://wiki.umiacs.umd.edu/cbcb/index.php?title=Pop_group_meeting&amp;diff=3781"/>
		<updated>2008-10-27T15:06:40Z</updated>

		<summary type="html">&lt;p&gt;Ghodsi: /* 10/20/2008 */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= 09/29/2008 =&lt;br /&gt;
&lt;br /&gt;
# Benjamin Langmead [http://www.plosone.org/article/info:doi%2F10.1371%2Fjournal.pone.0003263 Metagenomic Analysis of Lysogeny in Tampa Bay: Implications for Prophage Gene Expression] (PLoS one)&lt;br /&gt;
# Bo Liu [http://genomebiology.com/2008/9/9/R139 Systematic bioinformatic analysis of expression levels of 17,330 human genes across 9,783 samples from 175 types of healthy and pathological tissues] (Genome Biology)&lt;br /&gt;
# Daniela Puiu [http://genome.cshlp.org/cgi/reprint/gr.080200.108v1 Sequencing of natural strains of Arabidopsis thaliana with short reads] (Genome Research)&lt;br /&gt;
# Dan Sommer, Niranjan Nagarajan [http://www.nature.com/nature/journal/v455/n7212/edsumm/e080925-04.html Metagenomics: questions to answer] (Nature)&lt;br /&gt;
# James Robert White [http://www.sciencedirect.com/science?_ob=ArticleURL&amp;amp;_udi=B6V73-4SHVST0-1&amp;amp;_user=961305&amp;amp;_rdoc=1&amp;amp;_fmt=&amp;amp;_orig=search&amp;amp;_sort=d&amp;amp;view=c&amp;amp;_version=1&amp;amp;_urlVersion=0&amp;amp;_userid=961305&amp;amp;md5=4115e7b8186a36db60f6b11a217a8798 Microbial population dynamics during aerobic sludge granulation at different organic loading rates] (ScienceDirect)&lt;br /&gt;
# Mohammad Reza Ghodsi [http://nar.oxfordjournals.org/cgi/content/short/36/16/5180 A comparison of random sequence reads versus 16S rDNA sequences for estimating the biodiversity of a metagenomic library] (NAR)&lt;br /&gt;
# Sergey Koren [http://www.biomedcentral.com/1471-2105/9/390 R/parallel - speeding up bioinformatics analysis with R] (BMC Bioinformatics)&lt;br /&gt;
# Theodore Gibbons [http://www.ncbi.nlm.nih.gov/pubmed/18711340 High-resolution metagenomics targets specific functional types in complex microbial communities] (Nature Biotechnology) &amp;lt;BR&amp;gt; [http://www.nature.com/nrmicro/journal/v6/n9/abs/nrmicro1935.html Molecular eco-systems biology: towards an understanding of community function] (Systems Microbiology)&lt;br /&gt;
&lt;br /&gt;
= 10/06/2008 =&lt;br /&gt;
# Bo Liu [http://www.pnas.org/content/105/39/15076.short The convergence of carbohydrate active gene repertoires in human gut microbes.] PNAS&lt;br /&gt;
# Daniela Puiu [http://bioinformatics.oxfordjournals.org/cgi/content/full/24/20/2395 SeqMap: mapping massive amount of oligonucleotides to the genome] Bioinformatics&lt;br /&gt;
# James Robert White [http://www.pnas.org/content/104/21/8918.full Artificial selection of simulated microbial ecosystems.] PNAS&lt;br /&gt;
# MohammadReza Ghodsi [http://aem.asm.org/cgi/content/abstract/74/5/1453 Metagenomics: Read Length Matters] Applied Environmental Microbiology&lt;br /&gt;
# Sergey Koren [http://www.ploscompbiol.org/article/info:doi%2F10.1371%2Fjournal.pcbi.1000186 Gene-Boosted Assembly of a Novel Bacterial Genome from Very Short Reads] PLoS Computational Biology&lt;br /&gt;
&lt;br /&gt;
= 10/13/2008 =&lt;br /&gt;
# Daniela Puiu, Daniel Sommer [http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0003373 MetaSim—A Sequencing Simulator for Genomics and Metagenomics] PLoS ONE&lt;br /&gt;
# Bo Liu, James Robert White [http://www.sciencemag.org/cgi/content/full/322/5899/275 Environmental Genomics Reveals a Single-Species Ecosystem Deep Within Earth.] Science&lt;br /&gt;
# MohammadReza Ghodsi [http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=2557142 Probing Metagenomics by Rapid Cluster Analysis of Very Large Datasets] (PLoS one)&lt;br /&gt;
&lt;br /&gt;
MetaSim software:&lt;br /&gt;
      Location:         /fs/sz-user-supported/common/packages/MetaSim/ &lt;br /&gt;
      Executable(Java): /fs/sz-user-supported/common/packages/MetaSim/metasim/MetaSim &lt;br /&gt;
      Database:         /fs/sz-user-supported/common/packages/MetaSim/database           &lt;br /&gt;
      create a symlink to it from your working directory, otherwise you have to import it &lt;br /&gt;
      NCBI databases: [ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/all.fna.tar.gz Complete bacteria];   [ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/gi_taxid_nucl.dmp.gz Taxonomy]&lt;br /&gt;
&lt;br /&gt;
= 10/20/2008 =&lt;br /&gt;
# Bo Liu [http://www.pnas.org/content/105/37/13977.short?rss=1 Accelerated evolution of resistance in multidrug environments] PNAS &amp;lt;br&amp;gt; [http://www.pnas.org/content/105/39/14918.short?rss=1 Drug interactions modulate the potential for evolution of resistance] PNAS&lt;br /&gt;
# Daniel Sommer [http://www.ncbi.nlm.nih.gov/pubmed/18927115?dopt=Abstract NCBI Reference Sequences: current status, policy and new initiatives] NAR&lt;br /&gt;
# Daniela Puiu [http://genomebiology.com/content/pdf/gb-2008-9-10-r151.pdf A simple, fast, and accurate method of phylogenomic inference] Genome Biology&lt;br /&gt;
# Niranjan Nagarajan [http://www.springerlink.com/content/cn3551154w232l73/fulltext.pdf The Stastical Power of Phylogenetic Motif Models] RECOMB 2008&lt;br /&gt;
# Sergey Koren [http://www.nature.com/nmeth/journal/v5/n10/full/nmeth1008-903.html Scientific software: seeing the SNPs between us] Nature Methods&lt;br /&gt;
# Mohammadreza Ghodsi [http://www.biomedcentral.com/1471-2105/9/386 The metagenomics RAST server – a public resource for the automatic phylogenetic and functional analysis of metagenomes] BMC Bioinformatics&lt;br /&gt;
&lt;br /&gt;
= 10/27/2008 =&lt;br /&gt;
&lt;br /&gt;
# Sergey Koren [http://bioinformatics.oxfordjournals.org/cgi/content/abstract/btn548 Aggressive Assembly of Pyrosequencing Reads with Mates] Bioinformatics&lt;br /&gt;
# Daniela Puiu [http://bioinformatics.oxfordjournals.org/cgi/content/full/24/21/2431 ZOOM! Zillions of oligos mapped] Bioinformatics&lt;/div&gt;</summary>
		<author><name>Ghodsi</name></author>
	</entry>
	<entry>
		<id>https://wiki.umiacs.umd.edu/cbcb/index.php?title=Pop_group_meeting&amp;diff=3592</id>
		<title>Pop group meeting</title>
		<link rel="alternate" type="text/html" href="https://wiki.umiacs.umd.edu/cbcb/index.php?title=Pop_group_meeting&amp;diff=3592"/>
		<updated>2008-10-13T20:05:38Z</updated>

		<summary type="html">&lt;p&gt;Ghodsi: /* 10/13/2008 */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= 09/29/2008 =&lt;br /&gt;
&lt;br /&gt;
# Benjamin Langmead [http://www.plosone.org/article/info:doi%2F10.1371%2Fjournal.pone.0003263 Metagenomic Analysis of Lysogeny in Tampa Bay: Implications for Prophage Gene Expression] (PLoS one)&lt;br /&gt;
# Bo Liu [http://genomebiology.com/2008/9/9/R139 Systematic bioinformatic analysis of expression levels of 17,330 human genes across 9,783 samples from 175 types of healthy and pathological tissues] (Genome Biology)&lt;br /&gt;
# Daniela Puiu [http://genome.cshlp.org/cgi/reprint/gr.080200.108v1 Sequencing of natural strains of Arabidopsis thaliana with short reads] (Genome Research)&lt;br /&gt;
# Dan Sommer [http://www.nature.com/nature/journal/v455/n7212/edsumm/e080925-04.html Metagenomics: questions to answer] (Nature)&lt;br /&gt;
# James Robert White [http://www.sciencedirect.com/science?_ob=ArticleURL&amp;amp;_udi=B6V73-4SHVST0-1&amp;amp;_user=961305&amp;amp;_rdoc=1&amp;amp;_fmt=&amp;amp;_orig=search&amp;amp;_sort=d&amp;amp;view=c&amp;amp;_version=1&amp;amp;_urlVersion=0&amp;amp;_userid=961305&amp;amp;md5=4115e7b8186a36db60f6b11a217a8798 Microbial population dynamics during aerobic sludge granulation at different organic loading rates] (ScienceDirect)&lt;br /&gt;
# Mohammad Reza Ghodsi [http://nar.oxfordjournals.org/cgi/content/short/36/16/5180 A comparison of random sequence reads versus 16S rDNA sequences for estimating the biodiversity of a metagenomic library] (NAR)&lt;br /&gt;
# Niranjan Nagarajan [http://www.nature.com/nature/journal/v455/n7212/edsumm/e080925-04.html Metagenomics: questions to answer] (Nature)&lt;br /&gt;
# Sergey Koren [http://www.biomedcentral.com/1471-2105/9/390 R/parallel - speeding up bioinformatics analysis with R] (BMC Bioinformatics)&lt;br /&gt;
# Theodore Gibbons [http://www.ncbi.nlm.nih.gov/pubmed/18711340 High-resolution metagenomics targets specific functional types in complex microbial communities] (Nature Biotechnology) &amp;lt;BR&amp;gt; [http://www.nature.com/nrmicro/journal/v6/n9/abs/nrmicro1935.html Molecular eco-systems biology: towards an understanding of community function] (Systems Microbiology)&lt;br /&gt;
&lt;br /&gt;
= 10/06/2008 =&lt;br /&gt;
# Bo Liu [http://www.pnas.org/content/105/39/15076.short The convergence of carbohydrate active gene repertoires in human gut microbes.] PNAS&lt;br /&gt;
# Daniela Puiu [http://bioinformatics.oxfordjournals.org/cgi/content/full/24/20/2395 SeqMap: mapping massive amount of oligonucleotides to the genome] Bioinformatics&lt;br /&gt;
# James Robert White [http://www.pnas.org/content/104/21/8918.full Artificial selection of simulated microbial ecosystems.] PNAS&lt;br /&gt;
# MohammadReza Ghodsi [http://aem.asm.org/cgi/content/abstract/74/5/1453 Metagenomics: Read Length Matters] Applied Environmental Microbiology&lt;br /&gt;
&lt;br /&gt;
= 10/13/2008 =&lt;br /&gt;
# Daniela Puiu&lt;br /&gt;
# Bo Liu &amp;amp; James Robert White [http://www.sciencemag.org/cgi/content/full/322/5899/275 Environmental Genomics Reveals a Single-Species Ecosystem Deep Within Earth.] Science&lt;br /&gt;
# MohammadReza Ghodsi [http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=2557142 Probing Metagenomics by Rapid Cluster Analysis of Very Large Datasets] (PLoS one)&lt;/div&gt;</summary>
		<author><name>Ghodsi</name></author>
	</entry>
	<entry>
		<id>https://wiki.umiacs.umd.edu/cbcb/index.php?title=Pop_group_meeting&amp;diff=3560</id>
		<title>Pop group meeting</title>
		<link rel="alternate" type="text/html" href="https://wiki.umiacs.umd.edu/cbcb/index.php?title=Pop_group_meeting&amp;diff=3560"/>
		<updated>2008-10-08T19:25:07Z</updated>

		<summary type="html">&lt;p&gt;Ghodsi: /* 10/06/2008 */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= 09/29/2008 =&lt;br /&gt;
&lt;br /&gt;
# Benjamin Langmead [http://www.plosone.org/article/info:doi%2F10.1371%2Fjournal.pone.0003263 Metagenomic Analysis of Lysogeny in Tampa Bay: Implications for Prophage Gene Expression] (PLoS one)&lt;br /&gt;
# Bo Liu [http://genomebiology.com/2008/9/9/R139 Systematic bioinformatic analysis of expression levels of 17,330 human genes across 9,783 samples from 175 types of healthy and pathological tissues] (Genome Biology)&lt;br /&gt;
# Daniela Puiu [http://genome.cshlp.org/cgi/reprint/gr.080200.108v1 Sequencing of natural strains of Arabidopsis thaliana with short reads] (Genome Research)&lt;br /&gt;
# Dan Sommer [http://www.nature.com/nature/journal/v455/n7212/edsumm/e080925-04.html Metagenomics: questions to answer] (Nature)&lt;br /&gt;
# James Robert White [http://www.sciencedirect.com/science?_ob=ArticleURL&amp;amp;_udi=B6V73-4SHVST0-1&amp;amp;_user=961305&amp;amp;_rdoc=1&amp;amp;_fmt=&amp;amp;_orig=search&amp;amp;_sort=d&amp;amp;view=c&amp;amp;_version=1&amp;amp;_urlVersion=0&amp;amp;_userid=961305&amp;amp;md5=4115e7b8186a36db60f6b11a217a8798 Microbial population dynamics during aerobic sludge granulation at different organic loading rates] (ScienceDirect)&lt;br /&gt;
# Mohammad Reza Ghodsi [http://nar.oxfordjournals.org/cgi/content/short/36/16/5180 A comparison of random sequence reads versus 16S rDNA sequences for estimating the biodiversity of a metagenomic library] (NAR)&lt;br /&gt;
# Niranjan Nagarajan [http://www.nature.com/nature/journal/v455/n7212/edsumm/e080925-04.html Metagenomics: questions to answer] (Nature)&lt;br /&gt;
# Sergey Koren [http://www.biomedcentral.com/1471-2105/9/390 R/parallel - speeding up bioinformatics analysis with R] (BMC Bioinformatics)&lt;br /&gt;
# Theodore Gibbons [http://www.ncbi.nlm.nih.gov/pubmed/18711340 High-resolution metagenomics targets specific functional types in complex microbial communities] (Nature Biotechnology) &amp;lt;BR&amp;gt; [http://www.nature.com/nrmicro/journal/v6/n9/abs/nrmicro1935.html Molecular eco-systems biology: towards an understanding of community function] (Systems Microbiology)&lt;br /&gt;
&lt;br /&gt;
= 10/06/2008 =&lt;br /&gt;
# Bo Liu [http://www.pnas.org/content/105/39/15076.short The convergence of carbohydrate active gene repertoires in human gut microbes.] PNAS&lt;br /&gt;
# Daniela Puiu [http://bioinformatics.oxfordjournals.org/cgi/content/full/24/20/2395 SeqMap: mapping massive amount of oligonucleotides to the genome] Bioinformatics&lt;br /&gt;
# James Robert White [http://www.pnas.org/content/104/21/8918.full Artificial selection of simulated microbial ecosystems.] PNAS&lt;br /&gt;
# MohammadReza Ghodsi [http://aem.asm.org/cgi/content/abstract/74/5/1453 Metagenomics: Read Length Matters] Applied Environmental Microbiology&lt;/div&gt;</summary>
		<author><name>Ghodsi</name></author>
	</entry>
	<entry>
		<id>https://wiki.umiacs.umd.edu/cbcb/index.php?title=Pop_group_meeting&amp;diff=3559</id>
		<title>Pop group meeting</title>
		<link rel="alternate" type="text/html" href="https://wiki.umiacs.umd.edu/cbcb/index.php?title=Pop_group_meeting&amp;diff=3559"/>
		<updated>2008-10-08T19:23:28Z</updated>

		<summary type="html">&lt;p&gt;Ghodsi: /* 10/06/2008 */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= 09/29/2008 =&lt;br /&gt;
&lt;br /&gt;
# Benjamin Langmead [http://www.plosone.org/article/info:doi%2F10.1371%2Fjournal.pone.0003263 Metagenomic Analysis of Lysogeny in Tampa Bay: Implications for Prophage Gene Expression] (PLoS one)&lt;br /&gt;
# Bo Liu [http://genomebiology.com/2008/9/9/R139 Systematic bioinformatic analysis of expression levels of 17,330 human genes across 9,783 samples from 175 types of healthy and pathological tissues] (Genome Biology)&lt;br /&gt;
# Daniela Puiu [http://genome.cshlp.org/cgi/reprint/gr.080200.108v1 Sequencing of natural strains of Arabidopsis thaliana with short reads] (Genome Research)&lt;br /&gt;
# Dan Sommer [http://www.nature.com/nature/journal/v455/n7212/edsumm/e080925-04.html Metagenomics: questions to answer] (Nature)&lt;br /&gt;
# James Robert White [http://www.sciencedirect.com/science?_ob=ArticleURL&amp;amp;_udi=B6V73-4SHVST0-1&amp;amp;_user=961305&amp;amp;_rdoc=1&amp;amp;_fmt=&amp;amp;_orig=search&amp;amp;_sort=d&amp;amp;view=c&amp;amp;_version=1&amp;amp;_urlVersion=0&amp;amp;_userid=961305&amp;amp;md5=4115e7b8186a36db60f6b11a217a8798 Microbial population dynamics during aerobic sludge granulation at different organic loading rates] (ScienceDirect)&lt;br /&gt;
# Mohammad Reza Ghodsi [http://nar.oxfordjournals.org/cgi/content/short/36/16/5180 A comparison of random sequence reads versus 16S rDNA sequences for estimating the biodiversity of a metagenomic library] (NAR)&lt;br /&gt;
# Niranjan Nagarajan [http://www.nature.com/nature/journal/v455/n7212/edsumm/e080925-04.html Metagenomics: questions to answer] (Nature)&lt;br /&gt;
# Sergey Koren [http://www.biomedcentral.com/1471-2105/9/390 R/parallel - speeding up bioinformatics analysis with R] (BMC Bioinformatics)&lt;br /&gt;
# Theodore Gibbons [http://www.ncbi.nlm.nih.gov/pubmed/18711340 High-resolution metagenomics targets specific functional types in complex microbial communities] (Nature Biotechnology) &amp;lt;BR&amp;gt; [http://www.nature.com/nrmicro/journal/v6/n9/abs/nrmicro1935.html Molecular eco-systems biology: towards an understanding of community function] (Systems Microbiology)&lt;br /&gt;
&lt;br /&gt;
= 10/06/2008 =&lt;br /&gt;
# Bo Liu [http://www.pnas.org/content/105/39/15076.short The convergence of carbohydrate active gene repertoires in human gut microbes.] PNAS&lt;br /&gt;
# Daniela Puiu [http://bioinformatics.oxfordjournals.org/cgi/content/full/24/20/2395 SeqMap: mapping massive amount of oligonucleotides to the genome] Bioinformatics&lt;br /&gt;
# James Robert White [http://www.pnas.org/content/104/21/8918.full Artificial selection of simulated microbial ecosystems.] PNAS&lt;br /&gt;
# MohammadReza Ghodsi [http://aem.asm.org/cgi/content/abstract/74/5/1453 Metagenomics: Read Length Matters]&lt;/div&gt;</summary>
		<author><name>Ghodsi</name></author>
	</entry>
</feed>