<?xml version="1.0"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
	<id>https://wiki.umiacs.umd.edu/cbcb/api.php?action=feedcontributions&amp;feedformat=atom&amp;user=Mpop</id>
	<title>Cbcb - User contributions [en]</title>
	<link rel="self" type="application/atom+xml" href="https://wiki.umiacs.umd.edu/cbcb/api.php?action=feedcontributions&amp;feedformat=atom&amp;user=Mpop"/>
	<link rel="alternate" type="text/html" href="https://wiki.umiacs.umd.edu/cbcb/index.php/Special:Contributions/Mpop"/>
	<updated>2026-04-30T03:52:08Z</updated>
	<subtitle>User contributions</subtitle>
	<generator>MediaWiki 1.43.7</generator>
	<entry>
		<id>https://wiki.umiacs.umd.edu/cbcb/index.php?title=Getting_Started_in_CBCB&amp;diff=8969</id>
		<title>Getting Started in CBCB</title>
		<link rel="alternate" type="text/html" href="https://wiki.umiacs.umd.edu/cbcb/index.php?title=Getting_Started_in_CBCB&amp;diff=8969"/>
		<updated>2014-10-03T02:05:24Z</updated>

		<summary type="html">&lt;p&gt;Mpop: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Last update 9/11/13&amp;lt;br&amp;gt;&lt;br /&gt;
&lt;br /&gt;
You may want to review the following:&lt;br /&gt;
&lt;br /&gt;
[[Media:CBCB-quick-guide.pdf]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Getting Building Access and Room Keys==&lt;br /&gt;
&lt;br /&gt;
CBCB is located on the 3rd floor of The Biomolecular Sciences Building identified on campus maps as Building #296.  The building is secure and access is gained by either using your UM ID card, guest card or entering the 3-digit code of the person you to visit at the call box on the right side of the front door.&lt;br /&gt;
&lt;br /&gt;
Contact the Center Coordinator, Christine Bogan, about gaining card access to the building.  She will need the following information:&lt;br /&gt;
*An notification email from your sponsor/adviser&lt;br /&gt;
*Your Name&lt;br /&gt;
*Your 9 Digit University ID number&lt;br /&gt;
*Your Contact email&lt;br /&gt;
&lt;br /&gt;
Along with your assigned space and phone numbers, the coordinator will send your information to UMIACS Coordinator Edna Walker who will contact campus security to add you into their system.  &#039;&#039;Note: clearance usually takes a number of days, so contact the coordinator as soon as possible.&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
If you prefer not to send your information through email feel free to contact the coordinator in person.  &lt;br /&gt;
&lt;br /&gt;
You must get your key from the coordinator in person.&lt;br /&gt;
&lt;br /&gt;
Christine Bogan&amp;lt;br&amp;gt;&lt;br /&gt;
Room 3121&amp;lt;br&amp;gt;&lt;br /&gt;
Biomolecular Sciences Bldg #296&amp;lt;br&amp;gt;&lt;br /&gt;
301.405.5936&amp;lt;br&amp;gt;&lt;br /&gt;
dcross[at]umd.edu&amp;lt;br&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==Research In Progress Seminars==&lt;br /&gt;
&lt;br /&gt;
These seminars are held through the year.  For information go to [http://cbcb.umd.edu/node/18625 CBCB Research in Progress]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
For a list of our Disk Storage and amount of available space left on each one, see [http://wiki.umiacs.umd.edu/cbcb-private/index.php/Storage wiki.umiacs.umd.edu/cbcb-private/index.php/Storage]&amp;lt;br&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Configuring Your Home Directory and Shell ==&lt;br /&gt;
&lt;br /&gt;
We import common settings files that set our path variables to include commonly shared software repositories.&lt;br /&gt;
&lt;br /&gt;
As a start, add the following line to the top of the file called &amp;quot;.bashrc&amp;quot; located in your home directory (/nfshomes/username/):&lt;br /&gt;
&lt;br /&gt;
 . /fs/sz-user-supported/share/dotfiles/bashrc.cbcb&lt;br /&gt;
&lt;br /&gt;
This will import the common bashrc.cbcb file into your own bashrc file every time you log in.&lt;br /&gt;
&lt;br /&gt;
Now add this line to your &amp;quot;.bash_profile&amp;quot; file, also located in your home directory:&lt;br /&gt;
&lt;br /&gt;
 . ~/.bashrc&lt;br /&gt;
&lt;br /&gt;
This will import your personal bashrc file every time you log in. Now you should have access to most of the locally installed software like &amp;quot;blastall&amp;quot; and &amp;quot;AMOS.&amp;quot;&lt;br /&gt;
&lt;br /&gt;
If you want to add any addition commands to your bashrc file, such as setting your default text editor to &amp;quot;vim&amp;quot; or formatting the output of bash commands (eg. &amp;quot;ls&amp;quot;), add the appropriate commands after the imported common files as shown in this example:&lt;br /&gt;
&lt;br /&gt;
 . /fs/sz-user-supported/share/dotfiles/bashrc.cbcb&lt;br /&gt;
 &lt;br /&gt;
 alias vi=&#039;vim&#039;&lt;br /&gt;
&lt;br /&gt;
 alias ls=&#039;ls --color&#039;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
If you decide you want to change the settings in the common bashrc.cbcb to better suit your personal needs, then please copy and paste its contents into your personal bashrc file. &#039;&#039;&#039;Do not modify the common bashrc.cbcb file as it will affect everyone&#039;s environment.&#039;&#039;&#039; Also check back periodically as people may add common paths for new software.&lt;br /&gt;
&lt;br /&gt;
If you have any other problems, contact staff [at] umiacs.umd.edu or your PI. &amp;lt;br&amp;gt;&lt;br /&gt;
&lt;br /&gt;
More resources can be found at [https://wiki.umiacs.umd.edu/umiacs/index.php/GettingStarted umiacs wiki]&lt;br /&gt;
&lt;br /&gt;
== Printing ==&lt;br /&gt;
Go to the umiacs wiki to find [https://wiki.umiacs.umd.edu/umiacs/index.php/Printing system-specific guides for printing], and be sure to [https://wiki.umiacs.umd.edu/umiacs/index.php/PrinterQueueNaming add &#039;nb&#039; to the end of your print queue] to avoid wasting paper printing banners.&lt;br /&gt;
&lt;br /&gt;
== Using the Wiki ==&lt;br /&gt;
* [http://en.wikipedia.org/wiki/Wikipedia:Cheatsheet Wiki Formatting Cheatsheet]&lt;br /&gt;
* [http://www.mediawiki.org/wiki/Help:Configuration_settings Configuration settings list]&lt;br /&gt;
* [http://www.mediawiki.org/wiki/Help:FAQ MediaWiki FAQ]&lt;br /&gt;
* [http://mail.wikimedia.org/mailman/listinfo/mediawiki-announce MediaWiki release mailing list]&lt;br /&gt;
&lt;br /&gt;
== When You Travel ==&lt;br /&gt;
&lt;br /&gt;
[[Media:CBCB_Travel_Approval_may_2013.pdf‎]] basic travel approval form&lt;br /&gt;
&lt;br /&gt;
[[Media:Travel_101_cbcb_revised_4-2013.pdf]] for more detailed travel info&lt;/div&gt;</summary>
		<author><name>Mpop</name></author>
	</entry>
	<entry>
		<id>https://wiki.umiacs.umd.edu/cbcb/index.php?title=Getting_Started_in_CBCB&amp;diff=8968</id>
		<title>Getting Started in CBCB</title>
		<link rel="alternate" type="text/html" href="https://wiki.umiacs.umd.edu/cbcb/index.php?title=Getting_Started_in_CBCB&amp;diff=8968"/>
		<updated>2014-10-03T02:04:11Z</updated>

		<summary type="html">&lt;p&gt;Mpop: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Last update 9/11/13&amp;lt;br&amp;gt;&lt;br /&gt;
&lt;br /&gt;
You may want to review the following:&lt;br /&gt;
&lt;br /&gt;
[[Media:CBCB-quick-guide.pdf]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Getting Building Access and Room Keys==&lt;br /&gt;
&lt;br /&gt;
CBCB is located on the 3rd floor of The Biomolecular Sciences Building identified on campus maps as Building #296.  The building is secure and access is gained by either using your UM ID card, guest card or entering the 3-digit code of the person you to visit at the call box on the right side of the front door.&lt;br /&gt;
&lt;br /&gt;
Contact the Center Coordinator, Denise Cross, about gaining card access to the building.  She will need the following information:&lt;br /&gt;
*An notification email from your sponsor/adviser&lt;br /&gt;
*Your Name&lt;br /&gt;
*Your 9 Digit University ID number&lt;br /&gt;
*Your Contact email&lt;br /&gt;
&lt;br /&gt;
Along with your assigned space and phone numbers, the coordinator will send your information to UMIACS Coordinator Edna Walker who will contact campus security to add you into their system.  &#039;&#039;Note: clearance usually takes a number of days, so contact the coordinator as soon as possible.&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
If you prefer not to send your information through email feel free to contact the coordinator in person.  &lt;br /&gt;
&lt;br /&gt;
You must get your key from the coordinator in person.&lt;br /&gt;
&lt;br /&gt;
Christine Bogan&amp;lt;br&amp;gt;&lt;br /&gt;
Room 3121&amp;lt;br&amp;gt;&lt;br /&gt;
Biomolecular Sciences Bldg #296&amp;lt;br&amp;gt;&lt;br /&gt;
301.405.5936&amp;lt;br&amp;gt;&lt;br /&gt;
dcross[at]umd.edu&amp;lt;br&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==Research In Progress Seminars==&lt;br /&gt;
&lt;br /&gt;
These seminars are held through the year.  For information go to [http://cbcb.umd.edu/node/18625 CBCB Research in Progress]&lt;br /&gt;
==Understanding the Layout of Available Resources==&lt;br /&gt;
When you first log into a server (eg. flicker01@umiacs.umd.edu), you will probably be placed in one of the following personalized directories:&lt;br /&gt;
*/fs/wrenhome/yourUserName/&lt;br /&gt;
*/nfshomes/yourUserName/&lt;br /&gt;
nfshomes has a limit on available disk space (in the double digit MBs) while wrenhomes allows you the more freedom. Therefore you should use wrenhomes for your personal work files, but should be very mindful of how much space you are using and how much free space remains. For large amounts of data, you should use one of the following (after first checking with your sponsor).&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&lt;br /&gt;
For a list of our Disk Storage and amount of available space left on each one, see [http://wiki.umiacs.umd.edu/cbcb-private/index.php/Storage wiki.umiacs.umd.edu/cbcb-private/index.php/Storage]&amp;lt;br&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Configuring Your Home Directory and Shell ==&lt;br /&gt;
&lt;br /&gt;
We import common settings files that set our path variables to include commonly shared software repositories.&lt;br /&gt;
&lt;br /&gt;
As a start, add the following line to the top of the file called &amp;quot;.bashrc&amp;quot; located in your home directory (/nfshomes/username/):&lt;br /&gt;
&lt;br /&gt;
 . /fs/sz-user-supported/share/dotfiles/bashrc.cbcb&lt;br /&gt;
&lt;br /&gt;
This will import the common bashrc.cbcb file into your own bashrc file every time you log in.&lt;br /&gt;
&lt;br /&gt;
Now add this line to your &amp;quot;.bash_profile&amp;quot; file, also located in your home directory:&lt;br /&gt;
&lt;br /&gt;
 . ~/.bashrc&lt;br /&gt;
&lt;br /&gt;
This will import your personal bashrc file every time you log in. Now you should have access to most of the locally installed software like &amp;quot;blastall&amp;quot; and &amp;quot;AMOS.&amp;quot;&lt;br /&gt;
&lt;br /&gt;
If you want to add any addition commands to your bashrc file, such as setting your default text editor to &amp;quot;vim&amp;quot; or formatting the output of bash commands (eg. &amp;quot;ls&amp;quot;), add the appropriate commands after the imported common files as shown in this example:&lt;br /&gt;
&lt;br /&gt;
 . /fs/sz-user-supported/share/dotfiles/bashrc.cbcb&lt;br /&gt;
 &lt;br /&gt;
 alias vi=&#039;vim&#039;&lt;br /&gt;
&lt;br /&gt;
 alias ls=&#039;ls --color&#039;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
If you decide you want to change the settings in the common bashrc.cbcb to better suit your personal needs, then please copy and paste its contents into your personal bashrc file. &#039;&#039;&#039;Do not modify the common bashrc.cbcb file as it will affect everyone&#039;s environment.&#039;&#039;&#039; Also check back periodically as people may add common paths for new software.&lt;br /&gt;
&lt;br /&gt;
If you have any other problems, contact staff [at] umiacs.umd.edu or your PI. &amp;lt;br&amp;gt;&lt;br /&gt;
&lt;br /&gt;
More resources can be found at [https://wiki.umiacs.umd.edu/umiacs/index.php/GettingStarted umiacs wiki]&lt;br /&gt;
&lt;br /&gt;
== Printing ==&lt;br /&gt;
Go to the umiacs wiki to find [https://wiki.umiacs.umd.edu/umiacs/index.php/Printing system-specific guides for printing], and be sure to [https://wiki.umiacs.umd.edu/umiacs/index.php/PrinterQueueNaming add &#039;nb&#039; to the end of your print queue] to avoid wasting paper printing banners.&lt;br /&gt;
&lt;br /&gt;
== Using the Wiki ==&lt;br /&gt;
* [http://en.wikipedia.org/wiki/Wikipedia:Cheatsheet Wiki Formatting Cheatsheet]&lt;br /&gt;
* [http://www.mediawiki.org/wiki/Help:Configuration_settings Configuration settings list]&lt;br /&gt;
* [http://www.mediawiki.org/wiki/Help:FAQ MediaWiki FAQ]&lt;br /&gt;
* [http://mail.wikimedia.org/mailman/listinfo/mediawiki-announce MediaWiki release mailing list]&lt;br /&gt;
&lt;br /&gt;
== When You Travel ==&lt;br /&gt;
&lt;br /&gt;
[[Media:CBCB_Travel_Approval_may_2013.pdf‎]] basic travel approval form&lt;br /&gt;
&lt;br /&gt;
[[Media:Travel_101_cbcb_revised_4-2013.pdf]] for more detailed travel info&lt;/div&gt;</summary>
		<author><name>Mpop</name></author>
	</entry>
	<entry>
		<id>https://wiki.umiacs.umd.edu/cbcb/index.php?title=Getting_Started_in_CBCB&amp;diff=8967</id>
		<title>Getting Started in CBCB</title>
		<link rel="alternate" type="text/html" href="https://wiki.umiacs.umd.edu/cbcb/index.php?title=Getting_Started_in_CBCB&amp;diff=8967"/>
		<updated>2014-10-03T02:03:24Z</updated>

		<summary type="html">&lt;p&gt;Mpop: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Last update 9/11/13&amp;lt;br&amp;gt;&lt;br /&gt;
&lt;br /&gt;
You may want to review the following:&lt;br /&gt;
&lt;br /&gt;
[[Media:CBCB-quick-guide.pdf]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Getting Building Access and Room Keys==&lt;br /&gt;
&lt;br /&gt;
CBCB is located on the 3rd floor of The Biomolecular Sciences Building identified on campus maps as Building #296.  The building is secure and access is gained by either using your UM ID card, guest card or entering the 3-digit code of the person you to visit at the call box on the right side of the front door.&lt;br /&gt;
&lt;br /&gt;
Contact the Center Coordinator, Denise Cross, about gaining card access to the building.  She will need the following information:&lt;br /&gt;
*An notification email from your sponsor/adviser&lt;br /&gt;
*Your Name&lt;br /&gt;
*Your 9 Digit University ID number&lt;br /&gt;
*Your Contact email&lt;br /&gt;
&lt;br /&gt;
Along with your assigned space and phone numbers, the coordinator will send your information to UMIACS Coordinator Edna Walker who will contact campus security to add you into their system.  &#039;&#039;Note: clearance usually takes a number of days, so contact the coordinator as soon as possible.&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
If you prefer not to send your information through email feel free to contact the coordinator in person.  &lt;br /&gt;
&lt;br /&gt;
You must get your key from the coordinator in person.&lt;br /&gt;
&lt;br /&gt;
Denise Cross&amp;lt;br&amp;gt;&lt;br /&gt;
Room 3121&amp;lt;br&amp;gt;&lt;br /&gt;
Biomolecular Sciences Bldg #296&amp;lt;br&amp;gt;&lt;br /&gt;
301.405.5936&amp;lt;br&amp;gt;&lt;br /&gt;
dcross[at]umd.edu&amp;lt;br&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==Research In Progress Seminars==&lt;br /&gt;
&lt;br /&gt;
These seminars are held through the year.  For information go to [http://cbcb.umd.edu/node/18625 CBCB Research in Progress]&lt;br /&gt;
==Understanding the Layout of Available Resources==&lt;br /&gt;
When you first log into a server (eg. flicker01@umiacs.umd.edu), you will probably be placed in one of the following personalized directories:&lt;br /&gt;
*/fs/wrenhome/yourUserName/&lt;br /&gt;
*/nfshomes/yourUserName/&lt;br /&gt;
nfshomes has a limit on available disk space (in the double digit MBs) while wrenhomes allows you the more freedom. Therefore you should use wrenhomes for your personal work files, but should be very mindful of how much space you are using and how much free space remains. For large amounts of data, you should use one of the following (after first checking with your sponsor).&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&lt;br /&gt;
For a list of our Disk Storage and amount of available space left on each one, see [http://wiki.umiacs.umd.edu/cbcb-private/index.php/Storage wiki.umiacs.umd.edu/cbcb-private/index.php/Storage]&amp;lt;br&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Configuring Your Home Directory and Shell ==&lt;br /&gt;
&lt;br /&gt;
We import common settings files that set our path variables to include commonly shared software repositories.&lt;br /&gt;
&lt;br /&gt;
As a start, add the following line to the top of the file called &amp;quot;.bashrc&amp;quot; located in your home directory (/nfshomes/username/):&lt;br /&gt;
&lt;br /&gt;
 . /fs/sz-user-supported/share/dotfiles/bashrc.cbcb&lt;br /&gt;
&lt;br /&gt;
This will import the common bashrc.cbcb file into your own bashrc file every time you log in.&lt;br /&gt;
&lt;br /&gt;
Now add this line to your &amp;quot;.bash_profile&amp;quot; file, also located in your home directory:&lt;br /&gt;
&lt;br /&gt;
 . ~/.bashrc&lt;br /&gt;
&lt;br /&gt;
This will import your personal bashrc file every time you log in. Now you should have access to most of the locally installed software like &amp;quot;blastall&amp;quot; and &amp;quot;AMOS.&amp;quot;&lt;br /&gt;
&lt;br /&gt;
If you want to add any addition commands to your bashrc file, such as setting your default text editor to &amp;quot;vim&amp;quot; or formatting the output of bash commands (eg. &amp;quot;ls&amp;quot;), add the appropriate commands after the imported common files as shown in this example:&lt;br /&gt;
&lt;br /&gt;
 . /fs/sz-user-supported/share/dotfiles/bashrc.cbcb&lt;br /&gt;
 &lt;br /&gt;
 alias vi=&#039;vim&#039;&lt;br /&gt;
&lt;br /&gt;
 alias ls=&#039;ls --color&#039;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
If you decide you want to change the settings in the common bashrc.cbcb to better suit your personal needs, then please copy and paste its contents into your personal bashrc file. &#039;&#039;&#039;Do not modify the common bashrc.cbcb file as it will affect everyone&#039;s environment.&#039;&#039;&#039; Also check back periodically as people may add common paths for new software.&lt;br /&gt;
&lt;br /&gt;
If you have any other problems, contact staff [at] umiacs.umd.edu or your PI. &amp;lt;br&amp;gt;&lt;br /&gt;
&lt;br /&gt;
More resources can be found at [https://wiki.umiacs.umd.edu/umiacs/index.php/GettingStarted umiacs wiki]&lt;br /&gt;
&lt;br /&gt;
== Printing ==&lt;br /&gt;
Go to the umiacs wiki to find [https://wiki.umiacs.umd.edu/umiacs/index.php/Printing system-specific guides for printing], and be sure to [https://wiki.umiacs.umd.edu/umiacs/index.php/PrinterQueueNaming add &#039;nb&#039; to the end of your print queue] to avoid wasting paper printing banners.&lt;br /&gt;
&lt;br /&gt;
== Using the Wiki ==&lt;br /&gt;
* [http://en.wikipedia.org/wiki/Wikipedia:Cheatsheet Wiki Formatting Cheatsheet]&lt;br /&gt;
* [http://www.mediawiki.org/wiki/Help:Configuration_settings Configuration settings list]&lt;br /&gt;
* [http://www.mediawiki.org/wiki/Help:FAQ MediaWiki FAQ]&lt;br /&gt;
* [http://mail.wikimedia.org/mailman/listinfo/mediawiki-announce MediaWiki release mailing list]&lt;br /&gt;
&lt;br /&gt;
== When You Travel ==&lt;br /&gt;
&lt;br /&gt;
[[Media:CBCB_Travel_Approval_may_2013.pdf‎]] basic travel approval form&lt;br /&gt;
&lt;br /&gt;
[[Media:Travel_101_cbcb_revised_4-2013.pdf]] for more detailed travel info&lt;/div&gt;</summary>
		<author><name>Mpop</name></author>
	</entry>
	<entry>
		<id>https://wiki.umiacs.umd.edu/cbcb/index.php?title=Getting_Started_in_CBCB&amp;diff=8966</id>
		<title>Getting Started in CBCB</title>
		<link rel="alternate" type="text/html" href="https://wiki.umiacs.umd.edu/cbcb/index.php?title=Getting_Started_in_CBCB&amp;diff=8966"/>
		<updated>2014-10-03T02:03:04Z</updated>

		<summary type="html">&lt;p&gt;Mpop: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Last update 9/11/13&amp;lt;br&amp;gt;&lt;br /&gt;
&lt;br /&gt;
You may want to review the following:&lt;br /&gt;
&lt;br /&gt;
[[Media:]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Getting Building Access and Room Keys==&lt;br /&gt;
&lt;br /&gt;
CBCB is located on the 3rd floor of The Biomolecular Sciences Building identified on campus maps as Building #296.  The building is secure and access is gained by either using your UM ID card, guest card or entering the 3-digit code of the person you to visit at the call box on the right side of the front door.&lt;br /&gt;
&lt;br /&gt;
Contact the Center Coordinator, Denise Cross, about gaining card access to the building.  She will need the following information:&lt;br /&gt;
*An notification email from your sponsor/adviser&lt;br /&gt;
*Your Name&lt;br /&gt;
*Your 9 Digit University ID number&lt;br /&gt;
*Your Contact email&lt;br /&gt;
&lt;br /&gt;
Along with your assigned space and phone numbers, the coordinator will send your information to UMIACS Coordinator Edna Walker who will contact campus security to add you into their system.  &#039;&#039;Note: clearance usually takes a number of days, so contact the coordinator as soon as possible.&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
If you prefer not to send your information through email feel free to contact the coordinator in person.  &lt;br /&gt;
&lt;br /&gt;
You must get your key from the coordinator in person.&lt;br /&gt;
&lt;br /&gt;
Denise Cross&amp;lt;br&amp;gt;&lt;br /&gt;
Room 3121&amp;lt;br&amp;gt;&lt;br /&gt;
Biomolecular Sciences Bldg #296&amp;lt;br&amp;gt;&lt;br /&gt;
301.405.5936&amp;lt;br&amp;gt;&lt;br /&gt;
dcross[at]umd.edu&amp;lt;br&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==Research In Progress Seminars==&lt;br /&gt;
&lt;br /&gt;
These seminars are held through the year.  For information go to [http://cbcb.umd.edu/node/18625 CBCB Research in Progress]&lt;br /&gt;
==Understanding the Layout of Available Resources==&lt;br /&gt;
When you first log into a server (eg. flicker01@umiacs.umd.edu), you will probably be placed in one of the following personalized directories:&lt;br /&gt;
*/fs/wrenhome/yourUserName/&lt;br /&gt;
*/nfshomes/yourUserName/&lt;br /&gt;
nfshomes has a limit on available disk space (in the double digit MBs) while wrenhomes allows you the more freedom. Therefore you should use wrenhomes for your personal work files, but should be very mindful of how much space you are using and how much free space remains. For large amounts of data, you should use one of the following (after first checking with your sponsor).&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&lt;br /&gt;
For a list of our Disk Storage and amount of available space left on each one, see [http://wiki.umiacs.umd.edu/cbcb-private/index.php/Storage wiki.umiacs.umd.edu/cbcb-private/index.php/Storage]&amp;lt;br&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Configuring Your Home Directory and Shell ==&lt;br /&gt;
&lt;br /&gt;
We import common settings files that set our path variables to include commonly shared software repositories.&lt;br /&gt;
&lt;br /&gt;
As a start, add the following line to the top of the file called &amp;quot;.bashrc&amp;quot; located in your home directory (/nfshomes/username/):&lt;br /&gt;
&lt;br /&gt;
 . /fs/sz-user-supported/share/dotfiles/bashrc.cbcb&lt;br /&gt;
&lt;br /&gt;
This will import the common bashrc.cbcb file into your own bashrc file every time you log in.&lt;br /&gt;
&lt;br /&gt;
Now add this line to your &amp;quot;.bash_profile&amp;quot; file, also located in your home directory:&lt;br /&gt;
&lt;br /&gt;
 . ~/.bashrc&lt;br /&gt;
&lt;br /&gt;
This will import your personal bashrc file every time you log in. Now you should have access to most of the locally installed software like &amp;quot;blastall&amp;quot; and &amp;quot;AMOS.&amp;quot;&lt;br /&gt;
&lt;br /&gt;
If you want to add any addition commands to your bashrc file, such as setting your default text editor to &amp;quot;vim&amp;quot; or formatting the output of bash commands (eg. &amp;quot;ls&amp;quot;), add the appropriate commands after the imported common files as shown in this example:&lt;br /&gt;
&lt;br /&gt;
 . /fs/sz-user-supported/share/dotfiles/bashrc.cbcb&lt;br /&gt;
 &lt;br /&gt;
 alias vi=&#039;vim&#039;&lt;br /&gt;
&lt;br /&gt;
 alias ls=&#039;ls --color&#039;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
If you decide you want to change the settings in the common bashrc.cbcb to better suit your personal needs, then please copy and paste its contents into your personal bashrc file. &#039;&#039;&#039;Do not modify the common bashrc.cbcb file as it will affect everyone&#039;s environment.&#039;&#039;&#039; Also check back periodically as people may add common paths for new software.&lt;br /&gt;
&lt;br /&gt;
If you have any other problems, contact staff [at] umiacs.umd.edu or your PI. &amp;lt;br&amp;gt;&lt;br /&gt;
&lt;br /&gt;
More resources can be found at [https://wiki.umiacs.umd.edu/umiacs/index.php/GettingStarted umiacs wiki]&lt;br /&gt;
&lt;br /&gt;
== Printing ==&lt;br /&gt;
Go to the umiacs wiki to find [https://wiki.umiacs.umd.edu/umiacs/index.php/Printing system-specific guides for printing], and be sure to [https://wiki.umiacs.umd.edu/umiacs/index.php/PrinterQueueNaming add &#039;nb&#039; to the end of your print queue] to avoid wasting paper printing banners.&lt;br /&gt;
&lt;br /&gt;
== Using the Wiki ==&lt;br /&gt;
* [http://en.wikipedia.org/wiki/Wikipedia:Cheatsheet Wiki Formatting Cheatsheet]&lt;br /&gt;
* [http://www.mediawiki.org/wiki/Help:Configuration_settings Configuration settings list]&lt;br /&gt;
* [http://www.mediawiki.org/wiki/Help:FAQ MediaWiki FAQ]&lt;br /&gt;
* [http://mail.wikimedia.org/mailman/listinfo/mediawiki-announce MediaWiki release mailing list]&lt;br /&gt;
&lt;br /&gt;
== When You Travel ==&lt;br /&gt;
&lt;br /&gt;
[[Media:CBCB_Travel_Approval_may_2013.pdf‎]] basic travel approval form&lt;br /&gt;
&lt;br /&gt;
[[Media:Travel_101_cbcb_revised_4-2013.pdf]] for more detailed travel info&lt;/div&gt;</summary>
		<author><name>Mpop</name></author>
	</entry>
	<entry>
		<id>https://wiki.umiacs.umd.edu/cbcb/index.php?title=File:CBCB-quick-guide.pdf&amp;diff=8965</id>
		<title>File:CBCB-quick-guide.pdf</title>
		<link rel="alternate" type="text/html" href="https://wiki.umiacs.umd.edu/cbcb/index.php?title=File:CBCB-quick-guide.pdf&amp;diff=8965"/>
		<updated>2014-10-03T02:01:45Z</updated>

		<summary type="html">&lt;p&gt;Mpop: Mpop uploaded a new version of &amp;amp;quot;File:CBCB-quick-guide.pdf&amp;amp;quot;&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;&lt;/div&gt;</summary>
		<author><name>Mpop</name></author>
	</entry>
	<entry>
		<id>https://wiki.umiacs.umd.edu/cbcb/index.php?title=File:CBCB-quick-guide.pdf&amp;diff=8964</id>
		<title>File:CBCB-quick-guide.pdf</title>
		<link rel="alternate" type="text/html" href="https://wiki.umiacs.umd.edu/cbcb/index.php?title=File:CBCB-quick-guide.pdf&amp;diff=8964"/>
		<updated>2014-10-03T02:00:59Z</updated>

		<summary type="html">&lt;p&gt;Mpop: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;&lt;/div&gt;</summary>
		<author><name>Mpop</name></author>
	</entry>
	<entry>
		<id>https://wiki.umiacs.umd.edu/cbcb/index.php?title=Getting_Started_in_CBCB&amp;diff=8963</id>
		<title>Getting Started in CBCB</title>
		<link rel="alternate" type="text/html" href="https://wiki.umiacs.umd.edu/cbcb/index.php?title=Getting_Started_in_CBCB&amp;diff=8963"/>
		<updated>2014-10-03T02:00:37Z</updated>

		<summary type="html">&lt;p&gt;Mpop: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Last update 9/11/13&amp;lt;br&amp;gt;&lt;br /&gt;
&lt;br /&gt;
You may want to review the following:&lt;br /&gt;
&lt;br /&gt;
[[Media:CBCB-quick-guide.pdf]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Getting Building Access and Room Keys==&lt;br /&gt;
&lt;br /&gt;
CBCB is located on the 3rd floor of The Biomolecular Sciences Building identified on campus maps as Building #296.  The building is secure and access is gained by either using your UM ID card, guest card or entering the 3-digit code of the person you to visit at the call box on the right side of the front door.&lt;br /&gt;
&lt;br /&gt;
Contact the Center Coordinator, Denise Cross, about gaining card access to the building.  She will need the following information:&lt;br /&gt;
*An notification email from your sponsor/adviser&lt;br /&gt;
*Your Name&lt;br /&gt;
*Your 9 Digit University ID number&lt;br /&gt;
*Your Contact email&lt;br /&gt;
&lt;br /&gt;
Along with your assigned space and phone numbers, the coordinator will send your information to UMIACS Coordinator Edna Walker who will contact campus security to add you into their system.  &#039;&#039;Note: clearance usually takes a number of days, so contact the coordinator as soon as possible.&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
If you prefer not to send your information through email feel free to contact the coordinator in person.  &lt;br /&gt;
&lt;br /&gt;
You must get your key from the coordinator in person.&lt;br /&gt;
&lt;br /&gt;
Denise Cross&amp;lt;br&amp;gt;&lt;br /&gt;
Room 3121&amp;lt;br&amp;gt;&lt;br /&gt;
Biomolecular Sciences Bldg #296&amp;lt;br&amp;gt;&lt;br /&gt;
301.405.5936&amp;lt;br&amp;gt;&lt;br /&gt;
dcross[at]umd.edu&amp;lt;br&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==Research In Progress Seminars==&lt;br /&gt;
&lt;br /&gt;
These seminars are held through the year.  For information go to [http://cbcb.umd.edu/node/18625 CBCB Research in Progress]&lt;br /&gt;
==Understanding the Layout of Available Resources==&lt;br /&gt;
When you first log into a server (eg. flicker01@umiacs.umd.edu), you will probably be placed in one of the following personalized directories:&lt;br /&gt;
*/fs/wrenhome/yourUserName/&lt;br /&gt;
*/nfshomes/yourUserName/&lt;br /&gt;
nfshomes has a limit on available disk space (in the double digit MBs) while wrenhomes allows you the more freedom. Therefore you should use wrenhomes for your personal work files, but should be very mindful of how much space you are using and how much free space remains. For large amounts of data, you should use one of the following (after first checking with your sponsor).&lt;br /&gt;
&amp;lt;br&amp;gt;&lt;br /&gt;
&lt;br /&gt;
For a list of our Disk Storage and amount of available space left on each one, see [http://wiki.umiacs.umd.edu/cbcb-private/index.php/Storage wiki.umiacs.umd.edu/cbcb-private/index.php/Storage]&amp;lt;br&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Configuring Your Home Directory and Shell ==&lt;br /&gt;
&lt;br /&gt;
We import common settings files that set our path variables to include commonly shared software repositories.&lt;br /&gt;
&lt;br /&gt;
As a start, add the following line to the top of the file called &amp;quot;.bashrc&amp;quot; located in your home directory (/nfshomes/username/):&lt;br /&gt;
&lt;br /&gt;
 . /fs/sz-user-supported/share/dotfiles/bashrc.cbcb&lt;br /&gt;
&lt;br /&gt;
This will import the common bashrc.cbcb file into your own bashrc file every time you log in.&lt;br /&gt;
&lt;br /&gt;
Now add this line to your &amp;quot;.bash_profile&amp;quot; file, also located in your home directory:&lt;br /&gt;
&lt;br /&gt;
 . ~/.bashrc&lt;br /&gt;
&lt;br /&gt;
This will import your personal bashrc file every time you log in. Now you should have access to most of the locally installed software like &amp;quot;blastall&amp;quot; and &amp;quot;AMOS.&amp;quot;&lt;br /&gt;
&lt;br /&gt;
If you want to add any addition commands to your bashrc file, such as setting your default text editor to &amp;quot;vim&amp;quot; or formatting the output of bash commands (eg. &amp;quot;ls&amp;quot;), add the appropriate commands after the imported common files as shown in this example:&lt;br /&gt;
&lt;br /&gt;
 . /fs/sz-user-supported/share/dotfiles/bashrc.cbcb&lt;br /&gt;
 &lt;br /&gt;
 alias vi=&#039;vim&#039;&lt;br /&gt;
&lt;br /&gt;
 alias ls=&#039;ls --color&#039;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
If you decide you want to change the settings in the common bashrc.cbcb to better suit your personal needs, then please copy and paste its contents into your personal bashrc file. &#039;&#039;&#039;Do not modify the common bashrc.cbcb file as it will affect everyone&#039;s environment.&#039;&#039;&#039; Also check back periodically as people may add common paths for new software.&lt;br /&gt;
&lt;br /&gt;
If you have any other problems, contact staff [at] umiacs.umd.edu or your PI. &amp;lt;br&amp;gt;&lt;br /&gt;
&lt;br /&gt;
More resources can be found at [https://wiki.umiacs.umd.edu/umiacs/index.php/GettingStarted umiacs wiki]&lt;br /&gt;
&lt;br /&gt;
== Printing ==&lt;br /&gt;
Go to the umiacs wiki to find [https://wiki.umiacs.umd.edu/umiacs/index.php/Printing system-specific guides for printing], and be sure to [https://wiki.umiacs.umd.edu/umiacs/index.php/PrinterQueueNaming add &#039;nb&#039; to the end of your print queue] to avoid wasting paper printing banners.&lt;br /&gt;
&lt;br /&gt;
== Using the Wiki ==&lt;br /&gt;
* [http://en.wikipedia.org/wiki/Wikipedia:Cheatsheet Wiki Formatting Cheatsheet]&lt;br /&gt;
* [http://www.mediawiki.org/wiki/Help:Configuration_settings Configuration settings list]&lt;br /&gt;
* [http://www.mediawiki.org/wiki/Help:FAQ MediaWiki FAQ]&lt;br /&gt;
* [http://mail.wikimedia.org/mailman/listinfo/mediawiki-announce MediaWiki release mailing list]&lt;br /&gt;
&lt;br /&gt;
== When You Travel ==&lt;br /&gt;
&lt;br /&gt;
[[Media:CBCB_Travel_Approval_may_2013.pdf‎]] basic travel approval form&lt;br /&gt;
&lt;br /&gt;
[[Media:Travel_101_cbcb_revised_4-2013.pdf]] for more detailed travel info&lt;/div&gt;</summary>
		<author><name>Mpop</name></author>
	</entry>
	<entry>
		<id>https://wiki.umiacs.umd.edu/cbcb/index.php?title=Main_Page&amp;diff=8962</id>
		<title>Main Page</title>
		<link rel="alternate" type="text/html" href="https://wiki.umiacs.umd.edu/cbcb/index.php?title=Main_Page&amp;diff=8962"/>
		<updated>2014-10-03T01:57:44Z</updated>

		<summary type="html">&lt;p&gt;Mpop: /* People */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== IMPORTANT (Read First) ==&lt;br /&gt;
The CBCB computational infrastructure is a shared resource and we all need to pitch in in order to make sure it works well for all of us.  Most importantly, we need to ensure that our disk space and computational resources are used responsibly.  The disk space, in particular, is a valuable commodity and thus it is important to pay attention to the following:&lt;br /&gt;
* There are three types of disk space available (a full list of volumes is available at [https://wiki.umiacs.umd.edu/cbcb-private/index.php/Storage CBCB Storage]) :&lt;br /&gt;
** Local harddrives (usually mounted as /scratch on the machines that have this resource).  These are not backed up, are quite fast (they live physically close to the processor), but can only be &#039;seen&#039; by the machine where they are mounted and, thus, require data staging in/out (which can take a while)&lt;br /&gt;
** Shared 3Par storage (/fs/szasmg*, /fs/szdata/*, etc.).  This is very fast and very expensive disk and thus a limited resource. Please only use this space to store data temporarily, while you are running some analyses on it.  As a rule of thumb, if a file or collection of files of any considerable size have lived on this space for more than 1-2 weeks, they should probably be moved to the attic space (see below)&lt;br /&gt;
** Attic storage (/fs/szattic*).  This is cheaper and ample, but slow and brittle storage.  Your data-sets should primarily live here.  Due to it&#039;s brittleness, the IT department do not recommend you run any analyses directly in this volume, rather you copy the files over to a local harddrive or the 3Par instead, and copy the results back when done.&lt;br /&gt;
* You should remove any temporary results you don&#039;t need in the long term as soon as you&#039;ve generated them, and compress all of the large files.  Bzip2 works better than gzip but either should dramatically improve the space requirements, especially for text files such as fasta or fastq.&lt;br /&gt;
* You should ensure that your files are owned by the cbcb group and that they have group write permissions for any file stored on the 3Par, especially for all those in /fs/szscratch.  This will allow your colleagues to remove files in case the disk runs out of space and you are, for example, on vacation (in which case you shouldn&#039;t have any major files sitting around on the 3Par).&lt;br /&gt;
&lt;br /&gt;
For more information on the CBCB resources see [[Getting Started in CBCB]].&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Seminars ==&lt;br /&gt;
* [http://www.cbcb.umd.edu/seminars Regular CBCB seminars (during academic year)] &amp;lt;br&amp;gt;&lt;br /&gt;
* [[Cbcb:Works-In-Progress]] - Works in progress seminar schedule (Summer 2008) &amp;lt;br&amp;gt;&lt;br /&gt;
* [[short_read_sequencing|Short read sequencing Meeting]] (Fridays at 3pm)&lt;br /&gt;
&lt;br /&gt;
== Projects ==&lt;br /&gt;
&lt;br /&gt;
* [[Project:Pop-Lab|Pop-Lab]]&lt;br /&gt;
* [[Project:Hcbravo-lab|HCBravo Lab]]&lt;br /&gt;
* [[Project:Cloud-Computing|Cloud Computing]]&lt;br /&gt;
* [[Project:SummerInternships|Summer Internship Projects]]&lt;br /&gt;
* [[Metagenomics Reading Group (Wed 2pm)]]&lt;br /&gt;
&lt;br /&gt;
== People ==&lt;br /&gt;
 &lt;br /&gt;
* [[User:ayres|Daniel Ayres]]&lt;br /&gt;
* [[User:pknut777|Adam Bazinet]] &lt;br /&gt;
* [[User:irina|Irina Astrovskaya]] &lt;br /&gt;
* [[User:jpaulson|Joseph Paulson]]  &lt;br /&gt;
* [[User:mpop|Mihai Pop]] &lt;br /&gt;
* [[User:nelsayed|Najib El-Sayed]]&lt;br /&gt;
&lt;br /&gt;
== Getting started ==&lt;br /&gt;
If you have just received a new umiacs account through CBCB, follow the instructions on this page to get the basic information you&#039;ll need to start working:&amp;lt;br&amp;gt;&lt;br /&gt;
*[[Getting Started in CBCB]]&lt;br /&gt;
*[https://wiki.umiacs.umd.edu/cbcb-private/index.php/Storage CBCB Storage]&lt;br /&gt;
*[https://wiki.umiacs.umd.edu/cbcb-private/index.php/Compute CBCB Computers]&lt;br /&gt;
*[[Communal Software]]&lt;/div&gt;</summary>
		<author><name>Mpop</name></author>
	</entry>
	<entry>
		<id>https://wiki.umiacs.umd.edu/cbcb/index.php?title=Main_Page&amp;diff=8961</id>
		<title>Main Page</title>
		<link rel="alternate" type="text/html" href="https://wiki.umiacs.umd.edu/cbcb/index.php?title=Main_Page&amp;diff=8961"/>
		<updated>2014-10-03T01:56:49Z</updated>

		<summary type="html">&lt;p&gt;Mpop: /* Projects */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== IMPORTANT (Read First) ==&lt;br /&gt;
The CBCB computational infrastructure is a shared resource and we all need to pitch in in order to make sure it works well for all of us.  Most importantly, we need to ensure that our disk space and computational resources are used responsibly.  The disk space, in particular, is a valuable commodity and thus it is important to pay attention to the following:&lt;br /&gt;
* There are three types of disk space available (a full list of volumes is available at [https://wiki.umiacs.umd.edu/cbcb-private/index.php/Storage CBCB Storage]) :&lt;br /&gt;
** Local harddrives (usually mounted as /scratch on the machines that have this resource).  These are not backed up, are quite fast (they live physically close to the processor), but can only be &#039;seen&#039; by the machine where they are mounted and, thus, require data staging in/out (which can take a while)&lt;br /&gt;
** Shared 3Par storage (/fs/szasmg*, /fs/szdata/*, etc.).  This is very fast and very expensive disk and thus a limited resource. Please only use this space to store data temporarily, while you are running some analyses on it.  As a rule of thumb, if a file or collection of files of any considerable size have lived on this space for more than 1-2 weeks, they should probably be moved to the attic space (see below)&lt;br /&gt;
** Attic storage (/fs/szattic*).  This is cheaper and ample, but slow and brittle storage.  Your data-sets should primarily live here.  Due to it&#039;s brittleness, the IT department do not recommend you run any analyses directly in this volume, rather you copy the files over to a local harddrive or the 3Par instead, and copy the results back when done.&lt;br /&gt;
* You should remove any temporary results you don&#039;t need in the long term as soon as you&#039;ve generated them, and compress all of the large files.  Bzip2 works better than gzip but either should dramatically improve the space requirements, especially for text files such as fasta or fastq.&lt;br /&gt;
* You should ensure that your files are owned by the cbcb group and that they have group write permissions for any file stored on the 3Par, especially for all those in /fs/szscratch.  This will allow your colleagues to remove files in case the disk runs out of space and you are, for example, on vacation (in which case you shouldn&#039;t have any major files sitting around on the 3Par).&lt;br /&gt;
&lt;br /&gt;
For more information on the CBCB resources see [[Getting Started in CBCB]].&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Seminars ==&lt;br /&gt;
* [http://www.cbcb.umd.edu/seminars Regular CBCB seminars (during academic year)] &amp;lt;br&amp;gt;&lt;br /&gt;
* [[Cbcb:Works-In-Progress]] - Works in progress seminar schedule (Summer 2008) &amp;lt;br&amp;gt;&lt;br /&gt;
* [[short_read_sequencing|Short read sequencing Meeting]] (Fridays at 3pm)&lt;br /&gt;
&lt;br /&gt;
== Projects ==&lt;br /&gt;
&lt;br /&gt;
* [[Project:Pop-Lab|Pop-Lab]]&lt;br /&gt;
* [[Project:Hcbravo-lab|HCBravo Lab]]&lt;br /&gt;
* [[Project:Cloud-Computing|Cloud Computing]]&lt;br /&gt;
* [[Project:SummerInternships|Summer Internship Projects]]&lt;br /&gt;
* [[Metagenomics Reading Group (Wed 2pm)]]&lt;br /&gt;
&lt;br /&gt;
== People ==&lt;br /&gt;
 &lt;br /&gt;
* [[User:ayres|Daniel Ayres]]&lt;br /&gt;
* [[User:pknut777|Adam Bazinet]] &lt;br /&gt;
* [[User:amp|Adam M Phillipy]] &lt;br /&gt;
* [[User:adelcher|Arthur L. Delcher]] &lt;br /&gt;
* [[User:carlk|Carl Kinsford]]  &lt;br /&gt;
* [[User:dpuiu|Daniela Puiu]] &lt;br /&gt;
* [[User:dsommer|Dan Sommer]] &lt;br /&gt;
* [[User:gpertea|Geo Pertea]] &lt;br /&gt;
* [[User:irina|Irina Astrovskaya]] &lt;br /&gt;
* [[User:jeallen|Jonathan Edward All]] &lt;br /&gt;
* [[User:ayanbule|Kunmi Ayanbule]]&lt;br /&gt;
* [[User:mschatz|Michael Schatz]]&lt;br /&gt;
* [[User:jpaulson|Joseph Paulson]]  &lt;br /&gt;
* [[User:mpertea|Mihaela Pertea]] &lt;br /&gt;
* [[User:mpop|Mihai Pop]] &lt;br /&gt;
* [[User:nelsayed|Najib El-Sayed]] &lt;br /&gt;
* [[User:nedwards|Nathan Edwards]]&lt;br /&gt;
* [[User:niranjan|Niranjan Nagarajan]] &lt;br /&gt;
* [[User:saket|Saket Navlakha]]&lt;br /&gt;
* [[User:angiuoli|Samuel V Angluoli]] &lt;br /&gt;
* [[User:salzberg|Steven Salzberg]]&lt;br /&gt;
* [[User:tgibbons | Ted Gibbons]]&lt;br /&gt;
* [[User:treangen | Todd J. Treangen]]&lt;br /&gt;
* [[User:whitej|James Robert White]]&lt;br /&gt;
&lt;br /&gt;
== Getting started ==&lt;br /&gt;
If you have just received a new umiacs account through CBCB, follow the instructions on this page to get the basic information you&#039;ll need to start working:&amp;lt;br&amp;gt;&lt;br /&gt;
*[[Getting Started in CBCB]]&lt;br /&gt;
*[https://wiki.umiacs.umd.edu/cbcb-private/index.php/Storage CBCB Storage]&lt;br /&gt;
*[https://wiki.umiacs.umd.edu/cbcb-private/index.php/Compute CBCB Computers]&lt;br /&gt;
*[[Communal Software]]&lt;/div&gt;</summary>
		<author><name>Mpop</name></author>
	</entry>
	<entry>
		<id>https://wiki.umiacs.umd.edu/cbcb/index.php?title=Dpuiu_HTS&amp;diff=8923</id>
		<title>Dpuiu HTS</title>
		<link rel="alternate" type="text/html" href="https://wiki.umiacs.umd.edu/cbcb/index.php?title=Dpuiu_HTS&amp;diff=8923"/>
		<updated>2011-12-22T03:35:33Z</updated>

		<summary type="html">&lt;p&gt;Mpop: /* Newbler */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= Summary =&lt;br /&gt;
&lt;br /&gt;
* [http://www.politigenomics.com/next-generation-sequencing-informatics Politigenomics source]&lt;br /&gt;
  Vendor:                       Roche                   Illumina                          ABI             &lt;br /&gt;
  Technology:                   454                     Solexa GA                        SOLiD&lt;br /&gt;
                        --------------------    --------------------------------     ------------------      &lt;br /&gt;
  Platform:             GS20    FLX     Ti      I       II      IIx     HiSeq200     1       2       3&lt;br /&gt;
  Reads: (M)            0.5     0.5     1.25    28      100     150                  40      115     320&lt;br /&gt;
 &lt;br /&gt;
  Fragment&lt;br /&gt;
  Read length:          100     200     400     35      50      100     100          25      35      50&lt;br /&gt;
  Insert: 0 (unmates)&lt;br /&gt;
  Run time: (d)         0.25    0.3     0.4     3       3       5       8            6       5       8&lt;br /&gt;
  Yield: (Gb)           0.05    0.1     0.5     1       5       15      200          1       4       16&lt;br /&gt;
  Rate: (Gb/d)          0.2     0.33    1.25    0.33    1.67    3       25           0.34    1.6     2&lt;br /&gt;
  Images: (TB)          0.01    0.01    0.03    0.5     1.1     2.8                  1.8     2.5     1.9&lt;br /&gt;
  PA Disk: (GB)         3       3       15      175     300     300                  300     750     1200&lt;br /&gt;
  PA CPU: (hr)          10      140     220     100     70      NA                   NA      NA      NA&lt;br /&gt;
  SRA: (GB)             0.5     1       4       30      50      2.5                  100     140     600&lt;br /&gt;
 &lt;br /&gt;
  Paired-end&lt;br /&gt;
  Read length:                  200     400     2×35    2×50    2×100                2×25    2×35    2×50&lt;br /&gt;
  Insert: (kb)                  3.5     3.5     0.2     0.2     0.2                  3       3       3&lt;br /&gt;
  Run time: (d)                 0.3     0.4     6       10      10                   12      10      16&lt;br /&gt;
  Yield: (Gb)                   0.1     0.5     2       9       30                   2       8       32&lt;br /&gt;
  Rate: (Gb/d)                  0.33    1.25    0.33    1.67    3                    0.34    1.6     2&lt;br /&gt;
  Images: (TB)                  0.01    0.03    1       2.2     5.6                  3.6     5       3.8&lt;br /&gt;
  PA Disk: (GB)                 3       15      350     500     500                  600     1500    2400&lt;br /&gt;
  PA CPU: (hr)                  140     220     160     120     NA                   NA      NA      NA&lt;br /&gt;
  SRA: (GB)                     1       4       60      100     3.5                  200     280     1200&lt;br /&gt;
 &lt;br /&gt;
  Mate-pair             &lt;br /&gt;
&lt;br /&gt;
* [http://marray.economia.unimi.it/2009/material/lectures/L5_HTS_Overview.pdf Other ..]&lt;br /&gt;
&lt;br /&gt;
  Vendor:                       Roche                   Illumina                 ABI                  &lt;br /&gt;
  Technology:                   454                     Solexa GA               SOLiD                  &lt;br /&gt;
                        --------------------    --------------------    ------------------     &lt;br /&gt;
  Year                          2005                     2007                   2007                 &lt;br /&gt;
  Amplification             emulsion PCR               bridge PCR          emulsion PCR           &lt;br /&gt;
  Sequencing               pyrosequencing       sequencing-by-synthesis  ligase-based sequencing  &lt;br /&gt;
  ErrorsRate/bp                  5%                       1%                   &amp;lt;0.1% &lt;br /&gt;
  DominantError         indels(esp homopolym)         substitutions          substitutions&lt;br /&gt;
&lt;br /&gt;
= Links =&lt;br /&gt;
&lt;br /&gt;
* [http://www.illumina.com/Documents/products/technotes/technote_denovo_assembly_ecoli.pdf Illumina technote]&lt;br /&gt;
* [http://www.illumina.com/Documents/products/technotes/technote_denovo_assembly.pdf Illumina technote]&lt;br /&gt;
* [https://fts.illumina.com/seos/1000/mpd/110120112b55535854860a599df1c97ab5a6a2c3 Illumina technote data] &lt;br /&gt;
* [http://samtools.sourceforge.net/SAM1.pdf SAM format]&lt;br /&gt;
* [http://samtools.sourceforge.net/ SAM tools]&lt;br /&gt;
* [http://sourceforge.net/projects/dnaa/files/ DNAA]&lt;br /&gt;
* [http://www.cbcb.umd.edu/research/complexity/graphs/] genome complexity graphs&lt;br /&gt;
* [http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2639302/?tool=pmcentrez CA BOG paper]&lt;br /&gt;
&#039;&#039;Modified Celera Assembler: the Celera Assembler software has modules for successive phases of assembly: pairwise overlap detection; initial ungapped multiple sequence alignments called unitigs; unitig consensus calculation; combination of unitigs with mate constraints to form contigs and scaffolds, which are ungapped and gapped multiple sequence alignments, respectively; and finally, scaffold consensus determination (Myers et al., 2000). Our approach to hybrid data assembly reuses the Celera Assembler scaffold and consensus modules. Independent of the hybrid problem, the scaffold module was revised to recover trimmed base calls confirmed by co-locating reads, and the consensus module was revised to determine alternate consensus sequences in regions of apparent polymorphism (Denisov et al., 2008). Our analysis narrowed the source of hybrid assembly problems to the overlap and unitig stages.&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;For speed, Celera Assembler relies on short exact matches between reads as seeds for overlap detection. Its exact-match algorithms were sensitive to the different proclivities for stutter observed between platforms. Stutter, that is, incorrect determination of the number of bases in homopolymer (single-letter) runs, is more prevalent in pyro reads than Sanger reads. We therefore modified the software to search for matches in compressed sequence, in which all single-letter repeats are reduced to a single base. The uncompressed sequence is consulted later before the seeds become overlaps.&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;Celera Assembler was sensitive to the different average read lengths between platforms. The shorter reads are more likely to be entirely contained within genomic repeats. Over-collapsed alignments of short repeat reads induce true and false overlaps to the interior of longer reads. Where the longer reads extend beyond the genomic repeats, they do not all overlap each other. The result is short reads with containment overlaps to multiple long reads that do not overlap each other. These overlap tangles were triggering Celera heuristics designed to detect mis-assembly, leading to unnecessarily short contigs.&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;Celera Assembler was also sensitive to the higher coverage typical of lower cost pyrosequencing. Higher coverage leads to increased collisions of reads with exactly the same prefix sequence. The assembler&#039;s arbitrary tie-breaking heuristics, sufficient for infrequent ties, had the potential to lead the assembler away from the global optimum in hybrid data. To address these problems we developed an aggressive approach to unitig construction that builds unitigs in greedy fashion, always following a read&#039;s best overlap (by an appropriate criterion), and ignoring contained reads at first. The aggressive unitigs initially incorporate mistakes that, ideally, are caught and corrected later by pattern analysis applied to best overlaps and mate constraints.&lt;br /&gt;
High coverage could also increase the number of spurs, that is, reads with invalid sequence at one end. These seemed to contribute to fractured unitigs on hybrid data. We realized the software could turn higher coverage to its advantage by carefully trimming reads of unconfirmed sequence.&#039;&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
* [http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2532726/?tool=pubmed Substantial biases in ultra-short read data sets from high-throughput DNA sequencing]&lt;br /&gt;
* [http://genomebiology.com/2009/10/3/R32 Evaluation of next generation sequencing platforms for population targeted sequencing studies]&lt;br /&gt;
* [http://nar.oxfordjournals.org/content/38/6/1767.full The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants]&lt;br /&gt;
&lt;br /&gt;
= De novo assemblers =&lt;br /&gt;
&lt;br /&gt;
Performance measures&lt;br /&gt;
* ctg/scf/singleton stats&lt;br /&gt;
&lt;br /&gt;
== ABYSS ==&lt;br /&gt;
&lt;br /&gt;
* [http://www.bcgsc.ca/platform/bioinfo/software/abyss Website]&lt;br /&gt;
* [[Media:ABYSS.README.txt|README]]&lt;br /&gt;
* Example:&lt;br /&gt;
&lt;br /&gt;
  # single reads; one library&lt;br /&gt;
  # getting the best kmer size&lt;br /&gt;
  ABYSS -k31 frag_12.fastq -o contigs.fa &amp;gt; ABYSS.log&lt;br /&gt;
  perl -e &#039;foreach $k (21,23,25,27,29,31,33,35,37,39,41,43,45,47,49,51) { print &amp;quot;ABYSS -k$k frag_12.fastq -o frag.K$k.fa\n&amp;quot; }&#039;&lt;br /&gt;
  perl -e &#039;foreach $k (21,23,25,27,29,31,33,35,37,39,41,43,45,47,49,51) { print &amp;quot;lenseq frag.cor.K$k.fa | getSummary.pl -nh -i 1 -t K$k\n&amp;quot; }&#039; &amp;gt; ABYSS.summary&lt;br /&gt;
&lt;br /&gt;
  #paired &amp;amp; single reads&lt;br /&gt;
  abyss-pe -j2 k=31 n=10 name=ecoli lib=&#039;lib1 lib2&#039; lib1=&#039;lib1_1.fa lib1_2.fa&#039; lib2=&#039;lib2_1.fa lib2_2.fa&#039; se=&#039;se1.fa se2.fa&#039;&lt;br /&gt;
&lt;br /&gt;
  ##############################&lt;br /&gt;
 &lt;br /&gt;
  #paired reads; multiple libraries; n: minimum number of links between 2 contigs in a scaffold&lt;br /&gt;
  abyss-pe k=31 n=5 name=genome lib=&#039;frag short&#039; frag=&#039;frag_1.fastq frag_2.fastq&#039; short=&#039;short_1.fastq short_2.fastq&#039; aligner=bowtie &amp;gt;&amp;amp; abyss-pe.log&lt;br /&gt;
&lt;br /&gt;
  #paired reads; multiple libraries; n: minimum number of links between 2 contigs in a scaffold&lt;br /&gt;
  abyss-pe k=31 n=5 name=genome lib=&#039;frag short&#039; frag=frag_12.cor.fastq short=short_12.cor.fastq aligner=bowtie &amp;gt;&amp;amp; abyss-pe.log&lt;br /&gt;
&lt;br /&gt;
== ALLPATHS-LG ==&lt;br /&gt;
&lt;br /&gt;
* [http://www.broadinstitute.org/software/allpaths-lg/blog/ Website]&lt;br /&gt;
* [http://www.broadinstitute.org/science/programs/genome-biology/crd Download]&lt;br /&gt;
* [ftp://ftp.broadinstitute.org/pub/crd/ALLPATHS/Release-LG/latest_source_code/LATEST_VERSION.tar.gz Latest version download]&lt;br /&gt;
* [http://www.pnas.org/content/early/2010/12/20/1017351108.full.pdf+html PNAS article]&lt;br /&gt;
  The resulting draft genome assemblies have good accuracy, short-range contiguity, long-range connectivity, and coverage of the genome. &lt;br /&gt;
  In particular, the base accuracy is high (≥99.95%) and &lt;br /&gt;
  the scaffold sizes (N50 size = 11.5 Mb for human and 7.2 Mb for mouse) approach those obtained with capillary-based sequencing&lt;br /&gt;
&lt;br /&gt;
  Good assemblies definition: ctgLen=~20–100 kb ; scfLen=~10 Mb&lt;br /&gt;
&lt;br /&gt;
  Libraries,insert_types*  Fragment_size(bp)  Read_length(bp)  Sequence_coverage(×)  Required  Protocols&lt;br /&gt;
  Fragment                 180                ≥100             45                    Yes       existing ; try to improve high GC-content reg &lt;br /&gt;
  ShortJump                3,000              ≥100             45                    Yes       Illumina &lt;br /&gt;
  LongJump                 6,000              ≥100             5                     No        SOLiD EcoP15I based &lt;br /&gt;
  FosmidJump               40,000             ≥26              1                     No        “ShARC” and “Fosill” developed by BROAD&lt;br /&gt;
&lt;br /&gt;
* [ftp://ftp.broadinstitute.org/pub/crd/ALLPATHS/Release-LG/AllPaths-LG_Manual.pdf Manual]&lt;br /&gt;
* Data tested on:&lt;br /&gt;
 &lt;br /&gt;
* Compile:&lt;br /&gt;
  #wget ftp://ftp.broadinstitute.org/pub/crd/ALLPATHS/Release-LG/LATEST_VERSION.tar.gz&lt;br /&gt;
  wget ftp://ftp.broadinstitute.org/pub/crd/ALLPATHS/Release-LG/latest_source_code/LATEST_VERSION.tar.gz&lt;br /&gt;
  tar xzvf LATEST_VERSION.tar.gz&lt;br /&gt;
  cd allpathslg-*&lt;br /&gt;
  autoconf &lt;br /&gt;
  ./configure --with-boost=/fs/szdevel/core-cbcb-software/Linux-x86_64/  &lt;br /&gt;
  #ls -l /fs/szdevel/core-cbcb-software/Linux-x86_64/include/boost/&lt;br /&gt;
  #ll -l /fs/szdevel/core-cbcb-software/Linux-x86_64/lib/libboost_*&lt;br /&gt;
  make –j8 # use ‐j&amp;lt;n&amp;gt; to parallelize compilation&lt;br /&gt;
&lt;br /&gt;
* Compile issues : the static libs (.a) should be linked before the dynamic ones (-l)&lt;br /&gt;
  Fix: &lt;br /&gt;
  g++ -fopenmp -imacros config.h -Wextra -Wall -Wno-unused -ansi -pedantic -Wno-long-long -Wsign-promo -Woverloaded-virtual \&lt;br /&gt;
  -Wendif-labels -O3 -ggdb3  -ftemplate-depth-50 -Wno-deprecated -Wno-parentheses -fno-stri ct-aliasing -mieee-fp -iquote . \&lt;br /&gt;
  -o ErrorCorrectJumpNew -pthread ErrorCorrectJumpNew.o \&lt;br /&gt;
  -L/fs/szdevel/core-cbcb-software/Linux-x86_64//lib libAllPaths3.a -lboost_system -lboost_filesystem  \&lt;br /&gt;
  -pthread -Wl,-rpath -Wl,/fs/szdevel/core-cbcb-software/Linux-x86_64//lib &lt;br /&gt;
 &lt;br /&gt;
  -o CreateLookupTab -pthread CreateLookupTab.o \&lt;br /&gt;
  -o SamplePairedReadStatsOld -pthread SamplePairedReadStatsOld.o \&lt;br /&gt;
  -o SamplePairedReadStats -pthread SamplePairedReadStats.o \&lt;br /&gt;
  -o ErrorCorrectJump -pthread ErrorCorrectJump.o \&lt;br /&gt;
&lt;br /&gt;
 g++ -fopenmp -Wextra -Wall -Wno-unused -Wno-deprecated -ansi -pedantic -Wno-long-long -Wsign-promo -Woverloaded-virtual \&lt;br /&gt;
  -Wendif-labels -Wno-empty-body -Wno-parentheses -Wno-array-bounds -fno-nonansi-builtins -ftemplate-depth-50 -mieee-fp -fno-strict-aliasing -iquote . -imacros config.h -ggdb3  \&lt;br /&gt;
  -o ErrorCorrectJumpNew \&lt;br /&gt;
  -L/fs/szdevel/core-cbcb-software/Linux-x86_64//lib -R/fs/szdevel/core-cbcb-software/Linux-x86_64//lib \&lt;br /&gt;
  -pthread ErrorCorrectJumpNew.o libAllPathsLG.a -lboost_filesystem &lt;br /&gt;
&lt;br /&gt;
* Configuration:&lt;br /&gt;
  set path=($path /fs/szdevel/core-cbcb-software/Linux-x86_64/packages/allpaths3/bin/)&lt;br /&gt;
  limit stacksize 100000&lt;br /&gt;
&lt;br /&gt;
* Sequence formatting: sort and add /[12] suffix if needed&lt;br /&gt;
  [http://picard.sourceforge.net/ Picard]&lt;br /&gt;
  java -jar /nfshomes/dpuiu/core-cbcb-software/Linux-x86_64/packages/picard-tools/FastqToSam.jar SAMPLE_NAME=in_12 QUALITY_FORMAT=Standard FASTQ=in_1.fastq FASTQ2=in_2.fastq OUTPUT=out_12.bam &lt;br /&gt;
  java -jar /nfshomes/dpuiu/core-cbcb-software/Linux-x86_64/packages/picard-tools/SamToFastq.jar                                           INPUT=in_12.bam                    FASTQ=out_1.fastq SECOND_END_FASTQ=out_2.fastq&lt;br /&gt;
&lt;br /&gt;
* in_groups.csv &amp;amp; in_libs.csv files required &lt;br /&gt;
&lt;br /&gt;
  echo group_name, library_name, file_name &amp;gt;! in_groups.csv&lt;br /&gt;
  echo frag,  frag,  chr14_fragment_12.bam &amp;gt;&amp;gt; in_groups.csv&lt;br /&gt;
  echo short,  short, chr14_shortjump_12.bam &amp;gt;&amp;gt; in_groups.csv&lt;br /&gt;
  echo long,  long,  chr14_longjump_12.bam &amp;gt;&amp;gt; in_groups.csv&lt;br /&gt;
&lt;br /&gt;
  echo  library_name, project_name, organism_name, type,  paired, chr14_fragment_size, chr14_fragment_stddev, insert_size, insert_stddev, read_orientation, genomic_start, genomic_end &amp;gt;! in_libs.csv&lt;br /&gt;
  echo frag,    genome,       genome,        fragment, 1, 180, 20, , , inward, 0, 0 &amp;gt;&amp;gt; in_libs.csv&lt;br /&gt;
  echo short,   genome,       genome,        jumping,  1, , , 3000, 300, outward, 20, 81 &amp;gt;&amp;gt; in_libs.csv&lt;br /&gt;
  echo long,    genome,       genome,        jumping,  1, , , 35000, 3500, inward, 20, 56 &amp;gt;&amp;gt; in_libs.csv&lt;br /&gt;
&lt;br /&gt;
* prepare data&lt;br /&gt;
  BAMsToAllPathsInputs.pl DATA_DIR=$PWD PLOIDY=2 HOSTS=16 GENOME_SIZE=107349540 PICARD_TOOLS_DIR=/nfshomes/dpuiu/core-cbcb-software/Linux-x86_64/packages/picard-tools/&lt;br /&gt;
  or &lt;br /&gt;
  Allpaths-LG-preprocess.amos &lt;br /&gt;
&lt;br /&gt;
* run assembler&lt;br /&gt;
  (time RunAllPaths3G PRE=$PWD REFERENCE_NAME=. DATA_SUBDIR=. RUN=allpaths SUBDIR=run ERROR_CORRECTION=True ) &amp;gt;&amp;amp; RunAllPaths3G.log&lt;br /&gt;
  or&lt;br /&gt;
  Allpaths-LG-.amos&lt;br /&gt;
&lt;br /&gt;
* prepare data and run assembler in one subdirectory&lt;br /&gt;
  ~/bin/Allpaths-LG/Allpaths-LG-preprocess.amos -D PLOIDY=1 -D DATA_DIR=$PWD/mydata&lt;br /&gt;
  RunAllPaths3G PRE=$PWD REFERENCE_NAME=. DATA_SUBDIR=mydata RUN=RunAllPaths3G SUBDIR=run  &amp;gt;&amp;amp;! mydata.log&lt;br /&gt;
&lt;br /&gt;
* process run log&lt;br /&gt;
  cat RunAllPaths3G.log | grep RUN | p &#039;next unless (/^\[/ and $F[1]=~/^\w+/);shift @F; print join &amp;quot; &amp;quot;,@F; print &amp;quot;\n&amp;quot;;&#039; | sed -f RUN.sed | p &#039;print $.*10,&amp;quot;: &amp;quot;; print &amp;quot;\$(BINDIR)/&amp;quot; unless(/^ln/); print $_;&#039; &amp;gt; RUN.amos&lt;br /&gt;
  or &lt;br /&gt;
  allpathsLogFilter.pl&lt;br /&gt;
 &lt;br /&gt;
 Example: Ecoli: 0.05 simulated error ; 3 libs 50X,50X,5X&lt;br /&gt;
   step 2-30,32-36  : read error correction ; should run only once&lt;br /&gt;
 &lt;br /&gt;
   step program                     start     duration(5+ min)&lt;br /&gt;
     1  RunAllPaths3G               17:36:59  &lt;br /&gt;
 &lt;br /&gt;
     2  RemoveDodgyReads            17:37:03  &lt;br /&gt;
     3  PreCorrect                  17:38:08  9 &lt;br /&gt;
     4  FindErrors                  17:47:24  30   !!! most time consuming: 50% of the total time&lt;br /&gt;
     5  CommonPather                18:27:59  11&lt;br /&gt;
     6  MakeRcDb                    18:38:09  &lt;br /&gt;
     7  Unipather                   18:42:05  &lt;br /&gt;
     8  CommonPather                18:43:54  &lt;br /&gt;
     9  MakeRcDb                    18:48:50  &lt;br /&gt;
    10  FillFragments               18:50:09  &lt;br /&gt;
    11  CommonPather                18:51:07  &lt;br /&gt;
    12  MakeRcDb                    18:54:43  &lt;br /&gt;
    13  Unipather                   18:55:08  &lt;br /&gt;
    14  CloseUnipathGaps            18:55:29  &lt;br /&gt;
    15  LittleHelpsBig              18:58:23  &lt;br /&gt;
    16  ShaveUnipathGraph           19:00:19  &lt;br /&gt;
    17  ReplacePairsStats           19:01:44  &lt;br /&gt;
    18  RemoveDodgyReads            19:01:48  &lt;br /&gt;
    19  SamplePairedReadStats       19:02:48  &lt;br /&gt;
    20  UnipathPatcher              19:03:45  &lt;br /&gt;
    21  CommonPather                19:06:42  &lt;br /&gt;
    22  MakeRcDb                    19:10:40  &lt;br /&gt;
    23  Unipather                   19:10:41  &lt;br /&gt;
    24  FilterPrunedReads           19:10:43  &lt;br /&gt;
    25  CreateLookupTab             19:10:54  &lt;br /&gt;
    26  ErrorCorrectJump            19:11:41  &lt;br /&gt;
    27  SplitUnibases               19:12:26  &lt;br /&gt;
    28  MergeReadSets               19:12:28    &lt;br /&gt;
    29  GenerateLengthsFile         19:12:42  &lt;br /&gt;
    30  MakeRcDb                    19:12:52  &lt;br /&gt;
    31  MergeReadSets               19:13:09  &lt;br /&gt;
    32  UnibaseCopyNumber2          19:13:19  &lt;br /&gt;
    33  UnipathLocs3G               19:15:20  &lt;br /&gt;
    34  RemoveDodgyReads            19:15:38  &lt;br /&gt;
    35  ErrorCorrectJump            19:15:46  &lt;br /&gt;
    36  MergeReadSets               19:16:02  &lt;br /&gt;
 &lt;br /&gt;
    37  BuildUnipathLinkGraphs3G    19:16:05  &lt;br /&gt;
    38  SelectSeeds                 19:16:23  &lt;br /&gt;
    39  LocalizeReads3G             19:16:36  9&lt;br /&gt;
    40  MergeNeighborhoods1         19:25:30  &lt;br /&gt;
    41  MergeNeighborhoods2         19:26:48  &lt;br /&gt;
    42  MergeNeighborhoods3         19:26:50  &lt;br /&gt;
    43  DumpHyper                   19:27:06  &lt;br /&gt;
    44  RecoverUnipaths             19:27:12  &lt;br /&gt;
    45  FlattenHKP                  19:27:54  &lt;br /&gt;
    46  AlignPairsToFasta           19:27:56  &lt;br /&gt;
    47  RemoveHighCNAligns          19:28:12  &lt;br /&gt;
    48  MakeRcDb                    19:29:02  &lt;br /&gt;
    49  MakeScaffolds3G             19:29:06  &lt;br /&gt;
    50  PostPatcher                 19:29:36  &lt;br /&gt;
    51  FixSomeIndels               19:31:35  &lt;br /&gt;
    52  MakeReadLocs                19:37:53  &lt;br /&gt;
    53  ParseAssemblyReport         19:38:16&lt;br /&gt;
&lt;br /&gt;
* getting the corrected reads &lt;br /&gt;
  Allpaths-LG-postprocess.amos&lt;br /&gt;
  ...&lt;br /&gt;
  Fastb2Fasta IN=jump_reads_ec.fastb OUT=jump_reads_ec.fasta BREAKCOL=101&lt;br /&gt;
  ...&lt;br /&gt;
&lt;br /&gt;
  Example: S aureus: 0.00 vs 0.02 simulated error ; 4 libs:180_45X,3000_45X,6000_5x,40000_1x&lt;br /&gt;
&lt;br /&gt;
  more /fs/szattic-asmg5/dpuiu/HTS/Staphylococcus_aureus/Illumina.sim.180_45X.3000_45X.6000_5x.40000_1x/allpathsCor/README&lt;br /&gt;
  orig(e=0.00;0.02)                   len       allpathsCor(e=0.00;0.02)                        len:min;max      orientation&lt;br /&gt;
  ---------------------------------------------------------------------------------------------------------- &lt;br /&gt;
  frag/1                    647052    100       filled_reads_filt.fasta   645445   644487       120  240         inward&lt;br /&gt;
  frag                      1294104   100       frag_reads_corr.fasta     1294093  1293557      100  100         inward&lt;br /&gt;
  short                     1294104                                                                              outward-&amp;gt;inward&lt;br /&gt;
  medium                    143644                                                                               outward-&amp;gt;inward&lt;br /&gt;
  long                      28728     100       long_jump_reads_ec.fasta  28716    17260*       96   100         inward  &lt;br /&gt;
  ----------------------------------------------------------------------------------------------------------&lt;br /&gt;
  short+medium              1437748   100       jump_reads_ec.fasta       1437188  872734*      97   100     &lt;br /&gt;
  short+medium+long         1466476   100       scaffold_reads.fasta      1465904  889994       96   100    &lt;br /&gt;
  frag+short+medium         2731852   100       reads_merged_ec.fasta     2731281  2166291      97   100     &lt;br /&gt;
  frag+short+medium+long    2760580   100       all_reads.fasta           2178124  1593104      96   240&lt;br /&gt;
  ----------------------------------------------------------------------------------------------------------&lt;br /&gt;
&lt;br /&gt;
* Notes &lt;br /&gt;
  On a simulated data set Ploidity = 1 or 2 did not help much&lt;br /&gt;
  the jump outies get reversed&lt;br /&gt;
&lt;br /&gt;
== CA ==&lt;br /&gt;
&lt;br /&gt;
* Version: 6.1&lt;br /&gt;
* [http://sourceforge.net/apps/mediawiki/wgs-assembler/index.php?title=Main_Page Web site]&lt;br /&gt;
* Input: 454, Sanger&lt;br /&gt;
* Illumina format (Sanger or Illumina FASTQ)&lt;br /&gt;
  fastqToCA                      -libraryname prefix -type sanger            -fastq $PWD/prefix.fastq                              &amp;gt; prefix.frg&lt;br /&gt;
  fastqToCA -insertsize 300  30  -libraryname prefix -type [sanger|illumina] -innie -fastq $PWD/prefix_1.fastq,$PWD/prefix_2.fastq &amp;gt; prefix_12.frg&lt;br /&gt;
  fastqToCA -insertsize 3000 300 -libraryname prefix -type [sanger|illumina] -outie -fastq $PWD/prefix_1.fastq,$PWD/prefix_2.fastq &amp;gt; prefix_12.frg # long inserts&lt;br /&gt;
* Command&lt;br /&gt;
  ~/bin/runCA -d . -s runCA.spec -p asm prefix*.frg &amp;gt;&amp;amp; runCA.log&lt;br /&gt;
* 454 &lt;br /&gt;
** must use bog/mer params, otherwise lib insert size get reestimated; portion of scaffolds moved and rotated (compared to the ref seq)&lt;br /&gt;
  cat runCA.454.spec&lt;br /&gt;
  unitigger =                      bog&lt;br /&gt;
  ovlOverlapper =                  mer&lt;br /&gt;
* Sanger/Illumina &lt;br /&gt;
  cat runCA.Sanger.spec&lt;br /&gt;
  unitigger =                      bog&lt;br /&gt;
  ovlOverlapper =                  ovl &lt;br /&gt;
* Large assemblies, run in parallel&lt;br /&gt;
* Repeats: usually end up as degenerates&lt;br /&gt;
* Breaks:  seem to be caused by short tandem repeats (the multiple contig alignments are adjacent)&lt;br /&gt;
&lt;br /&gt;
== CA ShortReads assembly issues ==&lt;br /&gt;
Reads (~100bp) , high error&lt;br /&gt;
&lt;br /&gt;
* GC% bias, nonrandomness &amp;amp; redundancy in shortjump libs&lt;br /&gt;
  Use estimate genome size&lt;br /&gt;
  runCA genomeSize=...&lt;br /&gt;
&lt;br /&gt;
  Redundancy/chimera detection script&lt;br /&gt;
  ~/dpuiu/bin/azimin/filter_library.sh&lt;br /&gt;
&lt;br /&gt;
* Set kmer thold manually to ~10 times the coverage&lt;br /&gt;
  74X cvg =&amp;gt; 750 kmer thold&lt;br /&gt;
&lt;br /&gt;
  runCA obtMerThreshold=750 ovlMerThreshold=750&lt;br /&gt;
&lt;br /&gt;
* Recompile the CA code&lt;br /&gt;
  cat AS_CGW/AS_CGW_dataTypes.h&lt;br /&gt;
   #define CGW_MIN_DISCRIMINATOR_UNIQUE_LENGTH 250&lt;br /&gt;
&lt;br /&gt;
* Avoid unitig breaks caused by fragment libries (~180bp ins)&lt;br /&gt;
&lt;br /&gt;
  cat AS_BOG_MateChecker.cc&lt;br /&gt;
  //  if (globalStats[lib].samples &amp;lt; 10 )&lt;br /&gt;
  if (globalStats[lib].samples &amp;lt; 10 || globalStats[lib].stddev &amp;lt; 100 || globalStats[lib].mean &amp;lt; 500)&lt;br /&gt;
&lt;br /&gt;
== CA VeryShortReads assembly issues ==&lt;br /&gt;
Reads &amp;lt;40bp , high error&lt;br /&gt;
&lt;br /&gt;
* Recompile the CA code&lt;br /&gt;
  cat AS_global.h&lt;br /&gt;
    #define AS_READ_MIN_LEN  32&lt;br /&gt;
    #define AS_OVERLAP_MIN_LEN  30&lt;br /&gt;
&lt;br /&gt;
=== Read correction ===&lt;br /&gt;
* quake: better&lt;br /&gt;
* OBT : takes too long,  trims too much seq&lt;br /&gt;
&lt;br /&gt;
=== Kmers ===&lt;br /&gt;
&lt;br /&gt;
* decrease kmer size 22-&amp;gt;?&lt;br /&gt;
&lt;br /&gt;
=== Unitigger ===&lt;br /&gt;
&lt;br /&gt;
* bog&lt;br /&gt;
* increase utgErrorRate 0.015-&amp;gt;?&lt;br /&gt;
* Issues&lt;br /&gt;
** Snps in the repetitive unitigs&lt;br /&gt;
   6 out of 529 unitigs don&#039;t align perfectly to genome =&amp;gt; misassemblies&lt;br /&gt;
   data : 75bp error-free simulated reads &lt;br /&gt;
   200bp lib: 25X cvg &lt;br /&gt;
   6K lib   : 14X&lt;br /&gt;
   command: ~/bin/runCA -s ./runCA.spec  doOverlapBasedTrimming=0 unitigger=bog utgErrorRate=0.0 cgwErrorRate=0.0 cnsErrorRate=0.0 ovlErrorRate=0.0 -d . -p test test*.frg &lt;br /&gt;
&lt;br /&gt;
  show-coords -L 200 test.utg.delta | egrep -n &#039;220002412853|220002412688&#039;&lt;br /&gt;
  736581   736875  |        1      295  |      295      295  |    99.32  |  4639675      295  |     0.01   100.00  | NC_000913      utg220002412853 [CONTAINS]&lt;br /&gt;
  736741   737024  |        1      284  |      284      284  |   100.00  |  4639675      284  |     0.01   100.00  | NC_000913      utg220002412688 [CONTAINS]&lt;br /&gt;
&lt;br /&gt;
  show-snps test.utg.filter-q.delta | grep 220002412853&lt;br /&gt;
  736815   A G   235       |       36       61  |    1    0  |  1  1  NC_000913 utg220002412853&lt;br /&gt;
  736851   A C   271       |       25       25  |    1    0  |  1  1  NC_000913 utg220002412853&lt;br /&gt;
&lt;br /&gt;
=== Cgw ===&lt;br /&gt;
  filter out single fragment unitigs (most of them) and fix aStat from &amp;lt;1 to 1&lt;br /&gt;
   ~/bin/filterUnitigs.sh&lt;br /&gt;
&lt;br /&gt;
== Edena ==&lt;br /&gt;
&lt;br /&gt;
* Version 2.1.1&lt;br /&gt;
* [http://www.genomic.ch/edena.php Web site] &lt;br /&gt;
* Input: Illumina ; unmated only !!!&lt;br /&gt;
* Format: seq, fastq &lt;br /&gt;
* Command:&lt;br /&gt;
  edena --readsFile  prefix.fastq --prefix prefix --minOverlap [20]   # overlap mode&lt;br /&gt;
  edena --edenaFile  prefix.ovl -p prefix  --overlapCutoff [22]       # assembly mode&lt;br /&gt;
* Output: no information about the position of the reads in assembly !!!&lt;br /&gt;
&lt;br /&gt;
== Newbler ==&lt;br /&gt;
* [http://contig.wordpress.com/] more info ...&lt;br /&gt;
* Input: 454, Sanger&lt;br /&gt;
* Format: sff, seq/qual &lt;br /&gt;
* The linker is automatically detected only in the sff files  &lt;br /&gt;
* The linker has to be &amp;quot;manually&amp;quot; removed from seq/qual files &lt;br /&gt;
* Q: How to assemble paired reads from seq files? &lt;br /&gt;
* A: Header line DIR=F, DIR=R : see &amp;quot;gatekeeper&amp;quot; command &lt;br /&gt;
* No way to specify the library insert/std !!!&lt;br /&gt;
* The input file order matters !!!&lt;br /&gt;
* Can perform slightly better if CA OBT trimmings are used (Ex:Ecoli 454)&lt;br /&gt;
* DeNovo commands: &lt;br /&gt;
  # sff input&lt;br /&gt;
  runAssembly -force -o dir *.sff&lt;br /&gt;
 &lt;br /&gt;
  # seq/qual files ; -p : explicit pair ends&lt;br /&gt;
  test -f prefix.seq&lt;br /&gt;
  test -f prefix.seq.qual&lt;br /&gt;
  runAssembly -o dir [-p] *.seq&lt;br /&gt;
 &lt;br /&gt;
  # step by step; multiple sequence types/insert size; order matters&lt;br /&gt;
  newAssembly .&lt;br /&gt;
  addRun -p Sanger.seq&lt;br /&gt;
  addRun -p 454.flx.sff&lt;br /&gt;
  addRun -p 454.tit.sff&lt;br /&gt;
  runProject .&lt;br /&gt;
 &lt;br /&gt;
* RefMapper commands:&lt;br /&gt;
  #sff input&lt;br /&gt;
  runMapping -o . prefix.1con *.sff&lt;br /&gt;
 &lt;br /&gt;
  # step by step&lt;br /&gt;
  newMapping .&lt;br /&gt;
  addRun -p Sanger.seq&lt;br /&gt;
  addRun -p 454.flx.sff&lt;br /&gt;
  addRun -p 454.tit.sff&lt;br /&gt;
  setRef ref.1con&lt;br /&gt;
  runProject .&lt;br /&gt;
&lt;br /&gt;
* paired reads: must have the same template/library; dir=F|R&lt;br /&gt;
  Example:  &lt;br /&gt;
  &amp;gt;SRR001355.3635.1a template=2020+2021 dir=F library=SRX000348 trim=20-117&lt;br /&gt;
  ...&lt;br /&gt;
  &amp;gt;SRR001355.3635.1b template=2020+2021 dir=R library=SRX000348 trim=1-130&lt;br /&gt;
  ...&lt;br /&gt;
* config file &lt;br /&gt;
  assembly/454AssemblyProject.xml&lt;br /&gt;
  Not tested: edit the params in this file and relaunch the assembly&lt;br /&gt;
* viewer&lt;br /&gt;
  toAmos -ace 454Contigs.ace -m prefix.mates -o prefix.afg&lt;br /&gt;
  bank-transact -c -z -b prefix.bnk -m prefix.afg&lt;br /&gt;
&lt;br /&gt;
== SGA ==&lt;br /&gt;
&lt;br /&gt;
* [https://github.com/jts/sga/wiki Website]&lt;br /&gt;
* SGA (String Graph Assembler) is a de novo assembler for DNA sequence reads. It is based on Gene Myers&#039; string graph formulation of assembly and uses the FM-index/Burrows-Wheeler transform to efficiently find overlaps between sequence reads. &lt;br /&gt;
* [http://bioinformatics.oxfordjournals.org/cgi/content/abstract/26/12/i367 Paper] Efficient construction of an assembly string graph using the FM-index : Bioinformatics 2010&lt;br /&gt;
* [http://people.pwf.cam.ac.uk/js779/ Author website]&lt;br /&gt;
* [https://github.com/jts/sga Git download]&lt;br /&gt;
* [[Media:sga.README|README]]&lt;br /&gt;
* [[Media:sga.HELP|HELP]]&lt;br /&gt;
&lt;br /&gt;
* Compile&lt;br /&gt;
  ./configure --with-sparsehash=/fs/szdevel/core-cbcb-software/Linux-x86_64/ --with-bamtools=/fs/szdevel/core-cbcb-software/Linux-x86_64/packages/bamtools/&lt;br /&gt;
&lt;br /&gt;
* Example&lt;br /&gt;
  #!/bin/tcsh -x&lt;br /&gt;
 &lt;br /&gt;
  setenv THREADS 20&lt;br /&gt;
 &lt;br /&gt;
  setenv MINQUAL 4&lt;br /&gt;
 &lt;br /&gt;
  setenv MINOVL 45&lt;br /&gt;
  setenv MAXERR 0.04 &lt;br /&gt;
  setenv K 31&lt;br /&gt;
  setenv ITTERATIONS 10&lt;br /&gt;
  &lt;br /&gt;
  setenv k 27&lt;br /&gt;
  setenv MINCVG 3&lt;br /&gt;
  ##################################################################&lt;br /&gt;
 &lt;br /&gt;
  rm frag.pp.* default-*&lt;br /&gt;
 &lt;br /&gt;
  sga preprocess -p 1 -f $MINQUAL frag_?.fastq &amp;gt; frag.pp.fa&lt;br /&gt;
  sga index      -t $THREADS frag.pp.fa &lt;br /&gt;
 &lt;br /&gt;
  sga correct    -k $K -i $ITTERATIONS -t $THREADS frag.pp.fa -e $MAXERR -m $MINOVL -o frag.pp.ec.fa&lt;br /&gt;
  sga index      -t $THREADS frag.pp.ec.fa &lt;br /&gt;
  &lt;br /&gt;
  sga filter     -k $k -x $MINCVG frag.pp.ec.fa &lt;br /&gt;
  sga overlap    -m $MINOVL -t $THREADS frag.pp.ec.filter.pass.fa &lt;br /&gt;
  &lt;br /&gt;
  sga assemble   frag.pp.ec.filter.pass.asqg.gz&lt;br /&gt;
&lt;br /&gt;
== SOAPdenovo ==&lt;br /&gt;
&lt;br /&gt;
* Version: 1.05&lt;br /&gt;
* [http://soap.genomics.org.cn/soapdenovo.html Web site] , [[Media:SOAPdenovo-V1.05.manual.txt|Manual]]&lt;br /&gt;
* Input: Illumina&lt;br /&gt;
* Format: seq, fastq (Sanger/Illumina scale: no conversion neede)&lt;br /&gt;
* Other BGI tools for read correction, snp calling, gap closing &lt;br /&gt;
&lt;br /&gt;
=== Issues ===&lt;br /&gt;
&lt;br /&gt;
* SOAPdenovo scaff: &lt;br /&gt;
** Floating point exception if the insert is too short &lt;br /&gt;
** No output difference if the insert size given is longer than the real estimate&lt;br /&gt;
&lt;br /&gt;
=== Options ===&lt;br /&gt;
&lt;br /&gt;
  soapdenovo all –s config_file –K 25 –R –o graph_prefix&lt;br /&gt;
&lt;br /&gt;
  soapdenovo pregraph –s config_file –K 25 [–R -d -p] –o graph_prefix&lt;br /&gt;
  soapdenovo contig –g graph_prefix [–R –M 1 -D]&lt;br /&gt;
  soapdenovo map –s config_file –g graph_prefix [-p]&lt;br /&gt;
  soapdenovo scaff –g graph_prefix [–F -u -G -p]&lt;br /&gt;
&lt;br /&gt;
  -s	STR	configuration file&lt;br /&gt;
  -o	STR	output graph file prefix&lt;br /&gt;
  -g	STR	input graph file prefix&lt;br /&gt;
  -K	INT	K-mer size [default 23, min 13, max63]&lt;br /&gt;
  -p	INT	multithreads, n threads [default 8]&lt;br /&gt;
  -R		use reads to solve tiny repeats [default no]&lt;br /&gt;
  -d	INT	remove low-frequency K-mers with frequency no larger than [default 0] &lt;br /&gt;
  -D	INT	remove edges with coverage no larger that [default 1]&lt;br /&gt;
  -M	INT	strength of merging similar sequences during contiging [default 1, min 0, max 3]&lt;br /&gt;
  -F		intra-scaffold gap closure [default no]&lt;br /&gt;
  -u		un-mask high coverage contigs before scaffolding [defaut mask]&lt;br /&gt;
  -G    INT     allowed length difference between estimated and filled gap&lt;br /&gt;
  -L		minimum contigs length used for scaffolding&lt;br /&gt;
&lt;br /&gt;
=== Config file ===&lt;br /&gt;
* Example 1:&lt;br /&gt;
  max_rd_len=100                        #maximal read length ; set to 100 by default; if not changed, longer reads get truncated&lt;br /&gt;
 &lt;br /&gt;
  [LIB]                                 #paired reads&lt;br /&gt;
  avg_ins=200                           #average insert size&lt;br /&gt;
  reverse_seq=0                         #if sequence needs to be reversed (FF instead of FR : not checked yet)&lt;br /&gt;
  asm_flags=3                           #in which part(s) the reads are used; 1:ctg 2:scf 3:ctg+scf 4:gapClosure&lt;br /&gt;
  rank=1                                #in which order the reads are used while scaffolding&lt;br /&gt;
  rd_len_cutoff=75                      #cut the reads from the current library to this length&lt;br /&gt;
  pair_num_cutoff=3                     #cutoff value of pair number for a reliable connection between two contigs or pre-scaffolds. ; 3:shortInserts;  5:longInserts   (don&#039;t seem to be able to decrease it under 3; larger values help with the repeats)&lt;br /&gt;
  map_len=32                            #minimun alignment length between a read and a contig required for a reliable read location. ; 32:shortInserts; 35:longInserts  (don&#039;t seem to be able to decrease it under 32; larger values help with the repeats)&lt;br /&gt;
  &lt;br /&gt;
  q1=/path/fastq_read_1.fq              #fastq file for read fwd; names in q1&amp;amp;q2 don&#039;t have to match but the positions do !!! &lt;br /&gt;
  q2=/path/fastq_read_2.fq              #fastq file for read rev&lt;br /&gt;
  &lt;br /&gt;
  f1=/path/fasta_read_1.fa              #fasta file for read fwd&lt;br /&gt;
  f2=/path/fasta_read_2.fa              #fasta file for read rev&lt;br /&gt;
 &lt;br /&gt;
  p=/path/fasta_read_12.fa              #fwd followed by rev&lt;br /&gt;
 &lt;br /&gt;
  q=/path/fastq_read.fq                 #fastq file for read&lt;br /&gt;
  f=/path/fasta_read.fa                 #fasta file for read&lt;br /&gt;
  &lt;br /&gt;
  [LIB]                                 &lt;br /&gt;
  avg_ins=3000&lt;br /&gt;
  reverse_seq=1&lt;br /&gt;
  asm_flags=2&lt;br /&gt;
  rank=2&lt;br /&gt;
  pair_num_cutoff=5&lt;br /&gt;
  map_len=35&lt;br /&gt;
  &lt;br /&gt;
  q1=...&lt;br /&gt;
  q2=...&lt;br /&gt;
&lt;br /&gt;
=== iid2uid mapping &amp;amp; posmap file ===&lt;br /&gt;
* Example:&lt;br /&gt;
  cat s_1.unmated s_1.mates s_2.unmated s_2.mates s_3.unmated s_3.mates ... | p &#039; print $F[0],&amp;quot;\n&amp;quot;; print $F[1],&amp;quot;\n&amp;quot; if($F[1]);&#039; | nl &amp;gt; s.iid2uid&lt;br /&gt;
  cat             s_1.mates             s_2.mates             s_3.mates ... | p &#039; print $F[0],&amp;quot;\n&amp;quot;; print $F[1],&amp;quot;\n&amp;quot; if($F[1]);&#039; | nl &amp;gt; s.iid2uid&lt;br /&gt;
&lt;br /&gt;
  paste s_1.fastq s_2.fastq | perl -ane &#039; next unless($.%4==1);  &#039;s/\@//g&#039; ; @F=split /\s+/; print $F[0],&amp;quot;\n&amp;quot;,$F[1],&amp;quot;\n&amp;quot;;&#039; | nl &amp;gt; s.iid2uid&lt;br /&gt;
  sort -nk2 -nk3 s.readOnContig  | grep -v ^read |  ~/bin/replaceFirstCol.pl -f $1.iid2uid | perl -ane &#039;$F[4]=($F[3] eq &amp;quot;+&amp;quot;)?&amp;quot;f&amp;quot;:&amp;quot;r&amp;quot;; $F[3]=$F[2]+64; print join &amp;quot;\t&amp;quot;,@F; print &amp;quot;\n&amp;quot;;&#039; &amp;gt; s.posmap.frgctg&lt;br /&gt;
&lt;br /&gt;
=== links ===&lt;br /&gt;
&lt;br /&gt;
* Shows the links between contigs&lt;br /&gt;
  lenseq test.contig | egrep  &#039;1376|2376&#039;&lt;br /&gt;
  ctg        len&lt;br /&gt;
  ...&lt;br /&gt;
  1376       100    &lt;br /&gt;
  2376       69673  &lt;br /&gt;
  ...&lt;br /&gt;
&lt;br /&gt;
  egrep &#039;1376|2376&#039; test.scaf&lt;br /&gt;
  #ctg       pos        dir len        #DOWN ctg:totLinks:avgLibMea &lt;br /&gt;
  ...&lt;br /&gt;
  1376       109421     +   100        #DOWN 2376:52:55                #UP 2334:43:11 &lt;br /&gt;
  2376       109545     +   69673      #DOWN 2380:255:13               #UP 2334:255:8 1376:52:55 &lt;br /&gt;
  ...&lt;br /&gt;
 &lt;br /&gt;
  cat test.links | grep 2376 | grep 1376&lt;br /&gt;
  #ctg1      ctg2         dist    #Links  libMea&lt;br /&gt;
  1376       2376         34      32      240&lt;br /&gt;
  1376       2376         91      20      6054&lt;br /&gt;
&lt;br /&gt;
  avgLibMea= (sum #Links*libMea)/(sum #Links)&lt;br /&gt;
&lt;br /&gt;
=== Commands ===&lt;br /&gt;
* All at once:&lt;br /&gt;
  SOAPdenovo all -s SOAPdenovo.config -o prefix      # all: do all assembly steps  &lt;br /&gt;
&lt;br /&gt;
* Step by step:&lt;br /&gt;
  1. (time SOAPdenovo pregraph -s SOAPdenovo.config -o prefix -K 23 -p 6 ) &amp;gt;&amp;gt;&amp;amp;  SOAPdenovo.log&lt;br /&gt;
  prefix.kmerFreq                                                                                # up to 256&lt;br /&gt;
  prefix.edge                                                                                    # FASTA format&lt;br /&gt;
  prefix.preArc &lt;br /&gt;
  prefix.vertex                                                                                  # md5 sums ???&lt;br /&gt;
  prefix.preGraphBasic&lt;br /&gt;
&lt;br /&gt;
  2. (time SOAPdenovo contig -g prefix -M 1 ) &amp;gt;&amp;gt;&amp;amp;  SOAPdenovo.log&lt;br /&gt;
  prefix.contig&lt;br /&gt;
  prefix.ContigIndex&lt;br /&gt;
  prefix.updated.edge&lt;br /&gt;
  prefix.Arc         &lt;br /&gt;
&lt;br /&gt;
  3. (time SOAPdenovo map -s SOAPdenovo.config -g prefix -p 6 ) &amp;gt;&amp;gt;&amp;amp;  SOAPdenovo.log&lt;br /&gt;
  prefix.readOnContig&lt;br /&gt;
  prefix.peGrads&lt;br /&gt;
  prefix.readInGap&lt;br /&gt;
&lt;br /&gt;
  4. (time SOAPdenovo scaff -g prefix -F ) &amp;gt;&amp;gt;&amp;amp;  SOAPdenovo.log               &lt;br /&gt;
  prefix.newContigIndex  &lt;br /&gt;
  prefix.links&lt;br /&gt;
  prefix.scaf&lt;br /&gt;
  prefix.scaf_gap&lt;br /&gt;
  prefix.scafSeq&lt;br /&gt;
  prefix.gapSeq&lt;br /&gt;
&lt;br /&gt;
=== Run time ===&lt;br /&gt;
* On Saureus simulated data set &lt;br /&gt;
  1. 216.460u 2.792s 0:52.27 419.4% &lt;br /&gt;
  2. 0.425u   0.011s 0:00.50 86.0%    &lt;br /&gt;
  3. 59.921u  1.597s 0:21.21 290.0%   &lt;br /&gt;
  4. 6.113u   0.263s 0:06.22 102.4%   &lt;br /&gt;
&lt;br /&gt;
* Larger genomes ???&lt;br /&gt;
&lt;br /&gt;
=== Output ===&lt;br /&gt;
* No information about the position of the reads in assembly !!!&lt;br /&gt;
&lt;br /&gt;
  cat scaffold1.filter-q.delta | ~/bin/delta2posmap.pl | sort -nk3 | head -20 | pretty&lt;br /&gt;
&lt;br /&gt;
=== Adding more reads for scaffolding ===&lt;br /&gt;
&lt;br /&gt;
Files to link:&lt;br /&gt;
  perl -e &#039;@F=split /\s+/,&amp;quot;Arc contig ContigIndex edge kmerFreq peGrads preArc preGraphBasic vertex&amp;quot;; \&lt;br /&gt;
           foreach (@F) { print &amp;quot;ln -s ../SOAP-prev/*.$_\n&amp;quot; }&#039;&lt;br /&gt;
&lt;br /&gt;
=== GapCloser (works with SOAPdenovo) ===&lt;br /&gt;
&lt;br /&gt;
* Version: 1.10&lt;br /&gt;
* [http://soap.genomics.org.cn/soapdenovo.html Web site]&lt;br /&gt;
  ~/szdevel/bin/SOAPGapCloser -b SOAPdenovo.config -a Bb.scafSeq -o Bb.scafSeq.new -t 12&lt;br /&gt;
  =&amp;gt;&lt;br /&gt;
  Bb.scafSeq.new      : FASTA file(Bb.scafSeq with fewer N&#039;s)&lt;br /&gt;
  Bb.scafSeq.new.fill : gap fill status&lt;br /&gt;
  Example:&lt;br /&gt;
  &amp;gt;scaffold2 175.2&lt;br /&gt;
  ...&lt;br /&gt;
  #begin  end     5&#039;fill  3&#039;fill  closed?&lt;br /&gt;
  55203   57185   0       23      0&lt;br /&gt;
  69718   71607   0       28      0&lt;br /&gt;
  79195   79219   24      0       1&lt;br /&gt;
  80681   80741   0       23      0&lt;br /&gt;
  81738   81876   0       37      0&lt;br /&gt;
  ...&lt;br /&gt;
&lt;br /&gt;
=== GC vs Coverage correlation ===&lt;br /&gt;
  infoseq -description asm.K31.contig | p &#039;/cvg_(\d+.\d+)/; print &amp;quot;$F[0] $F[1] $F[2] $1\n&amp;quot; ;&#039; &amp;gt;! asm.K31.contig.infoseq &lt;br /&gt;
&lt;br /&gt;
  cat asm.K31.contig.infoseq | ~/bin/getCorrelationDescriptive.pl -i 2 -j 3 -min 100&lt;br /&gt;
  cat asm.K31.contig.infoseq | ~/bin/getCorrelationDescriptive.pl -i 2 -j 3 -min 1000&lt;br /&gt;
  ...&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== SOAPdenovo-prepare ===&lt;br /&gt;
&lt;br /&gt;
* Data Preparation Module generates necessary data for SOAPdenovo to run &amp;quot;map&amp;quot; and &amp;quot;scaff&amp;quot; steps from Contigs generated by SOAPdenovo or other assemblers with various length of kmer.&lt;br /&gt;
* options:&lt;br /&gt;
&lt;br /&gt;
    * -g [string] Prefix of output.&lt;br /&gt;
    * -K [int] Kmer length.&lt;br /&gt;
    * -c [string] Input Contigs FASTA. (Filename cannot be prefix.contig)&lt;br /&gt;
&lt;br /&gt;
* Example:&lt;br /&gt;
  /fs/szdevel/core-cbcb-software/Linux-x86_64/bin/SOAPdenovo-prepare -K 31 -c default-contigs.fa  -g genome  &lt;br /&gt;
  =&amp;gt;&lt;br /&gt;
  genome.ContigIndex&lt;br /&gt;
  genome.updated.edge&lt;br /&gt;
  genome.contig&lt;br /&gt;
  genome.Arc&lt;br /&gt;
  genome.conver&lt;br /&gt;
  genome.preGraphBasic&lt;br /&gt;
&lt;br /&gt;
  SOAPdenovo map -s SOAPdenovo.config -g genome    -p 20&lt;br /&gt;
  SOAPdenovo scaff                    -g genome -F -p 20&lt;br /&gt;
&lt;br /&gt;
* Eaxmple: use it if the jumpreads are &amp;lt; contigging kmer size&lt;br /&gt;
&lt;br /&gt;
  more ~/bin/SOAPdenovo-63mer.sh&lt;br /&gt;
  #!/bin/sh -x&lt;br /&gt;
  # Usage: SOAPdenovo-63mer.sh prefix ctgK scfK&lt;br /&gt;
 &lt;br /&gt;
  # pregraphing ...&lt;br /&gt;
  SOAPdenovo-63mer pregraph -s SOAPdenovo.config -K $2 -p 16 -o $1  &amp;gt;! SOAPdenovo.log&lt;br /&gt;
 &lt;br /&gt;
  # contigging ...&lt;br /&gt;
  SOAPdenovo-63mer contig   -g $1 -M 1                              &amp;gt;&amp;gt; SOAPdenovo.log&lt;br /&gt;
 &lt;br /&gt;
  # preparing ...&lt;br /&gt;
  mv $1.contig $1.contig.tmp &lt;br /&gt;
  SOAPdenovo-prepare -K $3 -c $1.contig.tmp  -g $1                  &amp;gt;&amp;gt; SOAPdenovo.log&lt;br /&gt;
 &lt;br /&gt;
  # read mapping ...&lt;br /&gt;
  SOAPdenovo-63mer map      -s SOAPdenovo.config -g $1 -p 16        &amp;gt;&amp;gt; SOAPdenovo.log&lt;br /&gt;
  &lt;br /&gt;
  # scaffolding ...&lt;br /&gt;
  SOAPdenovo-63mer scaff    -g $1 -F                   -p 16        &amp;gt;&amp;gt; SOAPdenovo.log&lt;br /&gt;
&lt;br /&gt;
== Velvet ==&lt;br /&gt;
&lt;br /&gt;
* Version: 0.7.55&lt;br /&gt;
* Version 1.1.04: &lt;br /&gt;
* [http://www.ebi.ac.uk/~zerbino/velvet/ Web site]&lt;br /&gt;
* [http://www.ebi.ac.uk/~zerbino/velvet/Manual.pdf Manual]&lt;br /&gt;
&lt;br /&gt;
  Concerning read orientation, Velvet expects paired-end reads to come from opposite strands facing each other, as in the traditional Sanger format.&lt;br /&gt;
  If you have paired-end reads produced from circularisation (i.e. from the same strand), it will be necessary to replace the first read in each pair &lt;br /&gt;
  by its reverse complement before running velveth.&lt;br /&gt;
&lt;br /&gt;
* [https://banana-slug.soe.ucsc.edu/lecture_notes:04-19-2010 Lecture]&lt;br /&gt;
&lt;br /&gt;
* Compile:&lt;br /&gt;
  make ’CATEGORIES=16’ \                   # max  libraries&lt;br /&gt;
        ’MAXKMERLENGTH=64’ \               &lt;br /&gt;
        ’OPENMP=1’ \                       # multithread&lt;br /&gt;
        ’BIGASSEMBLY=1’ \                  # more than 2.2 billion reads&lt;br /&gt;
        ’LONGSEQUENCES=1’                  # contigs longer than 32kb long&lt;br /&gt;
&lt;br /&gt;
  make CATEGORIES=16 MAXKMERLENGTH=64 OPENMP=1 BIGASSEMBLY=1 LONGSEQUENCES=1&lt;br /&gt;
&lt;br /&gt;
* Input: 454 , Illumina,  ABI SOLiD as well? , AssembledSequences &lt;br /&gt;
* Format: seq, fastq (Sanger/Illumina style no conversion needed) all mates should be innies&lt;br /&gt;
* 454: &lt;br /&gt;
** does poorly&lt;br /&gt;
** slightly better if the cvg is given&lt;br /&gt;
** same results for fasta/fastq input&lt;br /&gt;
** results get worse for higher cvg&lt;br /&gt;
* Commands: &lt;br /&gt;
  # unmated reads&lt;br /&gt;
  velveth . &amp;lt;hash_size&amp;gt; -fastq -short *.fastq&lt;br /&gt;
  velvetg . -amos_file yes -read_trkg yes -exp_cov &amp;lt;cvg&amp;gt;&lt;br /&gt;
 &lt;br /&gt;
  # mated reads&lt;br /&gt;
  shuffleSequences_fastq.pl prefix_1.fastq prefix_2.fastq  prefix_12.fastq&lt;br /&gt;
  velveth . &amp;lt;hash_size&amp;gt; -fasta -short prefix.fastq -shortPaired prefix_12.fastq&lt;br /&gt;
  velvetg . -ins_length &amp;lt;mean&amp;gt; -ins_length_sd &amp;lt;std&amp;gt; -amos_file yes -read_trkg yes -exp_cov &amp;lt;cvg&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  #reference based&lt;br /&gt;
  bowtie prefix.1con  -1 prefix_1.fastq -2 prefix_2.fastq -l 28 -n 2 --best -I 100 -X 200 -p 12 --sam | sort &amp;gt; prefix.bowtie.sam&lt;br /&gt;
  velveth . 21 -reference prefix.1con -shortPaired -sam prefix.bowtie.sam&lt;br /&gt;
  velvetg . -exp_cov auto -ins_length 150 -scaffolding yes -exportFiltered yes -unused_reads yes&lt;br /&gt;
&lt;br /&gt;
== Aleksey&#039;s create_k_unitigs ==&lt;br /&gt;
&lt;br /&gt;
* Usage&lt;br /&gt;
** genome8.umd.edu&lt;br /&gt;
 &lt;br /&gt;
  jellyfish count -m 31 -s 1073741824 -r -C -t 16 -o frag.k31.jf *.fa &lt;br /&gt;
  mv -i frag.k31.jf_0 frag.k31.jf&lt;br /&gt;
 &lt;br /&gt;
  create_k_unitigs -C -l 64 -m 5 -M 5 -t 16 -o frag.k31 frag.k31.jf&lt;br /&gt;
  &lt;br /&gt;
  cat frag.k31_*.fa     &amp;gt; frag.k31.fa&lt;br /&gt;
  cat frag.k31_*.counts &amp;gt; frag.k31.counts&lt;br /&gt;
  rm frag.k31_*&lt;br /&gt;
 &lt;br /&gt;
  ~alekseyz/myprogs/getNumBasesPerReadInFastaFile.perl frag.k31.fa | sort -gr | head&lt;br /&gt;
&lt;br /&gt;
** cbcb&lt;br /&gt;
  /fs/szdevel/dpuiu/filter_illumina_data_quals.pl&lt;br /&gt;
  /fs/szdevel/dpuiu/k_unitig-0.0.1/bin/&lt;br /&gt;
&lt;br /&gt;
  cat *fastq | /fs/szdevel/dpuiu/filter_illumina_data_quals.pl &#039;#&#039; &amp;gt; genome.seq&lt;br /&gt;
 &lt;br /&gt;
  /fs/szdevel/dpuiu/k_unitig-0.0.1/bin/jellyfish count -s 200000000 -m 31 -t 16 -r -C genome.seq&lt;br /&gt;
  /fs/szdevel/dpuiu/k_unitig-0.0.1/bin/error_correct_reads -d mer_counts.hash_0 -t 16 -m 3 frag_1.fastq&lt;br /&gt;
&lt;br /&gt;
== Jellyfish ==&lt;br /&gt;
 &lt;br /&gt;
  jellyfish count -m 18 -s 1073741824 --both-strands -t 16 -o frag.k18.jf frag_?.*fastq&lt;br /&gt;
  jellyfish merge frag.k18.jf_*  -o frag.k18.jf              &lt;br /&gt;
  rm frag.k18.jf_* &lt;br /&gt;
&lt;br /&gt;
  jellyfish histo frag.k18.jf &amp;gt; frag.k18.histo&lt;br /&gt;
&lt;br /&gt;
  jellyfish stats     frag.k18.jf &amp;gt; frag.k18.stats&lt;br /&gt;
  jellyfish stats --f frag.k18.jf &amp;gt; frag.k18.fasta&lt;br /&gt;
  jellyfish stats --c frag.k18.jf | awk &#039;{print $2,$1}&#039; &amp;gt; frag.k18.tab&lt;br /&gt;
&lt;br /&gt;
= Scaffolding =&lt;br /&gt;
== Bambus ==&lt;br /&gt;
* [http://www.cs.jhu.edu/~genomics/Bambus/Manual.html#run Manual]&lt;br /&gt;
&lt;br /&gt;
= Comparative assemblers =&lt;br /&gt;
&lt;br /&gt;
Performance measures:&lt;br /&gt;
* ctg/scf/singleton stats&lt;br /&gt;
* a related finished genome(s)must be available&lt;br /&gt;
** Q:which is the closest relative?  A:align the reads to the refs and identify the one with the max number of hits&lt;br /&gt;
** Q:how many repeats in the ref &amp;amp; how well are they assembled &lt;br /&gt;
** Q: reference coverage: how many gaps are left?&lt;br /&gt;
** comparative assembly to a very close/more distant relative&lt;br /&gt;
&lt;br /&gt;
== AMOScmp-shortReads ==&lt;br /&gt;
&lt;br /&gt;
* [http://sourceforge.net/apps/mediawiki/amos/index.php?title=AMOScmp Web site]&lt;br /&gt;
* Version: &lt;br /&gt;
* Input: 454, Illumina&lt;br /&gt;
* Format: amos  &lt;br /&gt;
* 454 trimming: &lt;br /&gt;
** alignment based trimming =&amp;gt; longer/fewer contigs&lt;br /&gt;
** OBT trimming =&amp;gt; shorter/more contigs, fewer snps &amp;amp; breaks&lt;br /&gt;
** lucy quality based trimming does not help much&lt;br /&gt;
* Illumina trimming&lt;br /&gt;
** Trim n bases at the 3&#039;&lt;br /&gt;
** Trim all bases with quality &amp;lt; n at the 3&#039; &amp;amp; 5&#039;&lt;br /&gt;
** Delete all reads which contain N&#039;s&lt;br /&gt;
&lt;br /&gt;
* Commands:&lt;br /&gt;
  AMOScmp-shortReads prefix                           # 454    : use nucmer aligner&lt;br /&gt;
  AMOScmp-shortReads-alignmentTrimmed  prefix         # 454    : use nucmer aligner&lt;br /&gt;
  AMOScmp-shortReads -D DOSOAP=1 prefix               # Solexa : use soap aligner&lt;br /&gt;
* Can use nucmer/soap for read alignment &lt;br /&gt;
  nucmer --maxmatch -l 16 -c 32&lt;br /&gt;
 &lt;br /&gt;
  soap -p 2 -r 2 -v 3 -g 3&lt;br /&gt;
  # p: parallel&lt;br /&gt;
  # r:2 report multiple occurances&lt;br /&gt;
  # v: max mismatch&lt;br /&gt;
  # g: max gap len&lt;br /&gt;
* Breaks&lt;br /&gt;
** Q: When the assembly is aligned back to the ref =&amp;gt; how many breaks,snps ?&lt;br /&gt;
** If several reads map partially to 2 different locations withing few hundred bp of one another &amp;amp; the CONTAINED reads in between align to multiple regions =&amp;gt; excision of the repeat&lt;br /&gt;
** by decreasing the &amp;quot;casm-layout -g&amp;quot; param from 1000 to 20(?) we can solve some of the problems&lt;br /&gt;
&lt;br /&gt;
== Maq ==&lt;br /&gt;
&lt;br /&gt;
* Version: 0.7.1&lt;br /&gt;
* [http://maq.sourceforge.net/ Web site]&lt;br /&gt;
* [http://lh3lh3.users.sourceforge.net/download/maq-20090501.pdf Heng Li&#039;s presentation]&lt;br /&gt;
* [http://bioinfo.genopole-toulouse.prd.fr/fileadmin/ToDownload/Formation/F10c/F10c2_slides_full.pdf  bioinfo.genopole-toulouse.prd.fr]&lt;br /&gt;
* [http://sourceforge.net/projects/maq/files/maqview/0.2.3/ Maqview]&lt;br /&gt;
* [http://sourceforge.net/projects/breakdancer/ BreakDancer-0.0.1] is a Perl package that provides genome-wide detection of structural variants from next generation paired-end sequencing reads. &lt;br /&gt;
* Input: Illumina&lt;br /&gt;
* Format: fastq&lt;br /&gt;
* Commands:&lt;br /&gt;
  maq.pl easyrun prefix.1con prefix.fastq                                    # single reads         &lt;br /&gt;
  maq.pl easyrun prefix.1con prefix.fwd.fastq prefix.rev.fastq -p -a [250]   # paired reads; max insert size defaults to 250bp ; NOT WORKING CORRECTLY !!! (0 paired reads reported)&lt;br /&gt;
&lt;br /&gt;
* Performs very well for close ref (&amp;gt;99% id); not that well for further ref (&amp;lt;98% id)&lt;br /&gt;
&lt;br /&gt;
== Newbler ==&lt;br /&gt;
&lt;br /&gt;
* See prev section&lt;br /&gt;
* Commands: &lt;br /&gt;
  # reference based&lt;br /&gt;
  runMapping -o . prefix.1con *.sff&lt;br /&gt;
&lt;br /&gt;
= Other assemblers =&lt;br /&gt;
&lt;br /&gt;
== ABBA ==&lt;br /&gt;
* http://sourceforge.net/apps/mediawiki/amos/index.php?title=ABBA&lt;br /&gt;
&lt;br /&gt;
== minimus2 ==&lt;br /&gt;
  #!/bin/tcsh&lt;br /&gt;
  ln -s ../&lt;br /&gt;
  cat ref.seq qry.seq &amp;gt; prefix.seq&lt;br /&gt;
  toAmos -s prefix.seq -o prefix.afg&lt;br /&gt;
  set REFCOUNT=`grep -c &amp;quot;&amp;gt;&amp;quot; ref.seq`&lt;br /&gt;
  minimus2 -D REFCOUNT=$REFCOUNT -D MINID=94 -D OVERLAP=20 prefix&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== minimusN ==&lt;br /&gt;
&lt;br /&gt;
  cat ref.asm/ref.scf.fasta | ~/bin/assignFastaIds.pl -p ref | tee ref.scf.fasta | ~/bin/scafffasta2fasta.pl  &amp;gt; ref.ctg.fasta&lt;br /&gt;
  cat qry.asm/qry.scf.fasta | ~/bin/assignFastaIds.pl -p qry | tee qry.scf.fasta | ~/bin/scafffasta2fasta.pl  &amp;gt; qry.ctg.fasta&lt;br /&gt;
 &lt;br /&gt;
  ~/bin/nucmer.amos -D REFN=18000 -D REF=ref.ctg.fasta -D SEQS=qry.ctg.fasta -D MINMATCH=20 -D MINCLUSTER=40 -D BREAKLEN=20 -D MAXGAP=20 ref-qry&lt;br /&gt;
 &lt;br /&gt;
  delta-filter -i 94 genome.filter-1.delta | /nfshomes/dpuiu/szdevel/SourceForge/AMOS/bin/delta2tab.pl | sort -nk1 -nk6 -nk7 &lt;br /&gt;
 &lt;br /&gt;
  ~/bin/AMOS/preMinimus2                 # to get qry breaks&lt;br /&gt;
  minimusN -D REFCOUNT=$REFCOUNT&lt;br /&gt;
&lt;br /&gt;
= Sequence aligners =&lt;br /&gt;
&lt;br /&gt;
== Bowtie ==&lt;br /&gt;
&lt;br /&gt;
* [http://bowtie-bio.sourceforge.net/index.shtml Web site(CBCB)]&lt;br /&gt;
&lt;br /&gt;
* No gapped alignments (yet)&lt;br /&gt;
&lt;br /&gt;
  -n : max mismatches in seed (can be 0-3, default: -n 2)              (Soap??)&lt;br /&gt;
  -l : seed length for -n (default: 28)&lt;br /&gt;
  --best   : hits guaranteed best stratum; ties broken by quality&lt;br /&gt;
  --strata : hits in sub-optimal strata aren&#039;t reported (requires --best)&lt;br /&gt;
&lt;br /&gt;
* Create an index&lt;br /&gt;
  bowtie-build ref.fasta ref.fasta &lt;br /&gt;
&lt;br /&gt;
* Aligned mated &amp;amp; unmated reads&lt;br /&gt;
  Example:&lt;br /&gt;
  # qry_1.fastq  &amp;amp; qry_2.fastq : mated reads&lt;br /&gt;
  # qry_0.fastq                : unmated reads&lt;br /&gt;
 &lt;br /&gt;
  bowtie  ref.fasta  -1 qry_1.fastq -2 qry_2.fastq -l 28 -n 2 -I 6000 -X 10000 --best  -p 8 &amp;gt;  qry_12.bowtie&lt;br /&gt;
  bowtie  ref.fasta     qry_0.fastq                -l 28 -n 2                  --best  -p 8 &amp;gt; qry_0.bowtie&lt;br /&gt;
&lt;br /&gt;
* --chunkmbs &amp;lt;int&amp;gt;   max megabytes of RAM for best-first search frames (def: 64)&lt;br /&gt;
  increase it to 512 or 1024 in case of error&lt;br /&gt;
&lt;br /&gt;
== Maq / Bwa ==&lt;br /&gt;
* [http://bio-bwa.sourceforge.net/bwa.shtml bwa]&lt;br /&gt;
* index reference&lt;br /&gt;
  bwa index ref.fasta&lt;br /&gt;
&lt;br /&gt;
* short single reads&lt;br /&gt;
  bwa aln ref.fasta qry.fastq &amp;gt; qry.sai&lt;br /&gt;
  bwa samse ref.fasta qry.sai qry.fastq &amp;gt; qry.sam&lt;br /&gt;
&lt;br /&gt;
* short paired reads&lt;br /&gt;
  bwa aln ref.fasta qry_1.fastq &amp;gt; qry_1.sai&lt;br /&gt;
  bwa aln ref.fasta qry_2.fastq &amp;gt; qry_2.sai&lt;br /&gt;
  bwa sampe ref.fasta qry_1.sai qry_2.sai qry_1.fastq qry_2.fq &amp;gt; qry.sam&lt;br /&gt;
&lt;br /&gt;
* long reads&lt;br /&gt;
  bwa bwasw ref.fasta qry.fastq &amp;gt; qry.sam&lt;br /&gt;
&lt;br /&gt;
* Example&lt;br /&gt;
&lt;br /&gt;
  bwa bwasw cChloroplast.fa test.fastq | sed &#039;s/555//g&#039; | pretty&lt;br /&gt;
  @SQ          SN:cChloroplast  LN:120481     &lt;br /&gt;
  match.fwd    0                cChloroplast  1  18  100M        *  0  0  TTATCCACCTATTGAAATAGATTCAACAGCGGCTAGATCCAGAGGAAAGTTGTGAGCATTACGTTCGTGCATAACTTCCATACCTAGGTTAGCACGATTA    ...     AS:i:100  XS:i:86  XF:i:2  XE:i:3  XN:i:0  &lt;br /&gt;
  5.fwd        0                cChloroplast  3  36  2S98M       *  0  0  NNATCCACCTATTGAAATAGATTCAACAGCGGCTAGATCCAGAGGAAAGTTGTGAGCATTACGTTCGTGCATAACTTCCATACCTAGGTTAGCACGATTA    ...     AS:i:98   XS:i:84  XF:i:3  XE:i:2  XN:i:0  &lt;br /&gt;
  3.fwd        0                cChloroplast  1  15  98M2S       *  0  0  TTATCCACCTATTGAAATAGATTCAACAGCGGCTAGATCCAGAGGAAAGTTGTGAGCATTACGTTCGTGCATAACTTCCATACCTAGGTTAGCACGATNN    ...     AS:i:98   XS:i:86  XF:i:2  XE:i:3  XN:i:0  &lt;br /&gt;
  5.rev        16               cChloroplast  2  36  1S99M       *  0  0  NNATCCACCTATTGAAATAGATTCAACAGCGGCTAGATCCAGAGGAAAGTTGTGAGCATTACGTTCGTGCATAACTTCCATACCTAGGTTAGCACGATTA    ...     AS:i:99   XS:i:85  XF:i:3  XE:i:2  XN:i:0  &lt;br /&gt;
  3.rev        16               cChloroplast  1  15  98M2S       *  0  0  TTATCCACCTATTGAAATAGATTCAACAGCGGCTAGATCCAGAGGAAAGTTGTGAGCATTACGTTCGTGCATAACTTCCATACCTAGGTTAGCACGATNN    ...     AS:i:98   XS:i:86  XF:i:2  XE:i:3  XN:i:0  &lt;br /&gt;
  ins.fwd      0                cChloroplast  3  40  2S48M2I50M  *  0  0  TAATCCACCTATTGAAATAGATTCAACAGCGGCTAGATCCAGAGGAAAGTNNTGTGAGCATTACGTTCGTGCATAACTTCCATACCTAGGTTAGCACGATTA  ...     AS:i:89  XS:i:75   XF:i:3   XE:i:2  XN:i:0  &lt;br /&gt;
  del.fwd      0                cChloroplast  3  41  2S47M2D49M  *  0  0  TAATCCACCTATTGAAATAGATTCAACAGCGGCTAGATCCAGAGGAAAGGTGAGCATTACGTTCGTGCATAACTTCCATACCTAGGTTAGCACGATTA      ...     AS:i:87   XS:i:73  XF:i:3  XE:i:2  XN:i:0  &lt;br /&gt;
&lt;br /&gt;
  bwa bwasw -H cChloroplast.fa test.fastq | pretty  &lt;br /&gt;
  @SQ          SN:cChloroplast  LN:120481     &lt;br /&gt;
  match.fwd    0                cChloroplast  1  18  100M        *  0  0  TTATCCACCTATTGAAATAGATTCAACAGCGGCTAGATCCAGAGGAAAGTTGTGAGCATTACGTTCGTGCATAACTTCCATACCTAGGTTAGCACGATTA  ...       AS:i:100  XS:i:86  XF:i:2  XE:i:3  XN:i:0  &lt;br /&gt;
  5.fwd        0                cChloroplast  3  36  2H98M       *  0  0  ATCCACCTATTGAAATAGATTCAACAGCGGCTAGATCCAGAGGAAAGTTGTGAGCATTACGTTCGTGCATAACTTCCATACCTAGGTTAGCACGATTA    ...       AS:i:98   XS:i:84  XF:i:3  XE:i:2  XN:i:0  &lt;br /&gt;
  3.fwd        0                cChloroplast  1  15  98M2H       *  0  0  TTATCCACCTATTGAAATAGATTCAACAGCGGCTAGATCCAGAGGAAAGTTGTGAGCATTACGTTCGTGCATAACTTCCATACCTAGGTTAGCACGAT    ...       AS:i:98   XS:i:86  XF:i:2  XE:i:3  XN:i:0  &lt;br /&gt;
  5.rev        16               cChloroplast  2  36  1H99M       *  0  0  NATCCACCTATTGAAATAGATTCAACAGCGGCTAGATCCAGAGGAAAGTTGTGAGCATTACGTTCGTGCATAACTTCCATACCTAGGTTAGCACGATTA   ...       AS:i:99   XS:i:85   XF:i:3   XE:i:2  XN:i:0  &lt;br /&gt;
  3.rev        16               cChloroplast  1  15  98M2H       *  0  0  TTATCCACCTATTGAAATAGATTCAACAGCGGCTAGATCCAGAGGAAAGTTGTGAGCATTACGTTCGTGCATAACTTCCATACCTAGGTTAGCACGAT    ...       AS:i:98   XS:i:86  XF:i:2  XE:i:3  XN:i:0  &lt;br /&gt;
  ins.fwd      0                cChloroplast  3  40  2H48M2I50M  *  0  0  ATCCACCTATTGAAATAGATTCAACAGCGGCTAGATCCAGAGGAAAGTNNTGTGAGCATTACGTTCGTGCATAACTTCCATACCTAGGTTAGCACGATTA  ...       AS:i:89   XS:i:75  XF:i:3  XE:i:2  XN:i:0  &lt;br /&gt;
  del.fwd      0                cChloroplast  3  41  2H47M2D49M  *  0  0  ATCCACCTATTGAAATAGATTCAACAGCGGCTAGATCCAGAGGAAAGGTGAGCATTACGTTCGTGCATAACTTCCATACCTAGGTTAGCACGATTA      ...       AS:i:87   XS:i:73   XF:i:3   XE:i:2  XN:i:0  &lt;br /&gt;
&lt;br /&gt;
  bwa bwasw -H cChloroplast.fa test.fastq | ~/bin/AMOS/sam2posmap.pl | pretty&lt;br /&gt;
  @SQ          SN:cChloroplast  LN:120481     &lt;br /&gt;
  match.fwd    cChloroplast  0  100  f  0  100  100  100.00  &lt;br /&gt;
  5. fwd       cChloroplast  2  100  f  2  100  98   100.00  &lt;br /&gt;
  3.fwd        cChloroplast  0  98   f  0  98   98   100.00  &lt;br /&gt;
  5.rev        cChloroplast  1  100  r  1  100  99   100.00  &lt;br /&gt;
  3.rev        cChloroplast  0  98   r  0  98   98   100.00  &lt;br /&gt;
  ins.fwd      cChloroplast  2  102  f  2  100  98   97.96   &lt;br /&gt;
  del.fwd      cChloroplast  2  98   f  2  100  98   97.96&lt;br /&gt;
&lt;br /&gt;
== Nucmer ==&lt;br /&gt;
* Run on parallel on big data sets&lt;br /&gt;
  #!/bin/tcsh&lt;br /&gt;
  set REFN=`grep -c &amp;quot;&amp;gt;&amp;quot; ref.fasta`&lt;br /&gt;
  @ REFN/=20&lt;br /&gt;
 &lt;br /&gt;
  /nfshomes/dpuiu/bin/nucmer.amos -D REFSEQ=$PWD/ref.fasta -D QRYSEQ=$PWD/qry.fasta -D REFN=$REFN ref-qry&lt;br /&gt;
&lt;br /&gt;
== SOAP2 ==&lt;br /&gt;
&lt;br /&gt;
* [http://soap.genomics.org.cn/ Web site (China)] &lt;br /&gt;
&lt;br /&gt;
* No gapped alignments&lt;br /&gt;
&lt;br /&gt;
* Create an index&lt;br /&gt;
  2bwt-builder ref.fasta &lt;br /&gt;
&lt;br /&gt;
* Align mated &amp;amp; unmated reads&lt;br /&gt;
  -l : alignment seed&lt;br /&gt;
  -s : minimal alignment length &lt;br /&gt;
  -n :  filter low-quality reads containing &amp;gt;n Ns before alignment, [5]&lt;br /&gt;
  -r : how to report repeat hits, 0=none; 1=random one; 2=all, [1]&lt;br /&gt;
  -v : maximum number of mismatches allowed on a read. [5] bp&lt;br /&gt;
  -g : one continuous gap size allowed on a read. [0] bp&lt;br /&gt;
&lt;br /&gt;
  Example: don&#039;t forget &amp;quot;-2&amp;quot;&lt;br /&gt;
  soap2 -D ref.index -a qry_1.fastq -b qry_2.fastq -o qry_12.mated.soap2 -2 qry_12.single.soap2  -l 28 -v 2 -n 0 -m 6000 -x 10000 -p 8&lt;br /&gt;
  soap2 -D ref.index -a qry_0.fastq                -o qry_0.soap2                                -l 28 -v 2 -n 0                  -p 8&lt;br /&gt;
&lt;br /&gt;
= Alignment processing =&lt;br /&gt;
&lt;br /&gt;
== Samtools ==&lt;br /&gt;
* [http://samtools.sourceforge.net/SAM1.pdf SAM format]&lt;br /&gt;
* [http://samtools.sourceforge.net/samtools.shtml manual]&lt;br /&gt;
* [http://sourceforge.net/apps/mediawiki/samtools/index.php?title=SAM_FAQ SAM FAQ]&lt;br /&gt;
* [http://sourceforge.net/apps/mediawiki/samtools/index.php?title=SAM_protocol SAM Protocol]&lt;br /&gt;
* [http://lh3lh3.users.sourceforge.net/download/maq-20090501.pdf Heng Li presentation]&lt;br /&gt;
&lt;br /&gt;
http://www.broadinstitute.org/gsa/wiki/index.php/The_Genome_Analysis_Toolkit&lt;br /&gt;
http://www.broadinstitute.org/gsa/wiki/index.php/Converting_Illumina_output_to_BAM_files&lt;br /&gt;
&lt;br /&gt;
= Data sets = &lt;br /&gt;
&lt;br /&gt;
Sources&lt;br /&gt;
* [http://www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?cmd=show&amp;amp;f=main&amp;amp;m=main&amp;amp;s=main NCBI SRA]&lt;br /&gt;
* [http://www.genomeweb.com/informatics/ncbi-end-support-sequence-read-archive-federal-purse-strings-tighten SRA to close]&lt;br /&gt;
* [http://www.ebi.ac.uk/embl/Documentation/ENA-Reads.html ENA] European Nucleotide Archive  (?)&lt;br /&gt;
&lt;br /&gt;
== Escherichia coli str. K-12 substr. MG1655  ==&lt;br /&gt;
&lt;br /&gt;
* [http://www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?study=SRP001087 SRA Study] E coli Whole Genome Sequencing on 454 and Illumina  (454 unpaired &amp;amp; Illumina reads)&lt;br /&gt;
* [http://www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?study=SRP000196 SRA Study] 454 sequencing of Escherichia coli str. K-12 substr. MG1655 genomic paired-end library (454 paired reads)&lt;br /&gt;
* 29 complete strains: NCBI search:  &amp;quot;Escherichia coli&amp;quot;[Organism] AND pt_default[prop] AND seqstat_complete[PROP]&lt;br /&gt;
* /fs/szdata//ncbi/genomes/Bacteria/Escherichia_coli_*/&lt;br /&gt;
* chromosome lengths 4.5-5.5Mbp&lt;br /&gt;
* plasmids?&lt;br /&gt;
* Ref (same species): &lt;br /&gt;
                 len     gc%   &lt;br /&gt;
  NC_000913      4639675 50.79  Escherichia coli str. K-12 substr. MG1655, complete genome  (same substr?)  2.9%repeats&lt;br /&gt;
  NC_010468      4746218 50.87  Escherichia coli ATCC 8739, complete genome (99% id)&lt;br /&gt;
  NC_008253      4938920 50.52  Escherichia coli 536, complete genome (97% id)&lt;br /&gt;
* Ref (same genus): &lt;br /&gt;
  NC_011740      4588711 49.94  Escherichia fergusonii ATCC 35469, complete genome (94% id)&lt;br /&gt;
&lt;br /&gt;
* Ecoli K12 vs ATCC alignment : 99% identity in aligned regions ; 0.926% identity overall [[Media:Ecoli.K12-ATCC.png]]&lt;br /&gt;
  .                    elem       min    q1     q2     q3     max        mean       n50        sum            &lt;br /&gt;
  0cvg                 130        6      153    756    2085   18927      2307       7560       299947  (6.4% of the K12 genome not present in ATCC)        &lt;br /&gt;
  %id                  4929904    68.73  98.95  99.21  99.38  100.00     98.52      99         .   &lt;br /&gt;
&lt;br /&gt;
* Ecoli K12 vs 536 alignment : 97% identity =&amp;gt; more distant&lt;br /&gt;
  .                    elem       min    q1     q2     q3     max        mean       n50        sum            &lt;br /&gt;
  0cvg                 375        2      134    461    1549   30645      1854.35    7094       695383  (14.98% of the K12 genome not present in 536)               &lt;br /&gt;
  %id                  4143854    79.11  96.81  97.48  97.88  100.00     97.17      97         .    &lt;br /&gt;
&lt;br /&gt;
* Ecoli K12 Repeats/Uniq regions (repeat-match based) : max uniq region is 138Kbp (the &amp;quot;best&amp;quot; we can do with short unmated reads) [[Media:Ecoli.K12-K12.png]]&lt;br /&gt;
&amp;lt;pre style=&amp;quot;background:yellow&amp;quot;&amp;gt;&lt;br /&gt;
  .                    elem       min    q1     q2     q3     max        mean       n50        sum            &lt;br /&gt;
  repeats.36bp+        701        36     42     57     132    3903       197.46     848        138420(2.9%)           &lt;br /&gt;
  repeats.100bp+       202        100    155    271    756    3903       550        1207       111011(2.4%) &lt;br /&gt;
  repeats.200bp+       126        200    290    541    1208   3903       794.88     1221       100155(2.1%)&lt;br /&gt;
  uniq                 646        1      10     84     8366   138200*    6970.60    28028      4503010  &lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Longest repeat: 3903b : 3 complete + 8 partial(33%+)&lt;br /&gt;
  223979    227889  | 3903 2    | 3911 3902 | 99.23  | 4639675 3903 | 0.08 99.97  | NC_000913 NC_000913.3422674-3426576 [CONTAINS] &lt;br /&gt;
  2725076   2727319 | 2    2245 | 2244 2244 | 99.60  | 4639675 3903 | 0.05 57.49  | NC_000913 NC_000913.3422674-3426576 &lt;br /&gt;
  2727589   2728971 | 2521 3903 | 1383 1383 | 100.00 | 4639675 3903 | 0.03 35.43  | NC_000913 NC_000913.3422674-3426576 &lt;br /&gt;
  3422674   3426576 | 1    3903 | 3903 3903 | 100.00 | 4639675 3903 | 0.08 100.00 | NC_000913 NC_000913.3422674-3426576 [CONTAINS] &lt;br /&gt;
  3940039   3941416 | 3903 2526 | 1378 1378 | 99.64  | 4639675 3903 | 0.03 35.31  | NC_000913 NC_000913.3422674-3426576 &lt;br /&gt;
  3941605   3943857 | 2245 2    | 2253 2244 | 99.33  | 4639675 3903 | 0.05 57.49  | NC_000913 NC_000913.3422674-3426576 &lt;br /&gt;
  4033762   4037673 | 3903 2    | 3912 3902 | 99.41  | 4639675 3903 | 0.08 99.97  | NC_000913 NC_000913.3422674-3426576 [CONTAINS] &lt;br /&gt;
  4164890   4166276 | 3903 2518 | 1387 1386 | 99.57  | 4639675 3903 | 0.03 35.51  | NC_000913 NC_000913.3422674-3426576 &lt;br /&gt;
  4166542   4168794 | 2245 2    | 2253 2244 | 99.11  | 4639675 3903 | 0.05 57.49  | NC_000913 NC_000913.3422674-3426576 &lt;br /&gt;
  4206378   4207764 | 3903 2518 | 1387 1386 | 99.57  | 4639675 3903 | 0.03 35.51  | NC_000913 NC_000913.3422674-3426576 &lt;br /&gt;
  4207944   4210196 | 2245 2    | 2253 2244 | 99.20  | 4639675 3903 | 0.05 57.49  | NC_000913 NC_000913.3422674-3426576 &lt;br /&gt;
&lt;br /&gt;
* Ecoli K12 Repeats (NCBI annotation)&lt;br /&gt;
  .                    elem       min    q1     q2     q3     max        mean       n50        sum            &lt;br /&gt;
  repeats              399        14     32     82     111    1443       191.69     980        76485          # much fewer than the ones I found&lt;br /&gt;
  genes                4485       14     465    813    1218   7077       922.17     1188       4135942        # ~ 62 genes (42Kb) CONTAINED by 42 annotated repeats&lt;br /&gt;
&lt;br /&gt;
* Ecoli K12 kmers&lt;br /&gt;
  23mers: most frequent occurance is 77&lt;br /&gt;
  31mers:                            46 &lt;br /&gt;
  41mers:                            11 &lt;br /&gt;
&lt;br /&gt;
* Ecoli K12 unique regions (Example)&lt;br /&gt;
  1745081-1952189&lt;br /&gt;
                                                                                                              # ~ 132 genes (75K) CONTAINED by 79 repeat-match repeats      &lt;br /&gt;
* Location on disk:&lt;br /&gt;
  /fs/szattic-asmg4/dpuiu/HTS/Escherichia_coli/      &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Read stats ===&lt;br /&gt;
&lt;br /&gt;
* 454 SRA&lt;br /&gt;
  .                    elem       min    max   mean    n50   sum         cvg     mea;std  q20Pos   SRA accession(s)&lt;br /&gt;
  454                  257,477    52     943   262     263   67640102    14.57   .                [http://www.ncbi.nlm.nih.gov/sra?term=SRX012358&amp;amp;report=full SRX012358],[http://www.ncbi.nlm.nih.gov/sra?term=SRX012359&amp;amp;report=full SRX012359]&lt;br /&gt;
  454 paired           322,386    4      831   185     248   59677025    12.86   3500             [http://www.ncbi.nlm.nih.gov/sra?term=SRX000348&amp;amp;report=full SRX000348] (different study,only run) &lt;br /&gt;
&lt;br /&gt;
* Illumina SRA &lt;br /&gt;
  .                    elem       min    max   mean    n50   sum         cvg     mea;std  q20Pos   SRA accession(s)&lt;br /&gt;
  Illumina             6,725,378  36     36    36      36    242113608   52.18   .        30      [http://www.ncbi.nlm.nih.gov/sra?term=SRX007752&amp;amp;report=full SRX007752] [[Media:SRX007752.qual.summary.png|qual plot]]&lt;br /&gt;
  Illumina paired(37)  9,928,004  37     37    37      37    367336148   79.17   178      30;13   [http://www.ncbi.nlm.nih.gov/sra?term=SRX012850&amp;amp;report=full SRX012850] [[Media:SRX012850.qual.summary.png|qual plot]]   &lt;br /&gt;
  Illumina paired(101) 20,635,060 101    101   101     101   .           449     168      50;13   [http://www.ncbi.nlm.nih.gov/sra/SRX016044?report=full SRX016044] [[Media:SRR034509.qual.summary.png|SRR034509 qual plot]]  # based on alignments to ref, mea=~64&lt;br /&gt;
&lt;br /&gt;
  Illumina paired(101) ?       101    101   101     101   .              449     150      50;13    [http://www.ncbi.nlm.nih.gov/sra/ERX008638?report=full ERX008638] &lt;br /&gt;
                                                                                                &lt;br /&gt;
  Illumina paired(36) ?                                                           500             [http://www.ncbi.nlm.nih.gov/sra/SRX000430?report=full SRX000430]&lt;br /&gt;
  Illumina paired(36) ?                                                           200             [http://www.ncbi.nlm.nih.gov/sra/SRX000429?report=full SRX000429]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* Illumina company: data described in Illumina Technical Notes; complete dataset received from Amy Bouck-Knight(techsupport@illumina.com)&lt;br /&gt;
  .                    elem       min    max   mean    n50   sum         cvg     mea;std  q20Pos  min;max  &lt;br /&gt;
  Illumina paired(75)  20,304,466 75     75    75      75    1522834950  328     210;10   75;75   180;240    # Oct2010   *0.15 =&amp;gt; 50X     # s_6 ; 53% of reads are error free &lt;br /&gt;
  Illumina paired(35)  6,211,692  35     35    35      35    217409220   46      6054:280 35;35   5214;6894  # Oct2010   *0.60 =&amp;gt; 28X     # s_7 ; 78% of reads are error free &lt;br /&gt;
  Illumina paired(35)  4,851,382  35     35    35      35    169798370   36      8917;700 35;35   6817;11017 # Oct2010   *0.77 =&amp;gt; 28X     # s_4 ; 78% of reads are error free&lt;br /&gt;
&lt;br /&gt;
* Illumina simulated &lt;br /&gt;
  .                    elem       min    max   mean    n50   sum         cvg     mea;std  q20Pos   &lt;br /&gt;
  Illumina paired(75)                                                            210&lt;br /&gt;
  Illumina paired(75)                                                            6054&lt;br /&gt;
&lt;br /&gt;
=== Read alignments === &lt;br /&gt;
&lt;br /&gt;
  # ref: Ecoli K12&lt;br /&gt;
                       total         aligned                          aligned%  program&lt;br /&gt;
  454                  257477        257372                           99.96     nucmer -l 16 -c 32&lt;br /&gt;
  454 paired           322386        286115                           88.75     nucmer -l 16 -c 32&lt;br /&gt;
  Illumina             6725378       4109521                          61.10     soap -g 5 -v 5 &lt;br /&gt;
  Illumina paired      9928004       3819582(3663441 fwd+156141 rev)  38.47     soap -g 5 -v 5&lt;br /&gt;
&lt;br /&gt;
=== 454 (14.57X read cvg) ===&lt;br /&gt;
* DeNovo assembly &lt;br /&gt;
  # ctg stats&lt;br /&gt;
                              ctgs   min   q1      q2      q3       max      mean    n50      sum         seqs    singl   0cvgSum  snps  breaks     comments&lt;br /&gt;
  CA.bog**                    176    1005  6116    17855   37563    169694   25899   44290    4558385     257477  1145    59607    1003  12         only contigs used for alignments; if deg used 0cvgSum=32455&lt;br /&gt;
  newbler.deNovo              274    100   1738    8645    25332    125168   16651   35805    4562444     257477  20255   68423    1242  5          &lt;br /&gt;
  velvet                      41859  45    54      67      96       9132     156     404      6566010(?)  257477  802     200706   45017 37         most uncovered regions &lt;br /&gt;
&lt;br /&gt;
* DeNovo merge assembly (minimus2; singletons were merged in the contigs file)&lt;br /&gt;
  # ctg stats &lt;br /&gt;
                        ctgs+sing    min   q1      q2      q3       max      mean    n50      sum         seqs    singl   0cvgSum  snps  breaks     comments&lt;br /&gt;
  CA.bog-newbler.deNovo       135    100   2036    19279   52988    224358   35628   79466    4809817     450     0       47585    1011  15&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* Reference based assembly (Ecoli K12)&lt;br /&gt;
  # ctg stats &lt;br /&gt;
                              ctgs   min   q1      q2      q3       max      mean    n50      sum         seqs    singl   0cvgSum  snps  breaks     comments&lt;br /&gt;
  AMOScmp.alignmentTrimmed**  8      721   178178  548382  1208791  1574740  579935  1208791  4639482     257477  11472   48       1155  0          nucmer alignment trimmed reads&lt;br /&gt;
  AMOScmp.orig                269    243   4062    10640   23598    146792   17230   31488    4635098     257477  59369   9508     3338  1          original reads  &lt;br /&gt;
  AMOScmp.lucy                181    256   8180    17089   34961    161650   25608   40906    4635189     257477  45663   5839     2393  1          Lucy quality trimmed reads&lt;br /&gt;
  AMOScmp.OBT                 59     386   25713   51940   127336   280148   78607   135820   4637865     257477  723     1524     926   0          OBT trimmed reads&lt;br /&gt;
 &lt;br /&gt;
  newbler.refMapper           70     196   770     33983   88976    327077   65412   178569   4578847     257477  143     25805    1985  0          most uncovered regions ???&lt;br /&gt;
&lt;br /&gt;
* Reference based K12 &amp;amp; deNovo merge (minimus2; singletons were merged in the contigs file) : not much help&lt;br /&gt;
  # ctg stats&lt;br /&gt;
                        ctgs+sing    min   q1      q2      q3       max      mean    n50      sum         seqs    singl   0cvgSum  snps  breaks     comments&lt;br /&gt;
  CA.bog-AMOScmp.at.K12       24     694   27931   52358   62326    2123245  218921  1554570  5254104     184     0       122      1346  13&lt;br /&gt;
&lt;br /&gt;
* Reference based assembly stats (Ecoli ATCC) &lt;br /&gt;
                              ctgs   min   q1      q2      q3       max      mean    n50      sum         seqs    singl   0cvgSum  snps  breaks     comments&lt;br /&gt;
  AMOScmp.alignmentTrimmed*   199    34    695     5745    26504    292814   21784   75690    4335019     257477  28327   293438   2304  23         breaks are adjacent (tandem repeats?)&lt;br /&gt;
  AMOScmp.OBT                 314    77    914     6225    16704    166680   13756   37619    4319673     257477  20556   307393   1645  8          breaks are adjacent (tandem repeats?)&lt;br /&gt;
 &lt;br /&gt;
  newbler.refMapper           213    105   615     6653    25594    211153   19987   57867    4257433     257477  15725   333122   2385  46         breaks are adjacent (tandem repeats?)&lt;br /&gt;
 &lt;br /&gt;
* Reference based assembly (Ecoli 536)&lt;br /&gt;
  # ctg stats &lt;br /&gt;
                              ctgs   min   q1      q2      q3       max      mean    n50      sum         seqs    singl   0cvgSum  snps  breaks  comments&lt;br /&gt;
  AMOScmp.alignmentTrimmed    601    77    1080    3820    8977     61585    6525    12044    3921648     257477  46572   701339   1406  5        &lt;br /&gt;
  newbler.refMapper           476    101   747     4112    10299    97951    8184    18241    3895796     257477  35796   729086   3004  77&lt;br /&gt;
&lt;br /&gt;
* Read trimming&lt;br /&gt;
  .                    elem       min    q1     q2     q3     max        mean       n50        sum            &lt;br /&gt;
  CA.ORIG,CLR          252939     68     252    261    272    823        263.10     263        66547020       &lt;br /&gt;
  CA.OBTINITIAL        252939     67     250    260    271    724        260.52     261        65894538       &lt;br /&gt;
  CA.OBTCHIMERA,LATEST 252939     64     220    244    257    610        231.39     247        58528503       &lt;br /&gt;
 &lt;br /&gt;
  AMOScmp.K12          257372     33     241    252    263    342        246.44     254        63425475&lt;br /&gt;
  AMOScmp.ATCC         242665     32     239    252    262    337        244.32     253        59286793&lt;br /&gt;
  AMOScmp.536          223903     32     237    250    261    342        241.07     252        53977199&lt;br /&gt;
&lt;br /&gt;
* Repeat stats &lt;br /&gt;
                                  elem min  q1  q2  q3   max   mean    n50  sum&lt;br /&gt;
  genome                           701  36  42  57  132  3903  197.46  848  138420&lt;br /&gt;
 &lt;br /&gt;
  CA.bog                           516  36  40  49  78   848   72.84   79   37585   # may repeats end up in degenerates&lt;br /&gt;
  newbler.deNovo*                  576  36  41  52  88   1409  106.82  162  61528   # &amp;quot;long&amp;quot; repeats have only 1 copy in the assembly&lt;br /&gt;
  velvet                           123  36  39  43  56   128   51.82   48   6374    &lt;br /&gt;
 &lt;br /&gt;
  AMOScmp.alignmentTrimmed.K12*    680  36  42  59  133  3903  200.92  855  136627  &lt;br /&gt;
  AMOScmp.OBT.K12                  675  36  42  58  132  3903  201.00  926  135676  &lt;br /&gt;
  newbler.refMapper.K12            598  36  41  52  88   569   82.42   95   49288   # surprisingly &amp;quot;bad&amp;quot; for a comparative assembler&lt;br /&gt;
 &lt;br /&gt;
  AMOScmp.alignmentTrimmed.ATCC    601  36  42  58  121  3903  179.94  575  108146  &lt;br /&gt;
  AMOScmp.OBT.ATCC                 583  36  42  58  123  3903  182.45  581  106370  &lt;br /&gt;
  newbler.refMapper.ATCC           536  36  41  53  95   1208  106.75  156  57217   &lt;br /&gt;
&lt;br /&gt;
* Comments:&lt;br /&gt;
** all repeats are CONTAINED &lt;br /&gt;
** Many gaps match the repeats&lt;br /&gt;
** Ref based assemblies to Ecoli ATCC &amp;amp; 536 contain many breaks &lt;br /&gt;
** velvet assembly very fragmented; many regions covered by multiple contigs&lt;br /&gt;
&lt;br /&gt;
* Viewers&lt;br /&gt;
[[Media:454.Ecoli.gff.png|Apollo]]&lt;br /&gt;
&lt;br /&gt;
=== 454 paired (12.86X cvg) on FASTQ files === &lt;br /&gt;
&lt;br /&gt;
* DeNovo assembly&lt;br /&gt;
  # contig stats&lt;br /&gt;
                              ctgs   min   q1      q2      q3       max      mean    n50      sum         seqs    singl   0cvgSum  snps  breaks     comments&lt;br /&gt;
  CA.bog*                     351    96    4154    9544    16800    96947    13097   21709    4597092     322386  4318    30577    1227  21         # error in the orig .frg files (FWD-FWD SRA lib =&amp;gt; outie mates)  &lt;br /&gt;
  newbler.deNovo              801    100   1402    3701    7530     57354    5645    9907     4522066     322386  1094    86176    640   4&lt;br /&gt;
 &lt;br /&gt;
 # scaffold stats&lt;br /&gt;
                              scf    min   q1      q2      q3       max      mean    n50      sum&lt;br /&gt;
  CA.bog*                     15     292   90673   198619  447960   882752   308205  707016   4623076&lt;br /&gt;
  newbler.deNovo              23     2062  2637    100587  210519   1824862  203995  598401   4691889                &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
  [[Media:Ecoli.454-454p.CA.bog.qc| 454 vs 454 paired CA.bog assembly qc]]&lt;br /&gt;
&lt;br /&gt;
* Reference based assembly(Ecoli K12)  &lt;br /&gt;
  # contig stats&lt;br /&gt;
                              ctgs   min   q1      q2      q3       max      mean    n50      sum         seqs    singl   0cvgSum  snps  breaks     comments&lt;br /&gt;
  AMOScmp.alignmentTrimmed*   94     106   10818   31335   80022    303035   49345   90573    4638465     322386  40727   2552     848   2&lt;br /&gt;
  AMOScmp                     506    53    2633    6051    12524    61278    9156    16014    4633422     &lt;br /&gt;
  newbler.refMapper           114    100   7026    31322   55148    203278   40126   80905    4574379                                                # error: annotation missing from FASTA headers &lt;br /&gt;
  newbler.refMapper.redo      801    100   1402    3701    7530     57354    5645    9907     4522066     322382  1094    86176    633   4           # much worse than newbler.refMapper ; 51% mates same ctg, 36% link , 11% FalsePair !!!&lt;br /&gt;
&lt;br /&gt;
* Repeat stats &lt;br /&gt;
                                  elem min  q1  q2  q3   max   mean    n50  sum&lt;br /&gt;
  genome                          701  36  42  57  132  3903  197.46  848  138420&lt;br /&gt;
 &lt;br /&gt;
  CA.bog.redo*                    643  36  42  56  140  3903  202.61  926  130277   !!!  longest repeat(3903 bp)     : 1 out of 3 copies present in assembly; &lt;br /&gt;
                                                                                         2nd longest repeat(3240 bp) : 6 out of 7 copies present in assembly; &lt;br /&gt;
  newbler.deNovo.redo             534  36  41  51  85   1811  116.04  245  61965   &lt;br /&gt;
  velvet                          194  36  39  47  69   848   67.57   69   13108&lt;br /&gt;
 &lt;br /&gt;
  AMOScmp.alignmentTrimmed.K12*   681  36  42  59  132  3903  200.68  855  136664  &lt;br /&gt;
  newbler.refMapper.K12.redo      534  36  41  51  85   1811  116.04  245  61965&lt;br /&gt;
&lt;br /&gt;
=== 454 paired (12.86X cvg) on SFF files === &lt;br /&gt;
&lt;br /&gt;
  # contig stats&lt;br /&gt;
                                  elem  min   q1    q2    q3     max    mean   n50    sum      q.0cvg  r.0cvg  1.0cvg  1.snps  1.breaks.all  1.breaks.1k+  &lt;br /&gt;
  CA.6.1                          351   6     3713  8594  17364  83842  13135  23470  4610499  47286   28839   54065   658     31            2             &lt;br /&gt;
  newbler.2.3                     477   1069  3507  6708  13167  61181  9287   14368  4429856  212847  160916  212847  268     13            3             &lt;br /&gt;
  newbler.2.5                     549   526   2619  5701  11142  72211  8176   13560  4488866  157609  116612  157609  226     11            1       &lt;br /&gt;
&lt;br /&gt;
  # scaff stats&lt;br /&gt;
                                  elem  min    q1     q2      q3      max      mean    n50     sum      q.0cvg  r.0cvg  1.0cvg  1.snps  1.breaks.all  1.breaks.1k+  &lt;br /&gt;
  CA.6.1                          14    30345  67631  110633  659844  1216418  331422  686116  4639911  47272   28832   54051   660     366           10            &lt;br /&gt;
  newbler.2.3                     17    2023   5481   126476  204123  1958493  275782  705166  4688295  212847  160916  212847  268     474           101           &lt;br /&gt;
  newbler.2.5                     13    2060   36701  144055  517319  2057185  364272  711945  4735538  157609  116612  157609  226     548           61&lt;br /&gt;
&lt;br /&gt;
=== Illumina.36len.52X ===&lt;br /&gt;
&lt;br /&gt;
* Input reads&lt;br /&gt;
  6,725,378 x 36bp&lt;br /&gt;
&lt;br /&gt;
* DeNovo assembly&lt;br /&gt;
  # ctg stats&lt;br /&gt;
                              ctgs  min  q1      q2      q3       max      mean    n50      sum      seqs     singl    0cvgSum snps  breaks&lt;br /&gt;
  edena                       7165  100  224     417     744      4375     568     819      4073189  6725378  4983572  591013  95    1&lt;br /&gt;
 &lt;br /&gt;
  velvet(k=21)*               1694  41   78      988     3856     35320**  2692    7060*    4561195  6725378  2662898  10859   973   8 &lt;br /&gt;
  velvet1.cor(k=21)**         1499  41   71      905     4390     34836*   3043    8038**   4562221  6725378                               # Oct2010 :longer&lt;br /&gt;
 &lt;br /&gt;
  SOAPdenovo                  5743  24   27      129     1046     13161    812     2503     4668239  6725378  3688149  99800   692   1&lt;br /&gt;
  SOAPdenovo.cor              3755  24   27      69      1390     30790    1232    4941     4628933  6725378&lt;br /&gt;
&lt;br /&gt;
* Reference based assembly(Ecoli K12)  &lt;br /&gt;
                              ctgs  min  q1      q2      q3       max      mean    n50      sum      seqs     singl    0cvgSum snps  breaks &lt;br /&gt;
  AMOScmp                     14    674  68206   245583  547319   1199735  331493  548739   4640911  6725378  4099119  16      1417  4          # only 3664073(54%) of seqs aligned to ref =&amp;gt; 28X cvg&lt;br /&gt;
  maq**                       6     765  145073  727066  1444956  1976524  773274  1444956  4639645  6725378  4116513  23      23    0&lt;br /&gt;
&lt;br /&gt;
* Reference based assembly(Ecoli ATCC) : sharp decrease in assembly quality  &lt;br /&gt;
                              ctgs  min  q1      q2      q3       max      mean    n50      sum      seqs     singl    0cvgSum snps  breaks&lt;br /&gt;
  AMOScmp*                    1530  36   95      532     2933     50248    2828    10506    4328273  6725378  4362159  329400  2837  29         # only 3281823(48%) of seqs aligned to ref =&amp;gt; 25X cvg&lt;br /&gt;
  maq                         1600  14   100     512     2670     51703    2694    10283    4311361  6725378  4415635  349521  1047  37&lt;br /&gt;
&lt;br /&gt;
* Merged assembly&lt;br /&gt;
&amp;lt;pre style=&amp;quot;background:yellow&amp;quot;&amp;gt;&lt;br /&gt;
  .                           elem  min  q1      q2     q3     max        mean       n50       sum &lt;br /&gt;
  velvet                      1694  41   78      988    3856   35320      2692      7060       4561195        &lt;br /&gt;
  AMOScmp.ATCC                1530  36   95      532    2933   50248      2828      10506      4328273        &lt;br /&gt;
  velvet-AMOScmp.ATCC         418   41   100     1110   9918   208338     11292     51437      4720083    &lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Repeat stats &lt;br /&gt;
                 elem min  q1  q2  q3   max   mean    n50  sum&lt;br /&gt;
  genome          701  36  42  57  132  3903  197.46  848  138420&lt;br /&gt;
 &lt;br /&gt;
  edena           57   36  53  75  156  756   139.07  244  7927    &lt;br /&gt;
  SOAPdenovo      130  36  48  67  139  570   118.14  202  15358   &lt;br /&gt;
  velvet*         182  36  48  65  139  855   120.25  229  21885   &lt;br /&gt;
 &lt;br /&gt;
  AMOScmp.K12     681  36  42  59  132  3903  200.68  855  136664  &lt;br /&gt;
  maq.K12         670  36  42  60  140  3903  203.30  855  136213  &lt;br /&gt;
 &lt;br /&gt;
  AMOScmp.ATCC    590  36  42  57  112  3026  164.80  501  97230   ??? max repeat not assembled &lt;br /&gt;
  maq.ATCC        579  36  42  58  122  2029  164.34  479  95154&lt;br /&gt;
&lt;br /&gt;
* Comments&lt;br /&gt;
** ACGATGTGACGTACGCGTATGCTCGTATACACACGC appears more than 231K times in the input; does not align to Ecoli or any other genomes&lt;br /&gt;
&lt;br /&gt;
=== Illuminap.37len.178ins.79X  ===&lt;br /&gt;
&lt;br /&gt;
* DeNovo assembly&lt;br /&gt;
  # ctg stats&lt;br /&gt;
                ctgs   min q1   q2   q3   max   mean    n50  sum      seqs     singl   0cvgSum snps  breaks&lt;br /&gt;
  SOAPdenovo    33346  24  24   25   124  6125  155.08  596  5171181  9619686  7111896  84198  1929  3    &lt;br /&gt;
  velvet        8080   45  178  364  721  7544  561.01  908  4532926  9619686  6199264  95868  1214  2&lt;br /&gt;
&lt;br /&gt;
=== Illuminap.101len.150ins.173X ===&lt;br /&gt;
&lt;br /&gt;
* INPUT: &lt;br /&gt;
  reads: 8,000,000&lt;br /&gt;
  mates: 4,000,000&lt;br /&gt;
  clr:   101bp&lt;br /&gt;
&lt;br /&gt;
* OUTPUT:&lt;br /&gt;
  #ctg stats:&lt;br /&gt;
  .                    elem       min    q1     q2     q3     max        mean       n50        sum            &lt;br /&gt;
  CA(OBT)              747        207    2204   4114   7508   35940      5962.97    8812       4454336&lt;br /&gt;
  SOAPdenovo           2789825    100    101    101    101    1116       101.32     101        282665113            &lt;br /&gt;
  velvet               73160      45     45     45     88     2282       87.06      117        6369253        &lt;br /&gt;
&lt;br /&gt;
  #scf stats&lt;br /&gt;
                       elem       min    q1     q2     q3     max        mean       n50        sum            &lt;br /&gt;
  CA                   601        1016   2533   4817   9024   38235      7416.40    11317      4457256 &lt;br /&gt;
  SOAPdenovo           2789825    100    101    101    101    1116       101.32     101        282665113&lt;br /&gt;
  velvet&lt;br /&gt;
&lt;br /&gt;
=== Illuminap.101len.150ins.59X.trimmed ===&lt;br /&gt;
&lt;br /&gt;
* Quality trimming:&lt;br /&gt;
** N trimming:   reads which contain N&#039;s were discarded&lt;br /&gt;
** Q20 trimming: beginning/end bases Q20- were deleted&lt;br /&gt;
 &lt;br /&gt;
* INPUT: &lt;br /&gt;
  reads: 3,013,500&lt;br /&gt;
  mates: 1,506,750&lt;br /&gt;
  avgClr:   91bp&lt;br /&gt;
 &lt;br /&gt;
  #ctg stats:&lt;br /&gt;
  .                    elem       min    q1     q2     q3     max        mean       n50        sum            &lt;br /&gt;
  CA                   916        211    1989   3357   5990   44653      4712.96    6317       4317070 &lt;br /&gt;
  SOAPdenovo           64582      100    100    101    104    2209       147.10     101        9500101 &lt;br /&gt;
  velvet(k31)          915        61     238    2191   7008   48470**    4970.98    12083**    4548449&lt;br /&gt;
&lt;br /&gt;
  #scf stats&lt;br /&gt;
                       elem       min    q1     q2     q3     max        mean       n50        sum            &lt;br /&gt;
  CA                   738        1023   2278   4135   7301   44653      5854.51    8275       4320630&lt;br /&gt;
  SOAPdenovo           60942      100    100    101    101    5705       168.40     125        10262880&lt;br /&gt;
  velvet(k31)          151        633    6518   10088  18197  48338**    13072.10   18197**    1973887&lt;br /&gt;
&lt;br /&gt;
=== Illuminap.101len.150ins.150X.cor ===&lt;br /&gt;
&lt;br /&gt;
* INPUT &lt;br /&gt;
  reads:  8,835,962           #  20,635,060 reads originally&lt;br /&gt;
  mates:  4,417,981&lt;br /&gt;
  avgClr: 79bp&lt;br /&gt;
  mea:    150&lt;br /&gt;
&lt;br /&gt;
  #ctg stats&lt;br /&gt;
  .                    elem       min    q1     q2     q3     max        mean       n50        sum         0cvgSum snps  breaks rearrangements   &lt;br /&gt;
  CA                   470        573    2986   6181   12285  68568      9614.45    15688      4518792     100041  348   20     3 &lt;br /&gt;
  SOAPdenovo100+       596        100    260    3233   10238  73870      7609.43    18141      4535220     91689   25    1         &lt;br /&gt;
  velvet100+           262        103    374    5487   26564  221091**   17346.07   46409**    4544670     14018   405   5&lt;br /&gt;
&lt;br /&gt;
  # scf stats&lt;br /&gt;
  .                    elem       min    q1     q2     q3     max        mean       n50        sum       rearrangements&lt;br /&gt;
  CA                   378        1017   3334   7108   15206  70497      11959.34   21457      4520632   3          &lt;br /&gt;
  SOAPdenovo           165        100    252    2713   35162  410074     27923.05   94982      4607303   0          &lt;br /&gt;
  velvet100+           174        103    198    1919   30793  413929**   26175.74   95373**    4554578   4         &lt;br /&gt;
&lt;br /&gt;
* Reference based assembly(Ecoli K12)  &lt;br /&gt;
                           ctgs  min  q1      q2      q3       max      mean    n50      sum      seqs     singl    0cvgSum snps  breaks &lt;br /&gt;
  AMOSCmp.soap             8     769  299915  345273  1100345  1444961  579962  1100345  4639697                                            # soap2  -r 2  -v 2 -n 0 -l 28 &lt;br /&gt;
&lt;br /&gt;
* Reference based assembly(Ecoli ATCC)  &lt;br /&gt;
                            ctgs  min  q1      q2      q3       max      mean    n50      sum      seqs     singl    0cvgSum snps  breaks &lt;br /&gt;
  AMOSCmp.soap100+          1024  100  431     1348    4811     56348    4199    11899    4300027                                            # soap2  -r 2  -v 2 -n 0 -l 28    =&amp;gt; 0cvg: 1033 regions; 419859 bp&lt;br /&gt;
  AMOSCmp.nucmer100+        428   100  210     1246    12466    233619   10142   35998    4341162                                            # nucmer -c 40 -l 20 -b 200 -g 90 =&amp;gt; 0cvg: 241  regions; 408109 bp&lt;br /&gt;
&lt;br /&gt;
* Merged assembly&lt;br /&gt;
&amp;lt;pre style=&amp;quot;background:yellow&amp;quot;&amp;gt;&lt;br /&gt;
  .                             elem       min    q1     q2     q3     max        mean       n50        sum            &lt;br /&gt;
  velvet100+                    262        103    374    5487   26564  221091**   17346      46409**    4544670        &lt;br /&gt;
  AMOSCmp-ATCC.nucmer100+       428        100    210    1246   12466  233619     10142      35998      4341162  &lt;br /&gt;
 &lt;br /&gt;
  velvet-AMOScmp.ATCC.nucm100+  104        114    1382   14570  52256  413910     45545      121678     4736715&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== IlluminaTechnicalNotes (Illumina results) ===&lt;br /&gt;
&lt;br /&gt;
* Metrics: N50 , Max, Genome coverage &lt;br /&gt;
* Factors&lt;br /&gt;
** Cvg:     no improvement beyond 50X  : &lt;br /&gt;
** ReadLen: reads beyond 36bp(100bp exp) don&#039;t help&lt;br /&gt;
** Inserts: insert should be longer than longest repeat&lt;br /&gt;
&lt;br /&gt;
  insertLen    readLen    readCvg  elem                                   maxScf*               n50Scf**              genomeCvg***       #readsSampled   #mateSampled&lt;br /&gt;
  200          2*75       50X      ?                                      223,000               97,000                99.58              3,045,862(49X)  1,522,931&lt;br /&gt;
  200+6K       2*75+2*35  50X+28X  ?                                      2,100,000             1,300,000             99.07              4,099,798(31X)  2,049,899&lt;br /&gt;
  200+9K       2*75+2*35  50X+28X  ?                                      4,500,000             4,500,000             99.69              3,736,240(28X)  1,868,120&lt;br /&gt;
&lt;br /&gt;
** Quality: best results for N/Failed_Chastity/s35 filter&lt;br /&gt;
&lt;br /&gt;
=== IlluminaTechnicalNotes (CBCB results) ===&lt;br /&gt;
  &lt;br /&gt;
* Compare given qualities with the computed ones&lt;br /&gt;
&lt;br /&gt;
** sample reads&lt;br /&gt;
  ~/bin/sampleFastq12.pl -r 0.0070 -noN -count 10000 ../s_6_1_export.sample.fastq ../s_6_2_export.sample.fastq Ecoli_1.fastq Ecoli_2.fastq&lt;br /&gt;
  fastq2seq.pl &amp;lt; Ecoli_1.fastq &amp;gt;! Ecoli_1.seq&lt;br /&gt;
&lt;br /&gt;
  cat original.count &lt;br /&gt;
  #fwd                            rev                             mates&lt;br /&gt;
  s_2_1_3kb_sequence.txt          s_2_2_3kb_sequence.txt          21563283&lt;br /&gt;
  s_2_1_8kb_sequence.txt          s_2_2_8kb_sequence.txt          198377&lt;br /&gt;
  ...&lt;br /&gt;
  cat original.count | p &#039;next if(/^#/); $c=10000; $r=(($c*1.05)/$F[2]); print &amp;quot;~/bin/sampleFastq12.pl -r $r -count $c  original/$F[0] original/$F[1] original_sample/$F[0] original_sample/$F[1]\n&amp;quot;;&#039;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
** given qualities =&amp;gt; avg error&lt;br /&gt;
  cat Ecoli_1.fastq | fastx_quality_stats | ~/bin/quality2error.pl -i 5 | nl &amp;gt;!  Ecoli_1.fastq.pos_qual&lt;br /&gt;
  cat Ecoli_1.fastq.pos_qual | getSummary.pl -i 1 | perl ~/bin/transpose.pl | grep mean  =&amp;gt; 0.0019493&lt;br /&gt;
&lt;br /&gt;
** estimate qualities : based on alignment to reference : ~ 0.2% (probably lower than it is)&lt;br /&gt;
&lt;br /&gt;
** alignments snps =&amp;gt; avg error&lt;br /&gt;
  nucmer -maxmatch -l 12 -c 12  ~/Escherichia_coli/Data/1con/Ecoli.K12.1con Ecoli_1.seq -p Ecoli_1&lt;br /&gt;
  cat Ecoli_1.delta | delta-filter-max.pl | ~/bin/deltaextend.pl | show-snps.pl -1 -H  | count.pl -p 10000 -i 3 | sort -n &amp;gt;! Ecoli_1.align.pos_qual&lt;br /&gt;
  perl -e &#039;$n =`wc -l Ecoli_1.filter-q.extended.snps | cut -f1 -d &amp;quot; &amp;quot;`; $e=($n/(75*10000)); print $e,&amp;quot;\n&amp;quot;  =&amp;gt; 0.022      &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Image:Ecoli.210.png]] [[Image:Ecoli.6k.png]] [[Image:Ecoli.9k.png]]&lt;br /&gt;
&lt;br /&gt;
* Estimate library insert sizes &lt;br /&gt;
          total    mated     single    single.both.aligned   single.both.compressed  single.both.expand&lt;br /&gt;
  210     20000    17780     783       30                    12                      18&lt;br /&gt;
  6k      20000    14360     3222      1434                  560                     856&lt;br /&gt;
  9k      20000    14130     3508      1906                  744                     1156&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* Error correction:&lt;br /&gt;
  quake.py -f Ecoli.ls -k 18 -p 16&lt;br /&gt;
  cat Ecoli.ls&lt;br /&gt;
     Ecoli_1.fastq Ecoli_2.fastq &lt;br /&gt;
     Ecoli_1.6k.fastq Ecoli_2.6k.fastq &lt;br /&gt;
     Ecoli_1.9k.fastq Ecoli_2.9k.fastq &lt;br /&gt;
 &lt;br /&gt;
  original(mates)  corrected(mates)  corrected(fwd)  corrected(rev)  deleted&lt;br /&gt;
  1,522,931        1,241,547         138,323         88,984          335,461  &lt;br /&gt;
  2,049,899        1,665,847         196,868         171,832         399,404&lt;br /&gt;
  1,868,120        1,528,727         173,005         146,659         359,122&lt;br /&gt;
&lt;br /&gt;
* velvet &lt;br /&gt;
&lt;br /&gt;
  #scf100+ stats&lt;br /&gt;
  .                                elem       min    q1     q2     q3     maxScf*    mean       n50Scf**   sum        genomeCvg*** &lt;br /&gt;
  200                              177        100    168    1427   31804  325,889    25729      95,446     4554080    99.91           &lt;br /&gt;
  200+6K                           79         100    121    182    623    1,591,085  59027      1,590,440  4663117&lt;br /&gt;
  200+9k                           78         100    130    190    734    1,637,906  59820      1,093,506  4665962&lt;br /&gt;
&lt;br /&gt;
  #ctg100+ stats&lt;br /&gt;
  insertLen                        elem       min    q1     q2     q3     maxCtg     mean       n50Ctg     sum            &lt;br /&gt;
  200                              312        100    596    5818   22311  110,215    14579      36,329     4548539        &lt;br /&gt;
  200+6K                           259        100    475    7990   28357  165,821    17911      42,905     4638953&lt;br /&gt;
  200+9k                           248        100    537    8388   25761  186,539    18648      45,601     4624717&lt;br /&gt;
&lt;br /&gt;
  # cvg computed using : nucmer -maxmatch Ecoli.1con Ecoli.ctg.fasta -p 1con-ctg  ; delta2cvg.pl &amp;lt; 1con-ctg.delta -M 0 | getSummary.pl -i 4  &lt;br /&gt;
&lt;br /&gt;
* CA &lt;br /&gt;
  &lt;br /&gt;
  #scf100+ stats&lt;br /&gt;
  .                                elem       min    q1     q2     q3     maxScf     mean       n50        sum            &lt;br /&gt;
  200                              389        1037   3683   7477   13796  71,906     10484      15837      4078105       # minRead=64;minOvl=40;minUtg=1000 (default)&lt;br /&gt;
  200.redo                         420        208    2443   4372   8100   39,715     6343       9202       2664254       # minRead=32;minOvl=30;minUtg=500&lt;br /&gt;
  200+6k                           ?                                                                                     # minRead=32;minOvl=30;minUtg=500  : run for a very long time ; killed&lt;br /&gt;
  200+6k.redo2                     166        75     1298   1904   8579   101,429    6257       14354      1038602       # minRead=32;minOvl=30;minUtg=500;utgErrorRate=0.03 ; lots of degenerates&lt;br /&gt;
&lt;br /&gt;
  #ctg100+ stats&lt;br /&gt;
  .                                elem       min    q1     q2     q3     maxCtg     mean       n50        sum            &lt;br /&gt;
  200                              657        104    2570   4773   8167   39,986     6199       8626       4072659 &lt;br /&gt;
  200.redo                         454        75     2322   4153   7505   35,904     5866       8744       2663112 &lt;br /&gt;
  200+6k                           ?                                                                                     # minRead=32;minOvl=30;minUtg=500  : run for a very long time ; killed&lt;br /&gt;
  200+6k.redo2                     612        35     215    1132   1827   4,994      1195       1904       731055        # minRead=32;minOvl=30;minUtg=500;utgErrorRate=0.03  ; lots of degenerates&lt;br /&gt;
&lt;br /&gt;
* Note&lt;br /&gt;
* Shorter velvet ctgs/scf on corrected reads: the input reads are already high quality !!!&lt;br /&gt;
&lt;br /&gt;
=== IlluminaTechnicalNotes sim data err=0.0 ===&lt;br /&gt;
&lt;br /&gt;
  insertLen    readLen    readCvg     &lt;br /&gt;
  200+6K       2*75+2*35  50X+28X     &lt;br /&gt;
  200+6K       2*75+2*75  50X+28X     &lt;br /&gt;
&lt;br /&gt;
* velvet &lt;br /&gt;
  #scf100+ stats&lt;br /&gt;
  .               elem       min    q1     q2     q3     max        mean       n50        sum        &lt;br /&gt;
  velvet.35       72         100    133    182    531    4636779    64823      4636779    4667231&lt;br /&gt;
  velvet.75       72         100    129    182    531    4636870    64822      4636870    4667211    &lt;br /&gt;
&lt;br /&gt;
  #ctg100+ stats&lt;br /&gt;
  .               elem       min    q1     q2     q3     max        mean       n50        sum        &lt;br /&gt;
  velvet.35       154        100    191    1919   42330  263156     30158      93526      4644318&lt;br /&gt;
  velvet.75       145        100    178    1331   45218  263151     32092      98728      4653398&lt;br /&gt;
&lt;br /&gt;
* CA&lt;br /&gt;
** CA.35 : cgw would take days to complete; Solution: delete 526,644 reads that assemble into single fragment unitigs (7% of total reads)&lt;br /&gt;
&lt;br /&gt;
  #scf100+ stats&lt;br /&gt;
  .               elem       min    q1     q2     q3     max        mean       n50        sum   &lt;br /&gt;
  CA.35           20         7383   81163  122901 359340 729797     222118     409560     4442362     &lt;br /&gt;
  CA.75           1          4640849                     4640849                                     &lt;br /&gt;
&lt;br /&gt;
  #ctg100+ stats&lt;br /&gt;
  CA.35           97         6687   14090  29769  57870  326996     44421      67366      4308861&lt;br /&gt;
  CA.75           47         102    14731  68369  128240 515019     98614      176651     4634873&lt;br /&gt;
&lt;br /&gt;
=== Illuminap.64len.475ins.50X ===&lt;br /&gt;
&lt;br /&gt;
* Simulated data : maq simulator : &lt;br /&gt;
  maq simulate -d 475 -s 47 -1 64 -2 64 -N 1812500 -r 0.001 Ecoli_1.fastq Ecoli_2.fastq Ecoli.1con Ecoli.simupars.dat&lt;br /&gt;
  soap2 -D Ecoli.1con.index -a Ecoli_1.fastq -b Ecoli_2.fastq -M 0 -o Ecoli_12.mated.soap2 -2 Ecoli_12.single.soap2 -m 375 -x 575 -p 20 -r 2&lt;br /&gt;
  cat *soap2 | ~/bin/AMOS/soap2posmap.pl | ~/bin/posmap2cvg.pl -M 0 | getSummary.pl -i 4&lt;br /&gt;
 &lt;br /&gt;
  .       elem  min  q1  q2  q3  max  mean  n50  sum   &lt;br /&gt;
  0cvg    1382  1    2   3   5   84   4.71  5    6512  #due to mutation&lt;br /&gt;
&lt;br /&gt;
  reads:    3,625,000  &lt;br /&gt;
  clr:      64 bp&lt;br /&gt;
  mea;std : 475;47 &lt;br /&gt;
&lt;br /&gt;
  #ctg stats&lt;br /&gt;
  .                          elem       min    q1     q2     q3     max        mean       n50        sum  &lt;br /&gt;
  velvet100+                 261        100    441    7429   25373  152351     17468      41524      4559162&lt;br /&gt;
  CA.ctg100+                 0                                                                                #no ctg/scf just deg max 200; is cvg too low?&lt;br /&gt;
 &lt;br /&gt;
  #scf stats&lt;br /&gt;
  velvet100+                 148        100    158    923    33253  327108     30850      132780     4565870&lt;br /&gt;
&lt;br /&gt;
* Reference based assembly(Ecoli K12)  &lt;br /&gt;
                             elem       min    q1     q2     q3     max        mean       n50        sum  &lt;br /&gt;
  AMOSCmp.nucmer100+         1                                      4639681                          4639681 &lt;br /&gt;
&lt;br /&gt;
* Reference based assembly(Ecoli ATCC)  &lt;br /&gt;
                             elem       min    q1     q2     q3     max        mean       n50        sum      &lt;br /&gt;
  AMOSCmp.nucmer100+         510        100    296    1741   11087  105786     8495       27761      4332605             # nucmer -c 28 -l 14              =&amp;gt; 0cvg: 334 regions; 401485 bp&lt;br /&gt;
&lt;br /&gt;
* Merged assembly&lt;br /&gt;
                             elem       min    q1     q2     q3     max        mean       n50        sum      &lt;br /&gt;
  velvet-AMOSCmp.nucmer100+  138        114    1861   17205  50382  268321     39660      105786     5473097             # total length is &amp;gt;&amp;gt; genomeLength (some duplication going on)&lt;br /&gt;
&lt;br /&gt;
=== Illuminap.64len.475ins.50X-64len.8000ins.10X (60X cvg) ===&lt;br /&gt;
&lt;br /&gt;
  reads:    3,625,000  + 725,000&lt;br /&gt;
  clr:      64 bp&lt;br /&gt;
  mea;std : 475;47       8000;800&lt;br /&gt;
&lt;br /&gt;
  #ctg stats&lt;br /&gt;
  .                          elem       min    q1     q2     q3     max        mean       n50        sum  &lt;br /&gt;
  SOAPdenovo100+             576        100    249    3249   9909   73,093     7893       21083      4546797&lt;br /&gt;
  velvet100+                 179        100    223    2955   35078  240,045    25933      83796      4642027 &lt;br /&gt;
&lt;br /&gt;
  #scf stats&lt;br /&gt;
  SOAPdenovo100+             241        100    151    730    6746   492338     19669      219830     4740402  # pair_num_cutoff=default&lt;br /&gt;
  SOAPdenovo100+             213        100    114    151    296    2,142,392  22266      1430222    4742568  # pair_num_cutoff=50&lt;br /&gt;
  velvet100+                 75         100    133    204    623    4,406,037  62160      4406037    4661969&lt;br /&gt;
&lt;br /&gt;
== Staphylococcus aureus  ==&lt;br /&gt;
&lt;br /&gt;
* [http://www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?study=SRP001086 SRA Study] Staphylococcus aureus Sequencing on Illumina  &lt;br /&gt;
* 22 complete strains: NCBI search:  &amp;quot;Staphylococcus aureus&amp;quot;[Organism] AND pt_default[prop] AND seqstat_complete[PROP]&lt;br /&gt;
* /fs/szdata//ncbi/genomes/Bacteria/Staphylococcus_aureus_*/&lt;br /&gt;
* chromosome lengths 2.7-2.9Mbp&lt;br /&gt;
* plasmids? (2-3)&lt;br /&gt;
* Ref: &lt;br /&gt;
  id                len     gc%    desc                    &lt;br /&gt;
  NC_010079         2872915 32.76  Staphylococcus aureus subsp. aureus USA300_TCH1516, complete genome  (most related to Illumina data set)&lt;br /&gt;
  NC_003923         2820462 32.83  Staphylococcus aureus subsp. aureus MW2, complete genome&lt;br /&gt;
  NC_007622         2742531 32.78  Staphylococcus aureus RF122, complete genome&lt;br /&gt;
 &lt;br /&gt;
  NZ_AASB00000000   2810505 32     Staphylococcus aureus subsp. aureus USA300_TCH959, 256 contigs&lt;br /&gt;
&lt;br /&gt;
  USA300-MW2 &lt;br /&gt;
  .                 elem       min    q1     q2     q3     max        mean       n50        sum            &lt;br /&gt;
  %id               2873390    79.28  98.93  99.30  99.54  100.00     98         99         284357849.9    &lt;br /&gt;
  0cvg              124        2      60     430    1357   18282      1149       2539       142568         &lt;br /&gt;
&lt;br /&gt;
* Repeats (Saureus USA300)&lt;br /&gt;
  .                    elem       min    q1     q2     q3     max        mean       n50        sum            &lt;br /&gt;
  36bp+                667        36     44     62     98     5016       125        191        83788         &lt;br /&gt;
  100bp+               158        100    119    158    254    5016       348        746        55049         &lt;br /&gt;
  200bp+               56         201    253    324    746    5016       735        1506       41185&lt;br /&gt;
&lt;br /&gt;
* Uniq (Saureus USA300)&lt;br /&gt;
  .                   elem       min     q1     q2     q3     max        mean       n50        sum            &lt;br /&gt;
  36bp+               623        1       10     75     820    126818     4467       34787      2783139        &lt;br /&gt;
  100bp+              142        1       76     1512   32659  180152     19801      74137      2811878        &lt;br /&gt;
  200bp+              48         1       221    35708  88217  328320     58812      145211     2823011        &lt;br /&gt;
&lt;br /&gt;
=== Read stats ===&lt;br /&gt;
  .                      reads       min    max    mean       n50        sum              cvg      SRA accession(s)&lt;br /&gt;
  454                    391,495     51     857    273        275        107151210        37.29    [http://www.ncbi.nlm.nih.gov/sra?term=SRX002328&amp;amp;report=full SRX002328] strain USA300_TCH959 HMP0023 &lt;br /&gt;
 &lt;br /&gt;
  454 paired             490,520     1      1323   134        182        65822848         22.95    [http://www.ncbi.nlm.nih.gov/sra?term=SRX002327&amp;amp;report=full SRX002327] strain USA300_TCH959 HMP0023 ; NOMINAL_LENGTH=2750 NOMINAL_STDEV=250&lt;br /&gt;
 &lt;br /&gt;
  Illumina               2,739,566   36     36     36         36         98624376         34.32    [http://www.ncbi.nlm.nih.gov/sra?term=SRX007668&amp;amp;report=full SRX007668] [[Media:SRX007668.qual.summary.png| qual plot]]&lt;br /&gt;
  &lt;br /&gt;
  Illumina paired end    8,982,084   36     36     36         36         323355024        112.55   [http://www.ncbi.nlm.nih.gov/sra?term=SRX007696&amp;amp;report=full SRX007696] [[Media:SRX007696.qual.summary.png| qual plot]]  #  NOMINAL_LENGTH=175&lt;br /&gt;
 &lt;br /&gt;
  Illuminia jumping lib  17,678,908  37     37     37         37         654119596        227      [http://www.ncbi.nlm.nih.gov/sra/SRX007711?report=full SRX007711] # NOMINAL_LENGTH=300, ORIENTATION=5&#039;3&#039;-3&#039;5&#039;; soap estimate 3538&lt;br /&gt;
  Illumina paired        30,597,352  101    101    101        101        3090332552       1075     [http://www.ncbi.nlm.nih.gov/sra/SRX007714?report=full SRX007714] # NOMINAL_LENGTH=300, NOMINAL_SDEV=22.253, ORIENTATION=5&#039;3&#039;-3&#039;5  ; CA estimate mea/std=170/21; strain USA300_TCH1516&lt;br /&gt;
&lt;br /&gt;
=== 454 stats (37.29X read cvg) === &lt;br /&gt;
&lt;br /&gt;
* DeNovo&lt;br /&gt;
  .                       ctgs min  q1     q2      q3      max     mean       n50     sum      &lt;br /&gt;
  CA.6.0.bog              31   1471 16316  51985   142506  318772  89171      157559  2764330  &lt;br /&gt;
  CA.6.1.bog              47   1066 6623   26392   73408   309139  58908      145306  2768666    # fewer degenerates compared to  CA.6.0.bog&lt;br /&gt;
&lt;br /&gt;
  newbler.2.3.deNovo      85   104  203    1176    19635   295958  32888      157201  2795541  &lt;br /&gt;
  newbler.2.5p1.deNovo*   46   105  231    3087    56818   546353* 60730      296002* 2793599&lt;br /&gt;
&lt;br /&gt;
* Reference based(Saureus USA300_TCH1516 (diff strain))&lt;br /&gt;
  .                       ctgs min  q1     q2      q3      max     mean       n50     sum   &lt;br /&gt;
  newbler.2.3.refMapper   183  103  630    3737    18738   117468  14374      42301   2630491&lt;br /&gt;
  newbler.2.5p1.refMapper 186  103  589    3740    17184   117464  14139      43130   2629763&lt;br /&gt;
&lt;br /&gt;
=== 454 paired stats (22.95X read cvg) ===&lt;br /&gt;
&lt;br /&gt;
* Data&lt;br /&gt;
                        reads   min    q1     q2     q3     max        mean       n50        sum            &lt;br /&gt;
  sff(original)         334241  36     201    256    277    362        235        264        78432227      &lt;br /&gt;
  sffToCA(adaptor free) 325555  51     103    148    208    358        157        185        51225615   # 58723 mates&lt;br /&gt;
&lt;br /&gt;
* DeNovo&lt;br /&gt;
  # ctg stats&lt;br /&gt;
  .                    ctgs min  q1     q2      q3      max     mean       n50     sum      &lt;br /&gt;
  CA.6.0.bog           11   278  47961  194288  508135  590074* 255272     508135* 2808001  &lt;br /&gt;
  CA.6.1.bog           18   238  21014  135971  239466  567548  155836     277888  2805055&lt;br /&gt;
 &lt;br /&gt;
  newbler.2.3.deNovo   544  100  810    2652    7672    33434   5126       10865   2788673  &lt;br /&gt;
  newbler.2.5p1.deNovo 100  103  295    3287    39467   229053  27879      78379   2787870&lt;br /&gt;
&lt;br /&gt;
  # scf stats&lt;br /&gt;
  .                    scfs min  q1     q2      q3      max     mean       n50     sum            &lt;br /&gt;
  CA.6.0.bog           6    278  47961  194288  1035529 1410221 468016     1410221 2808101        &lt;br /&gt;
  CA.6.1.bog           6    284  21014  173065  1032129 1458733 467554     1458733 2805325&lt;br /&gt;
  &lt;br /&gt;
  newbler.2.3          16   2042 2173   2767    110212  1409442 176373     1037981 2821966&lt;br /&gt;
  newbler.2.5p1.deNovo 8    2475 20731  110137  1030785 1408642 349895     1408642 2799157&lt;br /&gt;
&lt;br /&gt;
* Reference based(Saureus USA300)&lt;br /&gt;
  .                    ctgs min  q1     q2      q3      max     mean       n50     sum      &lt;br /&gt;
  newbler.2.3.refMapper 206 103  556    3098    15366   117487  12749      40687   2626469&lt;br /&gt;
&lt;br /&gt;
=== Illumina stats (34.32X read cvg) ===&lt;br /&gt;
&lt;br /&gt;
* DeNovo &lt;br /&gt;
  .                 ctgs  min  q1     q2     q3      max     mean   n50     sum      seqs     singl    0cvgSum snps  breaks&lt;br /&gt;
  edena             3956  100  207    422    851     13908   656    1040    2597617  2739566  ?        315399  18    1&lt;br /&gt;
  SOAPdenovo*       5814  24   45     164    624     13087   503    1331    2925517  2739566  ?        69547   298   1&lt;br /&gt;
  velvet            3477  45   166    464    1054    10468   812    1528    2825440  2739566  ?        100722  638   17&lt;br /&gt;
&lt;br /&gt;
* Reference based(Saureus USA300)&lt;br /&gt;
  .                 ctgs  min  q1     q2     q3      max     mean   n50     sum      seqs     singl    0cvgSum snps  breaks&lt;br /&gt;
  AMOScmp           81    36   5859   20119  52879   196907  35474  75670   2873401  2739566  1003517  376     649   3        # only 2126008(77%) of seqs aligned to ref =&amp;gt; 26X cvg&lt;br /&gt;
  maq**             41    36   10188  33799  114655  341537  70064  154157  2872626  2739566  955336   369     43    0   &lt;br /&gt;
&lt;br /&gt;
* Reference based(Saureus MW2)&lt;br /&gt;
  .                 ctgs  min  q1     q2     q3      max     mean   n50     sum      seqs     singl    0cvgSum snps  breaks&lt;br /&gt;
  AMOScmp           1199  36   88     366    2000    48748   2268   8866    2720299  2739566  1118637  152285  2369  19      # only 1965762(71%) of seqs aligned to ref =&amp;gt; 24X cvg&lt;br /&gt;
  maq*              1052  36   91     364    2111    64379   2576   10270   2710270  2739566  1100695  167184  2357  21&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Illumina paired stats (112.55X read cvg; avgInsert=~60bp) ===&lt;br /&gt;
* Read alignments&lt;br /&gt;
  73% of the reads align by &amp;quot;soap -g 5 -v 5 -r 2 -f 0&amp;quot;&lt;br /&gt;
  3368102 fwd &lt;br /&gt;
  3210408 rev&lt;br /&gt;
  6578510 total&lt;br /&gt;
&lt;br /&gt;
* DeNovo &lt;br /&gt;
  #ctg stats&lt;br /&gt;
  .                 ctgs  min  q1     q2     q3      max     mean   n50     sum      seqs     singl    0cvgSum snps  breaks&lt;br /&gt;
  edena             4636  100  216    388    707     6754    544    782     2526187  2739566  ?        385985  88    2       # most 0cvg regions&lt;br /&gt;
  SOAPdenovo        6038  24   26     34     47      46791   498    11310   3009291  8982084  ?        48023   38    0  &lt;br /&gt;
  velvet**          705   45   60     84     584     142306  4046   33811   2852996  8982084  2898483  39096   854   12&lt;br /&gt;
&lt;br /&gt;
  #scf stats&lt;br /&gt;
  .                 scfs  min  q1     q2     q3      max     mean   n50     sum            &lt;br /&gt;
  SOAPdenovo        482   100  270    1532   6719    58192   5849   18859   2819512  &lt;br /&gt;
  velvet            12    333  1111   27392  53828   142405  31488  78092   377866   # 2+ctg scaffolds        &lt;br /&gt;
&lt;br /&gt;
* Reference based(Saureus USA300) &lt;br /&gt;
  .                 ctgs  min     q1       q2       q3       max      mean    n50      sum      seqs     singl    0cvgSum snps  breaks&lt;br /&gt;
  maq**             1     2872913 2872913  2872913  2872913  2872913  2872913 2872913  2872913  8982084  2614519  2       3     0   &lt;br /&gt;
 &lt;br /&gt;
  Comments:&lt;br /&gt;
  * assembles in one piece&lt;br /&gt;
  * first &amp;amp; last bases in ref not covered&lt;br /&gt;
  * there is no read = ref.1-36 or ref.2872880-2872915&lt;br /&gt;
&lt;br /&gt;
* Reference based(Saureus MW2)&lt;br /&gt;
  .                ctgs  min  q1     q2     q3      max     mean   n50     sum      seqs     singl    0cvgSum snps  breaks&lt;br /&gt;
  maq              695   36   89     399    3304    77110   3914   17814   2720683  8982084  2801829  158246  4132  57&lt;br /&gt;
&lt;br /&gt;
=== Illumina101 paired stats (cvg 78X; givenInsert=300 ; computerInsert=170) ===&lt;br /&gt;
&lt;br /&gt;
* DeNovo &lt;br /&gt;
  # ctg stats&lt;br /&gt;
  .                    ctgs       min     q1       q2      q3      max        mean      n50        sum     &lt;br /&gt;
  CA.bog               73         1716    14503    26431   52086   148524*    39043     58038*     2850157 &lt;br /&gt;
  SOAPdenovo           9382       32      32       52      63      85850      347       16726      3259522&lt;br /&gt;
  velvet               453        61      85       224     3279    137163     6297      36496      2852552 &lt;br /&gt;
&lt;br /&gt;
  # scf stats&lt;br /&gt;
  .                    scf        min      q1      q2      q3      max        mean      n50        sum             &lt;br /&gt;
  CA.bog               67         1716     14869   32929   58038   148524*    42541     65383*     2850277&lt;br /&gt;
  SOAPdenovo           186        100      333     1528    17623   144079     15625     55558      2906207&lt;br /&gt;
  velvet               427        61       82      177     2982    137163     6685      37874      2854649&lt;br /&gt;
&lt;br /&gt;
* Comparative&lt;br /&gt;
&lt;br /&gt;
== Pseudomonas aeruginosa ==&lt;br /&gt;
&lt;br /&gt;
* PLoS article &lt;br /&gt;
  From 8,627,900 reads, each 33 nucleotides in length, we assembled the genome into one scaffold of 76 ordered contiguous sequences containing 6,290,005 nucleotides, &lt;br /&gt;
  including one contig spanning 512,638 nucleotides, plus an additional 436 unordered contigs containing 416,897 nucleotides.&lt;br /&gt;
&lt;br /&gt;
=== Read stats ===&lt;br /&gt;
&lt;br /&gt;
  .                    elem       min    q1     q2     q3     max        mean       n50        sum        cvg     &lt;br /&gt;
  original             8627900    33     33     33     33     33         33.00      33         284720700  43.55X cvg    &lt;br /&gt;
  cor(quake)           7401074    30     33     33     33     33         32.75      33         242393133  37X    cvg&lt;br /&gt;
&lt;br /&gt;
=== Assembly stats ===&lt;br /&gt;
&lt;br /&gt;
* DeNovo&lt;br /&gt;
&lt;br /&gt;
                          elem       min    q1     q2     q3     max        mean       n50        sum            &lt;br /&gt;
  velvet(k=23)            7821       45     214    502    1108   16239      862.12     1541       6742621    !!! previous &lt;br /&gt;
  velvet1.cor(k=21)**     5946       41     179    574    1472   20233      1129.77    2362       6717612    !!! new best&lt;br /&gt;
&lt;br /&gt;
* Comparative PA14&lt;br /&gt;
  .                       elem       min    q1     q2     q3     max        mean       n50        sum            &lt;br /&gt;
  AMOScmp.nucmer          1843       20     41     63     325    127984     3350.39    30660      6174765    !!! previous best : nucmer aligner&lt;br /&gt;
  AMOScmp.soap            1929       31     59     223    3395   79432      3182.93    13233      6139867    !!! new           : soap2  aligner&lt;br /&gt;
  AMOScmp.bowtie          1701       31     56     215    3535   79443      3610.23    15916      6141006    !!! new           : bowtie  aligner&lt;br /&gt;
&lt;br /&gt;
* Merged&lt;br /&gt;
&lt;br /&gt;
  velvet1.cor(k=21)**     5946       41     179    574    1472   20233      1129.77    2362       6717612    !!! new best&lt;br /&gt;
  AMOScmp.nucmer          1843       20     41     63     325    127984     3350.39    30660      6174765    !!! previous best : nucmer aligner&lt;br /&gt;
  AMOScmp.nucmer-velvet   983        41     174    796    2859   309985     7390.19    53631      7264555&lt;br /&gt;
&lt;br /&gt;
== SRA projects ==&lt;br /&gt;
&lt;br /&gt;
=== Whole genome based sequencing of a Escherichia fergusonii strain using 454 Technology. ===&lt;br /&gt;
&lt;br /&gt;
* [http://www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?study=SRP001132 SRP001132]&lt;br /&gt;
* 1 complete strain&lt;br /&gt;
* Use 454 single/paired , titanium&lt;br /&gt;
&lt;br /&gt;
=== E coli Whole Genome Sequencing on 454 and Illumina ===&lt;br /&gt;
* Mentioned by allpaths-lg&lt;br /&gt;
* [http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?study=SRP001087 SRP001087]&lt;br /&gt;
** [http://www.ncbi.nlm.nih.gov/sra/SRX016044?report=full SRX016044] --&amp;gt; [http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?cmd=viewer&amp;amp;m=data&amp;amp;s=viewer&amp;amp;run=SRR034509 SRR034509] readLen=101 ; insMea=180&lt;br /&gt;
** [http://www.ncbi.nlm.nih.gov/sra/SRX007757?report=full SRX007757] --&amp;gt; [http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?cmd=viewer&amp;amp;m=data&amp;amp;s=viewer&amp;amp;run=SRR022911 SRR022911] readLen=37 ; insMea=~2620&lt;br /&gt;
** [http://www.ncbi.nlm.nih.gov/sra/SRX007759?report=full SRX007759] --&amp;gt; [http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?cmd=viewer&amp;amp;m=data&amp;amp;s=viewer&amp;amp;run=SRR022913 SRR022913] readLen=37 ; insMea=~3666&lt;br /&gt;
&lt;br /&gt;
=== Genome Sequencing of Mycobacterium tuberculosis PGG1 ===&lt;br /&gt;
&lt;br /&gt;
* [http://www.ncbi.nlm.nih.gov/sites/entrez?db=genomeprj&amp;amp;cmd=Retrieve&amp;amp;dopt=Overview&amp;amp;list_uids=49647 SRP003679] &lt;br /&gt;
&lt;br /&gt;
=== Plasmodium falciparum Whole Genome Assembly Development ===&lt;br /&gt;
&lt;br /&gt;
* Mentioned by allpaths-lg&lt;br /&gt;
* [http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?study=SRP001775 SRP001775]&lt;br /&gt;
** [http://www.ncbi.nlm.nih.gov/sra/SRX033454?report=full SRX033454] readLen=101&lt;br /&gt;
** [http://www.ncbi.nlm.nih.gov/sra/SRX016058?report=full SRX016058] readLen=101&lt;br /&gt;
* ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByStudy/litesra/SRP/SRP001/SRP001775/&lt;br /&gt;
&lt;br /&gt;
=== Rhodobacter sphaeroides Sequencing on Illumina ===&lt;br /&gt;
&lt;br /&gt;
* Mentioned by allpaths-lg&lt;br /&gt;
* [http://www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?study=SRP001079 SRP001079]&lt;br /&gt;
** [http://www.ncbi.nlm.nih.gov/sra/SRX033397?report=full SRX033397] --&amp;gt; [http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?cmd=viewer&amp;amp;m=data&amp;amp;s=viewer&amp;amp;run=SRR081522 SRR081522] readLen=101 ; insMea=180 &lt;br /&gt;
** [http://www.ncbi.nlm.nih.gov/sra/SRX016063?report=full SRX016063] --&amp;gt; [http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?cmd=viewer&amp;amp;m=data&amp;amp;s=viewer&amp;amp;run=SRR034528 SRR034528] readLen=101 ; insMea~=3455; ~15% of the mates are short inserts (~250bp)&lt;br /&gt;
&lt;br /&gt;
* 4 complete strains: NCBI search:  &amp;quot;Rhodobacter sphaeroides&amp;quot;[Organism] AND pt_default[prop] AND seqstat_complete[PROP]&lt;br /&gt;
* /fs/szdata//ncbi/genomes/Bacteria/Rhodobacter_sphaeroides_*/&lt;br /&gt;
* chromosome lengths ~ 4Mbp&lt;br /&gt;
* several plasmids&lt;br /&gt;
* largest repeat : 5348 at 99.93%id; other repeats &amp;lt; 3k&lt;br /&gt;
&lt;br /&gt;
=== Staphylococcus aureus Sequencing on Illumina ===&lt;br /&gt;
&lt;br /&gt;
* [http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?study=SRP001086 SRP001086]&lt;br /&gt;
** [http://www.ncbi.nlm.nih.gov/sra/SRX007714?report=full SRX007714] --&amp;gt; SRR022868 readLen=101  insMea=3538&lt;br /&gt;
** [http://www.ncbi.nlm.nih.gov/sra/SRX007713?report=full SRX007713] --&amp;gt; SRR022867 readLen=37   ?&lt;br /&gt;
** [http://www.ncbi.nlm.nih.gov/sra/SRX007712?report=full SRX007712] --&amp;gt; SRR022866 readLen=76   insMea=180&lt;br /&gt;
&lt;br /&gt;
=== Whole Genome Sequencing of Streptomyces roseosporus NRRL 11379 ===&lt;br /&gt;
&lt;br /&gt;
* [http://www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?study=SRP001786 SRP001786]&lt;br /&gt;
* [http://www.ncbi.nlm.nih.gov/genomeprj?Db=genomeprj&amp;amp;cmd=ShowDetailView&amp;amp;TermToSearch=32281] Actinobacteria 7.7Mbp genome&lt;br /&gt;
&lt;br /&gt;
=== Streptococcus pyogenes pathogenicity ===&lt;br /&gt;
&lt;br /&gt;
* [http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?study=SRP000775 SRP000775] Broad 38.5% gc&lt;br /&gt;
&lt;br /&gt;
=== Vibrio Choerae : PacBio long reads !!! ====&lt;br /&gt;
&lt;br /&gt;
* [http://trace.ncbi.nlm.nih.gov/Traces/sra/?study=SRP004712 SRP004712]&lt;br /&gt;
&lt;br /&gt;
=== Simulaterd reads ===&lt;br /&gt;
&lt;br /&gt;
* use MetaSim&lt;br /&gt;
* MetaSim limitations:&lt;br /&gt;
** Illumina reads (Empirical error model) always length==36&lt;br /&gt;
** Exact reads : always unmated &lt;br /&gt;
* get quality values &amp;amp; mate pairs:&lt;br /&gt;
  ~/bin/fasta2qual.pl prefix.seq &amp;gt; prefix.qual&lt;br /&gt;
  cat prefix.seq | grep &amp;quot;^&amp;gt;&amp;quot; | awk &#039;{print $1}&#039; | grep 1$ | p &#039;/&amp;gt;(.+).1/; print &amp;quot;$1.1 $1.2\n&amp;quot;;&#039; &amp;gt; prefix.mates&lt;br /&gt;
  convert-fasta-to-v2.pl [ -454 ] -l prefix -mean 5000 -stddev 500 -s prefix.seq -q prefix.qual -m prefix.mates &amp;gt; prefix.frg&lt;br /&gt;
&lt;br /&gt;
= SRA =&lt;br /&gt;
&lt;br /&gt;
== Download ==&lt;br /&gt;
&lt;br /&gt;
* SRRs (runs)&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
  cat SRR.txt | perl -ane &#039;next unless /(SRR...)/; print &amp;quot;wget ftp://ftp.ncbi.nih.gov/sra/sra-instant/reads/ByRun/litesra/SRR/$1/$F[0]/$F[0].lite.sra\n&amp;quot;;&#039;  =&amp;gt; SRR001355.lite.sra&lt;br /&gt;
  cat SRR.txt | perl -ane &#039;next unless /(SRR...)/; print &amp;quot;wget ftp://ftp.ncbi.nih.gov/sra/sra-instant/reads/ByRun/sra/SRR/$1/$F[0]/$F[0].sra\n&amp;quot;;&#039;           =&amp;gt; SRR001355.sra&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* SRXs (experiments)&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
  cat SRX.txt | perl -ane &#039;/next unless (SRX...)/; system &amp;quot;wget -r ftp://ftp.ncbi.nih.gov/sra/sra-instant/reads/ByExp/litesra/SRX/$1/$F[0]/\n&amp;quot;;&#039;&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* SRPs (studies)&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
  cat SRP.txt | perl -ane &#039;/next unless (SRP...)/; system &amp;quot;wget -r ftp://ftp.ncbi.nih.gov/sra/sra-instant/reads/ByStudy/litesra/SRX/$1/$F[0]/\n&amp;quot;;&#039;&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Aspera&lt;br /&gt;
** [http://www.ncbi.nlm.nih.gov/viewvc/v1/trunk/sra/doc/SRA/SRA_Download_Guide.pdf?view=co Manual]&lt;br /&gt;
** [http://download.asperasoft.com/download/sw/connect/2.4/aspera-connect-2.4.5.30478-linux-64.sh Linux64bit install download]&lt;br /&gt;
&lt;br /&gt;
== Convert ==&lt;br /&gt;
&lt;br /&gt;
  cat SRR.txt | perl -ane &#039;print &amp;quot;fastq-dump -A $F[0] -D $F[0].*sra\n&amp;quot;;&#039;   =&amp;gt; SRR??????_[12].fastq&lt;br /&gt;
  cat SRR.txt | perl -ane &#039;print &amp;quot;sff-dump   -A $F[0] -D $F[0].*sra\n&amp;quot;;&#039;   =&amp;gt; SRR??????.sff&lt;br /&gt;
&lt;br /&gt;
* sff-dump dump does not work on litesra&lt;br /&gt;
&lt;br /&gt;
== Download and convert ==&lt;br /&gt;
&lt;br /&gt;
* Example:&lt;br /&gt;
  ~/bin/SRRget.pl [-lite] SRR075011 | sh             # =&amp;gt; SRR075011.fastq&lt;br /&gt;
&lt;br /&gt;
== Insert Size Estimate ==&lt;br /&gt;
  &lt;br /&gt;
  soap2-index ../1con/genome.1con&lt;br /&gt;
  touch ../1con/genome.1con.index&lt;br /&gt;
&lt;br /&gt;
  cat SRR.txt | perl -ane &#039;print &amp;quot;soap2 -D ../1con/*.index -a $F[0]_1.fastq -b $F[0]_2.fastq -o $F[0]_12.mated.soap2 -2 $F[0]_12.single.soap2 -p 20\n&amp;quot;;&#039;&lt;br /&gt;
  cat SRR.txt | perl -ane &#039;print &amp;quot;cat $F[0]_12.*.soap2 | ~/bin/AMOS/soap2sameRef.pl -dist -max 10000 | getSummary.pl -i 1\n&amp;quot;;&#039;                               # =&amp;gt; q2 ; add q2 vals as 2nd col in SRR.txt&lt;br /&gt;
&lt;br /&gt;
  cat SRR.txt | perl -ane &#039;$m=$F[1]*0.7; $x=$F[1]*1.3; print &amp;quot;soap2 -D ../1con/*.index -a $F[0]_1.fastq -b $F[0]_2.fastq -o $F[0]_12.mated.soap2 -2 $F[0]_12.single.soap2 -p 20 -m $m -x $x &amp;quot; ; print &amp;quot;-R&amp;quot; if($F[1]&amp;gt;1000); print &amp;quot;\n&amp;quot;;&#039;&lt;br /&gt;
&lt;br /&gt;
= 454 =&lt;br /&gt;
&lt;br /&gt;
== Format ==&lt;br /&gt;
&lt;br /&gt;
* sff&lt;br /&gt;
* linkers&lt;br /&gt;
  L454.flx          44     45.45   (palindrome)&lt;br /&gt;
  L454.titanium.fwd 42     30.95  &lt;br /&gt;
  L454.titanium.rev 42     30.95&lt;br /&gt;
* fasta &lt;br /&gt;
* fastq &lt;br /&gt;
&lt;br /&gt;
== Data sets == &lt;br /&gt;
&lt;br /&gt;
* Sff&lt;br /&gt;
  /nfshomes/dpuiu/methanobrevibacter_smithii/Data/454/*sff          # single ; FLX ?&lt;br /&gt;
  /nfshomes/dpuiu/F_tularensis/SRA_FTP/*/*sff                       # single &amp;amp; paired ; GS20 &amp;amp; FLX&lt;br /&gt;
  /nfshomes/dpuiu/Brugia_malayi.new/Data/brugia-sequencing/*/*sff   # paired ; FLX &amp;amp; Ti&lt;br /&gt;
  /fs/szattic-asmg4/dpuiu/HTS/Escherichia_coli/Data/454/*fastq      # single from SRA !!!&lt;br /&gt;
  /fs/szattic-asmg4/dpuiu/HTS/Escherichia_coli/Data/454p/*fastq     # paired from SRA !!!&lt;br /&gt;
&lt;br /&gt;
== Converters ==&lt;br /&gt;
&lt;br /&gt;
* sff -&amp;gt; sff&lt;br /&gt;
  sffinfo -a in.sff | head -100 &amp;gt; out.acc&lt;br /&gt;
  sfffile -o out.sff -i out.acc in.sff&lt;br /&gt;
&lt;br /&gt;
* sff -&amp;gt; seq/qual (454 package)&lt;br /&gt;
  sffinfo -s prefix.sff &amp;gt; prefix.seq&lt;br /&gt;
  sffinfo -q prefix.sff &amp;gt; prefix.qual&lt;br /&gt;
&lt;br /&gt;
* sff -&amp;gt; frg2  (wgs package)&lt;br /&gt;
  sffToCA -insertsize mean std -linker [flx|titanium] -trim chop -libraryname prefix -output prefix.frg2 prefix.sff    # =&amp;gt; prefix.frg2&lt;br /&gt;
  -trim chop: the read is chopped down to just the clear bases !!! should try it&lt;br /&gt;
&lt;br /&gt;
* fastq -&amp;gt; seq/qual -&amp;gt; frg2 (dpuiu &amp;amp; wgs)&lt;br /&gt;
  fastq2seq.pl  &amp;lt; prefix.fastq &amp;gt; prefix.seq&lt;br /&gt;
  fastq2qual.pl &amp;lt; prefix.fastq &amp;gt; prefix.qual&lt;br /&gt;
  convert-fasta-to-v2.pl -454 -l prefix -s prefix.seq -q prefix.qual &amp;gt; prefix.frg2&lt;br /&gt;
&lt;br /&gt;
* fastq -&amp;gt; frg2 (dpuiu &amp;amp; wgs)&lt;br /&gt;
  fastq2frg.sh prefix     # =&amp;gt; prefix.frg2&lt;br /&gt;
&lt;br /&gt;
* seq/qual -&amp;gt; clean seq/qual -&amp;gt; frg2 (linker removal for 454p) (dpuiu)&lt;br /&gt;
 &lt;br /&gt;
  454p2seqqual.amos prefix    # =&amp;gt; prefix.cseq,  prefix.cqual&lt;br /&gt;
  convert-fasta-to-v2.pl -454 -l prefix -s prefix.cseq -q prefix.cqual -l prefix -mean 3000 -stddev 300 -m prefix.mates &amp;gt; prefix.frg&lt;br /&gt;
  &lt;br /&gt;
See also&lt;br /&gt;
* http://sourceforge.net/apps/mediawiki/wgs-assembler/index.php?title=SFF_SOP&lt;br /&gt;
&lt;br /&gt;
= Illumina =&lt;br /&gt;
&lt;br /&gt;
== Formats ==&lt;br /&gt;
&lt;br /&gt;
* Raw : the intensity values (int) and noise (nse) values. &lt;br /&gt;
* Processed : the processed intensity values (sig2) and four-channel quality values (prb). &lt;br /&gt;
* Base : the base calls (the quality value is gotten from the prb for the called base).&lt;br /&gt;
&lt;br /&gt;
== Converters ==&lt;br /&gt;
&lt;br /&gt;
* illumina &amp;lt;-&amp;gt; srf (illumina package) &lt;br /&gt;
  illumina2srf -o lane_no.srf BustardDir/s_lane_no_*_qseq.txt&lt;br /&gt;
  srf2illumina file.srf&lt;br /&gt;
&lt;br /&gt;
* split joined paired reads from SRA&lt;br /&gt;
  # seq/qual&lt;br /&gt;
  cat prefix.seq  | perl ~/bin/splitSeqMiddle.pl    &amp;gt; prefix.cseq&lt;br /&gt;
  cat prefix.qual | perl ~/bin/splitSeqMiddle.pl -q &amp;gt; prefix.cqual&lt;br /&gt;
 &lt;br /&gt;
  # fastq&lt;br /&gt;
  cat prefix.fastq | perl ~/bin/splitFastqMiddle.pl &amp;gt;  prefix.cfastq&lt;br /&gt;
  cat prefix.fastq | perl ~/bin/splitFastqMiddle.pl -f &amp;gt;  prefix.fwd.cfastq &lt;br /&gt;
  cat prefix.fastq | perl ~/bin/splitFastqMiddle.pl -r &amp;gt;  prefix.rev.cfastq&lt;br /&gt;
&lt;br /&gt;
= SOLiD =&lt;br /&gt;
&lt;br /&gt;
== Converters ==&lt;br /&gt;
&lt;br /&gt;
* SOLiD -&amp;gt; srf (SOLiD package)  &lt;br /&gt;
  /fs/szdevel/dpuiu/solid2srf_v0.6.6/&lt;br /&gt;
&lt;br /&gt;
= Converters =&lt;br /&gt;
&lt;br /&gt;
== SFR ==&lt;br /&gt;
&lt;br /&gt;
* [http://www.politigenomics.com/2008/06/whats-in-an-srf.html SRF]&lt;br /&gt;
* [http://srf.sourceforge.net/ SRF]&lt;br /&gt;
* [http://srf.sourceforge.net/ShortSequenceFormatDec18th_v_1_3.htm SRF 1.3]&lt;br /&gt;
&lt;br /&gt;
Commands:&lt;br /&gt;
  srf2fasta&lt;br /&gt;
  srf2fastq&lt;br /&gt;
  srf_dump_all&lt;br /&gt;
  srf_extract_hash&lt;br /&gt;
  srf_extract_linear&lt;br /&gt;
  srf_filter&lt;br /&gt;
  srf_index_hash&lt;br /&gt;
  srf_info&lt;br /&gt;
&lt;br /&gt;
== FASTA/QUAL ==&lt;br /&gt;
* seq/qual =&amp;gt; frg2&lt;br /&gt;
  convert-fasta-to-v2.pl -l prefix -s prefix.seq -q prefix.qual &amp;gt; prefix.frg2&lt;br /&gt;
  &lt;br /&gt;
* seq/qual =&amp;gt; amos&lt;br /&gt;
  toAmos -m prefix.mates -s prefix.seq -q prefix.qual -o prefix.afg&lt;br /&gt;
&lt;br /&gt;
== FASTQ ==&lt;br /&gt;
* http://en.wikipedia.org/wiki/FASTQ_format&lt;br /&gt;
* [http://nar.oxfordjournals.org/content/38/6/1767.full The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants]&lt;br /&gt;
&lt;br /&gt;
  format                                   offset  qualityRanges&lt;br /&gt;
                                           0         1         2         3         4&lt;br /&gt;
                                           01234567890123456789012345678901234567890&lt;br /&gt;
  ===================================================================&lt;br /&gt;
  Sanger(cont. numbers)            33      !&amp;quot;#$%&amp;amp;&#039;()*+,-./0123456789:;&amp;lt;=&amp;gt;?@ABCDEFGHI  !\&amp;quot;\#\$%&amp;amp;\&#039;\(\)\*\+,-.\/0123456789:;&amp;lt;=&amp;gt;?\@ABCDEFGHI&lt;br /&gt;
  Illumina(cont. low cases)        64      @ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefgh&lt;br /&gt;
 &lt;br /&gt;
  CA frg packed                    48      0123456789:;&amp;lt;=&amp;gt;?@ABCDEFGHIJKLMNOPQRSTUVWX&lt;br /&gt;
  ===================================================================&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* fastqToCA&lt;br /&gt;
  fastqToCA -insertsize 3000 300 -libraryname short -type sanger -fastq $PWD/short_1.fastq,$PWD/short_2.fastq &amp;gt; short_12.frg&lt;br /&gt;
  runCA -d . -p genome stopAfter=initialStoreBuilding short.frg &lt;br /&gt;
  gatekeeper -dumpfrg -allreads -donotfixmates -format2 genome.gkpStore &amp;gt; genome.frg&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* fastq : input reads; id positions don&#039;t necessarily match; fwd read names end in 1, rev read names end in 2 &lt;br /&gt;
  =&amp;gt; fastq (_0: unmated; _[12]: mated) &lt;br /&gt;
&lt;br /&gt;
  fastq2seqs.pl &amp;lt; input_1.fastq | sed &#039;s/1$//&#039; &amp;gt; prefix_1.mate&lt;br /&gt;
  fastq2seqs.pl &amp;lt; input_2.fastq | sed &#039;s/2$//&#039; &amp;gt; prefix_2.mate&lt;br /&gt;
  intersect.pl prefix_1.mate prefix_2.mate     &amp;gt; prefix.mate&lt;br /&gt;
 &lt;br /&gt;
  cat prefix.mate | perl -ane &#039;print &amp;quot;$F[0]1 $F[0]2\n&amp;quot;;&#039; &amp;gt; prefix.mates&lt;br /&gt;
 &lt;br /&gt;
  difference.pl prefix_1.mate prefix.mate | perl -ane &#039;print &amp;quot;$F[0]1\n&amp;quot;;&#039; &amp;gt;  prefix.unmated&lt;br /&gt;
  difference.pl prefix_2.mate prefix.mate | perl -ane &#039;print &amp;quot;$F[0]2\n&amp;quot;;&#039; &amp;gt;&amp;gt; prefix.unmated&lt;br /&gt;
 &lt;br /&gt;
  extractfromfastqnames -i 0 -f prefix.unmated &amp;lt; prefix.fastq &amp;gt; prefix_0.fastq&lt;br /&gt;
  extractfromfastqnames -i 0 -f prefix.mates   &amp;lt; prefix.fastq &amp;gt; prefix_1.fastq&lt;br /&gt;
  extractfromfastqnames -i 1 -f prefix.mates   &amp;lt; prefix.fastq &amp;gt; prefix_2.fastq&lt;br /&gt;
&lt;br /&gt;
  #or &lt;br /&gt;
  fastq22fastq3.amos -D FWDSEQ=input_1.cor.txt -D REVSEQ=input_2.txt prefix&lt;br /&gt;
&lt;br /&gt;
* fastq2fastq&lt;br /&gt;
  maq sol2sanger illumina.fastq sanger.fastq&lt;br /&gt;
  ~/bin/fastq2afstq -in illumina -out sanger   &amp;lt; illumina.fastq &amp;gt; sanger.fastq&lt;br /&gt;
  ~/bin/fastq2afstq -in sanger   -out illumina &amp;lt; sanger.fastq   &amp;gt; illumina.fastq &lt;br /&gt;
&lt;br /&gt;
* fastq =&amp;gt; frg2&lt;br /&gt;
  fastq2seq.pl  &amp;lt; prefix.fastq &amp;gt; prefix.seq&lt;br /&gt;
  fastq2qual.pl &amp;lt; prefix.fastq &amp;gt; prefix.qual&lt;br /&gt;
  convert-fasta-to-v2.pl -l prefix -s prefix.seq -q prefix.qual &amp;gt; prefix.frg2&lt;br /&gt;
 &lt;br /&gt;
  ~/bin/fastq2frg.sh prefix&lt;br /&gt;
&lt;br /&gt;
== FRG ==&lt;br /&gt;
&lt;br /&gt;
* frg =&amp;gt; frg2&lt;br /&gt;
  convert-v1-to-v2.pl [-v vector-clear-file] [-noobt] [-unmated] &amp;lt; prefix.frg &amp;gt; prefix.frg2&lt;br /&gt;
 &lt;br /&gt;
* frg =&amp;gt; amos&lt;br /&gt;
  toAmos -f prefix.frg -o prefix.afg&lt;br /&gt;
&lt;br /&gt;
* frg2 =&amp;gt; frg&lt;br /&gt;
  gatekeeper -o prefix.gkpStore -T prefix.frg2&lt;br /&gt;
  gatekeeper -dumpfrg prefix.gkpStore &amp;gt; prefix.frg&lt;br /&gt;
&lt;br /&gt;
* frg =&amp;gt; mates&lt;br /&gt;
  frg2mates.pl &amp;lt; prefix.frg &amp;gt; prefix.mates&lt;br /&gt;
 &lt;br /&gt;
  frg2acc.pl &amp;lt; prefix.frg &amp;gt; prefix.acc&lt;br /&gt;
  cat prefix.acc | ~/bin/difference1or2.pl -j1 1 -f prefix.mates &amp;gt; prefix.unmated&lt;br /&gt;
&lt;br /&gt;
  extractfromfastqnames -i 0 -f prefix.unmated &amp;lt; prefix.fastq &amp;gt; prefix_0.fastq&lt;br /&gt;
  extractfromfastqnames -i 0 -f prefix.mates   &amp;lt; prefix.fastq &amp;gt; prefix_1.fastq&lt;br /&gt;
  extractfromfastqnames -i 1 -f prefix.mates   &amp;lt; prefix.fastq &amp;gt; prefix_2.fastq&lt;br /&gt;
&lt;br /&gt;
== AMOS ==&lt;br /&gt;
&lt;br /&gt;
* amos =&amp;gt; frg&lt;br /&gt;
  amos2frg -i prefix.afg -o prefix.frg [-a accession] &lt;br /&gt;
  removefromfrgmsg.pl prefix.frg ADT                      # in case convert-v1-to-v2.pl is run afterwards&lt;br /&gt;
 &lt;br /&gt;
* amos =&amp;gt; bank&lt;br /&gt;
  bank-transact -c -z -b prefix.bnk -m prefix.afg&lt;br /&gt;
&lt;br /&gt;
  # for velvet assemblies&lt;br /&gt;
  grep &amp;quot;^&amp;gt;&amp;quot; Sequences | sed &#039;s/&amp;gt;//&#039; | awk &#039;{print $1,$2}&#039; &amp;gt; prefix.seqs&lt;br /&gt;
  &lt;br /&gt;
  bank2contig prefix.bnk/ | grep ^# | grep -v ^## | p &#039;/(\d+)/; print $1,&amp;quot;\n&amp;quot;;&#039; | &amp;gt; prefix.assembled&lt;br /&gt;
  difference.pl -i 1 prefix.seqs prefix.assembled &amp;gt; prefix.singletons&lt;br /&gt;
&lt;br /&gt;
== TA (seq,qual,xml) ==&lt;br /&gt;
&lt;br /&gt;
* seq,qual,xml =&amp;gt; frg&lt;br /&gt;
  tarchive2ca   -o prefix -c prefix.clr -l prefix.libinfo prefix.seq   # =&amp;gt; prefix.frg&lt;br /&gt;
 &lt;br /&gt;
  tracedb-to-frg.pl (?)&lt;br /&gt;
* seq,qual,xml =&amp;gt; amos&lt;br /&gt;
  tarchive2amos -o prefix -c prefix.clr -l prefix.libinfo prefix.seq   # =&amp;gt; prefix.afg&lt;br /&gt;
&lt;br /&gt;
== ACE ==&lt;br /&gt;
&lt;br /&gt;
* ace =&amp;gt; amos&lt;br /&gt;
  toAmos -ace prefix.ace -m prefix.mates -o prefix.afg&lt;br /&gt;
&lt;br /&gt;
= Input processing  =&lt;br /&gt;
&lt;br /&gt;
== Quality assessment ==&lt;br /&gt;
&lt;br /&gt;
* Daniela&#039;s scripts&lt;br /&gt;
  cat prefix_1.fastq | ~/bin/getQualSummary.pl -type illumina &amp;gt; prefix_1.qual.summary&lt;br /&gt;
  cat prefix_2.fastq | ~/bin/getQualSummary.pl -type illumina &amp;gt; prefix_2.qual.summary&lt;br /&gt;
 &lt;br /&gt;
  cat ~/bin/plotQualSummary.gp | sed &#039;s/prefix/Ecoli/g&#039; &amp;gt;! plotQualSummary.gp&lt;br /&gt;
  gnuplot plotQualSummary.gp&lt;br /&gt;
  display Ecoli.qual.summary.png&lt;br /&gt;
&lt;br /&gt;
* Galaxity FASTX scripts&lt;br /&gt;
  ~/bin/fastx.sh prefix&lt;br /&gt;
  or&lt;br /&gt;
  fastx_quality_stats                     -i prefix.fastq -o prefix.stats&lt;br /&gt;
  fastq_quality_boxplot_graph.sh          -i prefix.stats -o prefix.quality_boxplot.png&lt;br /&gt;
  fastx_nucleotide_distribution_graph.sh  -i prefix.stats -o prefix.nucleotide_distribution_graph.png&lt;br /&gt;
  display *.png&lt;br /&gt;
&lt;br /&gt;
  #the counts should be pretty uniform&lt;br /&gt;
  ls s*stats | p &#039;print &amp;quot;getSummary.pl -i 12 -t $F[0] -nh &amp;lt;$F[0]\n&amp;quot;;&#039; | sh | pretty &amp;gt;  A_count.stats&lt;br /&gt;
  ls s*stats | p &#039;print &amp;quot;getSummary.pl -i 13 -t $F[0] -nh &amp;lt;$F[0]\n&amp;quot;;&#039; | sh | pretty &amp;gt;  C_count.stats&lt;br /&gt;
  ls s*stats | p &#039;print &amp;quot;getSummary.pl -i 14 -t $F[0] -nh &amp;lt;$F[0]\n&amp;quot;;&#039; | sh | pretty &amp;gt;  G_count.stats&lt;br /&gt;
  ls s*stats | p &#039;print &amp;quot;getSummary.pl -i 15 -t $F[0] -nh &amp;lt;$F[0]\n&amp;quot;;&#039; | sh | pretty &amp;gt;  T_count.stats&lt;br /&gt;
&lt;br /&gt;
== Coverage assesment ==&lt;br /&gt;
&lt;br /&gt;
 soap2 ....&lt;br /&gt;
 cat prefix*soap2 | ~/bin/AMOS/soap2posmap.pl | ~/bin/posmap2cvg.pl &amp;gt; prefix.cvg&lt;br /&gt;
 plot &amp;quot;prefix.cvg&amp;quot; u 2:6 w l&lt;br /&gt;
&lt;br /&gt;
== Read trimming ==&lt;br /&gt;
&lt;br /&gt;
* Delete all reads that contain an N (or a quality of 0)&lt;br /&gt;
* Compute Q20 &amp;amp; trim all reads to that length&lt;br /&gt;
* Compute nucleotide bias. Trim 5&#039;,3&#039; ends that show bias&lt;br /&gt;
* Trim 5&#039;,3&#039; ends &lt;br /&gt;
&lt;br /&gt;
== Adaptor removal ==&lt;br /&gt;
* fastx_clip -a adaptorSeq  !!! only frims the 3&#039;&lt;br /&gt;
* https://github.com/chapmanb/bcbb/blob/master/align/adaptor_trim.py&lt;br /&gt;
* check high frequency kmers&lt;br /&gt;
* sometimes there is adaptor in short insert size libs&lt;br /&gt;
&lt;br /&gt;
* Using regular expressions (tested)&lt;br /&gt;
  ~/bin/cleanTitaniumAdaptorFromFastq.pl&lt;br /&gt;
  ~/bin/grep.pl -f ~/bin/titanium.fgrep &lt;br /&gt;
  ~/bin/cleanAdaptorFastq.pl -tita -min 32&lt;br /&gt;
&lt;br /&gt;
== Read correction ==&lt;br /&gt;
&lt;br /&gt;
* quake &lt;br /&gt;
   echo prefix_1.fastq prefix_2.fastq  &amp;gt; prefix.ls&lt;br /&gt;
   quake.py -f prefix.ls -k 18 -p 16&lt;br /&gt;
or &lt;br /&gt;
  cat prefix*fastq | count-qmers -k 18 &amp;gt; prefix.ls.qcts&lt;br /&gt;
  gzip prefix.ls.qcts&lt;br /&gt;
  cov_model.py prefix.ls.qcts  | tail -1  | p &#039;print $F[1]&#039; &amp;gt; prefix.cutoff                            # takes pretty long to run&lt;br /&gt;
  cov_model.py 18mers.qcts &amp;gt;&amp;amp; /dev/null&lt;br /&gt;
  tail -1 cutoff.txt | perl -ane &#039;print $F[0]&#039; &amp;gt; prefix.cutoff&lt;br /&gt;
  correct -f prefix.ls -k 18 -c `cat prefix.cutoff` -m prefix.ls.qcts -p 4&lt;br /&gt;
&lt;br /&gt;
* Aleksey&#039;s &lt;br /&gt;
  set SIZE=100000000 # 100M&lt;br /&gt;
  jellyfish count -s 1000000 -m 18 --both-strands -t 16 -o frag.k18.jf frag_?.*fastq&lt;br /&gt;
  jellyfish merge frag.k18.jf_* -o frag.k18.jf&lt;br /&gt;
  jellyfish histo frag.k18.jf | head -20  =&amp;gt; bottom=cuttof&lt;br /&gt;
  correct -f prefix.ls -k 18 -c cutoff -m 18mers.qcts -p 4&lt;br /&gt;
&lt;br /&gt;
* SOAP correction &lt;br /&gt;
  ls -1 prefix_1.fastq prefix_2.fastq  &amp;gt; prefix.ls&lt;br /&gt;
  ~/szdevel/correction/KmerFreq  -i prefix.ls -o prefix.cor      -s 18      &lt;br /&gt;
  ~/szdevel/correction/Corrector -i prefix.ls -r prefix.cor.freq -s 18 -t 16&lt;br /&gt;
&lt;br /&gt;
  Issue: too many reads are discarded&lt;br /&gt;
&lt;br /&gt;
* hybrid-shrec : multiple platform correction&lt;br /&gt;
  http://www.cs.helsinki.fi/u/lmsalmel/hybrid-shrec/&lt;br /&gt;
&lt;br /&gt;
== Lucy ==&lt;br /&gt;
  lucy -debug prefix.lucy.info prefix.seq prefix.qual &lt;br /&gt;
  cat prefix.lucy.info | awk &#039;{print $1,$3,$4}&#039; &amp;gt; prefix.lucy.clr&lt;br /&gt;
 &lt;br /&gt;
  ~/bin/clrFasta.pl -f prefix.lucy.clr    &amp;lt; prefix.seq  &amp;gt; prefix.lucy.seq&lt;br /&gt;
  ~/bin/clrFasta.pl -f prefix.lucy.clr -q &amp;lt; prefix.qual &amp;gt; prefix.lucy.qual&lt;br /&gt;
&lt;br /&gt;
== CA ==&lt;br /&gt;
* 454 reads&lt;br /&gt;
  runCA -stopAfter OBT ...&lt;br /&gt;
 &lt;br /&gt;
  gatekeeper -dumpfragments                   -tabular asm.gkpStore | p &#039;print $F[-3],&amp;quot;\n&amp;quot; unless($F[6]);&#039;        | getSummary.pl -t ORIG&lt;br /&gt;
  gatekeeper -dumpfragments -clear CLR        -tabular asm.gkpStore | p &#039;print $F[-1]-$F[-2],&amp;quot;\n&amp;quot; unless($F[6]);&#039; | getSummary.pl -t CLR&lt;br /&gt;
  gatekeeper -dumpfragments -clear OBTINITIAL -tabular asm.gkpStore | p &#039;print $F[-1]-$F[-2],&amp;quot;\n&amp;quot; unless($F[6]);&#039; | getSummary.pl -t OBTINITIAL&lt;br /&gt;
  gatekeeper -dumpfragments -clear OBTCHIMERA -tabular asm.gkpStore | p &#039;print $F[-1]-$F[-2],&amp;quot;\n&amp;quot; unless($F[6]);&#039; | getSummary.pl -t OBTCHIMERA&lt;br /&gt;
  gatekeeper -dumpfragments -clear LATEST      -tabular asm.gkpStore | p &#039;print $F[-1]-$F[-2],&amp;quot;\n&amp;quot; unless($F[6]);&#039; | getSummary.pl -t LATEST&lt;br /&gt;
&lt;br /&gt;
= Simulators =&lt;br /&gt;
&lt;br /&gt;
* Maq: max read len can be recompiled; must be trained on a data set;&lt;br /&gt;
  ~/db/Illumina.simutrain.dat : trained on 75 bp Illumina &amp;quot;high qual&amp;quot; reads (q2&amp;gt;30 for all pos)&lt;br /&gt;
&lt;br /&gt;
  maq simutrain train.dat train.fastq&lt;br /&gt;
  maq simulate     -d libMea -s libStd -r 0  -N noMates -1 read1Len -2 read2Len read1Out read2Out ref.fasta  train.dat&lt;br /&gt;
&lt;br /&gt;
* Samtool&lt;br /&gt;
  wgsim -e 0       -d libMea -s libStd -r 0 -N noReads -1 read1Len -2 read2Len ref.fasta read1.fastq read2.fastq -C   # error free reads&lt;br /&gt;
  wgsim -e readErr -d libMea -s libStd -r 0 -N noReads -1 read1Len -2 read2Len ref.fasta read1.fastq read2.fastq -C  # error in reads&lt;br /&gt;
&lt;br /&gt;
* Others:&lt;br /&gt;
  dwgsim&lt;br /&gt;
  [https://github.com/jstjohn/SimSeq simseq] Used by Assemblathon !!!&lt;br /&gt;
&lt;br /&gt;
= Links =&lt;br /&gt;
&lt;br /&gt;
* http://xyala.cap.ed.ac.uk/Gene_Pool/Nextgen_assembly_workshop/&lt;br /&gt;
* http://en.wikipedia.org/wiki/FASTQ_format&lt;/div&gt;</summary>
		<author><name>Mpop</name></author>
	</entry>
	<entry>
		<id>https://wiki.umiacs.umd.edu/cbcb/index.php?title=Cbcb:Pop-Lab:16S-pipeline_16S_analysis_pipeline_(for_Gates_project)&amp;diff=8921</id>
		<title>Cbcb:Pop-Lab:16S-pipeline 16S analysis pipeline (for Gates project)</title>
		<link rel="alternate" type="text/html" href="https://wiki.umiacs.umd.edu/cbcb/index.php?title=Cbcb:Pop-Lab:16S-pipeline_16S_analysis_pipeline_(for_Gates_project)&amp;diff=8921"/>
		<updated>2011-08-16T18:27:44Z</updated>

		<summary type="html">&lt;p&gt;Mpop: /* Step 9: Generate summary tables */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= 16S analysis pipeline =&lt;br /&gt;
&lt;br /&gt;
== Overview ==&lt;br /&gt;
=== Assumptions ===&lt;br /&gt;
* 16S rRNA sequences are generated with barcoded 454 and are received as either: (i) .sff file; (ii) fasta and quality file; (iii) just fasta file&lt;br /&gt;
* each batch is a single 96-well plate and is accompanied by a tab-delimited file containing information about the sample, including the &amp;quot;Sample ID&amp;quot;, well on the plate, and additional information regarding the sample quality and DNA concentration&lt;br /&gt;
* we have a file that specifies the barcode and sequencing adapter used for each well (these entities do not change).&lt;br /&gt;
&lt;br /&gt;
=== Directory structure ===&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
Gates_SOM/&lt;br /&gt;
  Main/&lt;br /&gt;
    samples.csv      - information about all the samples available to us&lt;br /&gt;
    454.csv          - information about all 454 runs (essentially concatenation of .csvs from 454 dir)&lt;br /&gt;
    phylochip.csv    - information about all Phylochip runs&lt;br /&gt;
    IGS_Barcodes.csv - information about barcodes used to multiplex 454 runs&lt;br /&gt;
    scripts/         - scripts used to process the data (referred to as ${SCRIPTS} below)&lt;br /&gt;
    DB/              - scripts used to access the database (referred to as ${DB_SCRIPTS} below)&lt;br /&gt;
  454/               - here&#039;s where all 454 sequences live&lt;br /&gt;
    [batch1]/        - ... each batch in a separate directory&lt;br /&gt;
    [batch1].csv     - meta-information about the batch &lt;br /&gt;
    [fasta1]         - fasta files containing the batch&lt;br /&gt;
     ... &lt;br /&gt;
    [fastan]&lt;br /&gt;
    [batch1].part    - partition file describing how the sequences get split by barcode/sample&lt;br /&gt;
     part/           - directory where all the partitioned files live &lt;br /&gt;
     ... &lt;br /&gt;
    [batchn]&lt;br /&gt;
  Phylochip/         - all the CEL files and auxiliary information on the Phylochip runs&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== File-based approach == &lt;br /&gt;
&lt;br /&gt;
=== Step 0: Get the sequence information ===&lt;br /&gt;
* From .SFF files (assuming these are 454 sequences)&lt;br /&gt;
&lt;br /&gt;
This step uses the sff_extract program from the Staden package (if I&#039;m not mistaken)&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
for i in *.sff ;do&lt;br /&gt;
  name=`expr $i : &#039;\(.*\)\.sff&#039;`&lt;br /&gt;
  sff_extract -c -s $name.seq -q $name.qual $i&lt;br /&gt;
done&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 1: Cleanup meta-information ===&lt;br /&gt;
* Convert the Excel sheet containing the batch information into tab-delimited file (and run dos2unix) making sure the quotes added by Excel/OOffice are removed, adding the date (if not already in the file), and sorting the file by Sample ID.  At this stage also check that the header row information is in canonical format.&lt;br /&gt;
* Add barcode information using add_barcode.pl&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;${SCRIPTDIR}/add_barcode.pl [batch].csv ${MAINDIR}IGS_Barcodes.csv &amp;gt; [batch]_barcode.csv&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 2: Create partition file ===&lt;br /&gt;
&lt;br /&gt;
First concatenate all the sequence files (if multiple) from a batch into a single file [batch].all.seq.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;${SCRIPTDIR}/code2part.pl [batch].csv [batch].all.seq [batch] &amp;gt; [batch].part&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The files [batch].BAD.list and [batch].NONE.list contain the names of all sequences that have failed quality checks either because they are too short, or contain Ns (BAD), or because the barcode is not recognized (NONE). The files contain additional information that can be used to troubleshoot the pipeline:  sequences that are too short (&amp;lt; 75 454 cycles) are followed by the number of cycles, sequences that either contain Ns or have an unknown barcode are followed by the first 8 characters in the sequence.&lt;br /&gt;
&lt;br /&gt;
=== Step 3: Break up the fasta file into separate batches by partition ===&lt;br /&gt;
&lt;br /&gt;
* Create partition directory&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mkdir part&lt;br /&gt;
cd part&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Partition main file into sub-parts&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
${SCRIPTS}/partitionFasta.pl ../[batch].all.seq ../[batch].part&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The result is one fasta file per sub-partition (i.e. individual subject).&lt;br /&gt;
&lt;br /&gt;
* Remove barcodes &lt;br /&gt;
(still in the part/ subdirectory)&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
for i in *.seq; do&lt;br /&gt;
  ${SCRIPTS}/unbarcode.pl $i&lt;br /&gt;
done&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The output files will have the same name as the original file but with the addition of the .nbc suffix.  You should remove the .nbc files from the BAD/NONE files in order to prevent their addition to the pipeline downstream.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt; rm *.BAD.nbc.fa *.NONE.nbc.fa &amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 4: Add file names to sample tables ===&lt;br /&gt;
&lt;br /&gt;
The following needs to be run from the root of the 454 directory.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt; ${SCRIPTS}/add_file.pl [batch]/[batch]_barcode.csv [batch]/part &amp;gt; [batch]/[batch]_names.csv &amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
As a result, the file [batch]/[batch]_names.csv will associate each Sample ID to a file name and also record the number of sequences in that file.  Note that only files ending in .nbc.fa are procesed.&lt;br /&gt;
&lt;br /&gt;
=== Step 5: Merge all the batch meta-info files into a same file at the top ===&lt;br /&gt;
&lt;br /&gt;
Note: the addition of file names must be done on a batch by batch basis as multiple files might refer to a same Sample ID - within each batch it can be assumed that the Sample ID -&amp;gt; Filename mapping is unique.  In the 454.csv file in the top directory the unique key is the file name.  &lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt; &lt;br /&gt;
mv ../454.csv ../454.csv.bak&lt;br /&gt;
${SCRIPTS}/merge_csv.pl ../454.csv.bak [batch]/[batch]_names.csv &amp;quot;Sample ID&amp;quot; non-unique &amp;gt; ../454.csv&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This should probably be run with the filename as the key but then the tables need to be sorted  by filename.  Ultimately, this will all work better once the data are in a relational database.&lt;br /&gt;
&lt;br /&gt;
=== Step 6: Update sample file ===&lt;br /&gt;
From the top directory:&lt;br /&gt;
&lt;br /&gt;
* First add all new samples to the samples.csv file&lt;br /&gt;
&amp;lt;pre&amp;gt; &lt;br /&gt;
mv samples.csv samples.csv.bak&lt;br /&gt;
${SCRIPTS}/merge_csv.pl samples.csv.bak 454.csv &amp;quot;Sample ID&amp;quot; unique merge &amp;gt; samples.csv&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Note: merge means that if record keys conflict, the empty fields will be updated with the new data.&lt;br /&gt;
&lt;br /&gt;
* Update the tag indicating 454 sequences available for this sample&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mv samples.csv samples.csv.bak&lt;br /&gt;
${SCRIPTS}/update_field_csv.pl samples.csv.bak 454.csv &amp;quot;Sample ID&amp;quot; 454 Y &amp;gt; samples.csv&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 7: Assign numbers to all filenames ===&lt;br /&gt;
&lt;br /&gt;
Each file in the 454.csv file will be assigned an integer (if one is not already available).  This number will be used to prefix the sequences in the combined file for the project.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mv 454.csv 454.csv.bak&lt;br /&gt;
${SCRIPTS}/add_filenum.pl 454.csv.bak &amp;gt; 454.csv&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 8: Combine all fasta files into a single one ===&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Note: this assumes we&#039;re running the pipeline for the first time - a different protocol is necessary for adding new sequences to an already existing analysis&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
In the new file all sequences will be named &amp;lt;n&amp;gt;_&amp;lt;nn&amp;gt; where &amp;lt;n&amp;gt; is the value in the &amp;quot;File #&amp;quot;&lt;br /&gt;
field in 454.csv and &amp;lt;nn&amp;gt; is the index of the sequence in the file.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mkdir Analysis/Run[date]&lt;br /&gt;
${SCRIPTS}/combinefa.pl -c Analysis/Run[date]/Run[date] -i 454.csv 454&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The output will be in Analysis/Run[date]/Run[date].fna&lt;br /&gt;
&lt;br /&gt;
=== Step 9: Run clustering tool ===&lt;br /&gt;
&lt;br /&gt;
* First generate clusters&lt;br /&gt;
&#039;&#039;&#039;Note: This part assumes we&#039;re running the whole set of sequences as one batch.&#039;&#039;&#039;&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
cd Analysis/Run[date]/Run[date].fna&lt;br /&gt;
/fs/szasmg2/ghodsi/Src/dnaclust/dnaclust -i Run[date].fna &amp;gt; Run[date].fna.cluster&lt;br /&gt;
/fs/szasmg2/ghodsi/Src/dnaclust/dnaclust -m -i Run[date].fna &amp;gt; Run[date].fna.align&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Output will be in Run[date].fna.cluster, one cluster per line, cluster center listed as the first identifier.&lt;br /&gt;
&lt;br /&gt;
The .align file contains aligned FASTA records for all the sequences in each cluster.  Clusters are separated by #&amp;lt;number&amp;gt; where &amp;lt;number&amp;gt; is the number of sequences in the cluster.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* Then extract the cluster centers&lt;br /&gt;
&#039;&#039;&#039;From here on the code runs the same in both full-run and batch modes&#039;&#039;&#039;&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
/fs/szasmg2/ghodsi/Src/dnaclust/fastaselect -f Run[date].fna -c &amp;lt; Run[date].fna.cluster &amp;gt; Run[date].centers.fna&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 10: Assign putative taxonomic labels to clusters ===&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
/fs/szasmg2/ghodsi/rdp/findtax/findtaxid.sh Run[date].centers.fna &amp;gt; Run[date].centers.taxid&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Output is tab-delimited: sequence name &amp;lt;TAB&amp;gt; taxid&lt;br /&gt;
Note: using &amp;quot;findtax&amp;quot; instead of &amp;quot;findtaxid&amp;quot; will retrieve actual taxonomy names.&lt;br /&gt;
&lt;br /&gt;
=== Step 11: Build summary tables ===&lt;br /&gt;
&lt;br /&gt;
Using the output from steps 9 and 10 we construct a collection of tables linking OTUs, taxIDs, taxnames at various taxonomic levels to individual samples.  The colums are the samples and the rows are the respective units.  The cells are numbers of sequences assigned to the specific group. If looking at taxonomic levels, the sequences without an assignment at that level are assigned to a generic &amp;quot;No Assignment&amp;quot; bin.&lt;br /&gt;
&lt;br /&gt;
* First create a partition file that contains all the clusters and associated taxids (if they exist)&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
${SCRIPTS}/clusters2part.pl [batch].fna.cluster [batch] [batch].centers.taxid &amp;gt; [batch].part&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Using this partition, construct summary tables at various taxonomic levels&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
${SCRIPTS}/taxpart2summary.pl [batch].part ${MAIN}/454.csv [batch]&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The following outputs will be created:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
[batch].stats.txt - overall statistics for the data-set&lt;br /&gt;
[batch].otus.count.csv - table containing OTUs as rows, samples as columns, and entries represent&lt;br /&gt;
       # of sequences in each OTU/Sample pair&lt;br /&gt;
[batch].otus.percent.csv - same as &amp;quot;count&amp;quot; except that entries are percentages wrt total sequences&lt;br /&gt;
       in each sample&lt;br /&gt;
[batch].[tax].[otu|count|percent] - same as the &amp;quot;otus&amp;quot; file except at varying taxonomic levels.&lt;br /&gt;
       [tax] is one of &amp;quot;strain&amp;quot;, &amp;quot;species&amp;quot;, &amp;quot;genus&amp;quot;, &amp;quot;family&amp;quot;, &amp;quot;order&amp;quot;, &amp;quot;class&amp;quot;, &amp;quot;phylum&amp;quot;&lt;br /&gt;
       the &amp;quot;count&amp;quot; and &amp;quot;percent&amp;quot; entries are the same as for the &amp;quot;otus&amp;quot; files&lt;br /&gt;
       the &amp;quot;otu&amp;quot; entries contain number of OTUs assigned to the taxonomic group/Sample pair.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Database-based approach ==&lt;br /&gt;
&lt;br /&gt;
=== Database information === &lt;br /&gt;
* To access the database through a command line use the command shown below with password &amp;quot;access&amp;quot;:&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mysql -u access -p -h cbcbmysql00 gems&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
* A stub Perl script for playing with the database is provided at /fs/szasmg2/Gates_SOM/Main/454/DB/stub.pl&lt;br /&gt;
* All users can read the data from the database using users/password combo &amp;quot;access&amp;quot;/&amp;quot;access&amp;quot;.&lt;br /&gt;
* Database schema: [[Media:Schema.pdf]]&lt;br /&gt;
* The commands listed below assume you have write access to the database.&lt;br /&gt;
&lt;br /&gt;
=== Step 0: Get the sequence information ===&lt;br /&gt;
* From .SFF files (assuming these are 454 sequences)&lt;br /&gt;
&lt;br /&gt;
This step uses the sff_extract program from the Staden package (if I&#039;m not mistaken)&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
for i in *.sff ;do&lt;br /&gt;
  name=`expr $i : &#039;\(.*\)\.sff&#039;`&lt;br /&gt;
  sff_extract -c -s $name.seq -q $name.qual $i&lt;br /&gt;
done&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 1: Cleanup meta-information ===&lt;br /&gt;
* Convert the Excel sheet containing the batch information into tab-delimited file (and run dos2unix) making sure the quotes added by Excel/OOffice are removed, adding the date (if not already in the file), and sorting the file by Sample ID.  At this stage also check that the header row information is in canonical format.&lt;br /&gt;
* Add barcode information using add_barcode.pl&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;${SCRIPTDIR}/add_barcode.pl [batch].csv ${MAINDIR}IGS_Barcodes.csv &amp;gt; [batch]_barcode.csv&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 2: Create partition file ===&lt;br /&gt;
&lt;br /&gt;
First concatenate all the sequence files (if multiple) from a batch into a single file [batch].all.seq.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;${SCRIPTDIR}/code2part.pl [batch].csv [batch].all.seq [batch] &amp;gt; [batch].part&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The files [batch].BAD.list and [batch].NONE.list contain the names of all sequences that have failed quality checks either because they are too short, or contain Ns (BAD), or because the barcode is not recognized (NONE). The files contain additional information that can be used to troubleshoot the pipeline:  sequences that are too short (&amp;lt; 75 454 cycles) are followed by the number of cycles, sequences that either contain Ns or have an unknown barcode are followed by the first 8 characters in the sequence.&lt;br /&gt;
&lt;br /&gt;
=== Step 3: Break up the fasta file into separate batches by partition ===&lt;br /&gt;
&lt;br /&gt;
* Create partition directory&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mkdir part&lt;br /&gt;
cd part&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Partition main file into sub-parts&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
${SCRIPTS}/partitionFasta.pl ../[batch].all.seq ../[batch].part&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The result is one fasta file per sub-partition (i.e. individual subject).&lt;br /&gt;
&lt;br /&gt;
* Remove barcodes &lt;br /&gt;
(still in the part/ subdirectory)&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
for i in *.seq; do&lt;br /&gt;
  ${SCRIPTS}/unbarcode.pl $i&lt;br /&gt;
done&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The output files will have the same name as the original file but with the addition of the .nbc suffix.  You should remove the .nbc files from the BAD/NONE files in order to prevent their addition to the pipeline downstream.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt; rm *.BAD.nbc.fa *.NONE.nbc.fa &amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 4: Upload file names ===&lt;br /&gt;
Run from 454 directory&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
${DB_SCRIPTS}/add_file_db.pl [batch]/[batch]_barcode.csv [batch]/part&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 5: Combine all fasta files into a single one ===&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Note: this assumes we&#039;re running the pipeline for the first time - a different protocol is necessary for adding new sequences to an already existing analysis&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
In the new file all sequences will be named &amp;lt;n&amp;gt;_&amp;lt;nn&amp;gt; where &amp;lt;n&amp;gt; is the value in the &amp;quot;File #&amp;quot;&lt;br /&gt;
field in 454.csv and &amp;lt;nn&amp;gt; is the index of the sequence in the file.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mkdir Analysis/Run[date]&lt;br /&gt;
${DB_SCRIPTS}/combinefa_db.pl -c Analysis/Run[date]/Run[date] 454&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The output will be in Analysis/Run[date]/Run[date].fna&lt;br /&gt;
&lt;br /&gt;
=== Step 6: Run clustering tool ===&lt;br /&gt;
&lt;br /&gt;
* First generate clusters&lt;br /&gt;
&#039;&#039;&#039;Note: This part assumes we&#039;re running the whole set of sequences as one batch.&#039;&#039;&#039;&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
cd Analysis/Run[date]/Run[date].fna&lt;br /&gt;
/fs/szasmg2/ghodsi/Src/dnaclust/dnaclust -i Run[date].fna &amp;gt; Run[date].fna.cluster&lt;br /&gt;
/fs/szasmg2/ghodsi/Src/dnaclust/dnaclust -m -i Run[date].fna &amp;gt; Run[date].fna.align&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Output will be in Run[date].fna.cluster, one cluster per line, cluster center listed as the first identifier.&lt;br /&gt;
&lt;br /&gt;
The .align file contains aligned FASTA records for all the sequences in each cluster.  Clusters are separated by #&amp;lt;number&amp;gt; where &amp;lt;number&amp;gt; is the number of sequences in the cluster.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* Then extract the cluster centers&lt;br /&gt;
&#039;&#039;&#039;From here on the code runs the same in both full-run and batch modes&#039;&#039;&#039;&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
/fs/szasmg2/ghodsi/Src/dnaclust/fastaselect Run[date].fna &amp;lt; Run[date].fna.cluster &amp;gt; Run[date].centers.fna&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 7: Assign putative taxonomic labels to clusters ===&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
/fs/szasmg2/ghodsi/rdp/findtax/findtaxid.sh Run[date].centers.fna &amp;gt; Run[date].centers.taxid&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Output is tab-delimited: sequence name &amp;lt;TAB&amp;gt; taxid&lt;br /&gt;
Note: using &amp;quot;findtax&amp;quot; instead of &amp;quot;findtaxid&amp;quot; will retrieve actual taxonomy names.&lt;br /&gt;
&lt;br /&gt;
=== Step 8: Upload OTU information into the database ===&lt;br /&gt;
* First create a partition file that contains all the clusters and associated taxids (if they exist)&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
${SCRIPTS}/cluster2part.pl [batch].fna.cluster [batch] [batch].centers.taxid &amp;gt; [batch].part&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Then upload the resulting partition to the database&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
${DB_SCRIPTS}/upload_otus.pl [batch].part&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 9: Generate summary tables ===&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
${DB_SCRIPTS}/taxpart2summary-db.pl [batch]&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
See above in the non-db version for outputs.&lt;/div&gt;</summary>
		<author><name>Mpop</name></author>
	</entry>
	<entry>
		<id>https://wiki.umiacs.umd.edu/cbcb/index.php?title=Cbcb:Pop-Lab:16S-pipeline_16S_analysis_pipeline_(for_Gates_project)&amp;diff=8920</id>
		<title>Cbcb:Pop-Lab:16S-pipeline 16S analysis pipeline (for Gates project)</title>
		<link rel="alternate" type="text/html" href="https://wiki.umiacs.umd.edu/cbcb/index.php?title=Cbcb:Pop-Lab:16S-pipeline_16S_analysis_pipeline_(for_Gates_project)&amp;diff=8920"/>
		<updated>2011-08-16T18:27:02Z</updated>

		<summary type="html">&lt;p&gt;Mpop: /* Step 9: Generate summary tables */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= 16S analysis pipeline =&lt;br /&gt;
&lt;br /&gt;
== Overview ==&lt;br /&gt;
=== Assumptions ===&lt;br /&gt;
* 16S rRNA sequences are generated with barcoded 454 and are received as either: (i) .sff file; (ii) fasta and quality file; (iii) just fasta file&lt;br /&gt;
* each batch is a single 96-well plate and is accompanied by a tab-delimited file containing information about the sample, including the &amp;quot;Sample ID&amp;quot;, well on the plate, and additional information regarding the sample quality and DNA concentration&lt;br /&gt;
* we have a file that specifies the barcode and sequencing adapter used for each well (these entities do not change).&lt;br /&gt;
&lt;br /&gt;
=== Directory structure ===&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
Gates_SOM/&lt;br /&gt;
  Main/&lt;br /&gt;
    samples.csv      - information about all the samples available to us&lt;br /&gt;
    454.csv          - information about all 454 runs (essentially concatenation of .csvs from 454 dir)&lt;br /&gt;
    phylochip.csv    - information about all Phylochip runs&lt;br /&gt;
    IGS_Barcodes.csv - information about barcodes used to multiplex 454 runs&lt;br /&gt;
    scripts/         - scripts used to process the data (referred to as ${SCRIPTS} below)&lt;br /&gt;
    DB/              - scripts used to access the database (referred to as ${DB_SCRIPTS} below)&lt;br /&gt;
  454/               - here&#039;s where all 454 sequences live&lt;br /&gt;
    [batch1]/        - ... each batch in a separate directory&lt;br /&gt;
    [batch1].csv     - meta-information about the batch &lt;br /&gt;
    [fasta1]         - fasta files containing the batch&lt;br /&gt;
     ... &lt;br /&gt;
    [fastan]&lt;br /&gt;
    [batch1].part    - partition file describing how the sequences get split by barcode/sample&lt;br /&gt;
     part/           - directory where all the partitioned files live &lt;br /&gt;
     ... &lt;br /&gt;
    [batchn]&lt;br /&gt;
  Phylochip/         - all the CEL files and auxiliary information on the Phylochip runs&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== File-based approach == &lt;br /&gt;
&lt;br /&gt;
=== Step 0: Get the sequence information ===&lt;br /&gt;
* From .SFF files (assuming these are 454 sequences)&lt;br /&gt;
&lt;br /&gt;
This step uses the sff_extract program from the Staden package (if I&#039;m not mistaken)&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
for i in *.sff ;do&lt;br /&gt;
  name=`expr $i : &#039;\(.*\)\.sff&#039;`&lt;br /&gt;
  sff_extract -c -s $name.seq -q $name.qual $i&lt;br /&gt;
done&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 1: Cleanup meta-information ===&lt;br /&gt;
* Convert the Excel sheet containing the batch information into tab-delimited file (and run dos2unix) making sure the quotes added by Excel/OOffice are removed, adding the date (if not already in the file), and sorting the file by Sample ID.  At this stage also check that the header row information is in canonical format.&lt;br /&gt;
* Add barcode information using add_barcode.pl&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;${SCRIPTDIR}/add_barcode.pl [batch].csv ${MAINDIR}IGS_Barcodes.csv &amp;gt; [batch]_barcode.csv&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 2: Create partition file ===&lt;br /&gt;
&lt;br /&gt;
First concatenate all the sequence files (if multiple) from a batch into a single file [batch].all.seq.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;${SCRIPTDIR}/code2part.pl [batch].csv [batch].all.seq [batch] &amp;gt; [batch].part&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The files [batch].BAD.list and [batch].NONE.list contain the names of all sequences that have failed quality checks either because they are too short, or contain Ns (BAD), or because the barcode is not recognized (NONE). The files contain additional information that can be used to troubleshoot the pipeline:  sequences that are too short (&amp;lt; 75 454 cycles) are followed by the number of cycles, sequences that either contain Ns or have an unknown barcode are followed by the first 8 characters in the sequence.&lt;br /&gt;
&lt;br /&gt;
=== Step 3: Break up the fasta file into separate batches by partition ===&lt;br /&gt;
&lt;br /&gt;
* Create partition directory&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mkdir part&lt;br /&gt;
cd part&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Partition main file into sub-parts&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
${SCRIPTS}/partitionFasta.pl ../[batch].all.seq ../[batch].part&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The result is one fasta file per sub-partition (i.e. individual subject).&lt;br /&gt;
&lt;br /&gt;
* Remove barcodes &lt;br /&gt;
(still in the part/ subdirectory)&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
for i in *.seq; do&lt;br /&gt;
  ${SCRIPTS}/unbarcode.pl $i&lt;br /&gt;
done&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The output files will have the same name as the original file but with the addition of the .nbc suffix.  You should remove the .nbc files from the BAD/NONE files in order to prevent their addition to the pipeline downstream.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt; rm *.BAD.nbc.fa *.NONE.nbc.fa &amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 4: Add file names to sample tables ===&lt;br /&gt;
&lt;br /&gt;
The following needs to be run from the root of the 454 directory.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt; ${SCRIPTS}/add_file.pl [batch]/[batch]_barcode.csv [batch]/part &amp;gt; [batch]/[batch]_names.csv &amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
As a result, the file [batch]/[batch]_names.csv will associate each Sample ID to a file name and also record the number of sequences in that file.  Note that only files ending in .nbc.fa are procesed.&lt;br /&gt;
&lt;br /&gt;
=== Step 5: Merge all the batch meta-info files into a same file at the top ===&lt;br /&gt;
&lt;br /&gt;
Note: the addition of file names must be done on a batch by batch basis as multiple files might refer to a same Sample ID - within each batch it can be assumed that the Sample ID -&amp;gt; Filename mapping is unique.  In the 454.csv file in the top directory the unique key is the file name.  &lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt; &lt;br /&gt;
mv ../454.csv ../454.csv.bak&lt;br /&gt;
${SCRIPTS}/merge_csv.pl ../454.csv.bak [batch]/[batch]_names.csv &amp;quot;Sample ID&amp;quot; non-unique &amp;gt; ../454.csv&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This should probably be run with the filename as the key but then the tables need to be sorted  by filename.  Ultimately, this will all work better once the data are in a relational database.&lt;br /&gt;
&lt;br /&gt;
=== Step 6: Update sample file ===&lt;br /&gt;
From the top directory:&lt;br /&gt;
&lt;br /&gt;
* First add all new samples to the samples.csv file&lt;br /&gt;
&amp;lt;pre&amp;gt; &lt;br /&gt;
mv samples.csv samples.csv.bak&lt;br /&gt;
${SCRIPTS}/merge_csv.pl samples.csv.bak 454.csv &amp;quot;Sample ID&amp;quot; unique merge &amp;gt; samples.csv&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Note: merge means that if record keys conflict, the empty fields will be updated with the new data.&lt;br /&gt;
&lt;br /&gt;
* Update the tag indicating 454 sequences available for this sample&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mv samples.csv samples.csv.bak&lt;br /&gt;
${SCRIPTS}/update_field_csv.pl samples.csv.bak 454.csv &amp;quot;Sample ID&amp;quot; 454 Y &amp;gt; samples.csv&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 7: Assign numbers to all filenames ===&lt;br /&gt;
&lt;br /&gt;
Each file in the 454.csv file will be assigned an integer (if one is not already available).  This number will be used to prefix the sequences in the combined file for the project.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mv 454.csv 454.csv.bak&lt;br /&gt;
${SCRIPTS}/add_filenum.pl 454.csv.bak &amp;gt; 454.csv&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 8: Combine all fasta files into a single one ===&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Note: this assumes we&#039;re running the pipeline for the first time - a different protocol is necessary for adding new sequences to an already existing analysis&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
In the new file all sequences will be named &amp;lt;n&amp;gt;_&amp;lt;nn&amp;gt; where &amp;lt;n&amp;gt; is the value in the &amp;quot;File #&amp;quot;&lt;br /&gt;
field in 454.csv and &amp;lt;nn&amp;gt; is the index of the sequence in the file.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mkdir Analysis/Run[date]&lt;br /&gt;
${SCRIPTS}/combinefa.pl -c Analysis/Run[date]/Run[date] -i 454.csv 454&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The output will be in Analysis/Run[date]/Run[date].fna&lt;br /&gt;
&lt;br /&gt;
=== Step 9: Run clustering tool ===&lt;br /&gt;
&lt;br /&gt;
* First generate clusters&lt;br /&gt;
&#039;&#039;&#039;Note: This part assumes we&#039;re running the whole set of sequences as one batch.&#039;&#039;&#039;&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
cd Analysis/Run[date]/Run[date].fna&lt;br /&gt;
/fs/szasmg2/ghodsi/Src/dnaclust/dnaclust -i Run[date].fna &amp;gt; Run[date].fna.cluster&lt;br /&gt;
/fs/szasmg2/ghodsi/Src/dnaclust/dnaclust -m -i Run[date].fna &amp;gt; Run[date].fna.align&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Output will be in Run[date].fna.cluster, one cluster per line, cluster center listed as the first identifier.&lt;br /&gt;
&lt;br /&gt;
The .align file contains aligned FASTA records for all the sequences in each cluster.  Clusters are separated by #&amp;lt;number&amp;gt; where &amp;lt;number&amp;gt; is the number of sequences in the cluster.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* Then extract the cluster centers&lt;br /&gt;
&#039;&#039;&#039;From here on the code runs the same in both full-run and batch modes&#039;&#039;&#039;&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
/fs/szasmg2/ghodsi/Src/dnaclust/fastaselect -f Run[date].fna -c &amp;lt; Run[date].fna.cluster &amp;gt; Run[date].centers.fna&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 10: Assign putative taxonomic labels to clusters ===&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
/fs/szasmg2/ghodsi/rdp/findtax/findtaxid.sh Run[date].centers.fna &amp;gt; Run[date].centers.taxid&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Output is tab-delimited: sequence name &amp;lt;TAB&amp;gt; taxid&lt;br /&gt;
Note: using &amp;quot;findtax&amp;quot; instead of &amp;quot;findtaxid&amp;quot; will retrieve actual taxonomy names.&lt;br /&gt;
&lt;br /&gt;
=== Step 11: Build summary tables ===&lt;br /&gt;
&lt;br /&gt;
Using the output from steps 9 and 10 we construct a collection of tables linking OTUs, taxIDs, taxnames at various taxonomic levels to individual samples.  The colums are the samples and the rows are the respective units.  The cells are numbers of sequences assigned to the specific group. If looking at taxonomic levels, the sequences without an assignment at that level are assigned to a generic &amp;quot;No Assignment&amp;quot; bin.&lt;br /&gt;
&lt;br /&gt;
* First create a partition file that contains all the clusters and associated taxids (if they exist)&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
${SCRIPTS}/clusters2part.pl [batch].fna.cluster [batch] [batch].centers.taxid &amp;gt; [batch].part&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Using this partition, construct summary tables at various taxonomic levels&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
${SCRIPTS}/taxpart2summary.pl [batch].part ${MAIN}/454.csv [batch]&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The following outputs will be created:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
[batch].stats.txt - overall statistics for the data-set&lt;br /&gt;
[batch].otus.count.csv - table containing OTUs as rows, samples as columns, and entries represent&lt;br /&gt;
       # of sequences in each OTU/Sample pair&lt;br /&gt;
[batch].otus.percent.csv - same as &amp;quot;count&amp;quot; except that entries are percentages wrt total sequences&lt;br /&gt;
       in each sample&lt;br /&gt;
[batch].[tax].[otu|count|percent] - same as the &amp;quot;otus&amp;quot; file except at varying taxonomic levels.&lt;br /&gt;
       [tax] is one of &amp;quot;strain&amp;quot;, &amp;quot;species&amp;quot;, &amp;quot;genus&amp;quot;, &amp;quot;family&amp;quot;, &amp;quot;order&amp;quot;, &amp;quot;class&amp;quot;, &amp;quot;phylum&amp;quot;&lt;br /&gt;
       the &amp;quot;count&amp;quot; and &amp;quot;percent&amp;quot; entries are the same as for the &amp;quot;otus&amp;quot; files&lt;br /&gt;
       the &amp;quot;otu&amp;quot; entries contain number of OTUs assigned to the taxonomic group/Sample pair.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Database-based approach ==&lt;br /&gt;
&lt;br /&gt;
=== Database information === &lt;br /&gt;
* To access the database through a command line use the command shown below with password &amp;quot;access&amp;quot;:&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mysql -u access -p -h cbcbmysql00 gems&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
* A stub Perl script for playing with the database is provided at /fs/szasmg2/Gates_SOM/Main/454/DB/stub.pl&lt;br /&gt;
* All users can read the data from the database using users/password combo &amp;quot;access&amp;quot;/&amp;quot;access&amp;quot;.&lt;br /&gt;
* Database schema: [[Media:Schema.pdf]]&lt;br /&gt;
* The commands listed below assume you have write access to the database.&lt;br /&gt;
&lt;br /&gt;
=== Step 0: Get the sequence information ===&lt;br /&gt;
* From .SFF files (assuming these are 454 sequences)&lt;br /&gt;
&lt;br /&gt;
This step uses the sff_extract program from the Staden package (if I&#039;m not mistaken)&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
for i in *.sff ;do&lt;br /&gt;
  name=`expr $i : &#039;\(.*\)\.sff&#039;`&lt;br /&gt;
  sff_extract -c -s $name.seq -q $name.qual $i&lt;br /&gt;
done&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 1: Cleanup meta-information ===&lt;br /&gt;
* Convert the Excel sheet containing the batch information into tab-delimited file (and run dos2unix) making sure the quotes added by Excel/OOffice are removed, adding the date (if not already in the file), and sorting the file by Sample ID.  At this stage also check that the header row information is in canonical format.&lt;br /&gt;
* Add barcode information using add_barcode.pl&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;${SCRIPTDIR}/add_barcode.pl [batch].csv ${MAINDIR}IGS_Barcodes.csv &amp;gt; [batch]_barcode.csv&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 2: Create partition file ===&lt;br /&gt;
&lt;br /&gt;
First concatenate all the sequence files (if multiple) from a batch into a single file [batch].all.seq.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;${SCRIPTDIR}/code2part.pl [batch].csv [batch].all.seq [batch] &amp;gt; [batch].part&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The files [batch].BAD.list and [batch].NONE.list contain the names of all sequences that have failed quality checks either because they are too short, or contain Ns (BAD), or because the barcode is not recognized (NONE). The files contain additional information that can be used to troubleshoot the pipeline:  sequences that are too short (&amp;lt; 75 454 cycles) are followed by the number of cycles, sequences that either contain Ns or have an unknown barcode are followed by the first 8 characters in the sequence.&lt;br /&gt;
&lt;br /&gt;
=== Step 3: Break up the fasta file into separate batches by partition ===&lt;br /&gt;
&lt;br /&gt;
* Create partition directory&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mkdir part&lt;br /&gt;
cd part&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Partition main file into sub-parts&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
${SCRIPTS}/partitionFasta.pl ../[batch].all.seq ../[batch].part&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The result is one fasta file per sub-partition (i.e. individual subject).&lt;br /&gt;
&lt;br /&gt;
* Remove barcodes &lt;br /&gt;
(still in the part/ subdirectory)&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
for i in *.seq; do&lt;br /&gt;
  ${SCRIPTS}/unbarcode.pl $i&lt;br /&gt;
done&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The output files will have the same name as the original file but with the addition of the .nbc suffix.  You should remove the .nbc files from the BAD/NONE files in order to prevent their addition to the pipeline downstream.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt; rm *.BAD.nbc.fa *.NONE.nbc.fa &amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 4: Upload file names ===&lt;br /&gt;
Run from 454 directory&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
${DB_SCRIPTS}/add_file_db.pl [batch]/[batch]_barcode.csv [batch]/part&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 5: Combine all fasta files into a single one ===&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Note: this assumes we&#039;re running the pipeline for the first time - a different protocol is necessary for adding new sequences to an already existing analysis&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
In the new file all sequences will be named &amp;lt;n&amp;gt;_&amp;lt;nn&amp;gt; where &amp;lt;n&amp;gt; is the value in the &amp;quot;File #&amp;quot;&lt;br /&gt;
field in 454.csv and &amp;lt;nn&amp;gt; is the index of the sequence in the file.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mkdir Analysis/Run[date]&lt;br /&gt;
${DB_SCRIPTS}/combinefa_db.pl -c Analysis/Run[date]/Run[date] 454&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The output will be in Analysis/Run[date]/Run[date].fna&lt;br /&gt;
&lt;br /&gt;
=== Step 6: Run clustering tool ===&lt;br /&gt;
&lt;br /&gt;
* First generate clusters&lt;br /&gt;
&#039;&#039;&#039;Note: This part assumes we&#039;re running the whole set of sequences as one batch.&#039;&#039;&#039;&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
cd Analysis/Run[date]/Run[date].fna&lt;br /&gt;
/fs/szasmg2/ghodsi/Src/dnaclust/dnaclust -i Run[date].fna &amp;gt; Run[date].fna.cluster&lt;br /&gt;
/fs/szasmg2/ghodsi/Src/dnaclust/dnaclust -m -i Run[date].fna &amp;gt; Run[date].fna.align&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Output will be in Run[date].fna.cluster, one cluster per line, cluster center listed as the first identifier.&lt;br /&gt;
&lt;br /&gt;
The .align file contains aligned FASTA records for all the sequences in each cluster.  Clusters are separated by #&amp;lt;number&amp;gt; where &amp;lt;number&amp;gt; is the number of sequences in the cluster.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* Then extract the cluster centers&lt;br /&gt;
&#039;&#039;&#039;From here on the code runs the same in both full-run and batch modes&#039;&#039;&#039;&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
/fs/szasmg2/ghodsi/Src/dnaclust/fastaselect Run[date].fna &amp;lt; Run[date].fna.cluster &amp;gt; Run[date].centers.fna&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 7: Assign putative taxonomic labels to clusters ===&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
/fs/szasmg2/ghodsi/rdp/findtax/findtaxid.sh Run[date].centers.fna &amp;gt; Run[date].centers.taxid&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Output is tab-delimited: sequence name &amp;lt;TAB&amp;gt; taxid&lt;br /&gt;
Note: using &amp;quot;findtax&amp;quot; instead of &amp;quot;findtaxid&amp;quot; will retrieve actual taxonomy names.&lt;br /&gt;
&lt;br /&gt;
=== Step 8: Upload OTU information into the database ===&lt;br /&gt;
* First create a partition file that contains all the clusters and associated taxids (if they exist)&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
${SCRIPTS}/cluster2part.pl [batch].fna.cluster [batch] [batch].centers.taxid &amp;gt; [batch].part&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Then upload the resulting partition to the database&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
${DB_SCRIPTS}/upload_otus.pl [batch].part&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 9: Generate summary tables ===&lt;br /&gt;
${DB_SCRIPTS}/taxpart2summary-db.pl [batch]&lt;/div&gt;</summary>
		<author><name>Mpop</name></author>
	</entry>
	<entry>
		<id>https://wiki.umiacs.umd.edu/cbcb/index.php?title=Cbcb:Pop-Lab:16S-pipeline_16S_analysis_pipeline_(for_Gates_project)&amp;diff=8919</id>
		<title>Cbcb:Pop-Lab:16S-pipeline 16S analysis pipeline (for Gates project)</title>
		<link rel="alternate" type="text/html" href="https://wiki.umiacs.umd.edu/cbcb/index.php?title=Cbcb:Pop-Lab:16S-pipeline_16S_analysis_pipeline_(for_Gates_project)&amp;diff=8919"/>
		<updated>2011-08-16T18:26:48Z</updated>

		<summary type="html">&lt;p&gt;Mpop: /* Step 9: Generate summary tables */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= 16S analysis pipeline =&lt;br /&gt;
&lt;br /&gt;
== Overview ==&lt;br /&gt;
=== Assumptions ===&lt;br /&gt;
* 16S rRNA sequences are generated with barcoded 454 and are received as either: (i) .sff file; (ii) fasta and quality file; (iii) just fasta file&lt;br /&gt;
* each batch is a single 96-well plate and is accompanied by a tab-delimited file containing information about the sample, including the &amp;quot;Sample ID&amp;quot;, well on the plate, and additional information regarding the sample quality and DNA concentration&lt;br /&gt;
* we have a file that specifies the barcode and sequencing adapter used for each well (these entities do not change).&lt;br /&gt;
&lt;br /&gt;
=== Directory structure ===&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
Gates_SOM/&lt;br /&gt;
  Main/&lt;br /&gt;
    samples.csv      - information about all the samples available to us&lt;br /&gt;
    454.csv          - information about all 454 runs (essentially concatenation of .csvs from 454 dir)&lt;br /&gt;
    phylochip.csv    - information about all Phylochip runs&lt;br /&gt;
    IGS_Barcodes.csv - information about barcodes used to multiplex 454 runs&lt;br /&gt;
    scripts/         - scripts used to process the data (referred to as ${SCRIPTS} below)&lt;br /&gt;
    DB/              - scripts used to access the database (referred to as ${DB_SCRIPTS} below)&lt;br /&gt;
  454/               - here&#039;s where all 454 sequences live&lt;br /&gt;
    [batch1]/        - ... each batch in a separate directory&lt;br /&gt;
    [batch1].csv     - meta-information about the batch &lt;br /&gt;
    [fasta1]         - fasta files containing the batch&lt;br /&gt;
     ... &lt;br /&gt;
    [fastan]&lt;br /&gt;
    [batch1].part    - partition file describing how the sequences get split by barcode/sample&lt;br /&gt;
     part/           - directory where all the partitioned files live &lt;br /&gt;
     ... &lt;br /&gt;
    [batchn]&lt;br /&gt;
  Phylochip/         - all the CEL files and auxiliary information on the Phylochip runs&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== File-based approach == &lt;br /&gt;
&lt;br /&gt;
=== Step 0: Get the sequence information ===&lt;br /&gt;
* From .SFF files (assuming these are 454 sequences)&lt;br /&gt;
&lt;br /&gt;
This step uses the sff_extract program from the Staden package (if I&#039;m not mistaken)&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
for i in *.sff ;do&lt;br /&gt;
  name=`expr $i : &#039;\(.*\)\.sff&#039;`&lt;br /&gt;
  sff_extract -c -s $name.seq -q $name.qual $i&lt;br /&gt;
done&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 1: Cleanup meta-information ===&lt;br /&gt;
* Convert the Excel sheet containing the batch information into tab-delimited file (and run dos2unix) making sure the quotes added by Excel/OOffice are removed, adding the date (if not already in the file), and sorting the file by Sample ID.  At this stage also check that the header row information is in canonical format.&lt;br /&gt;
* Add barcode information using add_barcode.pl&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;${SCRIPTDIR}/add_barcode.pl [batch].csv ${MAINDIR}IGS_Barcodes.csv &amp;gt; [batch]_barcode.csv&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 2: Create partition file ===&lt;br /&gt;
&lt;br /&gt;
First concatenate all the sequence files (if multiple) from a batch into a single file [batch].all.seq.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;${SCRIPTDIR}/code2part.pl [batch].csv [batch].all.seq [batch] &amp;gt; [batch].part&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The files [batch].BAD.list and [batch].NONE.list contain the names of all sequences that have failed quality checks either because they are too short, or contain Ns (BAD), or because the barcode is not recognized (NONE). The files contain additional information that can be used to troubleshoot the pipeline:  sequences that are too short (&amp;lt; 75 454 cycles) are followed by the number of cycles, sequences that either contain Ns or have an unknown barcode are followed by the first 8 characters in the sequence.&lt;br /&gt;
&lt;br /&gt;
=== Step 3: Break up the fasta file into separate batches by partition ===&lt;br /&gt;
&lt;br /&gt;
* Create partition directory&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mkdir part&lt;br /&gt;
cd part&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Partition main file into sub-parts&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
${SCRIPTS}/partitionFasta.pl ../[batch].all.seq ../[batch].part&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The result is one fasta file per sub-partition (i.e. individual subject).&lt;br /&gt;
&lt;br /&gt;
* Remove barcodes &lt;br /&gt;
(still in the part/ subdirectory)&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
for i in *.seq; do&lt;br /&gt;
  ${SCRIPTS}/unbarcode.pl $i&lt;br /&gt;
done&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The output files will have the same name as the original file but with the addition of the .nbc suffix.  You should remove the .nbc files from the BAD/NONE files in order to prevent their addition to the pipeline downstream.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt; rm *.BAD.nbc.fa *.NONE.nbc.fa &amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 4: Add file names to sample tables ===&lt;br /&gt;
&lt;br /&gt;
The following needs to be run from the root of the 454 directory.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt; ${SCRIPTS}/add_file.pl [batch]/[batch]_barcode.csv [batch]/part &amp;gt; [batch]/[batch]_names.csv &amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
As a result, the file [batch]/[batch]_names.csv will associate each Sample ID to a file name and also record the number of sequences in that file.  Note that only files ending in .nbc.fa are procesed.&lt;br /&gt;
&lt;br /&gt;
=== Step 5: Merge all the batch meta-info files into a same file at the top ===&lt;br /&gt;
&lt;br /&gt;
Note: the addition of file names must be done on a batch by batch basis as multiple files might refer to a same Sample ID - within each batch it can be assumed that the Sample ID -&amp;gt; Filename mapping is unique.  In the 454.csv file in the top directory the unique key is the file name.  &lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt; &lt;br /&gt;
mv ../454.csv ../454.csv.bak&lt;br /&gt;
${SCRIPTS}/merge_csv.pl ../454.csv.bak [batch]/[batch]_names.csv &amp;quot;Sample ID&amp;quot; non-unique &amp;gt; ../454.csv&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This should probably be run with the filename as the key but then the tables need to be sorted  by filename.  Ultimately, this will all work better once the data are in a relational database.&lt;br /&gt;
&lt;br /&gt;
=== Step 6: Update sample file ===&lt;br /&gt;
From the top directory:&lt;br /&gt;
&lt;br /&gt;
* First add all new samples to the samples.csv file&lt;br /&gt;
&amp;lt;pre&amp;gt; &lt;br /&gt;
mv samples.csv samples.csv.bak&lt;br /&gt;
${SCRIPTS}/merge_csv.pl samples.csv.bak 454.csv &amp;quot;Sample ID&amp;quot; unique merge &amp;gt; samples.csv&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Note: merge means that if record keys conflict, the empty fields will be updated with the new data.&lt;br /&gt;
&lt;br /&gt;
* Update the tag indicating 454 sequences available for this sample&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mv samples.csv samples.csv.bak&lt;br /&gt;
${SCRIPTS}/update_field_csv.pl samples.csv.bak 454.csv &amp;quot;Sample ID&amp;quot; 454 Y &amp;gt; samples.csv&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 7: Assign numbers to all filenames ===&lt;br /&gt;
&lt;br /&gt;
Each file in the 454.csv file will be assigned an integer (if one is not already available).  This number will be used to prefix the sequences in the combined file for the project.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mv 454.csv 454.csv.bak&lt;br /&gt;
${SCRIPTS}/add_filenum.pl 454.csv.bak &amp;gt; 454.csv&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 8: Combine all fasta files into a single one ===&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Note: this assumes we&#039;re running the pipeline for the first time - a different protocol is necessary for adding new sequences to an already existing analysis&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
In the new file all sequences will be named &amp;lt;n&amp;gt;_&amp;lt;nn&amp;gt; where &amp;lt;n&amp;gt; is the value in the &amp;quot;File #&amp;quot;&lt;br /&gt;
field in 454.csv and &amp;lt;nn&amp;gt; is the index of the sequence in the file.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mkdir Analysis/Run[date]&lt;br /&gt;
${SCRIPTS}/combinefa.pl -c Analysis/Run[date]/Run[date] -i 454.csv 454&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The output will be in Analysis/Run[date]/Run[date].fna&lt;br /&gt;
&lt;br /&gt;
=== Step 9: Run clustering tool ===&lt;br /&gt;
&lt;br /&gt;
* First generate clusters&lt;br /&gt;
&#039;&#039;&#039;Note: This part assumes we&#039;re running the whole set of sequences as one batch.&#039;&#039;&#039;&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
cd Analysis/Run[date]/Run[date].fna&lt;br /&gt;
/fs/szasmg2/ghodsi/Src/dnaclust/dnaclust -i Run[date].fna &amp;gt; Run[date].fna.cluster&lt;br /&gt;
/fs/szasmg2/ghodsi/Src/dnaclust/dnaclust -m -i Run[date].fna &amp;gt; Run[date].fna.align&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Output will be in Run[date].fna.cluster, one cluster per line, cluster center listed as the first identifier.&lt;br /&gt;
&lt;br /&gt;
The .align file contains aligned FASTA records for all the sequences in each cluster.  Clusters are separated by #&amp;lt;number&amp;gt; where &amp;lt;number&amp;gt; is the number of sequences in the cluster.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* Then extract the cluster centers&lt;br /&gt;
&#039;&#039;&#039;From here on the code runs the same in both full-run and batch modes&#039;&#039;&#039;&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
/fs/szasmg2/ghodsi/Src/dnaclust/fastaselect -f Run[date].fna -c &amp;lt; Run[date].fna.cluster &amp;gt; Run[date].centers.fna&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 10: Assign putative taxonomic labels to clusters ===&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
/fs/szasmg2/ghodsi/rdp/findtax/findtaxid.sh Run[date].centers.fna &amp;gt; Run[date].centers.taxid&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Output is tab-delimited: sequence name &amp;lt;TAB&amp;gt; taxid&lt;br /&gt;
Note: using &amp;quot;findtax&amp;quot; instead of &amp;quot;findtaxid&amp;quot; will retrieve actual taxonomy names.&lt;br /&gt;
&lt;br /&gt;
=== Step 11: Build summary tables ===&lt;br /&gt;
&lt;br /&gt;
Using the output from steps 9 and 10 we construct a collection of tables linking OTUs, taxIDs, taxnames at various taxonomic levels to individual samples.  The colums are the samples and the rows are the respective units.  The cells are numbers of sequences assigned to the specific group. If looking at taxonomic levels, the sequences without an assignment at that level are assigned to a generic &amp;quot;No Assignment&amp;quot; bin.&lt;br /&gt;
&lt;br /&gt;
* First create a partition file that contains all the clusters and associated taxids (if they exist)&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
${SCRIPTS}/clusters2part.pl [batch].fna.cluster [batch] [batch].centers.taxid &amp;gt; [batch].part&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Using this partition, construct summary tables at various taxonomic levels&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
${SCRIPTS}/taxpart2summary.pl [batch].part ${MAIN}/454.csv [batch]&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The following outputs will be created:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
[batch].stats.txt - overall statistics for the data-set&lt;br /&gt;
[batch].otus.count.csv - table containing OTUs as rows, samples as columns, and entries represent&lt;br /&gt;
       # of sequences in each OTU/Sample pair&lt;br /&gt;
[batch].otus.percent.csv - same as &amp;quot;count&amp;quot; except that entries are percentages wrt total sequences&lt;br /&gt;
       in each sample&lt;br /&gt;
[batch].[tax].[otu|count|percent] - same as the &amp;quot;otus&amp;quot; file except at varying taxonomic levels.&lt;br /&gt;
       [tax] is one of &amp;quot;strain&amp;quot;, &amp;quot;species&amp;quot;, &amp;quot;genus&amp;quot;, &amp;quot;family&amp;quot;, &amp;quot;order&amp;quot;, &amp;quot;class&amp;quot;, &amp;quot;phylum&amp;quot;&lt;br /&gt;
       the &amp;quot;count&amp;quot; and &amp;quot;percent&amp;quot; entries are the same as for the &amp;quot;otus&amp;quot; files&lt;br /&gt;
       the &amp;quot;otu&amp;quot; entries contain number of OTUs assigned to the taxonomic group/Sample pair.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Database-based approach ==&lt;br /&gt;
&lt;br /&gt;
=== Database information === &lt;br /&gt;
* To access the database through a command line use the command shown below with password &amp;quot;access&amp;quot;:&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mysql -u access -p -h cbcbmysql00 gems&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
* A stub Perl script for playing with the database is provided at /fs/szasmg2/Gates_SOM/Main/454/DB/stub.pl&lt;br /&gt;
* All users can read the data from the database using users/password combo &amp;quot;access&amp;quot;/&amp;quot;access&amp;quot;.&lt;br /&gt;
* Database schema: [[Media:Schema.pdf]]&lt;br /&gt;
* The commands listed below assume you have write access to the database.&lt;br /&gt;
&lt;br /&gt;
=== Step 0: Get the sequence information ===&lt;br /&gt;
* From .SFF files (assuming these are 454 sequences)&lt;br /&gt;
&lt;br /&gt;
This step uses the sff_extract program from the Staden package (if I&#039;m not mistaken)&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
for i in *.sff ;do&lt;br /&gt;
  name=`expr $i : &#039;\(.*\)\.sff&#039;`&lt;br /&gt;
  sff_extract -c -s $name.seq -q $name.qual $i&lt;br /&gt;
done&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 1: Cleanup meta-information ===&lt;br /&gt;
* Convert the Excel sheet containing the batch information into tab-delimited file (and run dos2unix) making sure the quotes added by Excel/OOffice are removed, adding the date (if not already in the file), and sorting the file by Sample ID.  At this stage also check that the header row information is in canonical format.&lt;br /&gt;
* Add barcode information using add_barcode.pl&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;${SCRIPTDIR}/add_barcode.pl [batch].csv ${MAINDIR}IGS_Barcodes.csv &amp;gt; [batch]_barcode.csv&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 2: Create partition file ===&lt;br /&gt;
&lt;br /&gt;
First concatenate all the sequence files (if multiple) from a batch into a single file [batch].all.seq.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;${SCRIPTDIR}/code2part.pl [batch].csv [batch].all.seq [batch] &amp;gt; [batch].part&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The files [batch].BAD.list and [batch].NONE.list contain the names of all sequences that have failed quality checks either because they are too short, or contain Ns (BAD), or because the barcode is not recognized (NONE). The files contain additional information that can be used to troubleshoot the pipeline:  sequences that are too short (&amp;lt; 75 454 cycles) are followed by the number of cycles, sequences that either contain Ns or have an unknown barcode are followed by the first 8 characters in the sequence.&lt;br /&gt;
&lt;br /&gt;
=== Step 3: Break up the fasta file into separate batches by partition ===&lt;br /&gt;
&lt;br /&gt;
* Create partition directory&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mkdir part&lt;br /&gt;
cd part&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Partition main file into sub-parts&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
${SCRIPTS}/partitionFasta.pl ../[batch].all.seq ../[batch].part&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The result is one fasta file per sub-partition (i.e. individual subject).&lt;br /&gt;
&lt;br /&gt;
* Remove barcodes &lt;br /&gt;
(still in the part/ subdirectory)&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
for i in *.seq; do&lt;br /&gt;
  ${SCRIPTS}/unbarcode.pl $i&lt;br /&gt;
done&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The output files will have the same name as the original file but with the addition of the .nbc suffix.  You should remove the .nbc files from the BAD/NONE files in order to prevent their addition to the pipeline downstream.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt; rm *.BAD.nbc.fa *.NONE.nbc.fa &amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 4: Upload file names ===&lt;br /&gt;
Run from 454 directory&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
${DB_SCRIPTS}/add_file_db.pl [batch]/[batch]_barcode.csv [batch]/part&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 5: Combine all fasta files into a single one ===&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Note: this assumes we&#039;re running the pipeline for the first time - a different protocol is necessary for adding new sequences to an already existing analysis&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
In the new file all sequences will be named &amp;lt;n&amp;gt;_&amp;lt;nn&amp;gt; where &amp;lt;n&amp;gt; is the value in the &amp;quot;File #&amp;quot;&lt;br /&gt;
field in 454.csv and &amp;lt;nn&amp;gt; is the index of the sequence in the file.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mkdir Analysis/Run[date]&lt;br /&gt;
${DB_SCRIPTS}/combinefa_db.pl -c Analysis/Run[date]/Run[date] 454&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The output will be in Analysis/Run[date]/Run[date].fna&lt;br /&gt;
&lt;br /&gt;
=== Step 6: Run clustering tool ===&lt;br /&gt;
&lt;br /&gt;
* First generate clusters&lt;br /&gt;
&#039;&#039;&#039;Note: This part assumes we&#039;re running the whole set of sequences as one batch.&#039;&#039;&#039;&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
cd Analysis/Run[date]/Run[date].fna&lt;br /&gt;
/fs/szasmg2/ghodsi/Src/dnaclust/dnaclust -i Run[date].fna &amp;gt; Run[date].fna.cluster&lt;br /&gt;
/fs/szasmg2/ghodsi/Src/dnaclust/dnaclust -m -i Run[date].fna &amp;gt; Run[date].fna.align&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Output will be in Run[date].fna.cluster, one cluster per line, cluster center listed as the first identifier.&lt;br /&gt;
&lt;br /&gt;
The .align file contains aligned FASTA records for all the sequences in each cluster.  Clusters are separated by #&amp;lt;number&amp;gt; where &amp;lt;number&amp;gt; is the number of sequences in the cluster.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* Then extract the cluster centers&lt;br /&gt;
&#039;&#039;&#039;From here on the code runs the same in both full-run and batch modes&#039;&#039;&#039;&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
/fs/szasmg2/ghodsi/Src/dnaclust/fastaselect Run[date].fna &amp;lt; Run[date].fna.cluster &amp;gt; Run[date].centers.fna&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 7: Assign putative taxonomic labels to clusters ===&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
/fs/szasmg2/ghodsi/rdp/findtax/findtaxid.sh Run[date].centers.fna &amp;gt; Run[date].centers.taxid&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Output is tab-delimited: sequence name &amp;lt;TAB&amp;gt; taxid&lt;br /&gt;
Note: using &amp;quot;findtax&amp;quot; instead of &amp;quot;findtaxid&amp;quot; will retrieve actual taxonomy names.&lt;br /&gt;
&lt;br /&gt;
=== Step 8: Upload OTU information into the database ===&lt;br /&gt;
* First create a partition file that contains all the clusters and associated taxids (if they exist)&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
${SCRIPTS}/cluster2part.pl [batch].fna.cluster [batch] [batch].centers.taxid &amp;gt; [batch].part&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Then upload the resulting partition to the database&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
${DB_SCRIPTS}/upload_otus.pl [batch].part&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 9: Generate summary tables ===&lt;br /&gt;
${DB_SCRIPTS}/taxpart2summary-db.pl PREFIX&lt;/div&gt;</summary>
		<author><name>Mpop</name></author>
	</entry>
	<entry>
		<id>https://wiki.umiacs.umd.edu/cbcb/index.php?title=Cbcb:Pop-Lab:16S-pipeline_16S_analysis_pipeline_(for_Gates_project)&amp;diff=8918</id>
		<title>Cbcb:Pop-Lab:16S-pipeline 16S analysis pipeline (for Gates project)</title>
		<link rel="alternate" type="text/html" href="https://wiki.umiacs.umd.edu/cbcb/index.php?title=Cbcb:Pop-Lab:16S-pipeline_16S_analysis_pipeline_(for_Gates_project)&amp;diff=8918"/>
		<updated>2011-08-16T18:26:30Z</updated>

		<summary type="html">&lt;p&gt;Mpop: /* Step 9: Generate summary tables */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= 16S analysis pipeline =&lt;br /&gt;
&lt;br /&gt;
== Overview ==&lt;br /&gt;
=== Assumptions ===&lt;br /&gt;
* 16S rRNA sequences are generated with barcoded 454 and are received as either: (i) .sff file; (ii) fasta and quality file; (iii) just fasta file&lt;br /&gt;
* each batch is a single 96-well plate and is accompanied by a tab-delimited file containing information about the sample, including the &amp;quot;Sample ID&amp;quot;, well on the plate, and additional information regarding the sample quality and DNA concentration&lt;br /&gt;
* we have a file that specifies the barcode and sequencing adapter used for each well (these entities do not change).&lt;br /&gt;
&lt;br /&gt;
=== Directory structure ===&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
Gates_SOM/&lt;br /&gt;
  Main/&lt;br /&gt;
    samples.csv      - information about all the samples available to us&lt;br /&gt;
    454.csv          - information about all 454 runs (essentially concatenation of .csvs from 454 dir)&lt;br /&gt;
    phylochip.csv    - information about all Phylochip runs&lt;br /&gt;
    IGS_Barcodes.csv - information about barcodes used to multiplex 454 runs&lt;br /&gt;
    scripts/         - scripts used to process the data (referred to as ${SCRIPTS} below)&lt;br /&gt;
    DB/              - scripts used to access the database (referred to as ${DB_SCRIPTS} below)&lt;br /&gt;
  454/               - here&#039;s where all 454 sequences live&lt;br /&gt;
    [batch1]/        - ... each batch in a separate directory&lt;br /&gt;
    [batch1].csv     - meta-information about the batch &lt;br /&gt;
    [fasta1]         - fasta files containing the batch&lt;br /&gt;
     ... &lt;br /&gt;
    [fastan]&lt;br /&gt;
    [batch1].part    - partition file describing how the sequences get split by barcode/sample&lt;br /&gt;
     part/           - directory where all the partitioned files live &lt;br /&gt;
     ... &lt;br /&gt;
    [batchn]&lt;br /&gt;
  Phylochip/         - all the CEL files and auxiliary information on the Phylochip runs&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== File-based approach == &lt;br /&gt;
&lt;br /&gt;
=== Step 0: Get the sequence information ===&lt;br /&gt;
* From .SFF files (assuming these are 454 sequences)&lt;br /&gt;
&lt;br /&gt;
This step uses the sff_extract program from the Staden package (if I&#039;m not mistaken)&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
for i in *.sff ;do&lt;br /&gt;
  name=`expr $i : &#039;\(.*\)\.sff&#039;`&lt;br /&gt;
  sff_extract -c -s $name.seq -q $name.qual $i&lt;br /&gt;
done&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 1: Cleanup meta-information ===&lt;br /&gt;
* Convert the Excel sheet containing the batch information into tab-delimited file (and run dos2unix) making sure the quotes added by Excel/OOffice are removed, adding the date (if not already in the file), and sorting the file by Sample ID.  At this stage also check that the header row information is in canonical format.&lt;br /&gt;
* Add barcode information using add_barcode.pl&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;${SCRIPTDIR}/add_barcode.pl [batch].csv ${MAINDIR}IGS_Barcodes.csv &amp;gt; [batch]_barcode.csv&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 2: Create partition file ===&lt;br /&gt;
&lt;br /&gt;
First concatenate all the sequence files (if multiple) from a batch into a single file [batch].all.seq.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;${SCRIPTDIR}/code2part.pl [batch].csv [batch].all.seq [batch] &amp;gt; [batch].part&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The files [batch].BAD.list and [batch].NONE.list contain the names of all sequences that have failed quality checks either because they are too short, or contain Ns (BAD), or because the barcode is not recognized (NONE). The files contain additional information that can be used to troubleshoot the pipeline:  sequences that are too short (&amp;lt; 75 454 cycles) are followed by the number of cycles, sequences that either contain Ns or have an unknown barcode are followed by the first 8 characters in the sequence.&lt;br /&gt;
&lt;br /&gt;
=== Step 3: Break up the fasta file into separate batches by partition ===&lt;br /&gt;
&lt;br /&gt;
* Create partition directory&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mkdir part&lt;br /&gt;
cd part&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Partition main file into sub-parts&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
${SCRIPTS}/partitionFasta.pl ../[batch].all.seq ../[batch].part&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The result is one fasta file per sub-partition (i.e. individual subject).&lt;br /&gt;
&lt;br /&gt;
* Remove barcodes &lt;br /&gt;
(still in the part/ subdirectory)&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
for i in *.seq; do&lt;br /&gt;
  ${SCRIPTS}/unbarcode.pl $i&lt;br /&gt;
done&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The output files will have the same name as the original file but with the addition of the .nbc suffix.  You should remove the .nbc files from the BAD/NONE files in order to prevent their addition to the pipeline downstream.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt; rm *.BAD.nbc.fa *.NONE.nbc.fa &amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 4: Add file names to sample tables ===&lt;br /&gt;
&lt;br /&gt;
The following needs to be run from the root of the 454 directory.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt; ${SCRIPTS}/add_file.pl [batch]/[batch]_barcode.csv [batch]/part &amp;gt; [batch]/[batch]_names.csv &amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
As a result, the file [batch]/[batch]_names.csv will associate each Sample ID to a file name and also record the number of sequences in that file.  Note that only files ending in .nbc.fa are procesed.&lt;br /&gt;
&lt;br /&gt;
=== Step 5: Merge all the batch meta-info files into a same file at the top ===&lt;br /&gt;
&lt;br /&gt;
Note: the addition of file names must be done on a batch by batch basis as multiple files might refer to a same Sample ID - within each batch it can be assumed that the Sample ID -&amp;gt; Filename mapping is unique.  In the 454.csv file in the top directory the unique key is the file name.  &lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt; &lt;br /&gt;
mv ../454.csv ../454.csv.bak&lt;br /&gt;
${SCRIPTS}/merge_csv.pl ../454.csv.bak [batch]/[batch]_names.csv &amp;quot;Sample ID&amp;quot; non-unique &amp;gt; ../454.csv&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This should probably be run with the filename as the key but then the tables need to be sorted  by filename.  Ultimately, this will all work better once the data are in a relational database.&lt;br /&gt;
&lt;br /&gt;
=== Step 6: Update sample file ===&lt;br /&gt;
From the top directory:&lt;br /&gt;
&lt;br /&gt;
* First add all new samples to the samples.csv file&lt;br /&gt;
&amp;lt;pre&amp;gt; &lt;br /&gt;
mv samples.csv samples.csv.bak&lt;br /&gt;
${SCRIPTS}/merge_csv.pl samples.csv.bak 454.csv &amp;quot;Sample ID&amp;quot; unique merge &amp;gt; samples.csv&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Note: merge means that if record keys conflict, the empty fields will be updated with the new data.&lt;br /&gt;
&lt;br /&gt;
* Update the tag indicating 454 sequences available for this sample&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mv samples.csv samples.csv.bak&lt;br /&gt;
${SCRIPTS}/update_field_csv.pl samples.csv.bak 454.csv &amp;quot;Sample ID&amp;quot; 454 Y &amp;gt; samples.csv&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 7: Assign numbers to all filenames ===&lt;br /&gt;
&lt;br /&gt;
Each file in the 454.csv file will be assigned an integer (if one is not already available).  This number will be used to prefix the sequences in the combined file for the project.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mv 454.csv 454.csv.bak&lt;br /&gt;
${SCRIPTS}/add_filenum.pl 454.csv.bak &amp;gt; 454.csv&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 8: Combine all fasta files into a single one ===&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Note: this assumes we&#039;re running the pipeline for the first time - a different protocol is necessary for adding new sequences to an already existing analysis&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
In the new file all sequences will be named &amp;lt;n&amp;gt;_&amp;lt;nn&amp;gt; where &amp;lt;n&amp;gt; is the value in the &amp;quot;File #&amp;quot;&lt;br /&gt;
field in 454.csv and &amp;lt;nn&amp;gt; is the index of the sequence in the file.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mkdir Analysis/Run[date]&lt;br /&gt;
${SCRIPTS}/combinefa.pl -c Analysis/Run[date]/Run[date] -i 454.csv 454&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The output will be in Analysis/Run[date]/Run[date].fna&lt;br /&gt;
&lt;br /&gt;
=== Step 9: Run clustering tool ===&lt;br /&gt;
&lt;br /&gt;
* First generate clusters&lt;br /&gt;
&#039;&#039;&#039;Note: This part assumes we&#039;re running the whole set of sequences as one batch.&#039;&#039;&#039;&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
cd Analysis/Run[date]/Run[date].fna&lt;br /&gt;
/fs/szasmg2/ghodsi/Src/dnaclust/dnaclust -i Run[date].fna &amp;gt; Run[date].fna.cluster&lt;br /&gt;
/fs/szasmg2/ghodsi/Src/dnaclust/dnaclust -m -i Run[date].fna &amp;gt; Run[date].fna.align&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Output will be in Run[date].fna.cluster, one cluster per line, cluster center listed as the first identifier.&lt;br /&gt;
&lt;br /&gt;
The .align file contains aligned FASTA records for all the sequences in each cluster.  Clusters are separated by #&amp;lt;number&amp;gt; where &amp;lt;number&amp;gt; is the number of sequences in the cluster.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* Then extract the cluster centers&lt;br /&gt;
&#039;&#039;&#039;From here on the code runs the same in both full-run and batch modes&#039;&#039;&#039;&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
/fs/szasmg2/ghodsi/Src/dnaclust/fastaselect -f Run[date].fna -c &amp;lt; Run[date].fna.cluster &amp;gt; Run[date].centers.fna&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 10: Assign putative taxonomic labels to clusters ===&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
/fs/szasmg2/ghodsi/rdp/findtax/findtaxid.sh Run[date].centers.fna &amp;gt; Run[date].centers.taxid&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Output is tab-delimited: sequence name &amp;lt;TAB&amp;gt; taxid&lt;br /&gt;
Note: using &amp;quot;findtax&amp;quot; instead of &amp;quot;findtaxid&amp;quot; will retrieve actual taxonomy names.&lt;br /&gt;
&lt;br /&gt;
=== Step 11: Build summary tables ===&lt;br /&gt;
&lt;br /&gt;
Using the output from steps 9 and 10 we construct a collection of tables linking OTUs, taxIDs, taxnames at various taxonomic levels to individual samples.  The colums are the samples and the rows are the respective units.  The cells are numbers of sequences assigned to the specific group. If looking at taxonomic levels, the sequences without an assignment at that level are assigned to a generic &amp;quot;No Assignment&amp;quot; bin.&lt;br /&gt;
&lt;br /&gt;
* First create a partition file that contains all the clusters and associated taxids (if they exist)&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
${SCRIPTS}/clusters2part.pl [batch].fna.cluster [batch] [batch].centers.taxid &amp;gt; [batch].part&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Using this partition, construct summary tables at various taxonomic levels&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
${SCRIPTS}/taxpart2summary.pl [batch].part ${MAIN}/454.csv [batch]&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The following outputs will be created:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
[batch].stats.txt - overall statistics for the data-set&lt;br /&gt;
[batch].otus.count.csv - table containing OTUs as rows, samples as columns, and entries represent&lt;br /&gt;
       # of sequences in each OTU/Sample pair&lt;br /&gt;
[batch].otus.percent.csv - same as &amp;quot;count&amp;quot; except that entries are percentages wrt total sequences&lt;br /&gt;
       in each sample&lt;br /&gt;
[batch].[tax].[otu|count|percent] - same as the &amp;quot;otus&amp;quot; file except at varying taxonomic levels.&lt;br /&gt;
       [tax] is one of &amp;quot;strain&amp;quot;, &amp;quot;species&amp;quot;, &amp;quot;genus&amp;quot;, &amp;quot;family&amp;quot;, &amp;quot;order&amp;quot;, &amp;quot;class&amp;quot;, &amp;quot;phylum&amp;quot;&lt;br /&gt;
       the &amp;quot;count&amp;quot; and &amp;quot;percent&amp;quot; entries are the same as for the &amp;quot;otus&amp;quot; files&lt;br /&gt;
       the &amp;quot;otu&amp;quot; entries contain number of OTUs assigned to the taxonomic group/Sample pair.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Database-based approach ==&lt;br /&gt;
&lt;br /&gt;
=== Database information === &lt;br /&gt;
* To access the database through a command line use the command shown below with password &amp;quot;access&amp;quot;:&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mysql -u access -p -h cbcbmysql00 gems&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
* A stub Perl script for playing with the database is provided at /fs/szasmg2/Gates_SOM/Main/454/DB/stub.pl&lt;br /&gt;
* All users can read the data from the database using users/password combo &amp;quot;access&amp;quot;/&amp;quot;access&amp;quot;.&lt;br /&gt;
* Database schema: [[Media:Schema.pdf]]&lt;br /&gt;
* The commands listed below assume you have write access to the database.&lt;br /&gt;
&lt;br /&gt;
=== Step 0: Get the sequence information ===&lt;br /&gt;
* From .SFF files (assuming these are 454 sequences)&lt;br /&gt;
&lt;br /&gt;
This step uses the sff_extract program from the Staden package (if I&#039;m not mistaken)&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
for i in *.sff ;do&lt;br /&gt;
  name=`expr $i : &#039;\(.*\)\.sff&#039;`&lt;br /&gt;
  sff_extract -c -s $name.seq -q $name.qual $i&lt;br /&gt;
done&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 1: Cleanup meta-information ===&lt;br /&gt;
* Convert the Excel sheet containing the batch information into tab-delimited file (and run dos2unix) making sure the quotes added by Excel/OOffice are removed, adding the date (if not already in the file), and sorting the file by Sample ID.  At this stage also check that the header row information is in canonical format.&lt;br /&gt;
* Add barcode information using add_barcode.pl&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;${SCRIPTDIR}/add_barcode.pl [batch].csv ${MAINDIR}IGS_Barcodes.csv &amp;gt; [batch]_barcode.csv&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 2: Create partition file ===&lt;br /&gt;
&lt;br /&gt;
First concatenate all the sequence files (if multiple) from a batch into a single file [batch].all.seq.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;${SCRIPTDIR}/code2part.pl [batch].csv [batch].all.seq [batch] &amp;gt; [batch].part&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The files [batch].BAD.list and [batch].NONE.list contain the names of all sequences that have failed quality checks either because they are too short, or contain Ns (BAD), or because the barcode is not recognized (NONE). The files contain additional information that can be used to troubleshoot the pipeline:  sequences that are too short (&amp;lt; 75 454 cycles) are followed by the number of cycles, sequences that either contain Ns or have an unknown barcode are followed by the first 8 characters in the sequence.&lt;br /&gt;
&lt;br /&gt;
=== Step 3: Break up the fasta file into separate batches by partition ===&lt;br /&gt;
&lt;br /&gt;
* Create partition directory&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mkdir part&lt;br /&gt;
cd part&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Partition main file into sub-parts&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
${SCRIPTS}/partitionFasta.pl ../[batch].all.seq ../[batch].part&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The result is one fasta file per sub-partition (i.e. individual subject).&lt;br /&gt;
&lt;br /&gt;
* Remove barcodes &lt;br /&gt;
(still in the part/ subdirectory)&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
for i in *.seq; do&lt;br /&gt;
  ${SCRIPTS}/unbarcode.pl $i&lt;br /&gt;
done&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The output files will have the same name as the original file but with the addition of the .nbc suffix.  You should remove the .nbc files from the BAD/NONE files in order to prevent their addition to the pipeline downstream.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt; rm *.BAD.nbc.fa *.NONE.nbc.fa &amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 4: Upload file names ===&lt;br /&gt;
Run from 454 directory&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
${DB_SCRIPTS}/add_file_db.pl [batch]/[batch]_barcode.csv [batch]/part&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 5: Combine all fasta files into a single one ===&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Note: this assumes we&#039;re running the pipeline for the first time - a different protocol is necessary for adding new sequences to an already existing analysis&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
In the new file all sequences will be named &amp;lt;n&amp;gt;_&amp;lt;nn&amp;gt; where &amp;lt;n&amp;gt; is the value in the &amp;quot;File #&amp;quot;&lt;br /&gt;
field in 454.csv and &amp;lt;nn&amp;gt; is the index of the sequence in the file.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mkdir Analysis/Run[date]&lt;br /&gt;
${DB_SCRIPTS}/combinefa_db.pl -c Analysis/Run[date]/Run[date] 454&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The output will be in Analysis/Run[date]/Run[date].fna&lt;br /&gt;
&lt;br /&gt;
=== Step 6: Run clustering tool ===&lt;br /&gt;
&lt;br /&gt;
* First generate clusters&lt;br /&gt;
&#039;&#039;&#039;Note: This part assumes we&#039;re running the whole set of sequences as one batch.&#039;&#039;&#039;&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
cd Analysis/Run[date]/Run[date].fna&lt;br /&gt;
/fs/szasmg2/ghodsi/Src/dnaclust/dnaclust -i Run[date].fna &amp;gt; Run[date].fna.cluster&lt;br /&gt;
/fs/szasmg2/ghodsi/Src/dnaclust/dnaclust -m -i Run[date].fna &amp;gt; Run[date].fna.align&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Output will be in Run[date].fna.cluster, one cluster per line, cluster center listed as the first identifier.&lt;br /&gt;
&lt;br /&gt;
The .align file contains aligned FASTA records for all the sequences in each cluster.  Clusters are separated by #&amp;lt;number&amp;gt; where &amp;lt;number&amp;gt; is the number of sequences in the cluster.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* Then extract the cluster centers&lt;br /&gt;
&#039;&#039;&#039;From here on the code runs the same in both full-run and batch modes&#039;&#039;&#039;&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
/fs/szasmg2/ghodsi/Src/dnaclust/fastaselect Run[date].fna &amp;lt; Run[date].fna.cluster &amp;gt; Run[date].centers.fna&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 7: Assign putative taxonomic labels to clusters ===&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
/fs/szasmg2/ghodsi/rdp/findtax/findtaxid.sh Run[date].centers.fna &amp;gt; Run[date].centers.taxid&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Output is tab-delimited: sequence name &amp;lt;TAB&amp;gt; taxid&lt;br /&gt;
Note: using &amp;quot;findtax&amp;quot; instead of &amp;quot;findtaxid&amp;quot; will retrieve actual taxonomy names.&lt;br /&gt;
&lt;br /&gt;
=== Step 8: Upload OTU information into the database ===&lt;br /&gt;
* First create a partition file that contains all the clusters and associated taxids (if they exist)&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
${SCRIPTS}/cluster2part.pl [batch].fna.cluster [batch] [batch].centers.taxid &amp;gt; [batch].part&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Then upload the resulting partition to the database&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
${DB_SCRIPTS}/upload_otus.pl [batch].part&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 9: Generate summary tables ===&lt;br /&gt;
taxpart2summary-db.pl PREFIX&lt;/div&gt;</summary>
		<author><name>Mpop</name></author>
	</entry>
	<entry>
		<id>https://wiki.umiacs.umd.edu/cbcb/index.php?title=Cbcb:Pop-Lab:16S-pipeline_16S_analysis_pipeline_(for_Gates_project)&amp;diff=7896</id>
		<title>Cbcb:Pop-Lab:16S-pipeline 16S analysis pipeline (for Gates project)</title>
		<link rel="alternate" type="text/html" href="https://wiki.umiacs.umd.edu/cbcb/index.php?title=Cbcb:Pop-Lab:16S-pipeline_16S_analysis_pipeline_(for_Gates_project)&amp;diff=7896"/>
		<updated>2010-10-23T01:09:58Z</updated>

		<summary type="html">&lt;p&gt;Mpop: /* Step 9: Run clustering tool */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= 16S analysis pipeline =&lt;br /&gt;
&lt;br /&gt;
== Overview ==&lt;br /&gt;
=== Assumptions ===&lt;br /&gt;
* 16S rRNA sequences are generated with barcoded 454 and are received as either: (i) .sff file; (ii) fasta and quality file; (iii) just fasta file&lt;br /&gt;
* each batch is a single 96-well plate and is accompanied by a tab-delimited file containing information about the sample, including the &amp;quot;Sample ID&amp;quot;, well on the plate, and additional information regarding the sample quality and DNA concentration&lt;br /&gt;
* we have a file that specifies the barcode and sequencing adapter used for each well (these entities do not change).&lt;br /&gt;
&lt;br /&gt;
=== Directory structure ===&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
Gates_SOM/&lt;br /&gt;
  Main/&lt;br /&gt;
    samples.csv      - information about all the samples available to us&lt;br /&gt;
    454.csv          - information about all 454 runs (essentially concatenation of .csvs from 454 dir)&lt;br /&gt;
    phylochip.csv    - information about all Phylochip runs&lt;br /&gt;
    IGS_Barcodes.csv - information about barcodes used to multiplex 454 runs&lt;br /&gt;
    scripts/         - scripts used to process the data (referred to as ${SCRIPTS} below)&lt;br /&gt;
    DB/              - scripts used to access the database (referred to as ${DB_SCRIPTS} below)&lt;br /&gt;
  454/               - here&#039;s where all 454 sequences live&lt;br /&gt;
    [batch1]/        - ... each batch in a separate directory&lt;br /&gt;
    [batch1].csv     - meta-information about the batch &lt;br /&gt;
    [fasta1]         - fasta files containing the batch&lt;br /&gt;
     ... &lt;br /&gt;
    [fastan]&lt;br /&gt;
    [batch1].part    - partition file describing how the sequences get split by barcode/sample&lt;br /&gt;
     part/           - directory where all the partitioned files live &lt;br /&gt;
     ... &lt;br /&gt;
    [batchn]&lt;br /&gt;
  Phylochip/         - all the CEL files and auxiliary information on the Phylochip runs&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== File-based approach == &lt;br /&gt;
&lt;br /&gt;
=== Step 0: Get the sequence information ===&lt;br /&gt;
* From .SFF files (assuming these are 454 sequences)&lt;br /&gt;
&lt;br /&gt;
This step uses the sff_extract program from the Staden package (if I&#039;m not mistaken)&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
for i in *.sff ;do&lt;br /&gt;
  name=`expr $i : &#039;\(.*\)\.sff&#039;`&lt;br /&gt;
  sff_extract -c -s $name.seq -q $name.qual $i&lt;br /&gt;
done&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 1: Cleanup meta-information ===&lt;br /&gt;
* Convert the Excel sheet containing the batch information into tab-delimited file (and run dos2unix) making sure the quotes added by Excel/OOffice are removed, adding the date (if not already in the file), and sorting the file by Sample ID.  At this stage also check that the header row information is in canonical format.&lt;br /&gt;
* Add barcode information using add_barcode.pl&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;${SCRIPTDIR}/add_barcode.pl [batch].csv ${MAINDIR}IGS_Barcodes.csv &amp;gt; [batch]_barcode.csv&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 2: Create partition file ===&lt;br /&gt;
&lt;br /&gt;
First concatenate all the sequence files (if multiple) from a batch into a single file [batch].all.seq.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;${SCRIPTDIR}/code2part.pl [batch].csv [batch].all.seq [batch] &amp;gt; [batch].part&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The files [batch].BAD.list and [batch].NONE.list contain the names of all sequences that have failed quality checks either because they are too short, or contain Ns (BAD), or because the barcode is not recognized (NONE). The files contain additional information that can be used to troubleshoot the pipeline:  sequences that are too short (&amp;lt; 75 454 cycles) are followed by the number of cycles, sequences that either contain Ns or have an unknown barcode are followed by the first 8 characters in the sequence.&lt;br /&gt;
&lt;br /&gt;
=== Step 3: Break up the fasta file into separate batches by partition ===&lt;br /&gt;
&lt;br /&gt;
* Create partition directory&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mkdir part&lt;br /&gt;
cd part&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Partition main file into sub-parts&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
${SCRIPTS}/partitionFasta.pl ../[batch].all.seq ../[batch].part&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The result is one fasta file per sub-partition (i.e. individual subject).&lt;br /&gt;
&lt;br /&gt;
* Remove barcodes &lt;br /&gt;
(still in the part/ subdirectory)&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
for i in *.seq; do&lt;br /&gt;
  ${SCRIPTS}/unbarcode.pl $i&lt;br /&gt;
done&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The output files will have the same name as the original file but with the addition of the .nbc suffix.  You should remove the .nbc files from the BAD/NONE files in order to prevent their addition to the pipeline downstream.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt; rm *.BAD.nbc.fa *.NONE.nbc.fa &amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 4: Add file names to sample tables ===&lt;br /&gt;
&lt;br /&gt;
The following needs to be run from the root of the 454 directory.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt; ${SCRIPTS}/add_file.pl [batch]/[batch]_barcode.csv [batch]/part &amp;gt; [batch]/[batch]_names.csv &amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
As a result, the file [batch]/[batch]_names.csv will associate each Sample ID to a file name and also record the number of sequences in that file.  Note that only files ending in .nbc.fa are procesed.&lt;br /&gt;
&lt;br /&gt;
=== Step 5: Merge all the batch meta-info files into a same file at the top ===&lt;br /&gt;
&lt;br /&gt;
Note: the addition of file names must be done on a batch by batch basis as multiple files might refer to a same Sample ID - within each batch it can be assumed that the Sample ID -&amp;gt; Filename mapping is unique.  In the 454.csv file in the top directory the unique key is the file name.  &lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt; &lt;br /&gt;
mv ../454.csv ../454.csv.bak&lt;br /&gt;
${SCRIPTS}/merge_csv.pl ../454.csv.bak [batch]/[batch]_names.csv &amp;quot;Sample ID&amp;quot; non-unique &amp;gt; ../454.csv&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This should probably be run with the filename as the key but then the tables need to be sorted  by filename.  Ultimately, this will all work better once the data are in a relational database.&lt;br /&gt;
&lt;br /&gt;
=== Step 6: Update sample file ===&lt;br /&gt;
From the top directory:&lt;br /&gt;
&lt;br /&gt;
* First add all new samples to the samples.csv file&lt;br /&gt;
&amp;lt;pre&amp;gt; &lt;br /&gt;
mv samples.csv samples.csv.bak&lt;br /&gt;
${SCRIPTS}/merge_csv.pl samples.csv.bak 454.csv &amp;quot;Sample ID&amp;quot; unique merge &amp;gt; samples.csv&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Note: merge means that if record keys conflict, the empty fields will be updated with the new data.&lt;br /&gt;
&lt;br /&gt;
* Update the tag indicating 454 sequences available for this sample&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mv samples.csv samples.csv.bak&lt;br /&gt;
${SCRIPTS}/update_field_csv.pl samples.csv.bak 454.csv &amp;quot;Sample ID&amp;quot; 454 Y &amp;gt; samples.csv&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 7: Assign numbers to all filenames ===&lt;br /&gt;
&lt;br /&gt;
Each file in the 454.csv file will be assigned an integer (if one is not already available).  This number will be used to prefix the sequences in the combined file for the project.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mv 454.csv 454.csv.bak&lt;br /&gt;
${SCRIPTS}/add_filenum.pl 454.csv.bak &amp;gt; 454.csv&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 8: Combine all fasta files into a single one ===&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Note: this assumes we&#039;re running the pipeline for the first time - a different protocol is necessary for adding new sequences to an already existing analysis&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
In the new file all sequences will be named &amp;lt;n&amp;gt;_&amp;lt;nn&amp;gt; where &amp;lt;n&amp;gt; is the value in the &amp;quot;File #&amp;quot;&lt;br /&gt;
field in 454.csv and &amp;lt;nn&amp;gt; is the index of the sequence in the file.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mkdir Analysis/Run[date]&lt;br /&gt;
${SCRIPTS}/combinefa.pl -c Analysis/Run[date]/Run[date] -i 454.csv 454&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The output will be in Analysis/Run[date]/Run[date].fna&lt;br /&gt;
&lt;br /&gt;
=== Step 9: Run clustering tool ===&lt;br /&gt;
&lt;br /&gt;
* First generate clusters&lt;br /&gt;
&#039;&#039;&#039;Note: This part assumes we&#039;re running the whole set of sequences as one batch.&#039;&#039;&#039;&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
cd Analysis/Run[date]/Run[date].fna&lt;br /&gt;
/fs/szasmg2/ghodsi/Src/dnaclust/dnaclust -i Run[date].fna &amp;gt; Run[date].fna.cluster&lt;br /&gt;
/fs/szasmg2/ghodsi/Src/dnaclust/dnaclust -m -i Run[date].fna &amp;gt; Run[date].fna.align&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Output will be in Run[date].fna.cluster, one cluster per line, cluster center listed as the first identifier.&lt;br /&gt;
&lt;br /&gt;
The .align file contains aligned FASTA records for all the sequences in each cluster.  Clusters are separated by #&amp;lt;number&amp;gt; where &amp;lt;number&amp;gt; is the number of sequences in the cluster.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* Then extract the cluster centers&lt;br /&gt;
&#039;&#039;&#039;From here on the code runs the same in both full-run and batch modes&#039;&#039;&#039;&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
/fs/szasmg2/ghodsi/Src/dnaclust/fastaselect -f Run[date].fna -c &amp;lt; Run[date].fna.cluster &amp;gt; Run[date].centers.fna&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 10: Assign putative taxonomic labels to clusters ===&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
/fs/szasmg2/ghodsi/rdp/findtax/findtaxid.sh Run[date].centers.fna &amp;gt; Run[date].centers.taxid&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Output is tab-delimited: sequence name &amp;lt;TAB&amp;gt; taxid&lt;br /&gt;
Note: using &amp;quot;findtax&amp;quot; instead of &amp;quot;findtaxid&amp;quot; will retrieve actual taxonomy names.&lt;br /&gt;
&lt;br /&gt;
=== Step 11: Build summary tables ===&lt;br /&gt;
&lt;br /&gt;
Using the output from steps 9 and 10 we construct a collection of tables linking OTUs, taxIDs, taxnames at various taxonomic levels to individual samples.  The colums are the samples and the rows are the respective units.  The cells are numbers of sequences assigned to the specific group. If looking at taxonomic levels, the sequences without an assignment at that level are assigned to a generic &amp;quot;No Assignment&amp;quot; bin.&lt;br /&gt;
&lt;br /&gt;
* First create a partition file that contains all the clusters and associated taxids (if they exist)&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
${SCRIPTS}/clusters2part.pl [batch].fna.cluster [batch] [batch].centers.taxid &amp;gt; [batch].part&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Using this partition, construct summary tables at various taxonomic levels&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
${SCRIPTS}/taxpart2summary.pl [batch].part ${MAIN}/454.csv [batch]&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The following outputs will be created:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
[batch].stats.txt - overall statistics for the data-set&lt;br /&gt;
[batch].otus.count.csv - table containing OTUs as rows, samples as columns, and entries represent&lt;br /&gt;
       # of sequences in each OTU/Sample pair&lt;br /&gt;
[batch].otus.percent.csv - same as &amp;quot;count&amp;quot; except that entries are percentages wrt total sequences&lt;br /&gt;
       in each sample&lt;br /&gt;
[batch].[tax].[otu|count|percent] - same as the &amp;quot;otus&amp;quot; file except at varying taxonomic levels.&lt;br /&gt;
       [tax] is one of &amp;quot;strain&amp;quot;, &amp;quot;species&amp;quot;, &amp;quot;genus&amp;quot;, &amp;quot;family&amp;quot;, &amp;quot;order&amp;quot;, &amp;quot;class&amp;quot;, &amp;quot;phylum&amp;quot;&lt;br /&gt;
       the &amp;quot;count&amp;quot; and &amp;quot;percent&amp;quot; entries are the same as for the &amp;quot;otus&amp;quot; files&lt;br /&gt;
       the &amp;quot;otu&amp;quot; entries contain number of OTUs assigned to the taxonomic group/Sample pair.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Database-based approach ==&lt;br /&gt;
&lt;br /&gt;
=== Database information === &lt;br /&gt;
* To access the database through a command line use the command shown below with password &amp;quot;access&amp;quot;:&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mysql -u access -p -h cbcbmysql00 gems&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
* A stub Perl script for playing with the database is provided at /fs/szasmg2/Gates_SOM/Main/454/DB/stub.pl&lt;br /&gt;
* All users can read the data from the database using users/password combo &amp;quot;access&amp;quot;/&amp;quot;access&amp;quot;.&lt;br /&gt;
* Database schema: [[Media:Schema.pdf]]&lt;br /&gt;
* The commands listed below assume you have write access to the database.&lt;br /&gt;
&lt;br /&gt;
=== Step 0: Get the sequence information ===&lt;br /&gt;
* From .SFF files (assuming these are 454 sequences)&lt;br /&gt;
&lt;br /&gt;
This step uses the sff_extract program from the Staden package (if I&#039;m not mistaken)&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
for i in *.sff ;do&lt;br /&gt;
  name=`expr $i : &#039;\(.*\)\.sff&#039;`&lt;br /&gt;
  sff_extract -c -s $name.seq -q $name.qual $i&lt;br /&gt;
done&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 1: Cleanup meta-information ===&lt;br /&gt;
* Convert the Excel sheet containing the batch information into tab-delimited file (and run dos2unix) making sure the quotes added by Excel/OOffice are removed, adding the date (if not already in the file), and sorting the file by Sample ID.  At this stage also check that the header row information is in canonical format.&lt;br /&gt;
* Add barcode information using add_barcode.pl&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;${SCRIPTDIR}/add_barcode.pl [batch].csv ${MAINDIR}IGS_Barcodes.csv &amp;gt; [batch]_barcode.csv&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 2: Create partition file ===&lt;br /&gt;
&lt;br /&gt;
First concatenate all the sequence files (if multiple) from a batch into a single file [batch].all.seq.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;${SCRIPTDIR}/code2part.pl [batch].csv [batch].all.seq [batch] &amp;gt; [batch].part&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The files [batch].BAD.list and [batch].NONE.list contain the names of all sequences that have failed quality checks either because they are too short, or contain Ns (BAD), or because the barcode is not recognized (NONE). The files contain additional information that can be used to troubleshoot the pipeline:  sequences that are too short (&amp;lt; 75 454 cycles) are followed by the number of cycles, sequences that either contain Ns or have an unknown barcode are followed by the first 8 characters in the sequence.&lt;br /&gt;
&lt;br /&gt;
=== Step 3: Break up the fasta file into separate batches by partition ===&lt;br /&gt;
&lt;br /&gt;
* Create partition directory&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mkdir part&lt;br /&gt;
cd part&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Partition main file into sub-parts&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
${SCRIPTS}/partitionFasta.pl ../[batch].all.seq ../[batch].part&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The result is one fasta file per sub-partition (i.e. individual subject).&lt;br /&gt;
&lt;br /&gt;
* Remove barcodes &lt;br /&gt;
(still in the part/ subdirectory)&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
for i in *.seq; do&lt;br /&gt;
  ${SCRIPTS}/unbarcode.pl $i&lt;br /&gt;
done&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The output files will have the same name as the original file but with the addition of the .nbc suffix.  You should remove the .nbc files from the BAD/NONE files in order to prevent their addition to the pipeline downstream.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt; rm *.BAD.nbc.fa *.NONE.nbc.fa &amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 4: Upload file names ===&lt;br /&gt;
Run from 454 directory&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
${DB_SCRIPTS}/add_file_db.pl [batch]/[batch]_barcode.csv [batch]/part&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 5: Combine all fasta files into a single one ===&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Note: this assumes we&#039;re running the pipeline for the first time - a different protocol is necessary for adding new sequences to an already existing analysis&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
In the new file all sequences will be named &amp;lt;n&amp;gt;_&amp;lt;nn&amp;gt; where &amp;lt;n&amp;gt; is the value in the &amp;quot;File #&amp;quot;&lt;br /&gt;
field in 454.csv and &amp;lt;nn&amp;gt; is the index of the sequence in the file.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mkdir Analysis/Run[date]&lt;br /&gt;
${DB_SCRIPTS}/combinefa_db.pl -c Analysis/Run[date]/Run[date] 454&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The output will be in Analysis/Run[date]/Run[date].fna&lt;br /&gt;
&lt;br /&gt;
=== Step 6: Run clustering tool ===&lt;br /&gt;
&lt;br /&gt;
* First generate clusters&lt;br /&gt;
&#039;&#039;&#039;Note: This part assumes we&#039;re running the whole set of sequences as one batch.&#039;&#039;&#039;&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
cd Analysis/Run[date]/Run[date].fna&lt;br /&gt;
/fs/szasmg2/ghodsi/Src/dnaclust/dnaclust -i Run[date].fna &amp;gt; Run[date].fna.cluster&lt;br /&gt;
/fs/szasmg2/ghodsi/Src/dnaclust/dnaclust -m -i Run[date].fna &amp;gt; Run[date].fna.align&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Output will be in Run[date].fna.cluster, one cluster per line, cluster center listed as the first identifier.&lt;br /&gt;
&lt;br /&gt;
The .align file contains aligned FASTA records for all the sequences in each cluster.  Clusters are separated by #&amp;lt;number&amp;gt; where &amp;lt;number&amp;gt; is the number of sequences in the cluster.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* Then extract the cluster centers&lt;br /&gt;
&#039;&#039;&#039;From here on the code runs the same in both full-run and batch modes&#039;&#039;&#039;&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
/fs/szasmg2/ghodsi/Src/dnaclust/fastaselect Run[date].fna &amp;lt; Run[date].fna.cluster &amp;gt; Run[date].centers.fna&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 7: Assign putative taxonomic labels to clusters ===&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
/fs/szasmg2/ghodsi/rdp/findtax/findtaxid.sh Run[date].centers.fna &amp;gt; Run[date].centers.taxid&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Output is tab-delimited: sequence name &amp;lt;TAB&amp;gt; taxid&lt;br /&gt;
Note: using &amp;quot;findtax&amp;quot; instead of &amp;quot;findtaxid&amp;quot; will retrieve actual taxonomy names.&lt;br /&gt;
&lt;br /&gt;
=== Step 8: Upload OTU information into the database ===&lt;br /&gt;
* First create a partition file that contains all the clusters and associated taxids (if they exist)&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
${SCRIPTS}/cluster2part.pl [batch].fna.cluster [batch] [batch].centers.taxid &amp;gt; [batch].part&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Then upload the resulting partition to the database&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
${DB_SCRIPTS}/upload_otus.pl [batch].part&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 9: Generate summary tables ===&lt;/div&gt;</summary>
		<author><name>Mpop</name></author>
	</entry>
	<entry>
		<id>https://wiki.umiacs.umd.edu/cbcb/index.php?title=Metagenomics_Reading_Group_(Wed_2pm)&amp;diff=7591</id>
		<title>Metagenomics Reading Group (Wed 2pm)</title>
		<link rel="alternate" type="text/html" href="https://wiki.umiacs.umd.edu/cbcb/index.php?title=Metagenomics_Reading_Group_(Wed_2pm)&amp;diff=7591"/>
		<updated>2010-09-29T17:59:24Z</updated>

		<summary type="html">&lt;p&gt;Mpop: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;++ Metagenomic Reading Group (Wed 2pm)&lt;br /&gt;
&lt;br /&gt;
+ 9/22/2010&lt;br /&gt;
Presenter: Mihai Pop&lt;br /&gt;
Topic: Gut microbiome research&lt;br /&gt;
Papers:&lt;br /&gt;
    * Gill, S. R., M. Pop, et al. (2006). &amp;quot;Metagenomic analysis of the human distal gut microbiome.&amp;quot; Science 312(5778): 1355-1359. http://www.sciencemag.org/cgi/reprint/312/5778/1355.pdf&lt;br /&gt;
    * Qin, J., R. Li, et al. (2010). &amp;quot;A human gut microbial gene catalogue established by metagenomic sequencing.&amp;quot; Nature 464(7285): 59-65. http://www.nature.com/nature/journal/v464/n7285/pdf/nature08821.pdf&lt;br /&gt;
&lt;br /&gt;
+ 9/29/2010&lt;br /&gt;
Presenter: Chris Hill&lt;br /&gt;
Topic: Viral metagenomics&lt;br /&gt;
Papers:&lt;br /&gt;
* Viral metagenomics.  Robert A. Edwards and Forest Rohwer.  Nature Reviews Microbiology, Vol. 3, June 2005, 504.&lt;br /&gt;
* Genomic analysis of uncultured marine viral communities.  Mya Breitbart, Peter Salamon, Bjarne Andresen, Joseph M. Mahaffy, Anca M. Segall, David Mead, Farooq Azam, and Forest Rohwer. Proc Natl Acad Sci U S A. 2002 Oct 29;99(22):14250-5.&lt;br /&gt;
&lt;br /&gt;
+ 10/6/2010&lt;br /&gt;
Presenter:&lt;/div&gt;</summary>
		<author><name>Mpop</name></author>
	</entry>
	<entry>
		<id>https://wiki.umiacs.umd.edu/cbcb/index.php?title=File:Metagenomic_Reading_Group_(Wed_2pm).wiki&amp;diff=7590</id>
		<title>File:Metagenomic Reading Group (Wed 2pm).wiki</title>
		<link rel="alternate" type="text/html" href="https://wiki.umiacs.umd.edu/cbcb/index.php?title=File:Metagenomic_Reading_Group_(Wed_2pm).wiki&amp;diff=7590"/>
		<updated>2010-09-29T17:58:28Z</updated>

		<summary type="html">&lt;p&gt;Mpop: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;&lt;/div&gt;</summary>
		<author><name>Mpop</name></author>
	</entry>
	<entry>
		<id>https://wiki.umiacs.umd.edu/cbcb/index.php?title=Main_Page&amp;diff=7589</id>
		<title>Main Page</title>
		<link rel="alternate" type="text/html" href="https://wiki.umiacs.umd.edu/cbcb/index.php?title=Main_Page&amp;diff=7589"/>
		<updated>2010-09-29T17:57:54Z</updated>

		<summary type="html">&lt;p&gt;Mpop: /* Projects */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== IMPORTANT (Read First) ==&lt;br /&gt;
The CBCB computational infrastructure is a shared resource and we all need to pitch in in order to make sure it works well for all of us.  Most importantly, we need to ensure that our disk space and computational resources are used responsibly.  The disk space, in particular, is a valuable commodity and thus it is important to pay attention to the following:&lt;br /&gt;
* There are three types of disk space available (a full list of volumes is available at [https://wiki.umiacs.umd.edu/cbcb-private/index.php/Storage CBCB Storage]) :&lt;br /&gt;
** Local harddrives (usually mounted as /scratch on the machines that have this resource).  These are not backed up, are quite fast (they live physically close to the processor), but can only be &#039;seen&#039; by the machine where they are mounted and, thus, require data staging in/out (which can take a while)&lt;br /&gt;
** Shared 3Par storage (/fs/szasmg*, /fs/szdata/*, etc.).  This is very fast and very expensive disk and thus a limited resource. Please only use this space to store data temporarily, while you are running some analyses on it.  As a rule of thumb, if a file or collection of files of any considerable size have lived on this space for more than 1-2 weeks, they should probably be moved to the attic space (see below)&lt;br /&gt;
** Attic storage (/fs/szattic*).  This is cheaper and ample, but slow and brittle storage.  Your data-sets should primarily live here.  Due to it&#039;s brittleness, the IT department do not recommend you run any analyses directly in this volume, rather you copy the files over to a local harddrive or the 3Par instead, and copy the results back when done.&lt;br /&gt;
* You should remove any temporary results you don&#039;t need in the long term as soon as you&#039;ve generated them, and compress all of the large files.  Bzip2 works better than gzip but either should dramatically improve the space requirements, especially for text files such as fasta or fastq.&lt;br /&gt;
* You should ensure that your files are owned by the cbcb group and that they have group write permissions for any file stored on the 3Par, especially for all those in /fs/szscratch.  This will allow your colleagues to remove files in case the disk runs out of space and you are, for example, on vacation (in which case you shouldn&#039;t have any major files sitting around on the 3Par).&lt;br /&gt;
&lt;br /&gt;
For more information on the CBCB resources see [[Getting Started in CBCB]].&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Seminars ==&lt;br /&gt;
* [http://www.cbcb.umd.edu/seminars Regular CBCB seminars (during academic year)] &amp;lt;br&amp;gt;&lt;br /&gt;
* [[Cbcb:Works-In-Progress]] - Works in progress seminar schedule (Summer 2008) &amp;lt;br&amp;gt;&lt;br /&gt;
* [[short_read_sequencing|Short read sequencing Meeting]] (Fridays at 3pm)&lt;br /&gt;
&lt;br /&gt;
== Projects ==&lt;br /&gt;
&lt;br /&gt;
* [[Project:Pop-Lab|Pop-Lab]]&lt;br /&gt;
* [[Project:Kingsford-Group|Kingsford Group]]&lt;br /&gt;
* [[Project:Cloud-Computing|Cloud Computing]]&lt;br /&gt;
* [[Project:SummerInternships|Summer Internship Projects]]&lt;br /&gt;
* [[Metagenomics Reading Group (Wed 2pm)]]&lt;br /&gt;
&lt;br /&gt;
== People ==&lt;br /&gt;
 &lt;br /&gt;
* [[User:ayres|Daniel Ayres]]&lt;br /&gt;
* [[User:pknut777|Adam Bazinet]] &lt;br /&gt;
* [[User:amp|Adam M Phillipy]] &lt;br /&gt;
* [[User:adelcher|Arthur L. Delcher]] &lt;br /&gt;
* [[User:carlk|Carl Kinsford]]  &lt;br /&gt;
* [[User:dpuiu|Daniela Puiu]] &lt;br /&gt;
* [[User:dsommer|Dan Sommer]] &lt;br /&gt;
* [[User:gpertea|Geo Pertea]] &lt;br /&gt;
* [[User:jeallen|Jonathan Edward All]] &lt;br /&gt;
* [[User:ayanbule|Kunmi Ayanbule]]&lt;br /&gt;
* [[User:mschatz|Michael Schatz]]&lt;br /&gt;
* [[User:jpaulson|Joseph Paulson]]  &lt;br /&gt;
* [[User:mpertea|Mihaela Pertea]] &lt;br /&gt;
* [[User:mpop|Mihai Pop]] &lt;br /&gt;
* [[User:nelsayed|Najib El-Sayed]] &lt;br /&gt;
* [[User:nedwards|Nathan Edwards]]&lt;br /&gt;
* [[User:niranjan|Niranjan Nagarajan]] &lt;br /&gt;
* [[User:saket|Saket Navlakha]]&lt;br /&gt;
* [[User:angiuoli|Samuel V Angluoli]] &lt;br /&gt;
* [[User:salzberg|Steven Salzberg]]&lt;br /&gt;
* [[User:tgibbons | Ted Gibbons]]&lt;br /&gt;
* [[User:treangen | Todd J. Treangen]]&lt;br /&gt;
* [[User:whitej|James Robert White]]&lt;br /&gt;
&lt;br /&gt;
== Getting started ==&lt;br /&gt;
If you have just received a new umiacs account through CBCB, follow the instructions on this page to get the basic information you&#039;ll need to start working:&amp;lt;br&amp;gt;&lt;br /&gt;
*[[Getting Started in CBCB]]&lt;br /&gt;
*[https://wiki.umiacs.umd.edu/cbcb-private/index.php/Storage CBCB Storage]&lt;br /&gt;
*[https://wiki.umiacs.umd.edu/cbcb-private/index.php/Compute CBCB Computers]&lt;br /&gt;
*[[Communal Software]]&lt;/div&gt;</summary>
		<author><name>Mpop</name></author>
	</entry>
	<entry>
		<id>https://wiki.umiacs.umd.edu/cbcb/index.php?title=Cbcb:Pop-Lab:16S-pipeline_16S_analysis_pipeline_(for_Gates_project)&amp;diff=7313</id>
		<title>Cbcb:Pop-Lab:16S-pipeline 16S analysis pipeline (for Gates project)</title>
		<link rel="alternate" type="text/html" href="https://wiki.umiacs.umd.edu/cbcb/index.php?title=Cbcb:Pop-Lab:16S-pipeline_16S_analysis_pipeline_(for_Gates_project)&amp;diff=7313"/>
		<updated>2010-08-31T13:32:37Z</updated>

		<summary type="html">&lt;p&gt;Mpop: /* Step 11: Build summary tables */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= 16S analysis pipeline =&lt;br /&gt;
&lt;br /&gt;
== Overview ==&lt;br /&gt;
=== Assumptions ===&lt;br /&gt;
* 16S rRNA sequences are generated with barcoded 454 and are received as either: (i) .sff file; (ii) fasta and quality file; (iii) just fasta file&lt;br /&gt;
* each batch is a single 96-well plate and is accompanied by a tab-delimited file containing information about the sample, including the &amp;quot;Sample ID&amp;quot;, well on the plate, and additional information regarding the sample quality and DNA concentration&lt;br /&gt;
* we have a file that specifies the barcode and sequencing adapter used for each well (these entities do not change).&lt;br /&gt;
&lt;br /&gt;
=== Directory structure ===&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
Gates_SOM/&lt;br /&gt;
  Main/&lt;br /&gt;
    samples.csv      - information about all the samples available to us&lt;br /&gt;
    454.csv          - information about all 454 runs (essentially concatenation of .csvs from 454 dir)&lt;br /&gt;
    phylochip.csv    - information about all Phylochip runs&lt;br /&gt;
    IGS_Barcodes.csv - information about barcodes used to multiplex 454 runs&lt;br /&gt;
    scripts/         - scripts used to process the data (referred to as ${SCRIPTS} below)&lt;br /&gt;
    DB/              - scripts used to access the database (referred to as ${DB_SCRIPTS} below)&lt;br /&gt;
  454/               - here&#039;s where all 454 sequences live&lt;br /&gt;
    [batch1]/        - ... each batch in a separate directory&lt;br /&gt;
    [batch1].csv     - meta-information about the batch &lt;br /&gt;
    [fasta1]         - fasta files containing the batch&lt;br /&gt;
     ... &lt;br /&gt;
    [fastan]&lt;br /&gt;
    [batch1].part    - partition file describing how the sequences get split by barcode/sample&lt;br /&gt;
     part/           - directory where all the partitioned files live &lt;br /&gt;
     ... &lt;br /&gt;
    [batchn]&lt;br /&gt;
  Phylochip/         - all the CEL files and auxiliary information on the Phylochip runs&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== File-based approach == &lt;br /&gt;
&lt;br /&gt;
=== Step 0: Get the sequence information ===&lt;br /&gt;
* From .SFF files (assuming these are 454 sequences)&lt;br /&gt;
&lt;br /&gt;
This step uses the sff_extract program from the Staden package (if I&#039;m not mistaken)&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
for i in *.sff ;do&lt;br /&gt;
  name=`expr $i : &#039;\(.*\)\.sff&#039;`&lt;br /&gt;
  sff_extract -c -s $name.seq -q $name.qual $i&lt;br /&gt;
done&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 1: Cleanup meta-information ===&lt;br /&gt;
* Convert the Excel sheet containing the batch information into tab-delimited file (and run dos2unix) making sure the quotes added by Excel/OOffice are removed, adding the date (if not already in the file), and sorting the file by Sample ID.  At this stage also check that the header row information is in canonical format.&lt;br /&gt;
* Add barcode information using add_barcode.pl&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;${SCRIPTDIR}/add_barcode.pl [batch].csv ${MAINDIR}IGS_Barcodes.csv &amp;gt; [batch]_barcode.csv&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 2: Create partition file ===&lt;br /&gt;
&lt;br /&gt;
First concatenate all the sequence files (if multiple) from a batch into a single file [batch].all.seq.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;${SCRIPTDIR}/code2part.pl [batch].csv [batch].all.seq [batch] &amp;gt; [batch].part&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The files [batch].BAD.list and [batch].NONE.list contain the names of all sequences that have failed quality checks either because they are too short, or contain Ns (BAD), or because the barcode is not recognized (NONE). The files contain additional information that can be used to troubleshoot the pipeline:  sequences that are too short (&amp;lt; 75 454 cycles) are followed by the number of cycles, sequences that either contain Ns or have an unknown barcode are followed by the first 8 characters in the sequence.&lt;br /&gt;
&lt;br /&gt;
=== Step 3: Break up the fasta file into separate batches by partition ===&lt;br /&gt;
&lt;br /&gt;
* Create partition directory&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mkdir part&lt;br /&gt;
cd part&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Partition main file into sub-parts&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
${SCRIPTS}/partitionFasta.pl ../[batch].all.seq ../[batch].part&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The result is one fasta file per sub-partition (i.e. individual subject).&lt;br /&gt;
&lt;br /&gt;
* Remove barcodes &lt;br /&gt;
(still in the part/ subdirectory)&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
for i in *.seq; do&lt;br /&gt;
  ${SCRIPTS}/unbarcode.pl $i&lt;br /&gt;
done&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The output files will have the same name as the original file but with the addition of the .nbc suffix.  You should remove the .nbc files from the BAD/NONE files in order to prevent their addition to the pipeline downstream.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt; rm *.BAD.nbc.fa *.NONE.nbc.fa &amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 4: Add file names to sample tables ===&lt;br /&gt;
&lt;br /&gt;
The following needs to be run from the root of the 454 directory.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt; ${SCRIPTS}/add_file.pl [batch]/[batch]_barcode.csv [batch]/part &amp;gt; [batch]/[batch]_names.csv &amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
As a result, the file [batch]/[batch]_names.csv will associate each Sample ID to a file name and also record the number of sequences in that file.  Note that only files ending in .nbc.fa are procesed.&lt;br /&gt;
&lt;br /&gt;
=== Step 5: Merge all the batch meta-info files into a same file at the top ===&lt;br /&gt;
&lt;br /&gt;
Note: the addition of file names must be done on a batch by batch basis as multiple files might refer to a same Sample ID - within each batch it can be assumed that the Sample ID -&amp;gt; Filename mapping is unique.  In the 454.csv file in the top directory the unique key is the file name.  &lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt; &lt;br /&gt;
mv ../454.csv ../454.csv.bak&lt;br /&gt;
${SCRIPTS}/merge_csv.pl ../454.csv.bak [batch]/[batch]_names.csv &amp;quot;Sample ID&amp;quot; non-unique &amp;gt; ../454.csv&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This should probably be run with the filename as the key but then the tables need to be sorted  by filename.  Ultimately, this will all work better once the data are in a relational database.&lt;br /&gt;
&lt;br /&gt;
=== Step 6: Update sample file ===&lt;br /&gt;
From the top directory:&lt;br /&gt;
&lt;br /&gt;
* First add all new samples to the samples.csv file&lt;br /&gt;
&amp;lt;pre&amp;gt; &lt;br /&gt;
mv samples.csv samples.csv.bak&lt;br /&gt;
${SCRIPTS}/merge_csv.pl samples.csv.bak 454.csv &amp;quot;Sample ID&amp;quot; unique merge &amp;gt; samples.csv&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Note: merge means that if record keys conflict, the empty fields will be updated with the new data.&lt;br /&gt;
&lt;br /&gt;
* Update the tag indicating 454 sequences available for this sample&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mv samples.csv samples.csv.bak&lt;br /&gt;
${SCRIPTS}/update_field_csv.pl samples.csv.bak 454.csv &amp;quot;Sample ID&amp;quot; 454 Y &amp;gt; samples.csv&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 7: Assign numbers to all filenames ===&lt;br /&gt;
&lt;br /&gt;
Each file in the 454.csv file will be assigned an integer (if one is not already available).  This number will be used to prefix the sequences in the combined file for the project.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mv 454.csv 454.csv.bak&lt;br /&gt;
${SCRIPTS}/add_filenum.pl 454.csv.bak &amp;gt; 454.csv&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 8: Combine all fasta files into a single one ===&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Note: this assumes we&#039;re running the pipeline for the first time - a different protocol is necessary for adding new sequences to an already existing analysis&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
In the new file all sequences will be named &amp;lt;n&amp;gt;_&amp;lt;nn&amp;gt; where &amp;lt;n&amp;gt; is the value in the &amp;quot;File #&amp;quot;&lt;br /&gt;
field in 454.csv and &amp;lt;nn&amp;gt; is the index of the sequence in the file.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mkdir Analysis/Run[date]&lt;br /&gt;
${SCRIPTS}/combinefa.pl -c Analysis/Run[date]/Run[date] -i 454.csv 454&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The output will be in Analysis/Run[date]/Run[date].fna&lt;br /&gt;
&lt;br /&gt;
=== Step 9: Run clustering tool ===&lt;br /&gt;
&lt;br /&gt;
* First generate clusters&lt;br /&gt;
&#039;&#039;&#039;Note: This part assumes we&#039;re running the whole set of sequences as one batch.&#039;&#039;&#039;&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
cd Analysis/Run[date]/Run[date].fna&lt;br /&gt;
/fs/szasmg2/ghodsi/Src/clusterk/clusterk7 -r 2 -i Run[date].fna &amp;gt; Run[date].fna.cluster&lt;br /&gt;
/fs/szasmg2/ghodsi/Src/clusterk/clusterk7 -r 2 -m -i Run[date].fna &amp;gt; Run[date].fna.align&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Output will be in Run[date].fna.cluster, one cluster per line, cluster center listed as the first identifier.&lt;br /&gt;
&lt;br /&gt;
The .align file contains aligned FASTA records for all the sequences in each cluster.  Clusters are separated by #&amp;lt;number&amp;gt; where &amp;lt;number&amp;gt; is the number of sequences in the cluster.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* Then extract the cluster centers&lt;br /&gt;
&#039;&#039;&#039;From here on the code runs the same in both full-run and batch modes&#039;&#039;&#039;&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
/fs/szasmg2/ghodsi/Src/clusterk/fastaselect Run[date].fna &amp;lt; Run[date].fna.cluster &amp;gt; Run[date].centers.fna&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 10: Assign putative taxonomic labels to clusters ===&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
/fs/szasmg2/ghodsi/rdp/findtax/findtaxid.sh Run[date].centers.fna &amp;gt; Run[date].centers.taxid&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Output is tab-delimited: sequence name &amp;lt;TAB&amp;gt; taxid&lt;br /&gt;
Note: using &amp;quot;findtax&amp;quot; instead of &amp;quot;findtaxid&amp;quot; will retrieve actual taxonomy names.&lt;br /&gt;
&lt;br /&gt;
=== Step 11: Build summary tables ===&lt;br /&gt;
&lt;br /&gt;
Using the output from steps 9 and 10 we construct a collection of tables linking OTUs, taxIDs, taxnames at various taxonomic levels to individual samples.  The colums are the samples and the rows are the respective units.  The cells are numbers of sequences assigned to the specific group. If looking at taxonomic levels, the sequences without an assignment at that level are assigned to a generic &amp;quot;No Assignment&amp;quot; bin.&lt;br /&gt;
&lt;br /&gt;
* First create a partition file that contains all the clusters and associated taxids (if they exist)&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
${SCRIPTS}/clusters2part.pl [batch].fna.cluster [batch] [batch].centers.taxid &amp;gt; [batch].part&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Using this partition, construct summary tables at various taxonomic levels&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
${SCRIPTS}/taxpart2summary.pl [batch].part ${MAIN}/454.csv [batch]&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The following outputs will be created:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
[batch].stats.txt - overall statistics for the data-set&lt;br /&gt;
[batch].otus.count.csv - table containing OTUs as rows, samples as columns, and entries represent&lt;br /&gt;
       # of sequences in each OTU/Sample pair&lt;br /&gt;
[batch].otus.percent.csv - same as &amp;quot;count&amp;quot; except that entries are percentages wrt total sequences&lt;br /&gt;
       in each sample&lt;br /&gt;
[batch].[tax].[otu|count|percent] - same as the &amp;quot;otus&amp;quot; file except at varying taxonomic levels.&lt;br /&gt;
       [tax] is one of &amp;quot;strain&amp;quot;, &amp;quot;species&amp;quot;, &amp;quot;genus&amp;quot;, &amp;quot;family&amp;quot;, &amp;quot;order&amp;quot;, &amp;quot;class&amp;quot;, &amp;quot;phylum&amp;quot;&lt;br /&gt;
       the &amp;quot;count&amp;quot; and &amp;quot;percent&amp;quot; entries are the same as for the &amp;quot;otus&amp;quot; files&lt;br /&gt;
       the &amp;quot;otu&amp;quot; entries contain number of OTUs assigned to the taxonomic group/Sample pair.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Database-based approach ==&lt;br /&gt;
&lt;br /&gt;
=== Database information === &lt;br /&gt;
* To access the database through a command line use the command shown below with password &amp;quot;access&amp;quot;:&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mysql -u access -p -h cbcbmysql00 gems&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
* A stub Perl script for playing with the database is provided at /fs/szasmg2/Gates_SOM/Main/454/DB/stub.pl&lt;br /&gt;
* All users can read the data from the database using users/password combo &amp;quot;access&amp;quot;/&amp;quot;access&amp;quot;.&lt;br /&gt;
* Database schema: [[Media:Schema.pdf]]&lt;br /&gt;
* The commands listed below assume you have write access to the database.&lt;br /&gt;
&lt;br /&gt;
=== Step 0: Get the sequence information ===&lt;br /&gt;
* From .SFF files (assuming these are 454 sequences)&lt;br /&gt;
&lt;br /&gt;
This step uses the sff_extract program from the Staden package (if I&#039;m not mistaken)&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
for i in *.sff ;do&lt;br /&gt;
  name=`expr $i : &#039;\(.*\)\.sff&#039;`&lt;br /&gt;
  sff_extract -c -s $name.seq -q $name.qual $i&lt;br /&gt;
done&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 1: Cleanup meta-information ===&lt;br /&gt;
* Convert the Excel sheet containing the batch information into tab-delimited file (and run dos2unix) making sure the quotes added by Excel/OOffice are removed, adding the date (if not already in the file), and sorting the file by Sample ID.  At this stage also check that the header row information is in canonical format.&lt;br /&gt;
* Add barcode information using add_barcode.pl&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;${SCRIPTDIR}/add_barcode.pl [batch].csv ${MAINDIR}IGS_Barcodes.csv &amp;gt; [batch]_barcode.csv&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 2: Create partition file ===&lt;br /&gt;
&lt;br /&gt;
First concatenate all the sequence files (if multiple) from a batch into a single file [batch].all.seq.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;${SCRIPTDIR}/code2part.pl [batch].csv [batch].all.seq [batch] &amp;gt; [batch].part&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The files [batch].BAD.list and [batch].NONE.list contain the names of all sequences that have failed quality checks either because they are too short, or contain Ns (BAD), or because the barcode is not recognized (NONE). The files contain additional information that can be used to troubleshoot the pipeline:  sequences that are too short (&amp;lt; 75 454 cycles) are followed by the number of cycles, sequences that either contain Ns or have an unknown barcode are followed by the first 8 characters in the sequence.&lt;br /&gt;
&lt;br /&gt;
=== Step 3: Break up the fasta file into separate batches by partition ===&lt;br /&gt;
&lt;br /&gt;
* Create partition directory&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mkdir part&lt;br /&gt;
cd part&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Partition main file into sub-parts&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
${SCRIPTS}/partitionFasta.pl ../[batch].all.seq ../[batch].part&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The result is one fasta file per sub-partition (i.e. individual subject).&lt;br /&gt;
&lt;br /&gt;
* Remove barcodes &lt;br /&gt;
(still in the part/ subdirectory)&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
for i in *.seq; do&lt;br /&gt;
  ${SCRIPTS}/unbarcode.pl $i&lt;br /&gt;
done&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The output files will have the same name as the original file but with the addition of the .nbc suffix.  You should remove the .nbc files from the BAD/NONE files in order to prevent their addition to the pipeline downstream.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt; rm *.BAD.nbc.fa *.NONE.nbc.fa &amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 4: Upload file names ===&lt;br /&gt;
Run from 454 directory&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
${DB_SCRIPTS}/add_file_db.pl [batch]/[batch]_barcode.csv [batch]/part&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 5: Combine all fasta files into a single one ===&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Note: this assumes we&#039;re running the pipeline for the first time - a different protocol is necessary for adding new sequences to an already existing analysis&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
In the new file all sequences will be named &amp;lt;n&amp;gt;_&amp;lt;nn&amp;gt; where &amp;lt;n&amp;gt; is the value in the &amp;quot;File #&amp;quot;&lt;br /&gt;
field in 454.csv and &amp;lt;nn&amp;gt; is the index of the sequence in the file.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mkdir Analysis/Run[date]&lt;br /&gt;
${DB_SCRIPTS}/combinefa_db.pl -c Analysis/Run[date]/Run[date] 454&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The output will be in Analysis/Run[date]/Run[date].fna&lt;br /&gt;
&lt;br /&gt;
=== Step 6: Run clustering tool ===&lt;br /&gt;
&lt;br /&gt;
* First generate clusters&lt;br /&gt;
&#039;&#039;&#039;Note: This part assumes we&#039;re running the whole set of sequences as one batch.&#039;&#039;&#039;&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
cd Analysis/Run[date]/Run[date].fna&lt;br /&gt;
/fs/szasmg2/ghodsi/Src/clusterk/clusterk7 -r 2 -i Run[date].fna &amp;gt; Run[date].fna.cluster&lt;br /&gt;
/fs/szasmg2/ghodsi/Src/clusterk/clusterk7 -r 2 -m -i Run[date].fna &amp;gt; Run[date].fna.align&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Output will be in Run[date].fna.cluster, one cluster per line, cluster center listed as the first identifier.&lt;br /&gt;
&lt;br /&gt;
The .align file contains aligned FASTA records for all the sequences in each cluster.  Clusters are separated by #&amp;lt;number&amp;gt; where &amp;lt;number&amp;gt; is the number of sequences in the cluster.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* Then extract the cluster centers&lt;br /&gt;
&#039;&#039;&#039;From here on the code runs the same in both full-run and batch modes&#039;&#039;&#039;&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
/fs/szasmg2/ghodsi/Src/clusterk/fastaselect Run[date].fna &amp;lt; Run[date].fna.cluster &amp;gt; Run[date].centers.fna&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 7: Assign putative taxonomic labels to clusters ===&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
/fs/szasmg2/ghodsi/rdp/findtax/findtaxid.sh Run[date].centers.fna &amp;gt; Run[date].centers.taxid&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Output is tab-delimited: sequence name &amp;lt;TAB&amp;gt; taxid&lt;br /&gt;
Note: using &amp;quot;findtax&amp;quot; instead of &amp;quot;findtaxid&amp;quot; will retrieve actual taxonomy names.&lt;br /&gt;
&lt;br /&gt;
=== Step 8: Upload OTU information into the database ===&lt;br /&gt;
* First create a partition file that contains all the clusters and associated taxids (if they exist)&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
${SCRIPTS}/cluster2part.pl [batch].fna.cluster [batch] [batch].centers.taxid &amp;gt; [batch].part&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Then upload the resulting partition to the database&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
${DB_SCRIPTS}/upload_otus.pl [batch].part&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 9: Generate summary tables ===&lt;/div&gt;</summary>
		<author><name>Mpop</name></author>
	</entry>
	<entry>
		<id>https://wiki.umiacs.umd.edu/cbcb/index.php?title=Main_Page&amp;diff=7265</id>
		<title>Main Page</title>
		<link rel="alternate" type="text/html" href="https://wiki.umiacs.umd.edu/cbcb/index.php?title=Main_Page&amp;diff=7265"/>
		<updated>2010-07-14T20:01:42Z</updated>

		<summary type="html">&lt;p&gt;Mpop: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== IMPORTANT (Read First) ==&lt;br /&gt;
The CBCB computational infrastructure is a shared resource and we all need to pitch in in order to make sure it works well for all of us.  Most importantly, we need to ensure that our disk space and computational resources are used responsibly.  The disk space, in particular, is a valuable commodity and thus it is important to pay attention to the following:&lt;br /&gt;
* There are three types of disk space available (a full list of volumes is available at [https://wiki.umiacs.umd.edu/cbcb-private/index.php/Storage CBCB Storage]) :&lt;br /&gt;
** Local harddrives (usually mounted as /scratch on the machines that have this resource).  These are not backed up, are quite fast (they live physically close to the processor), but can only be &#039;seen&#039; by the machine where they are mounted and, thus, require data staging in/out (which can take a while)&lt;br /&gt;
** Shared 3Par storage (/fs/szasmg*, /fs/szdata/*, etc.).  This is very fast and very expensive disk and thus a limited resource. Please only use this space to store data temporarily, while you are running some analyses on it.  As a rule of thumb, if a file or collection of files of any considerable size have lived on this space for more than 1-2 weeks, they should probably be moved to the attic space (see below)&lt;br /&gt;
** Attic storage (/fs/szattic*).  This is cheaper and ample, but slow and brittle storage.  Your data-sets should primarily live here.  Due to it&#039;s brittleness, the IT department do not recommend you run any analyses directly in this volume, rather you copy the files over to a local harddrive or the 3Par instead, and copy the results back when done.&lt;br /&gt;
* You should remove any temporary results you don&#039;t need in the long term as soon as you&#039;ve generated them, and compress all of the large files.  Bzip2 works better than gzip but either should dramatically improve the space requirements, especially for text files such as fasta or fastq.&lt;br /&gt;
* You should ensure that your files are owned by the cbcb group and that they have group write permissions for any file stored on the 3Par, especially for all those in /fs/szscratch.  This will allow your colleagues to remove files in case the disk runs out of space and you are, for example, on vacation (in which case you shouldn&#039;t have any major files sitting around on the 3Par).&lt;br /&gt;
&lt;br /&gt;
For more information on the CBCB resources see [[Getting Started in CBCB]].&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Seminars ==&lt;br /&gt;
* [http://www.cbcb.umd.edu/seminars Regular CBCB seminars (during academic year)] &amp;lt;br&amp;gt;&lt;br /&gt;
* [[Cbcb:Works-In-Progress]] - Works in progress seminar schedule (Summer 2008) &amp;lt;br&amp;gt;&lt;br /&gt;
* [[short_read_sequencing|Short read sequencing Meeting]] (Fridays at 3pm)&lt;br /&gt;
&lt;br /&gt;
== Projects ==&lt;br /&gt;
&lt;br /&gt;
* [[Project:Pop-Lab|Pop-Lab]]&lt;br /&gt;
* [[Project:Kingsford-Group|Kingsford Group]]&lt;br /&gt;
* [[Project:Cloud-Computing|Cloud Computing]]&lt;br /&gt;
* [[Project:SummerInternships|Summer Internship Projects]]&lt;br /&gt;
&lt;br /&gt;
== People ==&lt;br /&gt;
 &lt;br /&gt;
* [[User:ayres|Daniel Ayres]]&lt;br /&gt;
* [[User:pknut777|Adam Bazinet]] &lt;br /&gt;
* [[User:amp|Adam M Phillipy]] &lt;br /&gt;
* [[User:adelcher|Arthur L. Delcher]] &lt;br /&gt;
* [[User:carlk|Carl Kinsford]]  &lt;br /&gt;
* [[User:dpuiu|Daniela Puiu]] &lt;br /&gt;
* [[User:dsommer|Dan Sommer]] &lt;br /&gt;
* [[User:gpertea|Geo Pertea]] &lt;br /&gt;
* [[User:jeallen|Jonathan Edward All]] &lt;br /&gt;
* [[User:ayanbule|Kunmi Ayanbule]]&lt;br /&gt;
* [[User:mschatz|Michael Schatz]]&lt;br /&gt;
* [[User:jpaulson|Joseph Paulson]]  &lt;br /&gt;
* [[User:mpertea|Mihaela Pertea]] &lt;br /&gt;
* [[User:mpop|Mihai Pop]] &lt;br /&gt;
* [[User:nelsayed|Najib El-Sayed]] &lt;br /&gt;
* [[User:nedwards|Nathan Edwards]]&lt;br /&gt;
* [[User:niranjan|Niranjan Nagarajan]] &lt;br /&gt;
* [[User:saket|Saket Navlakha]]&lt;br /&gt;
* [[User:angiuoli|Samuel V Angluoli]] &lt;br /&gt;
* [[User:salzberg|Steven Salzberg]]&lt;br /&gt;
* [[User:tgibbons | Ted Gibbons]]&lt;br /&gt;
* [[User:whitej|James Robert White]]&lt;br /&gt;
&lt;br /&gt;
== Getting started ==&lt;br /&gt;
If you have just received a new umiacs account through CBCB, follow the instructions on this page to get the basic information you&#039;ll need to start working:&amp;lt;br&amp;gt;&lt;br /&gt;
*[[Getting Started in CBCB]]&lt;br /&gt;
*[https://wiki.umiacs.umd.edu/cbcb-private/index.php/Storage CBCB Storage]&lt;br /&gt;
*[https://wiki.umiacs.umd.edu/cbcb-private/index.php/Compute CBCB Computers]&lt;br /&gt;
*[[Communal Software]]&lt;/div&gt;</summary>
		<author><name>Mpop</name></author>
	</entry>
	<entry>
		<id>https://wiki.umiacs.umd.edu/cbcb/index.php?title=Cbcb:Pop-Lab:Members&amp;diff=7264</id>
		<title>Cbcb:Pop-Lab:Members</title>
		<link rel="alternate" type="text/html" href="https://wiki.umiacs.umd.edu/cbcb/index.php?title=Cbcb:Pop-Lab:Members&amp;diff=7264"/>
		<updated>2010-07-08T14:09:59Z</updated>

		<summary type="html">&lt;p&gt;Mpop: /* Pop-lab group members */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==Pop-lab group members ==&lt;br /&gt;
[http://www.cbcb.umd.edu/~mpop Mihai Pop] - the big Kahuna &amp;lt;br&amp;gt;&lt;br /&gt;
Dan Sommer - bioinformatics engineer &amp;lt;br&amp;gt;&lt;br /&gt;
[http://www.cs.umd.edu/~ghodsi MohammadReza Ghodsi] - graduate student ([http://www.cs.umd.edu CS]) &amp;lt;br&amp;gt;&lt;br /&gt;
[http://www.cs.umd.edu/~sergek Sergey Koren] - graduate student ([http://www.cs.umd.edu CS]) &amp;lt;br&amp;gt;&lt;br /&gt;
[http://www.cbcb.umd.edu/~boliu Bo Liu] - graduate student ([http://cbmg.umd.edu CBMG]) &amp;lt;br&amp;gt;&lt;br /&gt;
Ted Gibbons - graduate student &amp;lt;br&amp;gt;&lt;br /&gt;
Chris Hill - graduate student ([http://www.cs.umd.edu CS])&amp;lt;br&amp;gt;&lt;br /&gt;
Joseph Paulson - graduate student (AMSC)&amp;lt;br&amp;gt;&lt;br /&gt;
Brianna Lindsay - graduate student (Epidemiology)&amp;lt;br&amp;gt;&lt;br /&gt;
==Alumni==&lt;br /&gt;
[http://www.cbcb.umd.edu/~niranjan Niranjan Nagarajan] - postdoctoral fellow &amp;lt;br&amp;gt;&lt;br /&gt;
[http://www.cbcb.umd.edu/~whitej James White] - graduate student ([http://amsc.umd.edu AMSC]) &amp;lt;br&amp;gt;&lt;br /&gt;
[http://www.cbcb.umd.edu/~langmead/index.html Ben Langmead] - graduate student ([http://www.cs.umd.edu CS]) &amp;lt;br&amp;gt;&lt;/div&gt;</summary>
		<author><name>Mpop</name></author>
	</entry>
	<entry>
		<id>https://wiki.umiacs.umd.edu/cbcb/index.php?title=Cbcb:Pop-Lab:TODO&amp;diff=7259</id>
		<title>Cbcb:Pop-Lab:TODO</title>
		<link rel="alternate" type="text/html" href="https://wiki.umiacs.umd.edu/cbcb/index.php?title=Cbcb:Pop-Lab:TODO&amp;diff=7259"/>
		<updated>2010-06-16T00:36:41Z</updated>

		<summary type="html">&lt;p&gt;Mpop: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==AMOS==&lt;br /&gt;
* Create &amp;quot;group&amp;quot; object to sub-cluster sequences/contigs/scaffolds/etc&lt;br /&gt;
* Remove iid concept - keep bid and let message files use eids for cross-referencing.  this will allow the merging of files&lt;br /&gt;
* Remove the layout objects - equivalent with contig without a sequence&lt;br /&gt;
* New way to store sequence data targeted at short sequences for which it&#039;s too verbose to store names and such&lt;br /&gt;
* figure out a way to handle 454 mate-pairs, esp. in hawkeye&lt;br /&gt;
* build a better &#039;toAmos&#039;&lt;br /&gt;
&lt;br /&gt;
Note: some progress in these directions has been made but the changes need to be cleaned up.&lt;br /&gt;
&lt;br /&gt;
==AMOScmp==&lt;br /&gt;
* integrate other aligners - Daniela has a preliminary pipeline that works with SOAP but why not Bowtie? A solution would be to allow inputs in BAM format.&lt;br /&gt;
&lt;br /&gt;
== Bambus 2 ==&lt;br /&gt;
* Handle scaffolds better than currently done (now the code uses &amp;quot;contigs&amp;quot; instead of scaffolds)&lt;br /&gt;
* Write good documentation and put together pipelines for various types of data and underlying assemblers&lt;/div&gt;</summary>
		<author><name>Mpop</name></author>
	</entry>
	<entry>
		<id>https://wiki.umiacs.umd.edu/cbcb/index.php?title=Cbcb:Pop-Lab:TODO&amp;diff=7258</id>
		<title>Cbcb:Pop-Lab:TODO</title>
		<link rel="alternate" type="text/html" href="https://wiki.umiacs.umd.edu/cbcb/index.php?title=Cbcb:Pop-Lab:TODO&amp;diff=7258"/>
		<updated>2010-06-16T00:35:15Z</updated>

		<summary type="html">&lt;p&gt;Mpop: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==AMOS==&lt;br /&gt;
* Create &amp;quot;group&amp;quot; object to sub-cluster sequences/contigs/scaffolds/etc&lt;br /&gt;
* Remove iid concept - keep bid and let message files use eids for cross-referencing.  this will allow the merging of files&lt;br /&gt;
* Remove the layout objects - equivalent with contig without a sequence&lt;br /&gt;
* New way to store sequence data targeted at short sequences for which it&#039;s too verbose to store names and such&lt;br /&gt;
* figure out a way to handle 454 mate-pairs, esp. in hawkeye&lt;br /&gt;
* build a better &#039;toAmos&#039;&lt;br /&gt;
&lt;br /&gt;
Note: some progress in these directions has been made but the changes need to be cleaned up.&lt;br /&gt;
&lt;br /&gt;
==AMOScmp==&lt;br /&gt;
* integrate other aligners - Daniela has a preliminary pipeline that works with SOAP but why not Bowtie? A solution would be to allow inputs in BAM format.&lt;/div&gt;</summary>
		<author><name>Mpop</name></author>
	</entry>
	<entry>
		<id>https://wiki.umiacs.umd.edu/cbcb/index.php?title=Cbcb:Pop-Lab&amp;diff=7257</id>
		<title>Cbcb:Pop-Lab</title>
		<link rel="alternate" type="text/html" href="https://wiki.umiacs.umd.edu/cbcb/index.php?title=Cbcb:Pop-Lab&amp;diff=7257"/>
		<updated>2010-06-16T00:33:28Z</updated>

		<summary type="html">&lt;p&gt;Mpop: /* Important pages */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= Important pages =&lt;br /&gt;
* Group members [[Cbcb:Pop-Lab:Members]] &lt;br /&gt;
* Blog: [http://genomes.blogspot.com Genomes.blogspot.com]&lt;br /&gt;
* How-tos: [[Cbcb:Pop-Lab:How-to]] - List of documentation on how to perform certain analyses, use different software, etc.&lt;br /&gt;
* Software developed by us: [[Cbcb:Pop-Lab:Software]]&lt;br /&gt;
* Metagenomics papers: [[Cbcb:Pop-Lab:Papers]]&lt;br /&gt;
* TODO items (please volunteer to take some of these on): [[Cbcb:Pop-Lab:TODO]]&lt;br /&gt;
* Grand Challenges: [[Cbcb:Pop-Lab:Challenges]] - research challenges&lt;br /&gt;
&lt;br /&gt;
= Meetings/seminars =&lt;br /&gt;
* [[Pop_group_meeting|Group meeting]]: every Monday from 2:30pm in corner conference room (3120 C)&lt;br /&gt;
* [[seminars|CBCB seminar]]: Thursdays 2-3pm during the semester in main conference room&lt;br /&gt;
* [[Summer_Reading_Group_2009|Summer Reading Group 2009 (Computational Metabolic Pathway Analysis)]]: Fridays from 2ish-5ish in 3120C in the SE corner of the Bioscience Research Building (BSB) from June 5-August 28&lt;br /&gt;
&lt;br /&gt;
= Progress Reports =&lt;br /&gt;
* [[Cbcb:Pop-Lab:Ben-Report|Ben]]&lt;br /&gt;
* [[Cbcb:Pop-Lab:Chris-Report|Chris]]&lt;br /&gt;
* [[Cbcb:Pop-Lab:James-Report|James]]&lt;br /&gt;
* [[Cbcb:Pop-Lab:Serge-Report|Serge]]&lt;br /&gt;
* [[Cbcb:Pop-Lab:Bo-Report|Bo]]&lt;br /&gt;
* [[Cbcb:Pop-Lab:Dan-Report|Dan]]&lt;br /&gt;
* [[Cbcb:Pop-Lab:Mohammad-Report|Mohammad]]&lt;br /&gt;
* [[Cbcb:Pop-Lab:Ted-Report|Ted]]&lt;br /&gt;
&lt;br /&gt;
= Pop Lab Presentations =&lt;br /&gt;
* [[Media:Xoo_Cripsr.odg|Xanthomonas Crispr Slides (openoffice format)]] &lt;br /&gt;
* Listeria Monocytogenes Slides (powerpoint)&lt;br /&gt;
* Metagenome pipelines overview [[Media:Pipeline_outline.pdf|Presentation(pdf)]]&lt;/div&gt;</summary>
		<author><name>Mpop</name></author>
	</entry>
	<entry>
		<id>https://wiki.umiacs.umd.edu/cbcb/index.php?title=File:Schema.pdf&amp;diff=7256</id>
		<title>File:Schema.pdf</title>
		<link rel="alternate" type="text/html" href="https://wiki.umiacs.umd.edu/cbcb/index.php?title=File:Schema.pdf&amp;diff=7256"/>
		<updated>2010-06-10T20:32:42Z</updated>

		<summary type="html">&lt;p&gt;Mpop: GEMS database schema&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;GEMS database schema&lt;/div&gt;</summary>
		<author><name>Mpop</name></author>
	</entry>
	<entry>
		<id>https://wiki.umiacs.umd.edu/cbcb/index.php?title=Cbcb:Pop-Lab:16S-pipeline_16S_analysis_pipeline_(for_Gates_project)&amp;diff=7255</id>
		<title>Cbcb:Pop-Lab:16S-pipeline 16S analysis pipeline (for Gates project)</title>
		<link rel="alternate" type="text/html" href="https://wiki.umiacs.umd.edu/cbcb/index.php?title=Cbcb:Pop-Lab:16S-pipeline_16S_analysis_pipeline_(for_Gates_project)&amp;diff=7255"/>
		<updated>2010-06-10T20:32:11Z</updated>

		<summary type="html">&lt;p&gt;Mpop: /* Database information */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= 16S analysis pipeline =&lt;br /&gt;
&lt;br /&gt;
== Overview ==&lt;br /&gt;
=== Assumptions ===&lt;br /&gt;
* 16S rRNA sequences are generated with barcoded 454 and are received as either: (i) .sff file; (ii) fasta and quality file; (iii) just fasta file&lt;br /&gt;
* each batch is a single 96-well plate and is accompanied by a tab-delimited file containing information about the sample, including the &amp;quot;Sample ID&amp;quot;, well on the plate, and additional information regarding the sample quality and DNA concentration&lt;br /&gt;
* we have a file that specifies the barcode and sequencing adapter used for each well (these entities do not change).&lt;br /&gt;
&lt;br /&gt;
=== Directory structure ===&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
Gates_SOM/&lt;br /&gt;
  Main/&lt;br /&gt;
    samples.csv      - information about all the samples available to us&lt;br /&gt;
    454.csv          - information about all 454 runs (essentially concatenation of .csvs from 454 dir)&lt;br /&gt;
    phylochip.csv    - information about all Phylochip runs&lt;br /&gt;
    IGS_Barcodes.csv - information about barcodes used to multiplex 454 runs&lt;br /&gt;
    scripts/         - scripts used to process the data (referred to as ${SCRIPTS} below)&lt;br /&gt;
    DB/              - scripts used to access the database (referred to as ${DB_SCRIPTS} below)&lt;br /&gt;
  454/               - here&#039;s where all 454 sequences live&lt;br /&gt;
    [batch1]/        - ... each batch in a separate directory&lt;br /&gt;
    [batch1].csv     - meta-information about the batch &lt;br /&gt;
    [fasta1]         - fasta files containing the batch&lt;br /&gt;
     ... &lt;br /&gt;
    [fastan]&lt;br /&gt;
    [batch1].part    - partition file describing how the sequences get split by barcode/sample&lt;br /&gt;
     part/           - directory where all the partitioned files live &lt;br /&gt;
     ... &lt;br /&gt;
    [batchn]&lt;br /&gt;
  Phylochip/         - all the CEL files and auxiliary information on the Phylochip runs&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== File-based approach == &lt;br /&gt;
&lt;br /&gt;
=== Step 0: Get the sequence information ===&lt;br /&gt;
* From .SFF files (assuming these are 454 sequences)&lt;br /&gt;
&lt;br /&gt;
This step uses the sff_extract program from the Staden package (if I&#039;m not mistaken)&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
for i in *.sff ;do&lt;br /&gt;
  name=`expr $i : &#039;\(.*\)\.sff&#039;`&lt;br /&gt;
  sff_extract -c -s $name.seq -q $name.qual $i&lt;br /&gt;
done&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 1: Cleanup meta-information ===&lt;br /&gt;
* Convert the Excel sheet containing the batch information into tab-delimited file (and run dos2unix) making sure the quotes added by Excel/OOffice are removed, adding the date (if not already in the file), and sorting the file by Sample ID.  At this stage also check that the header row information is in canonical format.&lt;br /&gt;
* Add barcode information using add_barcode.pl&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;${SCRIPTDIR}/add_barcode.pl [batch].csv ${MAINDIR}IGS_Barcodes.csv &amp;gt; [batch]_barcode.csv&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 2: Create partition file ===&lt;br /&gt;
&lt;br /&gt;
First concatenate all the sequence files (if multiple) from a batch into a single file [batch].all.seq.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;${SCRIPTDIR}/code2part.pl [batch].csv [batch].all.seq [batch] &amp;gt; [batch].part&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The files [batch].BAD.list and [batch].NONE.list contain the names of all sequences that have failed quality checks either because they are too short, or contain Ns (BAD), or because the barcode is not recognized (NONE). The files contain additional information that can be used to troubleshoot the pipeline:  sequences that are too short (&amp;lt; 75 454 cycles) are followed by the number of cycles, sequences that either contain Ns or have an unknown barcode are followed by the first 8 characters in the sequence.&lt;br /&gt;
&lt;br /&gt;
=== Step 3: Break up the fasta file into separate batches by partition ===&lt;br /&gt;
&lt;br /&gt;
* Create partition directory&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mkdir part&lt;br /&gt;
cd part&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Partition main file into sub-parts&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
${SCRIPTS}/partitionFasta.pl ../[batch].all.seq ../[batch].part&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The result is one fasta file per sub-partition (i.e. individual subject).&lt;br /&gt;
&lt;br /&gt;
* Remove barcodes &lt;br /&gt;
(still in the part/ subdirectory)&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
for i in *.seq; do&lt;br /&gt;
  ${SCRIPTS}/unbarcode.pl $i&lt;br /&gt;
done&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The output files will have the same name as the original file but with the addition of the .nbc suffix.  You should remove the .nbc files from the BAD/NONE files in order to prevent their addition to the pipeline downstream.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt; rm *.BAD.nbc.fa *.NONE.nbc.fa &amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 4: Add file names to sample tables ===&lt;br /&gt;
&lt;br /&gt;
The following needs to be run from the root of the 454 directory.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt; ${SCRIPTS}/add_file.pl [batch]/[batch]_barcode.csv [batch]/part &amp;gt; [batch]/[batch]_names.csv &amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
As a result, the file [batch]/[batch]_names.csv will associate each Sample ID to a file name and also record the number of sequences in that file.  Note that only files ending in .nbc.fa are procesed.&lt;br /&gt;
&lt;br /&gt;
=== Step 5: Merge all the batch meta-info files into a same file at the top ===&lt;br /&gt;
&lt;br /&gt;
Note: the addition of file names must be done on a batch by batch basis as multiple files might refer to a same Sample ID - within each batch it can be assumed that the Sample ID -&amp;gt; Filename mapping is unique.  In the 454.csv file in the top directory the unique key is the file name.  &lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt; &lt;br /&gt;
mv ../454.csv ../454.csv.bak&lt;br /&gt;
${SCRIPTS}/merge_csv.pl ../454.csv.bak [batch]/[batch]_names.csv &amp;quot;Sample ID&amp;quot; non-unique &amp;gt; ../454.csv&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This should probably be run with the filename as the key but then the tables need to be sorted  by filename.  Ultimately, this will all work better once the data are in a relational database.&lt;br /&gt;
&lt;br /&gt;
=== Step 6: Update sample file ===&lt;br /&gt;
From the top directory:&lt;br /&gt;
&lt;br /&gt;
* First add all new samples to the samples.csv file&lt;br /&gt;
&amp;lt;pre&amp;gt; &lt;br /&gt;
mv samples.csv samples.csv.bak&lt;br /&gt;
${SCRIPTS}/merge_csv.pl samples.csv.bak 454.csv &amp;quot;Sample ID&amp;quot; unique merge &amp;gt; samples.csv&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Note: merge means that if record keys conflict, the empty fields will be updated with the new data.&lt;br /&gt;
&lt;br /&gt;
* Update the tag indicating 454 sequences available for this sample&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mv samples.csv samples.csv.bak&lt;br /&gt;
${SCRIPTS}/update_field_csv.pl samples.csv.bak 454.csv &amp;quot;Sample ID&amp;quot; 454 Y &amp;gt; samples.csv&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 7: Assign numbers to all filenames ===&lt;br /&gt;
&lt;br /&gt;
Each file in the 454.csv file will be assigned an integer (if one is not already available).  This number will be used to prefix the sequences in the combined file for the project.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mv 454.csv 454.csv.bak&lt;br /&gt;
${SCRIPTS}/add_filenum.pl 454.csv.bak &amp;gt; 454.csv&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 8: Combine all fasta files into a single one ===&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Note: this assumes we&#039;re running the pipeline for the first time - a different protocol is necessary for adding new sequences to an already existing analysis&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
In the new file all sequences will be named &amp;lt;n&amp;gt;_&amp;lt;nn&amp;gt; where &amp;lt;n&amp;gt; is the value in the &amp;quot;File #&amp;quot;&lt;br /&gt;
field in 454.csv and &amp;lt;nn&amp;gt; is the index of the sequence in the file.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mkdir Analysis/Run[date]&lt;br /&gt;
${SCRIPTS}/combinefa.pl -c Analysis/Run[date]/Run[date] -i 454.csv 454&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The output will be in Analysis/Run[date]/Run[date].fna&lt;br /&gt;
&lt;br /&gt;
=== Step 9: Run clustering tool ===&lt;br /&gt;
&lt;br /&gt;
* First generate clusters&lt;br /&gt;
&#039;&#039;&#039;Note: This part assumes we&#039;re running the whole set of sequences as one batch.&#039;&#039;&#039;&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
cd Analysis/Run[date]/Run[date].fna&lt;br /&gt;
/fs/szasmg2/ghodsi/Src/clusterk/clusterk7 -r 2 -i Run[date].fna &amp;gt; Run[date].fna.cluster&lt;br /&gt;
/fs/szasmg2/ghodsi/Src/clusterk/clusterk7 -r 2 -m -i Run[date].fna &amp;gt; Run[date].fna.align&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Output will be in Run[date].fna.cluster, one cluster per line, cluster center listed as the first identifier.&lt;br /&gt;
&lt;br /&gt;
The .align file contains aligned FASTA records for all the sequences in each cluster.  Clusters are separated by #&amp;lt;number&amp;gt; where &amp;lt;number&amp;gt; is the number of sequences in the cluster.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* Then extract the cluster centers&lt;br /&gt;
&#039;&#039;&#039;From here on the code runs the same in both full-run and batch modes&#039;&#039;&#039;&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
/fs/szasmg2/ghodsi/Src/clusterk/fastaselect Run[date].fna &amp;lt; Run[date].fna.cluster &amp;gt; Run[date].centers.fna&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 10: Assign putative taxonomic labels to clusters ===&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
/fs/szasmg2/ghodsi/rdp/findtax/findtaxid.sh Run[date].centers.fna &amp;gt; Run[date].centers.taxid&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Output is tab-delimited: sequence name &amp;lt;TAB&amp;gt; taxid&lt;br /&gt;
Note: using &amp;quot;findtax&amp;quot; instead of &amp;quot;findtaxid&amp;quot; will retrieve actual taxonomy names.&lt;br /&gt;
&lt;br /&gt;
=== Step 11: Build summary tables ===&lt;br /&gt;
&lt;br /&gt;
Using the output from steps 9 and 10 we construct a collection of tables linking OTUs, taxIDs, taxnames at various taxonomic levels to individual samples.  The colums are the samples and the rows are the respective units.  The cells are numbers of sequences assigned to the specific group. If looking at taxonomic levels, the sequences without an assignment at that level are assigned to a generic &amp;quot;No Assignment&amp;quot; bin.&lt;br /&gt;
&lt;br /&gt;
* First create a partition file that contains all the clusters and associated taxids (if they exist)&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
${SCRIPTS}/cluster2part.pl [batch].fna.cluster [batch] [batch].centers.taxid &amp;gt; [batch].part&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Using this partition, construct summary tables at various taxonomic levels&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
${SCRIPTS}/taxpart2summary.pl [batch].part ${MAIN}/454.csv [batch]&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The following outputs will be created:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
[batch].stats.txt - overall statistics for the data-set&lt;br /&gt;
[batch].otus.count.csv - table containing OTUs as rows, samples as columns, and entries represent&lt;br /&gt;
       # of sequences in each OTU/Sample pair&lt;br /&gt;
[batch].otus.percent.csv - same as &amp;quot;count&amp;quot; except that entries are percentages wrt total sequences&lt;br /&gt;
       in each sample&lt;br /&gt;
[batch].[tax].[otu|count|percent] - same as the &amp;quot;otus&amp;quot; file except at varying taxonomic levels.&lt;br /&gt;
       [tax] is one of &amp;quot;strain&amp;quot;, &amp;quot;species&amp;quot;, &amp;quot;genus&amp;quot;, &amp;quot;family&amp;quot;, &amp;quot;order&amp;quot;, &amp;quot;class&amp;quot;, &amp;quot;phylum&amp;quot;&lt;br /&gt;
       the &amp;quot;count&amp;quot; and &amp;quot;percent&amp;quot; entries are the same as for the &amp;quot;otus&amp;quot; files&lt;br /&gt;
       the &amp;quot;otu&amp;quot; entries contain number of OTUs assigned to the taxonomic group/Sample pair.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Database-based approach ==&lt;br /&gt;
&lt;br /&gt;
=== Database information === &lt;br /&gt;
* To access the database through a command line use the command shown below with password &amp;quot;access&amp;quot;:&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mysql -u access -p -h cbcbmysql00 gems&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
* A stub Perl script for playing with the database is provided at /fs/szasmg2/Gates_SOM/Main/454/DB/stub.pl&lt;br /&gt;
* All users can read the data from the database using users/password combo &amp;quot;access&amp;quot;/&amp;quot;access&amp;quot;.&lt;br /&gt;
* Database schema: [[Media:Schema.pdf]]&lt;br /&gt;
* The commands listed below assume you have write access to the database.&lt;br /&gt;
&lt;br /&gt;
=== Step 0: Get the sequence information ===&lt;br /&gt;
* From .SFF files (assuming these are 454 sequences)&lt;br /&gt;
&lt;br /&gt;
This step uses the sff_extract program from the Staden package (if I&#039;m not mistaken)&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
for i in *.sff ;do&lt;br /&gt;
  name=`expr $i : &#039;\(.*\)\.sff&#039;`&lt;br /&gt;
  sff_extract -c -s $name.seq -q $name.qual $i&lt;br /&gt;
done&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 1: Cleanup meta-information ===&lt;br /&gt;
* Convert the Excel sheet containing the batch information into tab-delimited file (and run dos2unix) making sure the quotes added by Excel/OOffice are removed, adding the date (if not already in the file), and sorting the file by Sample ID.  At this stage also check that the header row information is in canonical format.&lt;br /&gt;
* Add barcode information using add_barcode.pl&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;${SCRIPTDIR}/add_barcode.pl [batch].csv ${MAINDIR}IGS_Barcodes.csv &amp;gt; [batch]_barcode.csv&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 2: Create partition file ===&lt;br /&gt;
&lt;br /&gt;
First concatenate all the sequence files (if multiple) from a batch into a single file [batch].all.seq.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;${SCRIPTDIR}/code2part.pl [batch].csv [batch].all.seq [batch] &amp;gt; [batch].part&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The files [batch].BAD.list and [batch].NONE.list contain the names of all sequences that have failed quality checks either because they are too short, or contain Ns (BAD), or because the barcode is not recognized (NONE). The files contain additional information that can be used to troubleshoot the pipeline:  sequences that are too short (&amp;lt; 75 454 cycles) are followed by the number of cycles, sequences that either contain Ns or have an unknown barcode are followed by the first 8 characters in the sequence.&lt;br /&gt;
&lt;br /&gt;
=== Step 3: Break up the fasta file into separate batches by partition ===&lt;br /&gt;
&lt;br /&gt;
* Create partition directory&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mkdir part&lt;br /&gt;
cd part&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Partition main file into sub-parts&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
${SCRIPTS}/partitionFasta.pl ../[batch].all.seq ../[batch].part&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The result is one fasta file per sub-partition (i.e. individual subject).&lt;br /&gt;
&lt;br /&gt;
* Remove barcodes &lt;br /&gt;
(still in the part/ subdirectory)&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
for i in *.seq; do&lt;br /&gt;
  ${SCRIPTS}/unbarcode.pl $i&lt;br /&gt;
done&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The output files will have the same name as the original file but with the addition of the .nbc suffix.  You should remove the .nbc files from the BAD/NONE files in order to prevent their addition to the pipeline downstream.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt; rm *.BAD.nbc.fa *.NONE.nbc.fa &amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 4: Upload file names ===&lt;br /&gt;
Run from 454 directory&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
${DB_SCRIPTS}/add_file_db.pl [batch]/[batch]_barcode.csv [batch]/part&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 5: Combine all fasta files into a single one ===&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Note: this assumes we&#039;re running the pipeline for the first time - a different protocol is necessary for adding new sequences to an already existing analysis&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
In the new file all sequences will be named &amp;lt;n&amp;gt;_&amp;lt;nn&amp;gt; where &amp;lt;n&amp;gt; is the value in the &amp;quot;File #&amp;quot;&lt;br /&gt;
field in 454.csv and &amp;lt;nn&amp;gt; is the index of the sequence in the file.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mkdir Analysis/Run[date]&lt;br /&gt;
${DB_SCRIPTS}/combinefa_db.pl -c Analysis/Run[date]/Run[date] 454&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The output will be in Analysis/Run[date]/Run[date].fna&lt;br /&gt;
&lt;br /&gt;
=== Step 6: Run clustering tool ===&lt;br /&gt;
&lt;br /&gt;
* First generate clusters&lt;br /&gt;
&#039;&#039;&#039;Note: This part assumes we&#039;re running the whole set of sequences as one batch.&#039;&#039;&#039;&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
cd Analysis/Run[date]/Run[date].fna&lt;br /&gt;
/fs/szasmg2/ghodsi/Src/clusterk/clusterk7 -r 2 -i Run[date].fna &amp;gt; Run[date].fna.cluster&lt;br /&gt;
/fs/szasmg2/ghodsi/Src/clusterk/clusterk7 -r 2 -m -i Run[date].fna &amp;gt; Run[date].fna.align&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Output will be in Run[date].fna.cluster, one cluster per line, cluster center listed as the first identifier.&lt;br /&gt;
&lt;br /&gt;
The .align file contains aligned FASTA records for all the sequences in each cluster.  Clusters are separated by #&amp;lt;number&amp;gt; where &amp;lt;number&amp;gt; is the number of sequences in the cluster.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* Then extract the cluster centers&lt;br /&gt;
&#039;&#039;&#039;From here on the code runs the same in both full-run and batch modes&#039;&#039;&#039;&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
/fs/szasmg2/ghodsi/Src/clusterk/fastaselect Run[date].fna &amp;lt; Run[date].fna.cluster &amp;gt; Run[date].centers.fna&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 7: Assign putative taxonomic labels to clusters ===&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
/fs/szasmg2/ghodsi/rdp/findtax/findtaxid.sh Run[date].centers.fna &amp;gt; Run[date].centers.taxid&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Output is tab-delimited: sequence name &amp;lt;TAB&amp;gt; taxid&lt;br /&gt;
Note: using &amp;quot;findtax&amp;quot; instead of &amp;quot;findtaxid&amp;quot; will retrieve actual taxonomy names.&lt;br /&gt;
&lt;br /&gt;
=== Step 8: Upload OTU information into the database ===&lt;br /&gt;
* First create a partition file that contains all the clusters and associated taxids (if they exist)&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
${SCRIPTS}/cluster2part.pl [batch].fna.cluster [batch] [batch].centers.taxid &amp;gt; [batch].part&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Then upload the resulting partition to the database&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
${DB_SCRIPTS}/upload_otus.pl [batch].part&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 9: Generate summary tables ===&lt;/div&gt;</summary>
		<author><name>Mpop</name></author>
	</entry>
	<entry>
		<id>https://wiki.umiacs.umd.edu/cbcb/index.php?title=Cbcb:Pop-Lab:16S-pipeline_16S_analysis_pipeline_(for_Gates_project)&amp;diff=7254</id>
		<title>Cbcb:Pop-Lab:16S-pipeline 16S analysis pipeline (for Gates project)</title>
		<link rel="alternate" type="text/html" href="https://wiki.umiacs.umd.edu/cbcb/index.php?title=Cbcb:Pop-Lab:16S-pipeline_16S_analysis_pipeline_(for_Gates_project)&amp;diff=7254"/>
		<updated>2010-06-10T15:49:30Z</updated>

		<summary type="html">&lt;p&gt;Mpop: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= 16S analysis pipeline =&lt;br /&gt;
&lt;br /&gt;
== Overview ==&lt;br /&gt;
=== Assumptions ===&lt;br /&gt;
* 16S rRNA sequences are generated with barcoded 454 and are received as either: (i) .sff file; (ii) fasta and quality file; (iii) just fasta file&lt;br /&gt;
* each batch is a single 96-well plate and is accompanied by a tab-delimited file containing information about the sample, including the &amp;quot;Sample ID&amp;quot;, well on the plate, and additional information regarding the sample quality and DNA concentration&lt;br /&gt;
* we have a file that specifies the barcode and sequencing adapter used for each well (these entities do not change).&lt;br /&gt;
&lt;br /&gt;
=== Directory structure ===&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
Gates_SOM/&lt;br /&gt;
  Main/&lt;br /&gt;
    samples.csv      - information about all the samples available to us&lt;br /&gt;
    454.csv          - information about all 454 runs (essentially concatenation of .csvs from 454 dir)&lt;br /&gt;
    phylochip.csv    - information about all Phylochip runs&lt;br /&gt;
    IGS_Barcodes.csv - information about barcodes used to multiplex 454 runs&lt;br /&gt;
    scripts/         - scripts used to process the data (referred to as ${SCRIPTS} below)&lt;br /&gt;
    DB/              - scripts used to access the database (referred to as ${DB_SCRIPTS} below)&lt;br /&gt;
  454/               - here&#039;s where all 454 sequences live&lt;br /&gt;
    [batch1]/        - ... each batch in a separate directory&lt;br /&gt;
    [batch1].csv     - meta-information about the batch &lt;br /&gt;
    [fasta1]         - fasta files containing the batch&lt;br /&gt;
     ... &lt;br /&gt;
    [fastan]&lt;br /&gt;
    [batch1].part    - partition file describing how the sequences get split by barcode/sample&lt;br /&gt;
     part/           - directory where all the partitioned files live &lt;br /&gt;
     ... &lt;br /&gt;
    [batchn]&lt;br /&gt;
  Phylochip/         - all the CEL files and auxiliary information on the Phylochip runs&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== File-based approach == &lt;br /&gt;
&lt;br /&gt;
=== Step 0: Get the sequence information ===&lt;br /&gt;
* From .SFF files (assuming these are 454 sequences)&lt;br /&gt;
&lt;br /&gt;
This step uses the sff_extract program from the Staden package (if I&#039;m not mistaken)&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
for i in *.sff ;do&lt;br /&gt;
  name=`expr $i : &#039;\(.*\)\.sff&#039;`&lt;br /&gt;
  sff_extract -c -s $name.seq -q $name.qual $i&lt;br /&gt;
done&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 1: Cleanup meta-information ===&lt;br /&gt;
* Convert the Excel sheet containing the batch information into tab-delimited file (and run dos2unix) making sure the quotes added by Excel/OOffice are removed, adding the date (if not already in the file), and sorting the file by Sample ID.  At this stage also check that the header row information is in canonical format.&lt;br /&gt;
* Add barcode information using add_barcode.pl&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;${SCRIPTDIR}/add_barcode.pl [batch].csv ${MAINDIR}IGS_Barcodes.csv &amp;gt; [batch]_barcode.csv&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 2: Create partition file ===&lt;br /&gt;
&lt;br /&gt;
First concatenate all the sequence files (if multiple) from a batch into a single file [batch].all.seq.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;${SCRIPTDIR}/code2part.pl [batch].csv [batch].all.seq [batch] &amp;gt; [batch].part&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The files [batch].BAD.list and [batch].NONE.list contain the names of all sequences that have failed quality checks either because they are too short, or contain Ns (BAD), or because the barcode is not recognized (NONE). The files contain additional information that can be used to troubleshoot the pipeline:  sequences that are too short (&amp;lt; 75 454 cycles) are followed by the number of cycles, sequences that either contain Ns or have an unknown barcode are followed by the first 8 characters in the sequence.&lt;br /&gt;
&lt;br /&gt;
=== Step 3: Break up the fasta file into separate batches by partition ===&lt;br /&gt;
&lt;br /&gt;
* Create partition directory&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mkdir part&lt;br /&gt;
cd part&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Partition main file into sub-parts&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
${SCRIPTS}/partitionFasta.pl ../[batch].all.seq ../[batch].part&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The result is one fasta file per sub-partition (i.e. individual subject).&lt;br /&gt;
&lt;br /&gt;
* Remove barcodes &lt;br /&gt;
(still in the part/ subdirectory)&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
for i in *.seq; do&lt;br /&gt;
  ${SCRIPTS}/unbarcode.pl $i&lt;br /&gt;
done&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The output files will have the same name as the original file but with the addition of the .nbc suffix.  You should remove the .nbc files from the BAD/NONE files in order to prevent their addition to the pipeline downstream.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt; rm *.BAD.nbc.fa *.NONE.nbc.fa &amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 4: Add file names to sample tables ===&lt;br /&gt;
&lt;br /&gt;
The following needs to be run from the root of the 454 directory.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt; ${SCRIPTS}/add_file.pl [batch]/[batch]_barcode.csv [batch]/part &amp;gt; [batch]/[batch]_names.csv &amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
As a result, the file [batch]/[batch]_names.csv will associate each Sample ID to a file name and also record the number of sequences in that file.  Note that only files ending in .nbc.fa are procesed.&lt;br /&gt;
&lt;br /&gt;
=== Step 5: Merge all the batch meta-info files into a same file at the top ===&lt;br /&gt;
&lt;br /&gt;
Note: the addition of file names must be done on a batch by batch basis as multiple files might refer to a same Sample ID - within each batch it can be assumed that the Sample ID -&amp;gt; Filename mapping is unique.  In the 454.csv file in the top directory the unique key is the file name.  &lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt; &lt;br /&gt;
mv ../454.csv ../454.csv.bak&lt;br /&gt;
${SCRIPTS}/merge_csv.pl ../454.csv.bak [batch]/[batch]_names.csv &amp;quot;Sample ID&amp;quot; non-unique &amp;gt; ../454.csv&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This should probably be run with the filename as the key but then the tables need to be sorted  by filename.  Ultimately, this will all work better once the data are in a relational database.&lt;br /&gt;
&lt;br /&gt;
=== Step 6: Update sample file ===&lt;br /&gt;
From the top directory:&lt;br /&gt;
&lt;br /&gt;
* First add all new samples to the samples.csv file&lt;br /&gt;
&amp;lt;pre&amp;gt; &lt;br /&gt;
mv samples.csv samples.csv.bak&lt;br /&gt;
${SCRIPTS}/merge_csv.pl samples.csv.bak 454.csv &amp;quot;Sample ID&amp;quot; unique merge &amp;gt; samples.csv&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Note: merge means that if record keys conflict, the empty fields will be updated with the new data.&lt;br /&gt;
&lt;br /&gt;
* Update the tag indicating 454 sequences available for this sample&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mv samples.csv samples.csv.bak&lt;br /&gt;
${SCRIPTS}/update_field_csv.pl samples.csv.bak 454.csv &amp;quot;Sample ID&amp;quot; 454 Y &amp;gt; samples.csv&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 7: Assign numbers to all filenames ===&lt;br /&gt;
&lt;br /&gt;
Each file in the 454.csv file will be assigned an integer (if one is not already available).  This number will be used to prefix the sequences in the combined file for the project.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mv 454.csv 454.csv.bak&lt;br /&gt;
${SCRIPTS}/add_filenum.pl 454.csv.bak &amp;gt; 454.csv&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 8: Combine all fasta files into a single one ===&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Note: this assumes we&#039;re running the pipeline for the first time - a different protocol is necessary for adding new sequences to an already existing analysis&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
In the new file all sequences will be named &amp;lt;n&amp;gt;_&amp;lt;nn&amp;gt; where &amp;lt;n&amp;gt; is the value in the &amp;quot;File #&amp;quot;&lt;br /&gt;
field in 454.csv and &amp;lt;nn&amp;gt; is the index of the sequence in the file.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mkdir Analysis/Run[date]&lt;br /&gt;
${SCRIPTS}/combinefa.pl -c Analysis/Run[date]/Run[date] -i 454.csv 454&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The output will be in Analysis/Run[date]/Run[date].fna&lt;br /&gt;
&lt;br /&gt;
=== Step 9: Run clustering tool ===&lt;br /&gt;
&lt;br /&gt;
* First generate clusters&lt;br /&gt;
&#039;&#039;&#039;Note: This part assumes we&#039;re running the whole set of sequences as one batch.&#039;&#039;&#039;&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
cd Analysis/Run[date]/Run[date].fna&lt;br /&gt;
/fs/szasmg2/ghodsi/Src/clusterk/clusterk7 -r 2 -i Run[date].fna &amp;gt; Run[date].fna.cluster&lt;br /&gt;
/fs/szasmg2/ghodsi/Src/clusterk/clusterk7 -r 2 -m -i Run[date].fna &amp;gt; Run[date].fna.align&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Output will be in Run[date].fna.cluster, one cluster per line, cluster center listed as the first identifier.&lt;br /&gt;
&lt;br /&gt;
The .align file contains aligned FASTA records for all the sequences in each cluster.  Clusters are separated by #&amp;lt;number&amp;gt; where &amp;lt;number&amp;gt; is the number of sequences in the cluster.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* Then extract the cluster centers&lt;br /&gt;
&#039;&#039;&#039;From here on the code runs the same in both full-run and batch modes&#039;&#039;&#039;&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
/fs/szasmg2/ghodsi/Src/clusterk/fastaselect Run[date].fna &amp;lt; Run[date].fna.cluster &amp;gt; Run[date].centers.fna&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 10: Assign putative taxonomic labels to clusters ===&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
/fs/szasmg2/ghodsi/rdp/findtax/findtaxid.sh Run[date].centers.fna &amp;gt; Run[date].centers.taxid&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Output is tab-delimited: sequence name &amp;lt;TAB&amp;gt; taxid&lt;br /&gt;
Note: using &amp;quot;findtax&amp;quot; instead of &amp;quot;findtaxid&amp;quot; will retrieve actual taxonomy names.&lt;br /&gt;
&lt;br /&gt;
=== Step 11: Build summary tables ===&lt;br /&gt;
&lt;br /&gt;
Using the output from steps 9 and 10 we construct a collection of tables linking OTUs, taxIDs, taxnames at various taxonomic levels to individual samples.  The colums are the samples and the rows are the respective units.  The cells are numbers of sequences assigned to the specific group. If looking at taxonomic levels, the sequences without an assignment at that level are assigned to a generic &amp;quot;No Assignment&amp;quot; bin.&lt;br /&gt;
&lt;br /&gt;
* First create a partition file that contains all the clusters and associated taxids (if they exist)&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
${SCRIPTS}/cluster2part.pl [batch].fna.cluster [batch] [batch].centers.taxid &amp;gt; [batch].part&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Using this partition, construct summary tables at various taxonomic levels&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
${SCRIPTS}/taxpart2summary.pl [batch].part ${MAIN}/454.csv [batch]&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The following outputs will be created:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
[batch].stats.txt - overall statistics for the data-set&lt;br /&gt;
[batch].otus.count.csv - table containing OTUs as rows, samples as columns, and entries represent&lt;br /&gt;
       # of sequences in each OTU/Sample pair&lt;br /&gt;
[batch].otus.percent.csv - same as &amp;quot;count&amp;quot; except that entries are percentages wrt total sequences&lt;br /&gt;
       in each sample&lt;br /&gt;
[batch].[tax].[otu|count|percent] - same as the &amp;quot;otus&amp;quot; file except at varying taxonomic levels.&lt;br /&gt;
       [tax] is one of &amp;quot;strain&amp;quot;, &amp;quot;species&amp;quot;, &amp;quot;genus&amp;quot;, &amp;quot;family&amp;quot;, &amp;quot;order&amp;quot;, &amp;quot;class&amp;quot;, &amp;quot;phylum&amp;quot;&lt;br /&gt;
       the &amp;quot;count&amp;quot; and &amp;quot;percent&amp;quot; entries are the same as for the &amp;quot;otus&amp;quot; files&lt;br /&gt;
       the &amp;quot;otu&amp;quot; entries contain number of OTUs assigned to the taxonomic group/Sample pair.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Database-based approach ==&lt;br /&gt;
&lt;br /&gt;
=== Database information === &lt;br /&gt;
* To access the database through a command line use the command shown below with password &amp;quot;access&amp;quot;:&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mysql -u access -p -h cbcbmysql00 gems&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
* A stub Perl script for playing with the database is provided at /fs/szasmg2/Gates_SOM/Main/454/DB/stub.pl&lt;br /&gt;
* All users can read the data from the database using users/password combo &amp;quot;access&amp;quot;/&amp;quot;access&amp;quot;.&lt;br /&gt;
* Database schema: &lt;br /&gt;
* The commands listed below assume you have write access to the database.&lt;br /&gt;
&lt;br /&gt;
=== Step 0: Get the sequence information ===&lt;br /&gt;
* From .SFF files (assuming these are 454 sequences)&lt;br /&gt;
&lt;br /&gt;
This step uses the sff_extract program from the Staden package (if I&#039;m not mistaken)&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
for i in *.sff ;do&lt;br /&gt;
  name=`expr $i : &#039;\(.*\)\.sff&#039;`&lt;br /&gt;
  sff_extract -c -s $name.seq -q $name.qual $i&lt;br /&gt;
done&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 1: Cleanup meta-information ===&lt;br /&gt;
* Convert the Excel sheet containing the batch information into tab-delimited file (and run dos2unix) making sure the quotes added by Excel/OOffice are removed, adding the date (if not already in the file), and sorting the file by Sample ID.  At this stage also check that the header row information is in canonical format.&lt;br /&gt;
* Add barcode information using add_barcode.pl&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;${SCRIPTDIR}/add_barcode.pl [batch].csv ${MAINDIR}IGS_Barcodes.csv &amp;gt; [batch]_barcode.csv&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 2: Create partition file ===&lt;br /&gt;
&lt;br /&gt;
First concatenate all the sequence files (if multiple) from a batch into a single file [batch].all.seq.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;${SCRIPTDIR}/code2part.pl [batch].csv [batch].all.seq [batch] &amp;gt; [batch].part&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The files [batch].BAD.list and [batch].NONE.list contain the names of all sequences that have failed quality checks either because they are too short, or contain Ns (BAD), or because the barcode is not recognized (NONE). The files contain additional information that can be used to troubleshoot the pipeline:  sequences that are too short (&amp;lt; 75 454 cycles) are followed by the number of cycles, sequences that either contain Ns or have an unknown barcode are followed by the first 8 characters in the sequence.&lt;br /&gt;
&lt;br /&gt;
=== Step 3: Break up the fasta file into separate batches by partition ===&lt;br /&gt;
&lt;br /&gt;
* Create partition directory&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mkdir part&lt;br /&gt;
cd part&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Partition main file into sub-parts&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
${SCRIPTS}/partitionFasta.pl ../[batch].all.seq ../[batch].part&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The result is one fasta file per sub-partition (i.e. individual subject).&lt;br /&gt;
&lt;br /&gt;
* Remove barcodes &lt;br /&gt;
(still in the part/ subdirectory)&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
for i in *.seq; do&lt;br /&gt;
  ${SCRIPTS}/unbarcode.pl $i&lt;br /&gt;
done&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The output files will have the same name as the original file but with the addition of the .nbc suffix.  You should remove the .nbc files from the BAD/NONE files in order to prevent their addition to the pipeline downstream.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt; rm *.BAD.nbc.fa *.NONE.nbc.fa &amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 4: Upload file names ===&lt;br /&gt;
Run from 454 directory&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
${DB_SCRIPTS}/add_file_db.pl [batch]/[batch]_barcode.csv [batch]/part&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 5: Combine all fasta files into a single one ===&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Note: this assumes we&#039;re running the pipeline for the first time - a different protocol is necessary for adding new sequences to an already existing analysis&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
In the new file all sequences will be named &amp;lt;n&amp;gt;_&amp;lt;nn&amp;gt; where &amp;lt;n&amp;gt; is the value in the &amp;quot;File #&amp;quot;&lt;br /&gt;
field in 454.csv and &amp;lt;nn&amp;gt; is the index of the sequence in the file.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mkdir Analysis/Run[date]&lt;br /&gt;
${DB_SCRIPTS}/combinefa_db.pl -c Analysis/Run[date]/Run[date] 454&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The output will be in Analysis/Run[date]/Run[date].fna&lt;br /&gt;
&lt;br /&gt;
=== Step 6: Run clustering tool ===&lt;br /&gt;
&lt;br /&gt;
* First generate clusters&lt;br /&gt;
&#039;&#039;&#039;Note: This part assumes we&#039;re running the whole set of sequences as one batch.&#039;&#039;&#039;&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
cd Analysis/Run[date]/Run[date].fna&lt;br /&gt;
/fs/szasmg2/ghodsi/Src/clusterk/clusterk7 -r 2 -i Run[date].fna &amp;gt; Run[date].fna.cluster&lt;br /&gt;
/fs/szasmg2/ghodsi/Src/clusterk/clusterk7 -r 2 -m -i Run[date].fna &amp;gt; Run[date].fna.align&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Output will be in Run[date].fna.cluster, one cluster per line, cluster center listed as the first identifier.&lt;br /&gt;
&lt;br /&gt;
The .align file contains aligned FASTA records for all the sequences in each cluster.  Clusters are separated by #&amp;lt;number&amp;gt; where &amp;lt;number&amp;gt; is the number of sequences in the cluster.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* Then extract the cluster centers&lt;br /&gt;
&#039;&#039;&#039;From here on the code runs the same in both full-run and batch modes&#039;&#039;&#039;&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
/fs/szasmg2/ghodsi/Src/clusterk/fastaselect Run[date].fna &amp;lt; Run[date].fna.cluster &amp;gt; Run[date].centers.fna&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 7: Assign putative taxonomic labels to clusters ===&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
/fs/szasmg2/ghodsi/rdp/findtax/findtaxid.sh Run[date].centers.fna &amp;gt; Run[date].centers.taxid&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Output is tab-delimited: sequence name &amp;lt;TAB&amp;gt; taxid&lt;br /&gt;
Note: using &amp;quot;findtax&amp;quot; instead of &amp;quot;findtaxid&amp;quot; will retrieve actual taxonomy names.&lt;br /&gt;
&lt;br /&gt;
=== Step 8: Upload OTU information into the database ===&lt;br /&gt;
* First create a partition file that contains all the clusters and associated taxids (if they exist)&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
${SCRIPTS}/cluster2part.pl [batch].fna.cluster [batch] [batch].centers.taxid &amp;gt; [batch].part&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Then upload the resulting partition to the database&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
${DB_SCRIPTS}/upload_otus.pl [batch].part&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 9: Generate summary tables ===&lt;/div&gt;</summary>
		<author><name>Mpop</name></author>
	</entry>
	<entry>
		<id>https://wiki.umiacs.umd.edu/cbcb/index.php?title=Cbcb:Pop-Lab:16S-pipeline_16S_analysis_pipeline_(for_Gates_project)&amp;diff=7253</id>
		<title>Cbcb:Pop-Lab:16S-pipeline 16S analysis pipeline (for Gates project)</title>
		<link rel="alternate" type="text/html" href="https://wiki.umiacs.umd.edu/cbcb/index.php?title=Cbcb:Pop-Lab:16S-pipeline_16S_analysis_pipeline_(for_Gates_project)&amp;diff=7253"/>
		<updated>2010-06-10T15:32:05Z</updated>

		<summary type="html">&lt;p&gt;Mpop: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= 16S analysis pipeline =&lt;br /&gt;
&lt;br /&gt;
== Overview ==&lt;br /&gt;
=== Assumptions ===&lt;br /&gt;
* 16S rRNA sequences are generated with barcoded 454 and are received as either: (i) .sff file; (ii) fasta and quality file; (iii) just fasta file&lt;br /&gt;
* each batch is a single 96-well plate and is accompanied by a tab-delimited file containing information about the sample, including the &amp;quot;Sample ID&amp;quot;, well on the plate, and additional information regarding the sample quality and DNA concentration&lt;br /&gt;
* we have a file that specifies the barcode and sequencing adapter used for each well (these entities do not change).&lt;br /&gt;
&lt;br /&gt;
=== Directory structure ===&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
Gates_SOM/&lt;br /&gt;
  Main/&lt;br /&gt;
    samples.csv      - information about all the samples available to us&lt;br /&gt;
    454.csv          - information about all 454 runs (essentially concatenation of .csvs from 454 dir)&lt;br /&gt;
    phylochip.csv    - information about all Phylochip runs&lt;br /&gt;
    IGS_Barcodes.csv - information about barcodes used to multiplex 454 runs&lt;br /&gt;
    scripts/         - scripts used to process the data (referred to as ${SCRIPTS} below)&lt;br /&gt;
    DB/              - scripts used to access the database (referred to as ${DB_SCRIPTS} below)&lt;br /&gt;
  454/               - here&#039;s where all 454 sequences live&lt;br /&gt;
    [batch1]/        - ... each batch in a separate directory&lt;br /&gt;
    [batch1].csv     - meta-information about the batch &lt;br /&gt;
    [fasta1]         - fasta files containing the batch&lt;br /&gt;
     ... &lt;br /&gt;
    [fastan]&lt;br /&gt;
    [batch1].part    - partition file describing how the sequences get split by barcode/sample&lt;br /&gt;
     part/           - directory where all the partitioned files live &lt;br /&gt;
     ... &lt;br /&gt;
    [batchn]&lt;br /&gt;
  Phylochip/         - all the CEL files and auxiliary information on the Phylochip runs&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== File-based approach == &lt;br /&gt;
&lt;br /&gt;
=== Step 0: Get the sequence information ===&lt;br /&gt;
* From .SFF files (assuming these are 454 sequences)&lt;br /&gt;
&lt;br /&gt;
This step uses the sff_extract program from the Staden package (if I&#039;m not mistaken)&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
for i in *.sff ;do&lt;br /&gt;
  name=`expr $i : &#039;\(.*\)\.sff&#039;`&lt;br /&gt;
  sff_extract -c -s $name.seq -q $name.qual $i&lt;br /&gt;
done&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 1: Cleanup meta-information ===&lt;br /&gt;
* Convert the Excel sheet containing the batch information into tab-delimited file (and run dos2unix) making sure the quotes added by Excel/OOffice are removed, adding the date (if not already in the file), and sorting the file by Sample ID.  At this stage also check that the header row information is in canonical format.&lt;br /&gt;
* Add barcode information using add_barcode.pl&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;${SCRIPTDIR}/add_barcode.pl [batch].csv ${MAINDIR}IGS_Barcodes.csv &amp;gt; [batch]_barcode.csv&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 2: Create partition file ===&lt;br /&gt;
&lt;br /&gt;
First concatenate all the sequence files (if multiple) from a batch into a single file [batch].all.seq.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;${SCRIPTDIR}/code2part.pl [batch].csv [batch].all.seq [batch] &amp;gt; [batch].part&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The files [batch].BAD.list and [batch].NONE.list contain the names of all sequences that have failed quality checks either because they are too short, or contain Ns (BAD), or because the barcode is not recognized (NONE). The files contain additional information that can be used to troubleshoot the pipeline:  sequences that are too short (&amp;lt; 75 454 cycles) are followed by the number of cycles, sequences that either contain Ns or have an unknown barcode are followed by the first 8 characters in the sequence.&lt;br /&gt;
&lt;br /&gt;
=== Step 3: Break up the fasta file into separate batches by partition ===&lt;br /&gt;
&lt;br /&gt;
* Create partition directory&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mkdir part&lt;br /&gt;
cd part&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Partition main file into sub-parts&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
${SCRIPTS}/partitionFasta.pl ../[batch].all.seq ../[batch].part&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The result is one fasta file per sub-partition (i.e. individual subject).&lt;br /&gt;
&lt;br /&gt;
* Remove barcodes &lt;br /&gt;
(still in the part/ subdirectory)&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
for i in *.seq; do&lt;br /&gt;
  ${SCRIPTS}/unbarcode.pl $i&lt;br /&gt;
done&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The output files will have the same name as the original file but with the addition of the .nbc suffix.  You should remove the .nbc files from the BAD/NONE files in order to prevent their addition to the pipeline downstream.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt; rm *.BAD.nbc.fa *.NONE.nbc.fa &amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 4: Add file names to sample tables ===&lt;br /&gt;
&lt;br /&gt;
The following needs to be run from the root of the 454 directory.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt; ${SCRIPTS}/add_file.pl [batch]/[batch]_barcode.csv [batch]/part &amp;gt; [batch]/[batch]_names.csv &amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
As a result, the file [batch]/[batch]_names.csv will associate each Sample ID to a file name and also record the number of sequences in that file.  Note that only files ending in .nbc.fa are procesed.&lt;br /&gt;
&lt;br /&gt;
=== Step 5: Merge all the batch meta-info files into a same file at the top ===&lt;br /&gt;
&lt;br /&gt;
Note: the addition of file names must be done on a batch by batch basis as multiple files might refer to a same Sample ID - within each batch it can be assumed that the Sample ID -&amp;gt; Filename mapping is unique.  In the 454.csv file in the top directory the unique key is the file name.  &lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt; &lt;br /&gt;
mv ../454.csv ../454.csv.bak&lt;br /&gt;
${SCRIPTS}/merge_csv.pl ../454.csv.bak [batch]/[batch]_names.csv &amp;quot;Sample ID&amp;quot; non-unique &amp;gt; ../454.csv&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This should probably be run with the filename as the key but then the tables need to be sorted  by filename.  Ultimately, this will all work better once the data are in a relational database.&lt;br /&gt;
&lt;br /&gt;
=== Step 6: Update sample file ===&lt;br /&gt;
From the top directory:&lt;br /&gt;
&lt;br /&gt;
* First add all new samples to the samples.csv file&lt;br /&gt;
&amp;lt;pre&amp;gt; &lt;br /&gt;
mv samples.csv samples.csv.bak&lt;br /&gt;
${SCRIPTS}/merge_csv.pl samples.csv.bak 454.csv &amp;quot;Sample ID&amp;quot; unique merge &amp;gt; samples.csv&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Note: merge means that if record keys conflict, the empty fields will be updated with the new data.&lt;br /&gt;
&lt;br /&gt;
* Update the tag indicating 454 sequences available for this sample&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mv samples.csv samples.csv.bak&lt;br /&gt;
${SCRIPTS}/update_field_csv.pl samples.csv.bak 454.csv &amp;quot;Sample ID&amp;quot; 454 Y &amp;gt; samples.csv&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 7: Assign numbers to all filenames ===&lt;br /&gt;
&lt;br /&gt;
Each file in the 454.csv file will be assigned an integer (if one is not already available).  This number will be used to prefix the sequences in the combined file for the project.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mv 454.csv 454.csv.bak&lt;br /&gt;
${SCRIPTS}/add_filenum.pl 454.csv.bak &amp;gt; 454.csv&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 8: Combine all fasta files into a single one ===&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Note: this assumes we&#039;re running the pipeline for the first time - a different protocol is necessary for adding new sequences to an already existing analysis&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
In the new file all sequences will be named &amp;lt;n&amp;gt;_&amp;lt;nn&amp;gt; where &amp;lt;n&amp;gt; is the value in the &amp;quot;File #&amp;quot;&lt;br /&gt;
field in 454.csv and &amp;lt;nn&amp;gt; is the index of the sequence in the file.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mkdir Analysis/Run[date]&lt;br /&gt;
${SCRIPTS}/combinefa.pl -c Analysis/Run[date]/Run[date] -i 454.csv 454&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The output will be in Analysis/Run[date]/Run[date].fna&lt;br /&gt;
&lt;br /&gt;
=== Step 9: Run clustering tool ===&lt;br /&gt;
&lt;br /&gt;
* First generate clusters&lt;br /&gt;
&#039;&#039;&#039;Note: This part assumes we&#039;re running the whole set of sequences as one batch.&#039;&#039;&#039;&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
cd Analysis/Run[date]/Run[date].fna&lt;br /&gt;
/fs/szasmg2/ghodsi/Src/clusterk/clusterk7 -r 2 -i Run[date].fna &amp;gt; Run[date].fna.cluster&lt;br /&gt;
/fs/szasmg2/ghodsi/Src/clusterk/clusterk7 -r 2 -m -i Run[date].fna &amp;gt; Run[date].fna.align&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Output will be in Run[date].fna.cluster, one cluster per line, cluster center listed as the first identifier.&lt;br /&gt;
&lt;br /&gt;
The .align file contains aligned FASTA records for all the sequences in each cluster.  Clusters are separated by #&amp;lt;number&amp;gt; where &amp;lt;number&amp;gt; is the number of sequences in the cluster.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* Then extract the cluster centers&lt;br /&gt;
&#039;&#039;&#039;From here on the code runs the same in both full-run and batch modes&#039;&#039;&#039;&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
/fs/szasmg2/ghodsi/Src/clusterk/fastaselect Run[date].fna &amp;lt; Run[date].fna.cluster &amp;gt; Run[date].centers.fna&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 10: Assign putative taxonomic labels to clusters ===&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
/fs/szasmg2/ghodsi/rdp/findtax/findtaxid.sh Run[date].centers.fna &amp;gt; Run[date].centers.taxid&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Output is tab-delimited: sequence name &amp;lt;TAB&amp;gt; taxid&lt;br /&gt;
Note: using &amp;quot;findtax&amp;quot; instead of &amp;quot;findtaxid&amp;quot; will retrieve actual taxonomy names.&lt;br /&gt;
&lt;br /&gt;
=== Step 11: Build summary tables ===&lt;br /&gt;
&lt;br /&gt;
Using the output from steps 9 and 10 we construct a collection of tables linking OTUs, taxIDs, taxnames at various taxonomic levels to individual samples.  The colums are the samples and the rows are the respective units.  The cells are numbers of sequences assigned to the specific group. If looking at taxonomic levels, the sequences without an assignment at that level are assigned to a generic &amp;quot;No Assignment&amp;quot; bin.&lt;br /&gt;
&lt;br /&gt;
* First create a partition file that contains all the clusters and associated taxids (if they exist)&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
${SCRIPTS}/cluster2part.pl [batch].fna.cluster [batch] [batch].centers.taxid &amp;gt; [batch].part&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Using this partition, construct summary tables at various taxonomic levels&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
${SCRIPTS}/taxpart2summary.pl [batch].part ${MAIN}/454.csv [batch]&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The following outputs will be created:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
[batch].stats.txt - overall statistics for the data-set&lt;br /&gt;
[batch].otus.count.csv - table containing OTUs as rows, samples as columns, and entries represent&lt;br /&gt;
       # of sequences in each OTU/Sample pair&lt;br /&gt;
[batch].otus.percent.csv - same as &amp;quot;count&amp;quot; except that entries are percentages wrt total sequences&lt;br /&gt;
       in each sample&lt;br /&gt;
[batch].[tax].[otu|count|percent] - same as the &amp;quot;otus&amp;quot; file except at varying taxonomic levels.&lt;br /&gt;
       [tax] is one of &amp;quot;strain&amp;quot;, &amp;quot;species&amp;quot;, &amp;quot;genus&amp;quot;, &amp;quot;family&amp;quot;, &amp;quot;order&amp;quot;, &amp;quot;class&amp;quot;, &amp;quot;phylum&amp;quot;&lt;br /&gt;
       the &amp;quot;count&amp;quot; and &amp;quot;percent&amp;quot; entries are the same as for the &amp;quot;otus&amp;quot; files&lt;br /&gt;
       the &amp;quot;otu&amp;quot; entries contain number of OTUs assigned to the taxonomic group/Sample pair.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Database-based approach ==&lt;br /&gt;
&lt;br /&gt;
=== Database information === &lt;br /&gt;
* To access the database through a command line use the command shown below with password &amp;quot;access&amp;quot;:&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mysql -u access -p -h cbcbmysql00 gems&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
* A stub Perl script for playing with the database is provided at /fs/szasmg2/Gates_SOM/Main/454/DB/stub.pl&lt;br /&gt;
* All users can read the data from the database using users/password combo &amp;quot;access&amp;quot;/&amp;quot;access&amp;quot;.&lt;br /&gt;
* Database schema: &lt;br /&gt;
* The commands listed below assume you have write access to the database.&lt;br /&gt;
&lt;br /&gt;
=== Step 0: Get the sequence information ===&lt;br /&gt;
* From .SFF files (assuming these are 454 sequences)&lt;br /&gt;
&lt;br /&gt;
This step uses the sff_extract program from the Staden package (if I&#039;m not mistaken)&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
for i in *.sff ;do&lt;br /&gt;
  name=`expr $i : &#039;\(.*\)\.sff&#039;`&lt;br /&gt;
  sff_extract -c -s $name.seq -q $name.qual $i&lt;br /&gt;
done&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 1: Cleanup meta-information ===&lt;br /&gt;
* Convert the Excel sheet containing the batch information into tab-delimited file (and run dos2unix) making sure the quotes added by Excel/OOffice are removed, adding the date (if not already in the file), and sorting the file by Sample ID.  At this stage also check that the header row information is in canonical format.&lt;br /&gt;
* Add barcode information using add_barcode.pl&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;${SCRIPTDIR}/add_barcode.pl [batch].csv ${MAINDIR}IGS_Barcodes.csv &amp;gt; [batch]_barcode.csv&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 2: Create partition file ===&lt;br /&gt;
&lt;br /&gt;
First concatenate all the sequence files (if multiple) from a batch into a single file [batch].all.seq.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;${SCRIPTDIR}/code2part.pl [batch].csv [batch].all.seq [batch] &amp;gt; [batch].part&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The files [batch].BAD.list and [batch].NONE.list contain the names of all sequences that have failed quality checks either because they are too short, or contain Ns (BAD), or because the barcode is not recognized (NONE). The files contain additional information that can be used to troubleshoot the pipeline:  sequences that are too short (&amp;lt; 75 454 cycles) are followed by the number of cycles, sequences that either contain Ns or have an unknown barcode are followed by the first 8 characters in the sequence.&lt;br /&gt;
&lt;br /&gt;
=== Step 3: Break up the fasta file into separate batches by partition ===&lt;br /&gt;
&lt;br /&gt;
* Create partition directory&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mkdir part&lt;br /&gt;
cd part&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Partition main file into sub-parts&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
${SCRIPTS}/partitionFasta.pl ../[batch].all.seq ../[batch].part&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The result is one fasta file per sub-partition (i.e. individual subject).&lt;br /&gt;
&lt;br /&gt;
* Remove barcodes &lt;br /&gt;
(still in the part/ subdirectory)&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
for i in *.seq; do&lt;br /&gt;
  ${SCRIPTS}/unbarcode.pl $i&lt;br /&gt;
done&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The output files will have the same name as the original file but with the addition of the .nbc suffix.  You should remove the .nbc files from the BAD/NONE files in order to prevent their addition to the pipeline downstream.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt; rm *.BAD.nbc.fa *.NONE.nbc.fa &amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 4: Upload file names ===&lt;br /&gt;
Run from 454 directory&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
${DB_SCRIPTS}/add_file_db.pl [batch]/[batch]_barcode.csv [batch]/part&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 5: Combine all fasta files into a single one ===&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Note: this assumes we&#039;re running the pipeline for the first time - a different protocol is necessary for adding new sequences to an already existing analysis&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
In the new file all sequences will be named &amp;lt;n&amp;gt;_&amp;lt;nn&amp;gt; where &amp;lt;n&amp;gt; is the value in the &amp;quot;File #&amp;quot;&lt;br /&gt;
field in 454.csv and &amp;lt;nn&amp;gt; is the index of the sequence in the file.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mkdir Analysis/Run[date]&lt;br /&gt;
${DB_SCRIPTS}/combinefa_db.pl -c Analysis/Run[date]/Run[date] 454&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The output will be in Analysis/Run[date]/Run[date].fna&lt;br /&gt;
&lt;br /&gt;
=== Step 6: Run clustering tool ===&lt;br /&gt;
&lt;br /&gt;
* First generate clusters&lt;br /&gt;
&#039;&#039;&#039;Note: This part assumes we&#039;re running the whole set of sequences as one batch.&#039;&#039;&#039;&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
cd Analysis/Run[date]/Run[date].fna&lt;br /&gt;
/fs/szasmg2/ghodsi/Src/clusterk/clusterk7 -r 2 -i Run[date].fna &amp;gt; Run[date].fna.cluster&lt;br /&gt;
/fs/szasmg2/ghodsi/Src/clusterk/clusterk7 -r 2 -m -i Run[date].fna &amp;gt; Run[date].fna.align&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Output will be in Run[date].fna.cluster, one cluster per line, cluster center listed as the first identifier.&lt;br /&gt;
&lt;br /&gt;
The .align file contains aligned FASTA records for all the sequences in each cluster.  Clusters are separated by #&amp;lt;number&amp;gt; where &amp;lt;number&amp;gt; is the number of sequences in the cluster.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* Then extract the cluster centers&lt;br /&gt;
&#039;&#039;&#039;From here on the code runs the same in both full-run and batch modes&#039;&#039;&#039;&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
/fs/szasmg2/ghodsi/Src/clusterk/fastaselect Run[date].fna &amp;lt; Run[date].fna.cluster &amp;gt; Run[date].centers.fna&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 7: Assign putative taxonomic labels to clusters ===&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
/fs/szasmg2/ghodsi/rdp/findtax/findtaxid.sh Run[date].centers.fna &amp;gt; Run[date].centers.taxid&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Output is tab-delimited: sequence name &amp;lt;TAB&amp;gt; taxid&lt;br /&gt;
Note: using &amp;quot;findtax&amp;quot; instead of &amp;quot;findtaxid&amp;quot; will retrieve actual taxonomy names.&lt;/div&gt;</summary>
		<author><name>Mpop</name></author>
	</entry>
	<entry>
		<id>https://wiki.umiacs.umd.edu/cbcb/index.php?title=Cbcb:Pop-Lab:16S-pipeline_16S_analysis_pipeline_(for_Gates_project)&amp;diff=7252</id>
		<title>Cbcb:Pop-Lab:16S-pipeline 16S analysis pipeline (for Gates project)</title>
		<link rel="alternate" type="text/html" href="https://wiki.umiacs.umd.edu/cbcb/index.php?title=Cbcb:Pop-Lab:16S-pipeline_16S_analysis_pipeline_(for_Gates_project)&amp;diff=7252"/>
		<updated>2010-06-10T15:28:03Z</updated>

		<summary type="html">&lt;p&gt;Mpop: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= 16S analysis pipeline =&lt;br /&gt;
&lt;br /&gt;
== Overview ==&lt;br /&gt;
=== Assumptions ===&lt;br /&gt;
* 16S rRNA sequences are generated with barcoded 454 and are received as either: (i) .sff file; (ii) fasta and quality file; (iii) just fasta file&lt;br /&gt;
* each batch is a single 96-well plate and is accompanied by a tab-delimited file containing information about the sample, including the &amp;quot;Sample ID&amp;quot;, well on the plate, and additional information regarding the sample quality and DNA concentration&lt;br /&gt;
* we have a file that specifies the barcode and sequencing adapter used for each well (these entities do not change).&lt;br /&gt;
&lt;br /&gt;
=== Directory structure ===&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
Gates_SOM/&lt;br /&gt;
  Main/&lt;br /&gt;
    samples.csv      - information about all the samples available to us&lt;br /&gt;
    454.csv          - information about all 454 runs (essentially concatenation of .csvs from 454 dir)&lt;br /&gt;
    phylochip.csv    - information about all Phylochip runs&lt;br /&gt;
    IGS_Barcodes.csv - information about barcodes used to multiplex 454 runs&lt;br /&gt;
    scripts/         - scripts used to process the data (referred to as ${SCRIPTS} below)&lt;br /&gt;
    DB/              - scripts used to access the database (referred to as ${DB_SCRIPTS} below)&lt;br /&gt;
  454/               - here&#039;s where all 454 sequences live&lt;br /&gt;
    [batch1]/        - ... each batch in a separate directory&lt;br /&gt;
    [batch1].csv     - meta-information about the batch &lt;br /&gt;
    [fasta1]         - fasta files containing the batch&lt;br /&gt;
     ... &lt;br /&gt;
    [fastan]&lt;br /&gt;
    [batch1].part    - partition file describing how the sequences get split by barcode/sample&lt;br /&gt;
     part/           - directory where all the partitioned files live &lt;br /&gt;
     ... &lt;br /&gt;
    [batchn]&lt;br /&gt;
  Phylochip/         - all the CEL files and auxiliary information on the Phylochip runs&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== File-based approach == &lt;br /&gt;
&lt;br /&gt;
=== Step 0: Get the sequence information ===&lt;br /&gt;
* From .SFF files (assuming these are 454 sequences)&lt;br /&gt;
&lt;br /&gt;
This step uses the sff_extract program from the Staden package (if I&#039;m not mistaken)&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
for i in *.sff ;do&lt;br /&gt;
  name=`expr $i : &#039;\(.*\)\.sff&#039;`&lt;br /&gt;
  sff_extract -c -s $name.seq -q $name.qual $i&lt;br /&gt;
done&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 1: Cleanup meta-information ===&lt;br /&gt;
* Convert the Excel sheet containing the batch information into tab-delimited file (and run dos2unix) making sure the quotes added by Excel/OOffice are removed, adding the date (if not already in the file), and sorting the file by Sample ID.  At this stage also check that the header row information is in canonical format.&lt;br /&gt;
* Add barcode information using add_barcode.pl&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;${SCRIPTDIR}/add_barcode.pl [batch].csv ${MAINDIR}IGS_Barcodes.csv &amp;gt; [batch]_barcode.csv&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 2: Create partition file ===&lt;br /&gt;
&lt;br /&gt;
First concatenate all the sequence files (if multiple) from a batch into a single file [batch].all.seq.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;${SCRIPTDIR}/code2part.pl [batch].csv [batch].all.seq [batch] &amp;gt; [batch].part&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The files [batch].BAD.list and [batch].NONE.list contain the names of all sequences that have failed quality checks either because they are too short, or contain Ns (BAD), or because the barcode is not recognized (NONE). The files contain additional information that can be used to troubleshoot the pipeline:  sequences that are too short (&amp;lt; 75 454 cycles) are followed by the number of cycles, sequences that either contain Ns or have an unknown barcode are followed by the first 8 characters in the sequence.&lt;br /&gt;
&lt;br /&gt;
=== Step 3: Break up the fasta file into separate batches by partition ===&lt;br /&gt;
&lt;br /&gt;
* Create partition directory&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mkdir part&lt;br /&gt;
cd part&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Partition main file into sub-parts&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
${SCRIPTS}/partitionFasta.pl ../[batch].all.seq ../[batch].part&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The result is one fasta file per sub-partition (i.e. individual subject).&lt;br /&gt;
&lt;br /&gt;
* Remove barcodes &lt;br /&gt;
(still in the part/ subdirectory)&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
for i in *.seq; do&lt;br /&gt;
  ${SCRIPTS}/unbarcode.pl $i&lt;br /&gt;
done&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The output files will have the same name as the original file but with the addition of the .nbc suffix.  You should remove the .nbc files from the BAD/NONE files in order to prevent their addition to the pipeline downstream.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt; rm *.BAD.nbc.fa *.NONE.nbc.fa &amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 4: Add file names to sample tables ===&lt;br /&gt;
&lt;br /&gt;
The following needs to be run from the root of the 454 directory.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt; ${SCRIPTS}/add_file.pl [batch]/[batch]_barcode.csv [batch]/part &amp;gt; [batch]/[batch]_names.csv &amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
As a result, the file [batch]/[batch]_names.csv will associate each Sample ID to a file name and also record the number of sequences in that file.  Note that only files ending in .nbc.fa are procesed.&lt;br /&gt;
&lt;br /&gt;
=== Step 5: Merge all the batch meta-info files into a same file at the top ===&lt;br /&gt;
&lt;br /&gt;
Note: the addition of file names must be done on a batch by batch basis as multiple files might refer to a same Sample ID - within each batch it can be assumed that the Sample ID -&amp;gt; Filename mapping is unique.  In the 454.csv file in the top directory the unique key is the file name.  &lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt; &lt;br /&gt;
mv ../454.csv ../454.csv.bak&lt;br /&gt;
${SCRIPTS}/merge_csv.pl ../454.csv.bak [batch]/[batch]_names.csv &amp;quot;Sample ID&amp;quot; non-unique &amp;gt; ../454.csv&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This should probably be run with the filename as the key but then the tables need to be sorted  by filename.  Ultimately, this will all work better once the data are in a relational database.&lt;br /&gt;
&lt;br /&gt;
=== Step 6: Update sample file ===&lt;br /&gt;
From the top directory:&lt;br /&gt;
&lt;br /&gt;
* First add all new samples to the samples.csv file&lt;br /&gt;
&amp;lt;pre&amp;gt; &lt;br /&gt;
mv samples.csv samples.csv.bak&lt;br /&gt;
${SCRIPTS}/merge_csv.pl samples.csv.bak 454.csv &amp;quot;Sample ID&amp;quot; unique merge &amp;gt; samples.csv&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Note: merge means that if record keys conflict, the empty fields will be updated with the new data.&lt;br /&gt;
&lt;br /&gt;
* Update the tag indicating 454 sequences available for this sample&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mv samples.csv samples.csv.bak&lt;br /&gt;
${SCRIPTS}/update_field_csv.pl samples.csv.bak 454.csv &amp;quot;Sample ID&amp;quot; 454 Y &amp;gt; samples.csv&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 7: Assign numbers to all filenames ===&lt;br /&gt;
&lt;br /&gt;
Each file in the 454.csv file will be assigned an integer (if one is not already available).  This number will be used to prefix the sequences in the combined file for the project.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mv 454.csv 454.csv.bak&lt;br /&gt;
${SCRIPTS}/add_filenum.pl 454.csv.bak &amp;gt; 454.csv&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 8: Combine all fasta files into a single one ===&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Note: this assumes we&#039;re running the pipeline for the first time - a different protocol is necessary for adding new sequences to an already existing analysis&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
In the new file all sequences will be named &amp;lt;n&amp;gt;_&amp;lt;nn&amp;gt; where &amp;lt;n&amp;gt; is the value in the &amp;quot;File #&amp;quot;&lt;br /&gt;
field in 454.csv and &amp;lt;nn&amp;gt; is the index of the sequence in the file.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mkdir Analysis/Run[date]&lt;br /&gt;
${SCRIPTS}/combinefa.pl -c Analysis/Run[date]/Run[date] -i 454.csv 454&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The output will be in Analysis/Run[date]/Run[date].fna&lt;br /&gt;
&lt;br /&gt;
=== Step 9: Run clustering tool ===&lt;br /&gt;
&lt;br /&gt;
* First generate clusters&lt;br /&gt;
&#039;&#039;&#039;Note: This part assumes we&#039;re running the whole set of sequences as one batch.&#039;&#039;&#039;&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
cd Analysis/Run[date]/Run[date].fna&lt;br /&gt;
/fs/szasmg2/ghodsi/Src/clusterk/clusterk7 -r 2 -i Run[date].fna &amp;gt; Run[date].fna.cluster&lt;br /&gt;
/fs/szasmg2/ghodsi/Src/clusterk/clusterk7 -r 2 -m -i Run[date].fna &amp;gt; Run[date].fna.align&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Output will be in Run[date].fna.cluster, one cluster per line, cluster center listed as the first identifier.&lt;br /&gt;
&lt;br /&gt;
The .align file contains aligned FASTA records for all the sequences in each cluster.  Clusters are separated by #&amp;lt;number&amp;gt; where &amp;lt;number&amp;gt; is the number of sequences in the cluster.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* Then extract the cluster centers&lt;br /&gt;
&#039;&#039;&#039;From here on the code runs the same in both full-run and batch modes&#039;&#039;&#039;&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
/fs/szasmg2/ghodsi/Src/clusterk/fastaselect Run[date].fna &amp;lt; Run[date].fna.cluster &amp;gt; Run[date].centers.fna&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 10: Assign putative taxonomic labels to clusters ===&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
/fs/szasmg2/ghodsi/rdp/findtax/findtaxid.sh Run[date].centers.fna &amp;gt; Run[date].centers.taxid&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Output is tab-delimited: sequence name &amp;lt;TAB&amp;gt; taxid&lt;br /&gt;
Note: using &amp;quot;findtax&amp;quot; instead of &amp;quot;findtaxid&amp;quot; will retrieve actual taxonomy names.&lt;br /&gt;
&lt;br /&gt;
=== Step 11: Build summary tables ===&lt;br /&gt;
&lt;br /&gt;
Using the output from steps 9 and 10 we construct a collection of tables linking OTUs, taxIDs, taxnames at various taxonomic levels to individual samples.  The colums are the samples and the rows are the respective units.  The cells are numbers of sequences assigned to the specific group. If looking at taxonomic levels, the sequences without an assignment at that level are assigned to a generic &amp;quot;No Assignment&amp;quot; bin.&lt;br /&gt;
&lt;br /&gt;
* First create a partition file that contains all the clusters and associated taxids (if they exist)&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
${SCRIPTS}/cluster2part.pl [batch].fna.cluster [batch] [batch].centers.taxid &amp;gt; [batch].part&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Using this partition, construct summary tables at various taxonomic levels&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
${SCRIPTS}/taxpart2summary.pl [batch].part ${MAIN}/454.csv [batch]&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The following outputs will be created:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
[batch].stats.txt - overall statistics for the data-set&lt;br /&gt;
[batch].otus.count.csv - table containing OTUs as rows, samples as columns, and entries represent&lt;br /&gt;
       # of sequences in each OTU/Sample pair&lt;br /&gt;
[batch].otus.percent.csv - same as &amp;quot;count&amp;quot; except that entries are percentages wrt total sequences&lt;br /&gt;
       in each sample&lt;br /&gt;
[batch].[tax].[otu|count|percent] - same as the &amp;quot;otus&amp;quot; file except at varying taxonomic levels.&lt;br /&gt;
       [tax] is one of &amp;quot;strain&amp;quot;, &amp;quot;species&amp;quot;, &amp;quot;genus&amp;quot;, &amp;quot;family&amp;quot;, &amp;quot;order&amp;quot;, &amp;quot;class&amp;quot;, &amp;quot;phylum&amp;quot;&lt;br /&gt;
       the &amp;quot;count&amp;quot; and &amp;quot;percent&amp;quot; entries are the same as for the &amp;quot;otus&amp;quot; files&lt;br /&gt;
       the &amp;quot;otu&amp;quot; entries contain number of OTUs assigned to the taxonomic group/Sample pair.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Database-based approach ==&lt;br /&gt;
&lt;br /&gt;
=== Database information === &lt;br /&gt;
* To access the database through a command line use the command shown below with password &amp;quot;access&amp;quot;:&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mysql -u access -p -h cbcbmysql00 gems&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
* A stub Perl script for playing with the database is provided at /fs/szasmg2/Gates_SOM/Main/454/DB/stub.pl&lt;br /&gt;
* All users can read the data from the database using users/password combo &amp;quot;access&amp;quot;/&amp;quot;access&amp;quot;.&lt;br /&gt;
* Database schema: &lt;br /&gt;
* The commands listed below assume you have write access to the database.&lt;br /&gt;
&lt;br /&gt;
=== Step 0: Get the sequence information ===&lt;br /&gt;
* From .SFF files (assuming these are 454 sequences)&lt;br /&gt;
&lt;br /&gt;
This step uses the sff_extract program from the Staden package (if I&#039;m not mistaken)&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
for i in *.sff ;do&lt;br /&gt;
  name=`expr $i : &#039;\(.*\)\.sff&#039;`&lt;br /&gt;
  sff_extract -c -s $name.seq -q $name.qual $i&lt;br /&gt;
done&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 1: Cleanup meta-information ===&lt;br /&gt;
* Convert the Excel sheet containing the batch information into tab-delimited file (and run dos2unix) making sure the quotes added by Excel/OOffice are removed, adding the date (if not already in the file), and sorting the file by Sample ID.  At this stage also check that the header row information is in canonical format.&lt;br /&gt;
* Add barcode information using add_barcode.pl&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;${SCRIPTDIR}/add_barcode.pl [batch].csv ${MAINDIR}IGS_Barcodes.csv &amp;gt; [batch]_barcode.csv&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 2: Create partition file ===&lt;br /&gt;
&lt;br /&gt;
First concatenate all the sequence files (if multiple) from a batch into a single file [batch].all.seq.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;${SCRIPTDIR}/code2part.pl [batch].csv [batch].all.seq [batch] &amp;gt; [batch].part&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The files [batch].BAD.list and [batch].NONE.list contain the names of all sequences that have failed quality checks either because they are too short, or contain Ns (BAD), or because the barcode is not recognized (NONE). The files contain additional information that can be used to troubleshoot the pipeline:  sequences that are too short (&amp;lt; 75 454 cycles) are followed by the number of cycles, sequences that either contain Ns or have an unknown barcode are followed by the first 8 characters in the sequence.&lt;br /&gt;
&lt;br /&gt;
=== Step 3: Break up the fasta file into separate batches by partition ===&lt;br /&gt;
&lt;br /&gt;
* Create partition directory&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mkdir part&lt;br /&gt;
cd part&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Partition main file into sub-parts&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
${SCRIPTS}/partitionFasta.pl ../[batch].all.seq ../[batch].part&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The result is one fasta file per sub-partition (i.e. individual subject).&lt;br /&gt;
&lt;br /&gt;
* Remove barcodes &lt;br /&gt;
(still in the part/ subdirectory)&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
for i in *.seq; do&lt;br /&gt;
  ${SCRIPTS}/unbarcode.pl $i&lt;br /&gt;
done&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The output files will have the same name as the original file but with the addition of the .nbc suffix.  You should remove the .nbc files from the BAD/NONE files in order to prevent their addition to the pipeline downstream.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt; rm *.BAD.nbc.fa *.NONE.nbc.fa &amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 4: Upload file names ===&lt;br /&gt;
Run from 454 directory&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
${DB_SCRIPTS}/add_file_db.pl [batch]/[batch]_barcode.csv [batch]/part&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 5: Combine all fasta files into a single one ===&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Note: this assumes we&#039;re running the pipeline for the first time - a different protocol is necessary for adding new sequences to an already existing analysis&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
In the new file all sequences will be named &amp;lt;n&amp;gt;_&amp;lt;nn&amp;gt; where &amp;lt;n&amp;gt; is the value in the &amp;quot;File #&amp;quot;&lt;br /&gt;
field in 454.csv and &amp;lt;nn&amp;gt; is the index of the sequence in the file.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mkdir Analysis/Run[date]&lt;br /&gt;
${DB_SCRIPTS}/combinefa_db.pl -c Analysis/Run[date]/Run[date] 454&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The output will be in Analysis/Run[date]/Run[date].fna&lt;/div&gt;</summary>
		<author><name>Mpop</name></author>
	</entry>
	<entry>
		<id>https://wiki.umiacs.umd.edu/cbcb/index.php?title=Cbcb:Pop-Lab:16S-pipeline_16S_analysis_pipeline_(for_Gates_project)&amp;diff=7251</id>
		<title>Cbcb:Pop-Lab:16S-pipeline 16S analysis pipeline (for Gates project)</title>
		<link rel="alternate" type="text/html" href="https://wiki.umiacs.umd.edu/cbcb/index.php?title=Cbcb:Pop-Lab:16S-pipeline_16S_analysis_pipeline_(for_Gates_project)&amp;diff=7251"/>
		<updated>2010-06-08T20:34:18Z</updated>

		<summary type="html">&lt;p&gt;Mpop: /* Directory structure */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= 16S analysis pipeline =&lt;br /&gt;
&lt;br /&gt;
== Overview ==&lt;br /&gt;
=== Assumptions ===&lt;br /&gt;
* 16S rRNA sequences are generated with barcoded 454 and are received as either: (i) .sff file; (ii) fasta and quality file; (iii) just fasta file&lt;br /&gt;
* each batch is a single 96-well plate and is accompanied by a tab-delimited file containing information about the sample, including the &amp;quot;Sample ID&amp;quot;, well on the plate, and additional information regarding the sample quality and DNA concentration&lt;br /&gt;
* we have a file that specifies the barcode and sequencing adapter used for each well (these entities do not change).&lt;br /&gt;
&lt;br /&gt;
=== Directory structure ===&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
Gates_SOM/&lt;br /&gt;
  Main/&lt;br /&gt;
    samples.csv      - information about all the samples available to us&lt;br /&gt;
    454.csv          - information about all 454 runs (essentially concatenation of .csvs from 454 dir)&lt;br /&gt;
    phylochip.csv    - information about all Phylochip runs&lt;br /&gt;
    IGS_Barcodes.csv - information about barcodes used to multiplex 454 runs&lt;br /&gt;
    scripts/         - scripts used to process the data (referred to as ${SCRIPTS} below)&lt;br /&gt;
    DB/              - scripts used to access the database (referred to as ${DB_SCRIPTS} below)&lt;br /&gt;
  454/               - here&#039;s where all 454 sequences live&lt;br /&gt;
    [batch1]/        - ... each batch in a separate directory&lt;br /&gt;
    [batch1].csv     - meta-information about the batch &lt;br /&gt;
    [fasta1]         - fasta files containing the batch&lt;br /&gt;
     ... &lt;br /&gt;
    [fastan]&lt;br /&gt;
    [batch1].part    - partition file describing how the sequences get split by barcode/sample&lt;br /&gt;
     part/           - directory where all the partitioned files live &lt;br /&gt;
     ... &lt;br /&gt;
    [batchn]&lt;br /&gt;
  Phylochip/         - all the CEL files and auxiliary information on the Phylochip runs&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== File-based approach == &lt;br /&gt;
&lt;br /&gt;
=== Step 0: Get the sequence information ===&lt;br /&gt;
* From .SFF files (assuming these are 454 sequences)&lt;br /&gt;
&lt;br /&gt;
This step uses the sff_extract program from the Staden package (if I&#039;m not mistaken)&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
for i in *.sff ;do&lt;br /&gt;
  name=`expr $i : &#039;\(.*\)\.sff&#039;`&lt;br /&gt;
  sff_extract -c -s $name.seq -q $name.qual $i&lt;br /&gt;
done&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 1: Cleanup meta-information ===&lt;br /&gt;
* Convert the Excel sheet containing the batch information into tab-delimited file (and run dos2unix) making sure the quotes added by Excel/OOffice are removed, adding the date (if not already in the file), and sorting the file by Sample ID.  At this stage also check that the header row information is in canonical format.&lt;br /&gt;
* Add barcode information using add_barcode.pl&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;${SCRIPTDIR}/add_barcode.pl [batch].csv ${MAINDIR}IGS_Barcodes.csv &amp;gt; [batch]_barcode.csv&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 2: Create partition file ===&lt;br /&gt;
&lt;br /&gt;
First concatenate all the sequence files (if multiple) from a batch into a single file [batch].all.seq.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;${SCRIPTDIR}/code2part.pl [batch].csv [batch].all.seq [batch] &amp;gt; [batch].part&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The files [batch].BAD.list and [batch].NONE.list contain the names of all sequences that have failed quality checks either because they are too short, or contain Ns (BAD), or because the barcode is not recognized (NONE). The files contain additional information that can be used to troubleshoot the pipeline:  sequences that are too short (&amp;lt; 75 454 cycles) are followed by the number of cycles, sequences that either contain Ns or have an unknown barcode are followed by the first 8 characters in the sequence.&lt;br /&gt;
&lt;br /&gt;
=== Step 3: Break up the fasta file into separate batches by partition ===&lt;br /&gt;
&lt;br /&gt;
* Create partition directory&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mkdir part&lt;br /&gt;
cd part&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Partition main file into sub-parts&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
${SCRIPTS}/partitionFasta.pl ../[batch].all.seq ../[batch].part&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The result is one fasta file per sub-partition (i.e. individual subject).&lt;br /&gt;
&lt;br /&gt;
* Remove barcodes &lt;br /&gt;
(still in the part/ subdirectory)&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
for i in *.seq; do&lt;br /&gt;
  ${SCRIPTS}/unbarcode.pl $i&lt;br /&gt;
done&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The output files will have the same name as the original file but with the addition of the .nbc suffix.  You should remove the .nbc files from the BAD/NONE files in order to prevent their addition to the pipeline downstream.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt; rm *.BAD.nbc.fa *.NONE.nbc.fa &amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 4: Add file names to sample tables ===&lt;br /&gt;
&lt;br /&gt;
The following needs to be run from the root of the 454 directory.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt; ${SCRIPTS}/add_file.pl [batch]/[batch]_barcode.csv [batch]/part &amp;gt; [batch]/[batch]_names.csv &amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
As a result, the file [batch]/[batch]_names.csv will associate each Sample ID to a file name and also record the number of sequences in that file.  Note that only files ending in .nbc.fa are procesed.&lt;br /&gt;
&lt;br /&gt;
=== Step 5: Merge all the batch meta-info files into a same file at the top ===&lt;br /&gt;
&lt;br /&gt;
Note: the addition of file names must be done on a batch by batch basis as multiple files might refer to a same Sample ID - within each batch it can be assumed that the Sample ID -&amp;gt; Filename mapping is unique.  In the 454.csv file in the top directory the unique key is the file name.  &lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt; &lt;br /&gt;
mv ../454.csv ../454.csv.bak&lt;br /&gt;
${SCRIPTS}/merge_csv.pl ../454.csv.bak [batch]/[batch]_names.csv &amp;quot;Sample ID&amp;quot; non-unique &amp;gt; ../454.csv&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This should probably be run with the filename as the key but then the tables need to be sorted  by filename.  Ultimately, this will all work better once the data are in a relational database.&lt;br /&gt;
&lt;br /&gt;
=== Step 6: Update sample file ===&lt;br /&gt;
From the top directory:&lt;br /&gt;
&lt;br /&gt;
* First add all new samples to the samples.csv file&lt;br /&gt;
&amp;lt;pre&amp;gt; &lt;br /&gt;
mv samples.csv samples.csv.bak&lt;br /&gt;
${SCRIPTS}/merge_csv.pl samples.csv.bak 454.csv &amp;quot;Sample ID&amp;quot; unique merge &amp;gt; samples.csv&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Note: merge means that if record keys conflict, the empty fields will be updated with the new data.&lt;br /&gt;
&lt;br /&gt;
* Update the tag indicating 454 sequences available for this sample&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mv samples.csv samples.csv.bak&lt;br /&gt;
${SCRIPTS}/update_field_csv.pl samples.csv.bak 454.csv &amp;quot;Sample ID&amp;quot; 454 Y &amp;gt; samples.csv&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 7: Assign numbers to all filenames ===&lt;br /&gt;
&lt;br /&gt;
Each file in the 454.csv file will be assigned an integer (if one is not already available).  This number will be used to prefix the sequences in the combined file for the project.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mv 454.csv 454.csv.bak&lt;br /&gt;
${SCRIPTS}/add_filenum.pl 454.csv.bak &amp;gt; 454.csv&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 8: Combine all fasta files into a single one ===&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Note: this assumes we&#039;re running the pipeline for the first time - a different protocol is necessary for adding new sequences to an already existing analysis&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
In the new file all sequences will be named &amp;lt;n&amp;gt;_&amp;lt;nn&amp;gt; where &amp;lt;n&amp;gt; is the value in the &amp;quot;File #&amp;quot;&lt;br /&gt;
field in 454.csv and &amp;lt;nn&amp;gt; is the index of the sequence in the file.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mkdir Analysis/Run[date]&lt;br /&gt;
${SCRIPTS}/combinefa.pl -c Analysis/Run[date]/Run[date] -i 454.csv 454&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The output will be in Analysis/Run[date]/Run[date].fna&lt;br /&gt;
&lt;br /&gt;
=== Step 9: Run clustering tool ===&lt;br /&gt;
&lt;br /&gt;
* First generate clusters&lt;br /&gt;
&#039;&#039;&#039;Note: This part assumes we&#039;re running the whole set of sequences as one batch.&#039;&#039;&#039;&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
cd Analysis/Run[date]/Run[date].fna&lt;br /&gt;
/fs/szasmg2/ghodsi/Src/clusterk/clusterk7 -r 2 -i Run[date].fna &amp;gt; Run[date].fna.cluster&lt;br /&gt;
/fs/szasmg2/ghodsi/Src/clusterk/clusterk7 -r 2 -m -i Run[date].fna &amp;gt; Run[date].fna.align&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Output will be in Run[date].fna.cluster, one cluster per line, cluster center listed as the first identifier.&lt;br /&gt;
&lt;br /&gt;
The .align file contains aligned FASTA records for all the sequences in each cluster.  Clusters are separated by #&amp;lt;number&amp;gt; where &amp;lt;number&amp;gt; is the number of sequences in the cluster.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* Then extract the cluster centers&lt;br /&gt;
&#039;&#039;&#039;From here on the code runs the same in both full-run and batch modes&#039;&#039;&#039;&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
/fs/szasmg2/ghodsi/Src/clusterk/fastaselect Run[date].fna &amp;lt; Run[date].fna.cluster &amp;gt; Run[date].centers.fna&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 10: Assign putative taxonomic labels to clusters ===&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
/fs/szasmg2/ghodsi/rdp/findtax/findtaxid.sh Run[date].centers.fna &amp;gt; Run[date].centers.taxid&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Output is tab-delimited: sequence name &amp;lt;TAB&amp;gt; taxid&lt;br /&gt;
Note: using &amp;quot;findtax&amp;quot; instead of &amp;quot;findtaxid&amp;quot; will retrieve actual taxonomy names.&lt;br /&gt;
&lt;br /&gt;
=== Step 11: Build summary tables ===&lt;br /&gt;
&lt;br /&gt;
Using the output from steps 9 and 10 we construct a collection of tables linking OTUs, taxIDs, taxnames at various taxonomic levels to individual samples.  The colums are the samples and the rows are the respective units.  The cells are numbers of sequences assigned to the specific group. If looking at taxonomic levels, the sequences without an assignment at that level are assigned to a generic &amp;quot;No Assignment&amp;quot; bin.&lt;br /&gt;
&lt;br /&gt;
* First create a partition file that contains all the clusters and associated taxids (if they exist)&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
${SCRIPTS}/cluster2part.pl [batch].fna.cluster [batch] [batch].centers.taxid &amp;gt; [batch].part&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Using this partition, construct summary tables at various taxonomic levels&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
${SCRIPTS}/taxpart2summary.pl [batch].part ${MAIN}/454.csv [batch]&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The following outputs will be created:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
[batch].stats.txt - overall statistics for the data-set&lt;br /&gt;
[batch].otus.count.csv - table containing OTUs as rows, samples as columns, and entries represent&lt;br /&gt;
       # of sequences in each OTU/Sample pair&lt;br /&gt;
[batch].otus.percent.csv - same as &amp;quot;count&amp;quot; except that entries are percentages wrt total sequences&lt;br /&gt;
       in each sample&lt;br /&gt;
[batch].[tax].[otu|count|percent] - same as the &amp;quot;otus&amp;quot; file except at varying taxonomic levels.&lt;br /&gt;
       [tax] is one of &amp;quot;strain&amp;quot;, &amp;quot;species&amp;quot;, &amp;quot;genus&amp;quot;, &amp;quot;family&amp;quot;, &amp;quot;order&amp;quot;, &amp;quot;class&amp;quot;, &amp;quot;phylum&amp;quot;&lt;br /&gt;
       the &amp;quot;count&amp;quot; and &amp;quot;percent&amp;quot; entries are the same as for the &amp;quot;otus&amp;quot; files&lt;br /&gt;
       the &amp;quot;otu&amp;quot; entries contain number of OTUs assigned to the taxonomic group/Sample pair.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Database-based approach ==&lt;br /&gt;
&lt;br /&gt;
=== Database information === &lt;br /&gt;
* To access the database through a command line use the command shown below with password &amp;quot;access&amp;quot;:&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mysql -u access -p -h cbcbmysql00 gems&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
* A stub Perl script for playing with the database is provided at /fs/szasmg2/Gates_SOM/Main/454/DB/stub.pl&lt;br /&gt;
* All users can read the data from the database using users/password combo &amp;quot;access&amp;quot;/&amp;quot;access&amp;quot;.&lt;br /&gt;
* Database schema: &lt;br /&gt;
* The commands listed below assume you have write access to the database.&lt;br /&gt;
&lt;br /&gt;
=== Step 0: Get the sequence information ===&lt;br /&gt;
* From .SFF files (assuming these are 454 sequences)&lt;br /&gt;
&lt;br /&gt;
This step uses the sff_extract program from the Staden package (if I&#039;m not mistaken)&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
for i in *.sff ;do&lt;br /&gt;
  name=`expr $i : &#039;\(.*\)\.sff&#039;`&lt;br /&gt;
  sff_extract -c -s $name.seq -q $name.qual $i&lt;br /&gt;
done&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 1: Cleanup meta-information ===&lt;br /&gt;
* Convert the Excel sheet containing the batch information into tab-delimited file (and run dos2unix) making sure the quotes added by Excel/OOffice are removed, adding the date (if not already in the file), and sorting the file by Sample ID.  At this stage also check that the header row information is in canonical format.&lt;br /&gt;
* Add barcode information using add_barcode.pl&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;${SCRIPTDIR}/add_barcode.pl [batch].csv ${MAINDIR}IGS_Barcodes.csv &amp;gt; [batch]_barcode.csv&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 2: Create partition file ===&lt;br /&gt;
&lt;br /&gt;
First concatenate all the sequence files (if multiple) from a batch into a single file [batch].all.seq.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;${SCRIPTDIR}/code2part.pl [batch].csv [batch].all.seq [batch] &amp;gt; [batch].part&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The files [batch].BAD.list and [batch].NONE.list contain the names of all sequences that have failed quality checks either because they are too short, or contain Ns (BAD), or because the barcode is not recognized (NONE). The files contain additional information that can be used to troubleshoot the pipeline:  sequences that are too short (&amp;lt; 75 454 cycles) are followed by the number of cycles, sequences that either contain Ns or have an unknown barcode are followed by the first 8 characters in the sequence.&lt;br /&gt;
&lt;br /&gt;
=== Step 3: Break up the fasta file into separate batches by partition ===&lt;br /&gt;
&lt;br /&gt;
* Create partition directory&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mkdir part&lt;br /&gt;
cd part&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Partition main file into sub-parts&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
${SCRIPTS}/partitionFasta.pl ../[batch].all.seq ../[batch].part&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The result is one fasta file per sub-partition (i.e. individual subject).&lt;br /&gt;
&lt;br /&gt;
* Remove barcodes &lt;br /&gt;
(still in the part/ subdirectory)&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
for i in *.seq; do&lt;br /&gt;
  ${SCRIPTS}/unbarcode.pl $i&lt;br /&gt;
done&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The output files will have the same name as the original file but with the addition of the .nbc suffix.  You should remove the .nbc files from the BAD/NONE files in order to prevent their addition to the pipeline downstream.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt; rm *.BAD.nbc.fa *.NONE.nbc.fa &amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 4: Upload file names ===&lt;br /&gt;
Run from 454 directory&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
${DB_SCRIPTS}/add_file_db.pl [batch]/[batch]_barcode.csv [batch]/part&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;/div&gt;</summary>
		<author><name>Mpop</name></author>
	</entry>
	<entry>
		<id>https://wiki.umiacs.umd.edu/cbcb/index.php?title=Cbcb:Pop-Lab:16S-pipeline_16S_analysis_pipeline_(for_Gates_project)&amp;diff=7250</id>
		<title>Cbcb:Pop-Lab:16S-pipeline 16S analysis pipeline (for Gates project)</title>
		<link rel="alternate" type="text/html" href="https://wiki.umiacs.umd.edu/cbcb/index.php?title=Cbcb:Pop-Lab:16S-pipeline_16S_analysis_pipeline_(for_Gates_project)&amp;diff=7250"/>
		<updated>2010-06-08T20:33:03Z</updated>

		<summary type="html">&lt;p&gt;Mpop: /* Step 4: Upload file names */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= 16S analysis pipeline =&lt;br /&gt;
&lt;br /&gt;
== Overview ==&lt;br /&gt;
=== Assumptions ===&lt;br /&gt;
* 16S rRNA sequences are generated with barcoded 454 and are received as either: (i) .sff file; (ii) fasta and quality file; (iii) just fasta file&lt;br /&gt;
* each batch is a single 96-well plate and is accompanied by a tab-delimited file containing information about the sample, including the &amp;quot;Sample ID&amp;quot;, well on the plate, and additional information regarding the sample quality and DNA concentration&lt;br /&gt;
* we have a file that specifies the barcode and sequencing adapter used for each well (these entities do not change).&lt;br /&gt;
&lt;br /&gt;
=== Directory structure ===&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
Gates_SOM/&lt;br /&gt;
  Main/&lt;br /&gt;
    samples.csv      - information about all the samples available to us&lt;br /&gt;
    454.csv          - information about all 454 runs (essentially concatenation of .csvs from 454 dir)&lt;br /&gt;
    phylochip.csv    - information about all Phylochip runs&lt;br /&gt;
    IGS_Barcodes.csv - information about barcodes used to multiplex 454 runs&lt;br /&gt;
    scripts/         - scripts used to process the data&lt;br /&gt;
  454/               - here&#039;s where all 454 sequences live&lt;br /&gt;
    [batch1]/        - ... each batch in a separate directory&lt;br /&gt;
    [batch1].csv     - meta-information about the batch &lt;br /&gt;
    [fasta1]         - fasta files containing the batch&lt;br /&gt;
     ... &lt;br /&gt;
    [fastan]&lt;br /&gt;
    [batch1].part    - partition file describing how the sequences get split by barcode/sample&lt;br /&gt;
     part/           - directory where all the partitioned files live &lt;br /&gt;
     ... &lt;br /&gt;
    [batchn]&lt;br /&gt;
  Phylochip/         - all the CEL files and auxiliary information on the Phylochip runs&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== File-based approach == &lt;br /&gt;
&lt;br /&gt;
=== Step 0: Get the sequence information ===&lt;br /&gt;
* From .SFF files (assuming these are 454 sequences)&lt;br /&gt;
&lt;br /&gt;
This step uses the sff_extract program from the Staden package (if I&#039;m not mistaken)&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
for i in *.sff ;do&lt;br /&gt;
  name=`expr $i : &#039;\(.*\)\.sff&#039;`&lt;br /&gt;
  sff_extract -c -s $name.seq -q $name.qual $i&lt;br /&gt;
done&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 1: Cleanup meta-information ===&lt;br /&gt;
* Convert the Excel sheet containing the batch information into tab-delimited file (and run dos2unix) making sure the quotes added by Excel/OOffice are removed, adding the date (if not already in the file), and sorting the file by Sample ID.  At this stage also check that the header row information is in canonical format.&lt;br /&gt;
* Add barcode information using add_barcode.pl&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;${SCRIPTDIR}/add_barcode.pl [batch].csv ${MAINDIR}IGS_Barcodes.csv &amp;gt; [batch]_barcode.csv&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 2: Create partition file ===&lt;br /&gt;
&lt;br /&gt;
First concatenate all the sequence files (if multiple) from a batch into a single file [batch].all.seq.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;${SCRIPTDIR}/code2part.pl [batch].csv [batch].all.seq [batch] &amp;gt; [batch].part&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The files [batch].BAD.list and [batch].NONE.list contain the names of all sequences that have failed quality checks either because they are too short, or contain Ns (BAD), or because the barcode is not recognized (NONE). The files contain additional information that can be used to troubleshoot the pipeline:  sequences that are too short (&amp;lt; 75 454 cycles) are followed by the number of cycles, sequences that either contain Ns or have an unknown barcode are followed by the first 8 characters in the sequence.&lt;br /&gt;
&lt;br /&gt;
=== Step 3: Break up the fasta file into separate batches by partition ===&lt;br /&gt;
&lt;br /&gt;
* Create partition directory&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mkdir part&lt;br /&gt;
cd part&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Partition main file into sub-parts&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
${SCRIPTS}/partitionFasta.pl ../[batch].all.seq ../[batch].part&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The result is one fasta file per sub-partition (i.e. individual subject).&lt;br /&gt;
&lt;br /&gt;
* Remove barcodes &lt;br /&gt;
(still in the part/ subdirectory)&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
for i in *.seq; do&lt;br /&gt;
  ${SCRIPTS}/unbarcode.pl $i&lt;br /&gt;
done&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The output files will have the same name as the original file but with the addition of the .nbc suffix.  You should remove the .nbc files from the BAD/NONE files in order to prevent their addition to the pipeline downstream.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt; rm *.BAD.nbc.fa *.NONE.nbc.fa &amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 4: Add file names to sample tables ===&lt;br /&gt;
&lt;br /&gt;
The following needs to be run from the root of the 454 directory.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt; ${SCRIPTS}/add_file.pl [batch]/[batch]_barcode.csv [batch]/part &amp;gt; [batch]/[batch]_names.csv &amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
As a result, the file [batch]/[batch]_names.csv will associate each Sample ID to a file name and also record the number of sequences in that file.  Note that only files ending in .nbc.fa are procesed.&lt;br /&gt;
&lt;br /&gt;
=== Step 5: Merge all the batch meta-info files into a same file at the top ===&lt;br /&gt;
&lt;br /&gt;
Note: the addition of file names must be done on a batch by batch basis as multiple files might refer to a same Sample ID - within each batch it can be assumed that the Sample ID -&amp;gt; Filename mapping is unique.  In the 454.csv file in the top directory the unique key is the file name.  &lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt; &lt;br /&gt;
mv ../454.csv ../454.csv.bak&lt;br /&gt;
${SCRIPTS}/merge_csv.pl ../454.csv.bak [batch]/[batch]_names.csv &amp;quot;Sample ID&amp;quot; non-unique &amp;gt; ../454.csv&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This should probably be run with the filename as the key but then the tables need to be sorted  by filename.  Ultimately, this will all work better once the data are in a relational database.&lt;br /&gt;
&lt;br /&gt;
=== Step 6: Update sample file ===&lt;br /&gt;
From the top directory:&lt;br /&gt;
&lt;br /&gt;
* First add all new samples to the samples.csv file&lt;br /&gt;
&amp;lt;pre&amp;gt; &lt;br /&gt;
mv samples.csv samples.csv.bak&lt;br /&gt;
${SCRIPTS}/merge_csv.pl samples.csv.bak 454.csv &amp;quot;Sample ID&amp;quot; unique merge &amp;gt; samples.csv&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Note: merge means that if record keys conflict, the empty fields will be updated with the new data.&lt;br /&gt;
&lt;br /&gt;
* Update the tag indicating 454 sequences available for this sample&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mv samples.csv samples.csv.bak&lt;br /&gt;
${SCRIPTS}/update_field_csv.pl samples.csv.bak 454.csv &amp;quot;Sample ID&amp;quot; 454 Y &amp;gt; samples.csv&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 7: Assign numbers to all filenames ===&lt;br /&gt;
&lt;br /&gt;
Each file in the 454.csv file will be assigned an integer (if one is not already available).  This number will be used to prefix the sequences in the combined file for the project.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mv 454.csv 454.csv.bak&lt;br /&gt;
${SCRIPTS}/add_filenum.pl 454.csv.bak &amp;gt; 454.csv&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 8: Combine all fasta files into a single one ===&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Note: this assumes we&#039;re running the pipeline for the first time - a different protocol is necessary for adding new sequences to an already existing analysis&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
In the new file all sequences will be named &amp;lt;n&amp;gt;_&amp;lt;nn&amp;gt; where &amp;lt;n&amp;gt; is the value in the &amp;quot;File #&amp;quot;&lt;br /&gt;
field in 454.csv and &amp;lt;nn&amp;gt; is the index of the sequence in the file.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mkdir Analysis/Run[date]&lt;br /&gt;
${SCRIPTS}/combinefa.pl -c Analysis/Run[date]/Run[date] -i 454.csv 454&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The output will be in Analysis/Run[date]/Run[date].fna&lt;br /&gt;
&lt;br /&gt;
=== Step 9: Run clustering tool ===&lt;br /&gt;
&lt;br /&gt;
* First generate clusters&lt;br /&gt;
&#039;&#039;&#039;Note: This part assumes we&#039;re running the whole set of sequences as one batch.&#039;&#039;&#039;&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
cd Analysis/Run[date]/Run[date].fna&lt;br /&gt;
/fs/szasmg2/ghodsi/Src/clusterk/clusterk7 -r 2 -i Run[date].fna &amp;gt; Run[date].fna.cluster&lt;br /&gt;
/fs/szasmg2/ghodsi/Src/clusterk/clusterk7 -r 2 -m -i Run[date].fna &amp;gt; Run[date].fna.align&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Output will be in Run[date].fna.cluster, one cluster per line, cluster center listed as the first identifier.&lt;br /&gt;
&lt;br /&gt;
The .align file contains aligned FASTA records for all the sequences in each cluster.  Clusters are separated by #&amp;lt;number&amp;gt; where &amp;lt;number&amp;gt; is the number of sequences in the cluster.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* Then extract the cluster centers&lt;br /&gt;
&#039;&#039;&#039;From here on the code runs the same in both full-run and batch modes&#039;&#039;&#039;&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
/fs/szasmg2/ghodsi/Src/clusterk/fastaselect Run[date].fna &amp;lt; Run[date].fna.cluster &amp;gt; Run[date].centers.fna&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 10: Assign putative taxonomic labels to clusters ===&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
/fs/szasmg2/ghodsi/rdp/findtax/findtaxid.sh Run[date].centers.fna &amp;gt; Run[date].centers.taxid&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Output is tab-delimited: sequence name &amp;lt;TAB&amp;gt; taxid&lt;br /&gt;
Note: using &amp;quot;findtax&amp;quot; instead of &amp;quot;findtaxid&amp;quot; will retrieve actual taxonomy names.&lt;br /&gt;
&lt;br /&gt;
=== Step 11: Build summary tables ===&lt;br /&gt;
&lt;br /&gt;
Using the output from steps 9 and 10 we construct a collection of tables linking OTUs, taxIDs, taxnames at various taxonomic levels to individual samples.  The colums are the samples and the rows are the respective units.  The cells are numbers of sequences assigned to the specific group. If looking at taxonomic levels, the sequences without an assignment at that level are assigned to a generic &amp;quot;No Assignment&amp;quot; bin.&lt;br /&gt;
&lt;br /&gt;
* First create a partition file that contains all the clusters and associated taxids (if they exist)&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
${SCRIPTS}/cluster2part.pl [batch].fna.cluster [batch] [batch].centers.taxid &amp;gt; [batch].part&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Using this partition, construct summary tables at various taxonomic levels&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
${SCRIPTS}/taxpart2summary.pl [batch].part ${MAIN}/454.csv [batch]&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The following outputs will be created:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
[batch].stats.txt - overall statistics for the data-set&lt;br /&gt;
[batch].otus.count.csv - table containing OTUs as rows, samples as columns, and entries represent&lt;br /&gt;
       # of sequences in each OTU/Sample pair&lt;br /&gt;
[batch].otus.percent.csv - same as &amp;quot;count&amp;quot; except that entries are percentages wrt total sequences&lt;br /&gt;
       in each sample&lt;br /&gt;
[batch].[tax].[otu|count|percent] - same as the &amp;quot;otus&amp;quot; file except at varying taxonomic levels.&lt;br /&gt;
       [tax] is one of &amp;quot;strain&amp;quot;, &amp;quot;species&amp;quot;, &amp;quot;genus&amp;quot;, &amp;quot;family&amp;quot;, &amp;quot;order&amp;quot;, &amp;quot;class&amp;quot;, &amp;quot;phylum&amp;quot;&lt;br /&gt;
       the &amp;quot;count&amp;quot; and &amp;quot;percent&amp;quot; entries are the same as for the &amp;quot;otus&amp;quot; files&lt;br /&gt;
       the &amp;quot;otu&amp;quot; entries contain number of OTUs assigned to the taxonomic group/Sample pair.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Database-based approach ==&lt;br /&gt;
&lt;br /&gt;
=== Database information === &lt;br /&gt;
* To access the database through a command line use the command shown below with password &amp;quot;access&amp;quot;:&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mysql -u access -p -h cbcbmysql00 gems&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
* A stub Perl script for playing with the database is provided at /fs/szasmg2/Gates_SOM/Main/454/DB/stub.pl&lt;br /&gt;
* All users can read the data from the database using users/password combo &amp;quot;access&amp;quot;/&amp;quot;access&amp;quot;.&lt;br /&gt;
* Database schema: &lt;br /&gt;
* The commands listed below assume you have write access to the database.&lt;br /&gt;
&lt;br /&gt;
=== Step 0: Get the sequence information ===&lt;br /&gt;
* From .SFF files (assuming these are 454 sequences)&lt;br /&gt;
&lt;br /&gt;
This step uses the sff_extract program from the Staden package (if I&#039;m not mistaken)&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
for i in *.sff ;do&lt;br /&gt;
  name=`expr $i : &#039;\(.*\)\.sff&#039;`&lt;br /&gt;
  sff_extract -c -s $name.seq -q $name.qual $i&lt;br /&gt;
done&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 1: Cleanup meta-information ===&lt;br /&gt;
* Convert the Excel sheet containing the batch information into tab-delimited file (and run dos2unix) making sure the quotes added by Excel/OOffice are removed, adding the date (if not already in the file), and sorting the file by Sample ID.  At this stage also check that the header row information is in canonical format.&lt;br /&gt;
* Add barcode information using add_barcode.pl&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;${SCRIPTDIR}/add_barcode.pl [batch].csv ${MAINDIR}IGS_Barcodes.csv &amp;gt; [batch]_barcode.csv&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 2: Create partition file ===&lt;br /&gt;
&lt;br /&gt;
First concatenate all the sequence files (if multiple) from a batch into a single file [batch].all.seq.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;${SCRIPTDIR}/code2part.pl [batch].csv [batch].all.seq [batch] &amp;gt; [batch].part&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The files [batch].BAD.list and [batch].NONE.list contain the names of all sequences that have failed quality checks either because they are too short, or contain Ns (BAD), or because the barcode is not recognized (NONE). The files contain additional information that can be used to troubleshoot the pipeline:  sequences that are too short (&amp;lt; 75 454 cycles) are followed by the number of cycles, sequences that either contain Ns or have an unknown barcode are followed by the first 8 characters in the sequence.&lt;br /&gt;
&lt;br /&gt;
=== Step 3: Break up the fasta file into separate batches by partition ===&lt;br /&gt;
&lt;br /&gt;
* Create partition directory&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mkdir part&lt;br /&gt;
cd part&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Partition main file into sub-parts&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
${SCRIPTS}/partitionFasta.pl ../[batch].all.seq ../[batch].part&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The result is one fasta file per sub-partition (i.e. individual subject).&lt;br /&gt;
&lt;br /&gt;
* Remove barcodes &lt;br /&gt;
(still in the part/ subdirectory)&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
for i in *.seq; do&lt;br /&gt;
  ${SCRIPTS}/unbarcode.pl $i&lt;br /&gt;
done&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The output files will have the same name as the original file but with the addition of the .nbc suffix.  You should remove the .nbc files from the BAD/NONE files in order to prevent their addition to the pipeline downstream.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt; rm *.BAD.nbc.fa *.NONE.nbc.fa &amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 4: Upload file names ===&lt;br /&gt;
Run from 454 directory&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
${DB_SCRIPTS}/add_file_db.pl [batch]/[batch]_barcode.csv [batch]/part&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;/div&gt;</summary>
		<author><name>Mpop</name></author>
	</entry>
	<entry>
		<id>https://wiki.umiacs.umd.edu/cbcb/index.php?title=Cbcb:Pop-Lab:16S-pipeline_16S_analysis_pipeline_(for_Gates_project)&amp;diff=7249</id>
		<title>Cbcb:Pop-Lab:16S-pipeline 16S analysis pipeline (for Gates project)</title>
		<link rel="alternate" type="text/html" href="https://wiki.umiacs.umd.edu/cbcb/index.php?title=Cbcb:Pop-Lab:16S-pipeline_16S_analysis_pipeline_(for_Gates_project)&amp;diff=7249"/>
		<updated>2010-06-08T19:50:59Z</updated>

		<summary type="html">&lt;p&gt;Mpop: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= 16S analysis pipeline =&lt;br /&gt;
&lt;br /&gt;
== Overview ==&lt;br /&gt;
=== Assumptions ===&lt;br /&gt;
* 16S rRNA sequences are generated with barcoded 454 and are received as either: (i) .sff file; (ii) fasta and quality file; (iii) just fasta file&lt;br /&gt;
* each batch is a single 96-well plate and is accompanied by a tab-delimited file containing information about the sample, including the &amp;quot;Sample ID&amp;quot;, well on the plate, and additional information regarding the sample quality and DNA concentration&lt;br /&gt;
* we have a file that specifies the barcode and sequencing adapter used for each well (these entities do not change).&lt;br /&gt;
&lt;br /&gt;
=== Directory structure ===&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
Gates_SOM/&lt;br /&gt;
  Main/&lt;br /&gt;
    samples.csv      - information about all the samples available to us&lt;br /&gt;
    454.csv          - information about all 454 runs (essentially concatenation of .csvs from 454 dir)&lt;br /&gt;
    phylochip.csv    - information about all Phylochip runs&lt;br /&gt;
    IGS_Barcodes.csv - information about barcodes used to multiplex 454 runs&lt;br /&gt;
    scripts/         - scripts used to process the data&lt;br /&gt;
  454/               - here&#039;s where all 454 sequences live&lt;br /&gt;
    [batch1]/        - ... each batch in a separate directory&lt;br /&gt;
    [batch1].csv     - meta-information about the batch &lt;br /&gt;
    [fasta1]         - fasta files containing the batch&lt;br /&gt;
     ... &lt;br /&gt;
    [fastan]&lt;br /&gt;
    [batch1].part    - partition file describing how the sequences get split by barcode/sample&lt;br /&gt;
     part/           - directory where all the partitioned files live &lt;br /&gt;
     ... &lt;br /&gt;
    [batchn]&lt;br /&gt;
  Phylochip/         - all the CEL files and auxiliary information on the Phylochip runs&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== File-based approach == &lt;br /&gt;
&lt;br /&gt;
=== Step 0: Get the sequence information ===&lt;br /&gt;
* From .SFF files (assuming these are 454 sequences)&lt;br /&gt;
&lt;br /&gt;
This step uses the sff_extract program from the Staden package (if I&#039;m not mistaken)&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
for i in *.sff ;do&lt;br /&gt;
  name=`expr $i : &#039;\(.*\)\.sff&#039;`&lt;br /&gt;
  sff_extract -c -s $name.seq -q $name.qual $i&lt;br /&gt;
done&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 1: Cleanup meta-information ===&lt;br /&gt;
* Convert the Excel sheet containing the batch information into tab-delimited file (and run dos2unix) making sure the quotes added by Excel/OOffice are removed, adding the date (if not already in the file), and sorting the file by Sample ID.  At this stage also check that the header row information is in canonical format.&lt;br /&gt;
* Add barcode information using add_barcode.pl&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;${SCRIPTDIR}/add_barcode.pl [batch].csv ${MAINDIR}IGS_Barcodes.csv &amp;gt; [batch]_barcode.csv&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 2: Create partition file ===&lt;br /&gt;
&lt;br /&gt;
First concatenate all the sequence files (if multiple) from a batch into a single file [batch].all.seq.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;${SCRIPTDIR}/code2part.pl [batch].csv [batch].all.seq [batch] &amp;gt; [batch].part&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The files [batch].BAD.list and [batch].NONE.list contain the names of all sequences that have failed quality checks either because they are too short, or contain Ns (BAD), or because the barcode is not recognized (NONE). The files contain additional information that can be used to troubleshoot the pipeline:  sequences that are too short (&amp;lt; 75 454 cycles) are followed by the number of cycles, sequences that either contain Ns or have an unknown barcode are followed by the first 8 characters in the sequence.&lt;br /&gt;
&lt;br /&gt;
=== Step 3: Break up the fasta file into separate batches by partition ===&lt;br /&gt;
&lt;br /&gt;
* Create partition directory&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mkdir part&lt;br /&gt;
cd part&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Partition main file into sub-parts&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
${SCRIPTS}/partitionFasta.pl ../[batch].all.seq ../[batch].part&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The result is one fasta file per sub-partition (i.e. individual subject).&lt;br /&gt;
&lt;br /&gt;
* Remove barcodes &lt;br /&gt;
(still in the part/ subdirectory)&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
for i in *.seq; do&lt;br /&gt;
  ${SCRIPTS}/unbarcode.pl $i&lt;br /&gt;
done&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The output files will have the same name as the original file but with the addition of the .nbc suffix.  You should remove the .nbc files from the BAD/NONE files in order to prevent their addition to the pipeline downstream.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt; rm *.BAD.nbc.fa *.NONE.nbc.fa &amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 4: Add file names to sample tables ===&lt;br /&gt;
&lt;br /&gt;
The following needs to be run from the root of the 454 directory.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt; ${SCRIPTS}/add_file.pl [batch]/[batch]_barcode.csv [batch]/part &amp;gt; [batch]/[batch]_names.csv &amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
As a result, the file [batch]/[batch]_names.csv will associate each Sample ID to a file name and also record the number of sequences in that file.  Note that only files ending in .nbc.fa are procesed.&lt;br /&gt;
&lt;br /&gt;
=== Step 5: Merge all the batch meta-info files into a same file at the top ===&lt;br /&gt;
&lt;br /&gt;
Note: the addition of file names must be done on a batch by batch basis as multiple files might refer to a same Sample ID - within each batch it can be assumed that the Sample ID -&amp;gt; Filename mapping is unique.  In the 454.csv file in the top directory the unique key is the file name.  &lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt; &lt;br /&gt;
mv ../454.csv ../454.csv.bak&lt;br /&gt;
${SCRIPTS}/merge_csv.pl ../454.csv.bak [batch]/[batch]_names.csv &amp;quot;Sample ID&amp;quot; non-unique &amp;gt; ../454.csv&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This should probably be run with the filename as the key but then the tables need to be sorted  by filename.  Ultimately, this will all work better once the data are in a relational database.&lt;br /&gt;
&lt;br /&gt;
=== Step 6: Update sample file ===&lt;br /&gt;
From the top directory:&lt;br /&gt;
&lt;br /&gt;
* First add all new samples to the samples.csv file&lt;br /&gt;
&amp;lt;pre&amp;gt; &lt;br /&gt;
mv samples.csv samples.csv.bak&lt;br /&gt;
${SCRIPTS}/merge_csv.pl samples.csv.bak 454.csv &amp;quot;Sample ID&amp;quot; unique merge &amp;gt; samples.csv&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Note: merge means that if record keys conflict, the empty fields will be updated with the new data.&lt;br /&gt;
&lt;br /&gt;
* Update the tag indicating 454 sequences available for this sample&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mv samples.csv samples.csv.bak&lt;br /&gt;
${SCRIPTS}/update_field_csv.pl samples.csv.bak 454.csv &amp;quot;Sample ID&amp;quot; 454 Y &amp;gt; samples.csv&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 7: Assign numbers to all filenames ===&lt;br /&gt;
&lt;br /&gt;
Each file in the 454.csv file will be assigned an integer (if one is not already available).  This number will be used to prefix the sequences in the combined file for the project.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mv 454.csv 454.csv.bak&lt;br /&gt;
${SCRIPTS}/add_filenum.pl 454.csv.bak &amp;gt; 454.csv&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 8: Combine all fasta files into a single one ===&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Note: this assumes we&#039;re running the pipeline for the first time - a different protocol is necessary for adding new sequences to an already existing analysis&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
In the new file all sequences will be named &amp;lt;n&amp;gt;_&amp;lt;nn&amp;gt; where &amp;lt;n&amp;gt; is the value in the &amp;quot;File #&amp;quot;&lt;br /&gt;
field in 454.csv and &amp;lt;nn&amp;gt; is the index of the sequence in the file.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mkdir Analysis/Run[date]&lt;br /&gt;
${SCRIPTS}/combinefa.pl -c Analysis/Run[date]/Run[date] -i 454.csv 454&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The output will be in Analysis/Run[date]/Run[date].fna&lt;br /&gt;
&lt;br /&gt;
=== Step 9: Run clustering tool ===&lt;br /&gt;
&lt;br /&gt;
* First generate clusters&lt;br /&gt;
&#039;&#039;&#039;Note: This part assumes we&#039;re running the whole set of sequences as one batch.&#039;&#039;&#039;&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
cd Analysis/Run[date]/Run[date].fna&lt;br /&gt;
/fs/szasmg2/ghodsi/Src/clusterk/clusterk7 -r 2 -i Run[date].fna &amp;gt; Run[date].fna.cluster&lt;br /&gt;
/fs/szasmg2/ghodsi/Src/clusterk/clusterk7 -r 2 -m -i Run[date].fna &amp;gt; Run[date].fna.align&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Output will be in Run[date].fna.cluster, one cluster per line, cluster center listed as the first identifier.&lt;br /&gt;
&lt;br /&gt;
The .align file contains aligned FASTA records for all the sequences in each cluster.  Clusters are separated by #&amp;lt;number&amp;gt; where &amp;lt;number&amp;gt; is the number of sequences in the cluster.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* Then extract the cluster centers&lt;br /&gt;
&#039;&#039;&#039;From here on the code runs the same in both full-run and batch modes&#039;&#039;&#039;&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
/fs/szasmg2/ghodsi/Src/clusterk/fastaselect Run[date].fna &amp;lt; Run[date].fna.cluster &amp;gt; Run[date].centers.fna&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 10: Assign putative taxonomic labels to clusters ===&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
/fs/szasmg2/ghodsi/rdp/findtax/findtaxid.sh Run[date].centers.fna &amp;gt; Run[date].centers.taxid&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Output is tab-delimited: sequence name &amp;lt;TAB&amp;gt; taxid&lt;br /&gt;
Note: using &amp;quot;findtax&amp;quot; instead of &amp;quot;findtaxid&amp;quot; will retrieve actual taxonomy names.&lt;br /&gt;
&lt;br /&gt;
=== Step 11: Build summary tables ===&lt;br /&gt;
&lt;br /&gt;
Using the output from steps 9 and 10 we construct a collection of tables linking OTUs, taxIDs, taxnames at various taxonomic levels to individual samples.  The colums are the samples and the rows are the respective units.  The cells are numbers of sequences assigned to the specific group. If looking at taxonomic levels, the sequences without an assignment at that level are assigned to a generic &amp;quot;No Assignment&amp;quot; bin.&lt;br /&gt;
&lt;br /&gt;
* First create a partition file that contains all the clusters and associated taxids (if they exist)&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
${SCRIPTS}/cluster2part.pl [batch].fna.cluster [batch] [batch].centers.taxid &amp;gt; [batch].part&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Using this partition, construct summary tables at various taxonomic levels&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
${SCRIPTS}/taxpart2summary.pl [batch].part ${MAIN}/454.csv [batch]&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The following outputs will be created:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
[batch].stats.txt - overall statistics for the data-set&lt;br /&gt;
[batch].otus.count.csv - table containing OTUs as rows, samples as columns, and entries represent&lt;br /&gt;
       # of sequences in each OTU/Sample pair&lt;br /&gt;
[batch].otus.percent.csv - same as &amp;quot;count&amp;quot; except that entries are percentages wrt total sequences&lt;br /&gt;
       in each sample&lt;br /&gt;
[batch].[tax].[otu|count|percent] - same as the &amp;quot;otus&amp;quot; file except at varying taxonomic levels.&lt;br /&gt;
       [tax] is one of &amp;quot;strain&amp;quot;, &amp;quot;species&amp;quot;, &amp;quot;genus&amp;quot;, &amp;quot;family&amp;quot;, &amp;quot;order&amp;quot;, &amp;quot;class&amp;quot;, &amp;quot;phylum&amp;quot;&lt;br /&gt;
       the &amp;quot;count&amp;quot; and &amp;quot;percent&amp;quot; entries are the same as for the &amp;quot;otus&amp;quot; files&lt;br /&gt;
       the &amp;quot;otu&amp;quot; entries contain number of OTUs assigned to the taxonomic group/Sample pair.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Database-based approach ==&lt;br /&gt;
&lt;br /&gt;
=== Database information === &lt;br /&gt;
* To access the database through a command line use the command shown below with password &amp;quot;access&amp;quot;:&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mysql -u access -p -h cbcbmysql00 gems&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
* A stub Perl script for playing with the database is provided at /fs/szasmg2/Gates_SOM/Main/454/DB/stub.pl&lt;br /&gt;
* All users can read the data from the database using users/password combo &amp;quot;access&amp;quot;/&amp;quot;access&amp;quot;.&lt;br /&gt;
* Database schema: &lt;br /&gt;
* The commands listed below assume you have write access to the database.&lt;br /&gt;
&lt;br /&gt;
=== Step 0: Get the sequence information ===&lt;br /&gt;
* From .SFF files (assuming these are 454 sequences)&lt;br /&gt;
&lt;br /&gt;
This step uses the sff_extract program from the Staden package (if I&#039;m not mistaken)&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
for i in *.sff ;do&lt;br /&gt;
  name=`expr $i : &#039;\(.*\)\.sff&#039;`&lt;br /&gt;
  sff_extract -c -s $name.seq -q $name.qual $i&lt;br /&gt;
done&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 1: Cleanup meta-information ===&lt;br /&gt;
* Convert the Excel sheet containing the batch information into tab-delimited file (and run dos2unix) making sure the quotes added by Excel/OOffice are removed, adding the date (if not already in the file), and sorting the file by Sample ID.  At this stage also check that the header row information is in canonical format.&lt;br /&gt;
* Add barcode information using add_barcode.pl&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;${SCRIPTDIR}/add_barcode.pl [batch].csv ${MAINDIR}IGS_Barcodes.csv &amp;gt; [batch]_barcode.csv&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 2: Create partition file ===&lt;br /&gt;
&lt;br /&gt;
First concatenate all the sequence files (if multiple) from a batch into a single file [batch].all.seq.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;${SCRIPTDIR}/code2part.pl [batch].csv [batch].all.seq [batch] &amp;gt; [batch].part&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The files [batch].BAD.list and [batch].NONE.list contain the names of all sequences that have failed quality checks either because they are too short, or contain Ns (BAD), or because the barcode is not recognized (NONE). The files contain additional information that can be used to troubleshoot the pipeline:  sequences that are too short (&amp;lt; 75 454 cycles) are followed by the number of cycles, sequences that either contain Ns or have an unknown barcode are followed by the first 8 characters in the sequence.&lt;br /&gt;
&lt;br /&gt;
=== Step 3: Break up the fasta file into separate batches by partition ===&lt;br /&gt;
&lt;br /&gt;
* Create partition directory&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mkdir part&lt;br /&gt;
cd part&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Partition main file into sub-parts&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
${SCRIPTS}/partitionFasta.pl ../[batch].all.seq ../[batch].part&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The result is one fasta file per sub-partition (i.e. individual subject).&lt;br /&gt;
&lt;br /&gt;
* Remove barcodes &lt;br /&gt;
(still in the part/ subdirectory)&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
for i in *.seq; do&lt;br /&gt;
  ${SCRIPTS}/unbarcode.pl $i&lt;br /&gt;
done&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The output files will have the same name as the original file but with the addition of the .nbc suffix.  You should remove the .nbc files from the BAD/NONE files in order to prevent their addition to the pipeline downstream.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt; rm *.BAD.nbc.fa *.NONE.nbc.fa &amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 4: Upload file names ===&lt;/div&gt;</summary>
		<author><name>Mpop</name></author>
	</entry>
	<entry>
		<id>https://wiki.umiacs.umd.edu/cbcb/index.php?title=Cbcb:Pop-Lab:16S-pipeline_16S_analysis_pipeline_(for_Gates_project)&amp;diff=7248</id>
		<title>Cbcb:Pop-Lab:16S-pipeline 16S analysis pipeline (for Gates project)</title>
		<link rel="alternate" type="text/html" href="https://wiki.umiacs.umd.edu/cbcb/index.php?title=Cbcb:Pop-Lab:16S-pipeline_16S_analysis_pipeline_(for_Gates_project)&amp;diff=7248"/>
		<updated>2010-06-08T19:49:29Z</updated>

		<summary type="html">&lt;p&gt;Mpop: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= 16S analysis pipeline =&lt;br /&gt;
&lt;br /&gt;
== Overview ==&lt;br /&gt;
=== Assumptions ===&lt;br /&gt;
* 16S rRNA sequences are generated with barcoded 454 and are received as either: (i) .sff file; (ii) fasta and quality file; (iii) just fasta file&lt;br /&gt;
* each batch is a single 96-well plate and is accompanied by a tab-delimited file containing information about the sample, including the &amp;quot;Sample ID&amp;quot;, well on the plate, and additional information regarding the sample quality and DNA concentration&lt;br /&gt;
* we have a file that specifies the barcode and sequencing adapter used for each well (these entities do not change).&lt;br /&gt;
&lt;br /&gt;
=== Directory structure ===&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
Gates_SOM/&lt;br /&gt;
  Main/&lt;br /&gt;
    samples.csv      - information about all the samples available to us&lt;br /&gt;
    454.csv          - information about all 454 runs (essentially concatenation of .csvs from 454 dir)&lt;br /&gt;
    phylochip.csv    - information about all Phylochip runs&lt;br /&gt;
    IGS_Barcodes.csv - information about barcodes used to multiplex 454 runs&lt;br /&gt;
    scripts/         - scripts used to process the data&lt;br /&gt;
  454/               - here&#039;s where all 454 sequences live&lt;br /&gt;
    [batch1]/        - ... each batch in a separate directory&lt;br /&gt;
    [batch1].csv     - meta-information about the batch &lt;br /&gt;
    [fasta1]         - fasta files containing the batch&lt;br /&gt;
     ... &lt;br /&gt;
    [fastan]&lt;br /&gt;
    [batch1].part    - partition file describing how the sequences get split by barcode/sample&lt;br /&gt;
     part/           - directory where all the partitioned files live &lt;br /&gt;
     ... &lt;br /&gt;
    [batchn]&lt;br /&gt;
  Phylochip/         - all the CEL files and auxiliary information on the Phylochip runs&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== File-based approach == &lt;br /&gt;
&lt;br /&gt;
=== Step 0: Get the sequence information ===&lt;br /&gt;
* From .SFF files (assuming these are 454 sequences)&lt;br /&gt;
&lt;br /&gt;
This step uses the sff_extract program from the Staden package (if I&#039;m not mistaken)&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
for i in *.sff ;do&lt;br /&gt;
  name=`expr $i : &#039;\(.*\)\.sff&#039;`&lt;br /&gt;
  sff_extract -c -s $name.seq -q $name.qual $i&lt;br /&gt;
done&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 1: Cleanup meta-information ===&lt;br /&gt;
* Convert the Excel sheet containing the batch information into tab-delimited file (and run dos2unix) making sure the quotes added by Excel/OOffice are removed, adding the date (if not already in the file), and sorting the file by Sample ID.  At this stage also check that the header row information is in canonical format.&lt;br /&gt;
* Add barcode information using add_barcode.pl&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;${SCRIPTDIR}/add_barcode.pl [batch].csv ${MAINDIR}IGS_Barcodes.csv &amp;gt; [batch]_barcode.csv&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 2: Create partition file ===&lt;br /&gt;
&lt;br /&gt;
First concatenate all the sequence files (if multiple) from a batch into a single file [batch].all.seq.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;${SCRIPTDIR}/code2part.pl [batch].csv [batch].all.seq [batch] &amp;gt; [batch].part&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The files [batch].BAD.list and [batch].NONE.list contain the names of all sequences that have failed quality checks either because they are too short, or contain Ns (BAD), or because the barcode is not recognized (NONE). The files contain additional information that can be used to troubleshoot the pipeline:  sequences that are too short (&amp;lt; 75 454 cycles) are followed by the number of cycles, sequences that either contain Ns or have an unknown barcode are followed by the first 8 characters in the sequence.&lt;br /&gt;
&lt;br /&gt;
=== Step 3: Break up the fasta file into separate batches by partition ===&lt;br /&gt;
&lt;br /&gt;
* Create partition directory&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mkdir part&lt;br /&gt;
cd part&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Partition main file into sub-parts&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
${SCRIPTS}/partitionFasta.pl ../[batch].all.seq ../[batch].part&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The result is one fasta file per sub-partition (i.e. individual subject).&lt;br /&gt;
&lt;br /&gt;
* Remove barcodes &lt;br /&gt;
(still in the part/ subdirectory)&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
for i in *.seq; do&lt;br /&gt;
  ${SCRIPTS}/unbarcode.pl $i&lt;br /&gt;
done&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The output files will have the same name as the original file but with the addition of the .nbc suffix.  You should remove the .nbc files from the BAD/NONE files in order to prevent their addition to the pipeline downstream.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt; rm *.BAD.nbc.fa *.NONE.nbc.fa &amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 4: Add file names to sample tables ===&lt;br /&gt;
&lt;br /&gt;
The following needs to be run from the root of the 454 directory.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt; ${SCRIPTS}/add_file.pl [batch]/[batch]_barcode.csv [batch]/part &amp;gt; [batch]/[batch]_names.csv &amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
As a result, the file [batch]/[batch]_names.csv will associate each Sample ID to a file name and also record the number of sequences in that file.  Note that only files ending in .nbc.fa are procesed.&lt;br /&gt;
&lt;br /&gt;
=== Step 5: Merge all the batch meta-info files into a same file at the top ===&lt;br /&gt;
&lt;br /&gt;
Note: the addition of file names must be done on a batch by batch basis as multiple files might refer to a same Sample ID - within each batch it can be assumed that the Sample ID -&amp;gt; Filename mapping is unique.  In the 454.csv file in the top directory the unique key is the file name.  &lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt; &lt;br /&gt;
mv ../454.csv ../454.csv.bak&lt;br /&gt;
${SCRIPTS}/merge_csv.pl ../454.csv.bak [batch]/[batch]_names.csv &amp;quot;Sample ID&amp;quot; non-unique &amp;gt; ../454.csv&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This should probably be run with the filename as the key but then the tables need to be sorted  by filename.  Ultimately, this will all work better once the data are in a relational database.&lt;br /&gt;
&lt;br /&gt;
=== Step 6: Update sample file ===&lt;br /&gt;
From the top directory:&lt;br /&gt;
&lt;br /&gt;
* First add all new samples to the samples.csv file&lt;br /&gt;
&amp;lt;pre&amp;gt; &lt;br /&gt;
mv samples.csv samples.csv.bak&lt;br /&gt;
${SCRIPTS}/merge_csv.pl samples.csv.bak 454.csv &amp;quot;Sample ID&amp;quot; unique merge &amp;gt; samples.csv&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Note: merge means that if record keys conflict, the empty fields will be updated with the new data.&lt;br /&gt;
&lt;br /&gt;
* Update the tag indicating 454 sequences available for this sample&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mv samples.csv samples.csv.bak&lt;br /&gt;
${SCRIPTS}/update_field_csv.pl samples.csv.bak 454.csv &amp;quot;Sample ID&amp;quot; 454 Y &amp;gt; samples.csv&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 7: Assign numbers to all filenames ===&lt;br /&gt;
&lt;br /&gt;
Each file in the 454.csv file will be assigned an integer (if one is not already available).  This number will be used to prefix the sequences in the combined file for the project.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mv 454.csv 454.csv.bak&lt;br /&gt;
${SCRIPTS}/add_filenum.pl 454.csv.bak &amp;gt; 454.csv&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 8: Combine all fasta files into a single one ===&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Note: this assumes we&#039;re running the pipeline for the first time - a different protocol is necessary for adding new sequences to an already existing analysis&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
In the new file all sequences will be named &amp;lt;n&amp;gt;_&amp;lt;nn&amp;gt; where &amp;lt;n&amp;gt; is the value in the &amp;quot;File #&amp;quot;&lt;br /&gt;
field in 454.csv and &amp;lt;nn&amp;gt; is the index of the sequence in the file.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mkdir Analysis/Run[date]&lt;br /&gt;
${SCRIPTS}/combinefa.pl -c Analysis/Run[date]/Run[date] -i 454.csv 454&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The output will be in Analysis/Run[date]/Run[date].fna&lt;br /&gt;
&lt;br /&gt;
=== Step 9: Run clustering tool ===&lt;br /&gt;
&lt;br /&gt;
* First generate clusters&lt;br /&gt;
&#039;&#039;&#039;Note: This part assumes we&#039;re running the whole set of sequences as one batch.&#039;&#039;&#039;&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
cd Analysis/Run[date]/Run[date].fna&lt;br /&gt;
/fs/szasmg2/ghodsi/Src/clusterk/clusterk7 -r 2 -i Run[date].fna &amp;gt; Run[date].fna.cluster&lt;br /&gt;
/fs/szasmg2/ghodsi/Src/clusterk/clusterk7 -r 2 -m -i Run[date].fna &amp;gt; Run[date].fna.align&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Output will be in Run[date].fna.cluster, one cluster per line, cluster center listed as the first identifier.&lt;br /&gt;
&lt;br /&gt;
The .align file contains aligned FASTA records for all the sequences in each cluster.  Clusters are separated by #&amp;lt;number&amp;gt; where &amp;lt;number&amp;gt; is the number of sequences in the cluster.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* Then extract the cluster centers&lt;br /&gt;
&#039;&#039;&#039;From here on the code runs the same in both full-run and batch modes&#039;&#039;&#039;&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
/fs/szasmg2/ghodsi/Src/clusterk/fastaselect Run[date].fna &amp;lt; Run[date].fna.cluster &amp;gt; Run[date].centers.fna&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 10: Assign putative taxonomic labels to clusters ===&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
/fs/szasmg2/ghodsi/rdp/findtax/findtaxid.sh Run[date].centers.fna &amp;gt; Run[date].centers.taxid&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Output is tab-delimited: sequence name &amp;lt;TAB&amp;gt; taxid&lt;br /&gt;
Note: using &amp;quot;findtax&amp;quot; instead of &amp;quot;findtaxid&amp;quot; will retrieve actual taxonomy names.&lt;br /&gt;
&lt;br /&gt;
=== Step 11: Build summary tables ===&lt;br /&gt;
&lt;br /&gt;
Using the output from steps 9 and 10 we construct a collection of tables linking OTUs, taxIDs, taxnames at various taxonomic levels to individual samples.  The colums are the samples and the rows are the respective units.  The cells are numbers of sequences assigned to the specific group. If looking at taxonomic levels, the sequences without an assignment at that level are assigned to a generic &amp;quot;No Assignment&amp;quot; bin.&lt;br /&gt;
&lt;br /&gt;
* First create a partition file that contains all the clusters and associated taxids (if they exist)&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
${SCRIPTS}/cluster2part.pl [batch].fna.cluster [batch] [batch].centers.taxid &amp;gt; [batch].part&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Using this partition, construct summary tables at various taxonomic levels&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
${SCRIPTS}/taxpart2summary.pl [batch].part ${MAIN}/454.csv [batch]&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The following outputs will be created:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
[batch].stats.txt - overall statistics for the data-set&lt;br /&gt;
[batch].otus.count.csv - table containing OTUs as rows, samples as columns, and entries represent&lt;br /&gt;
       # of sequences in each OTU/Sample pair&lt;br /&gt;
[batch].otus.percent.csv - same as &amp;quot;count&amp;quot; except that entries are percentages wrt total sequences&lt;br /&gt;
       in each sample&lt;br /&gt;
[batch].[tax].[otu|count|percent] - same as the &amp;quot;otus&amp;quot; file except at varying taxonomic levels.&lt;br /&gt;
       [tax] is one of &amp;quot;strain&amp;quot;, &amp;quot;species&amp;quot;, &amp;quot;genus&amp;quot;, &amp;quot;family&amp;quot;, &amp;quot;order&amp;quot;, &amp;quot;class&amp;quot;, &amp;quot;phylum&amp;quot;&lt;br /&gt;
       the &amp;quot;count&amp;quot; and &amp;quot;percent&amp;quot; entries are the same as for the &amp;quot;otus&amp;quot; files&lt;br /&gt;
       the &amp;quot;otu&amp;quot; entries contain number of OTUs assigned to the taxonomic group/Sample pair.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Database-based approach ==&lt;br /&gt;
&lt;br /&gt;
=== Database information === &lt;br /&gt;
* To access the database through a command line use:&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mysql -u access -p -h cbcbmysql00 gems&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
* A stub Perl script for playing with the database is provided at /fs/szasmg2/Gates_SOM/Main/454/DB/stub.pl&lt;br /&gt;
* All users can read the data from the database using users/password combo &amp;quot;access&amp;quot;/&amp;quot;access&amp;quot;.&lt;br /&gt;
* Database schema: &lt;br /&gt;
&lt;br /&gt;
=== Step 0: Get the sequence information ===&lt;br /&gt;
* From .SFF files (assuming these are 454 sequences)&lt;br /&gt;
&lt;br /&gt;
This step uses the sff_extract program from the Staden package (if I&#039;m not mistaken)&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
for i in *.sff ;do&lt;br /&gt;
  name=`expr $i : &#039;\(.*\)\.sff&#039;`&lt;br /&gt;
  sff_extract -c -s $name.seq -q $name.qual $i&lt;br /&gt;
done&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 1: Cleanup meta-information ===&lt;br /&gt;
* Convert the Excel sheet containing the batch information into tab-delimited file (and run dos2unix) making sure the quotes added by Excel/OOffice are removed, adding the date (if not already in the file), and sorting the file by Sample ID.  At this stage also check that the header row information is in canonical format.&lt;br /&gt;
* Add barcode information using add_barcode.pl&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;${SCRIPTDIR}/add_barcode.pl [batch].csv ${MAINDIR}IGS_Barcodes.csv &amp;gt; [batch]_barcode.csv&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 2: Create partition file ===&lt;br /&gt;
&lt;br /&gt;
First concatenate all the sequence files (if multiple) from a batch into a single file [batch].all.seq.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;${SCRIPTDIR}/code2part.pl [batch].csv [batch].all.seq [batch] &amp;gt; [batch].part&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The files [batch].BAD.list and [batch].NONE.list contain the names of all sequences that have failed quality checks either because they are too short, or contain Ns (BAD), or because the barcode is not recognized (NONE). The files contain additional information that can be used to troubleshoot the pipeline:  sequences that are too short (&amp;lt; 75 454 cycles) are followed by the number of cycles, sequences that either contain Ns or have an unknown barcode are followed by the first 8 characters in the sequence.&lt;br /&gt;
&lt;br /&gt;
=== Step 3: Break up the fasta file into separate batches by partition ===&lt;br /&gt;
&lt;br /&gt;
* Create partition directory&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mkdir part&lt;br /&gt;
cd part&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Partition main file into sub-parts&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
${SCRIPTS}/partitionFasta.pl ../[batch].all.seq ../[batch].part&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The result is one fasta file per sub-partition (i.e. individual subject).&lt;br /&gt;
&lt;br /&gt;
* Remove barcodes &lt;br /&gt;
(still in the part/ subdirectory)&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
for i in *.seq; do&lt;br /&gt;
  ${SCRIPTS}/unbarcode.pl $i&lt;br /&gt;
done&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The output files will have the same name as the original file but with the addition of the .nbc suffix.  You should remove the .nbc files from the BAD/NONE files in order to prevent their addition to the pipeline downstream.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt; rm *.BAD.nbc.fa *.NONE.nbc.fa &amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 4: upload file names ===&lt;/div&gt;</summary>
		<author><name>Mpop</name></author>
	</entry>
	<entry>
		<id>https://wiki.umiacs.umd.edu/cbcb/index.php?title=Cbcb:Pop-Lab:16S-pipeline_16S_analysis_pipeline_(for_Gates_project)&amp;diff=7247</id>
		<title>Cbcb:Pop-Lab:16S-pipeline 16S analysis pipeline (for Gates project)</title>
		<link rel="alternate" type="text/html" href="https://wiki.umiacs.umd.edu/cbcb/index.php?title=Cbcb:Pop-Lab:16S-pipeline_16S_analysis_pipeline_(for_Gates_project)&amp;diff=7247"/>
		<updated>2010-06-08T19:43:53Z</updated>

		<summary type="html">&lt;p&gt;Mpop: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= 16S analysis pipeline =&lt;br /&gt;
&lt;br /&gt;
== Overview ==&lt;br /&gt;
Assumptions:&amp;lt;br&amp;gt;&lt;br /&gt;
* 16S rRNA sequences are generated with barcoded 454 and are received as either: (i) .sff file; (ii) fasta and quality file; (iii) just fasta file&lt;br /&gt;
* each batch is a single 96-well plate and is accompanied by a tab-delimited file containing information about the sample, including the &amp;quot;Sample ID&amp;quot;, well on the plate, and additional information regarding the sample quality and DNA concentration&lt;br /&gt;
* we have a file that specifies the barcode and sequencing adapter used for each well (these entities do not change).&lt;br /&gt;
&lt;br /&gt;
=== Directory structure ===&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
Gates_SOM/&lt;br /&gt;
  Main/&lt;br /&gt;
    samples.csv      - information about all the samples available to us&lt;br /&gt;
    454.csv          - information about all 454 runs (essentially concatenation of .csvs from 454 dir)&lt;br /&gt;
    phylochip.csv    - information about all Phylochip runs&lt;br /&gt;
    IGS_Barcodes.csv - information about barcodes used to multiplex 454 runs&lt;br /&gt;
    scripts/         - scripts used to process the data&lt;br /&gt;
  454/               - here&#039;s where all 454 sequences live&lt;br /&gt;
    [batch1]/        - ... each batch in a separate directory&lt;br /&gt;
    [batch1].csv     - meta-information about the batch &lt;br /&gt;
    [fasta1]         - fasta files containing the batch&lt;br /&gt;
     ... &lt;br /&gt;
    [fastan]&lt;br /&gt;
    [batch1].part    - partition file describing how the sequences get split by barcode/sample&lt;br /&gt;
     part/           - directory where all the partitioned files live &lt;br /&gt;
     ... &lt;br /&gt;
    [batchn]&lt;br /&gt;
  Phylochip/         - all the CEL files and auxiliary information on the Phylochip runs&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== File-based approach == &lt;br /&gt;
&lt;br /&gt;
=== Step 0: Get the sequence information ===&lt;br /&gt;
* From .SFF files (assuming these are 454 sequences)&lt;br /&gt;
&lt;br /&gt;
This step uses the sff_extract program from the Staden package (if I&#039;m not mistaken)&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
for i in *.sff ;do&lt;br /&gt;
  name=`expr $i : &#039;\(.*\)\.sff&#039;`&lt;br /&gt;
  sff_extract -c -s $name.seq -q $name.qual $i&lt;br /&gt;
done&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 1: Cleanup meta-information ===&lt;br /&gt;
* Convert the Excel sheet containing the batch information into tab-delimited file (and run dos2unix) making sure the quotes added by Excel/OOffice are removed, adding the date (if not already in the file), and sorting the file by Sample ID.  At this stage also check that the header row information is in canonical format.&lt;br /&gt;
* Add barcode information using add_barcode.pl&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;${SCRIPTDIR}/add_barcode.pl [batch].csv ${MAINDIR}IGS_Barcodes.csv &amp;gt; [batch]_barcode.csv&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 2: Create partition file ===&lt;br /&gt;
&lt;br /&gt;
First concatenate all the sequence files (if multiple) from a batch into a single file [batch].all.seq.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;${SCRIPTDIR}/code2part.pl [batch].csv [batch].all.seq [batch] &amp;gt; [batch].part&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The files [batch].BAD.list and [batch].NONE.list contain the names of all sequences that have failed quality checks either because they are too short, or contain Ns (BAD), or because the barcode is not recognized (NONE). The files contain additional information that can be used to troubleshoot the pipeline:  sequences that are too short (&amp;lt; 75 454 cycles) are followed by the number of cycles, sequences that either contain Ns or have an unknown barcode are followed by the first 8 characters in the sequence.&lt;br /&gt;
&lt;br /&gt;
=== Step 3: Break up the fasta file into separate batches by partition ===&lt;br /&gt;
&lt;br /&gt;
* Create partition directory&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mkdir part&lt;br /&gt;
cd part&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Partition main file into sub-parts&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
${SCRIPTS}/partitionFasta.pl ../[batch].all.seq ../[batch].part&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The result is one fasta file per sub-partition (i.e. individual subject).&lt;br /&gt;
&lt;br /&gt;
* Remove barcodes &lt;br /&gt;
(still in the part/ subdirectory)&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
for i in *.seq; do&lt;br /&gt;
  ${SCRIPTS}/unbarcode.pl $i&lt;br /&gt;
done&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The output files will have the same name as the original file but with the addition of the .nbc suffix.  You should remove the .nbc files from the BAD/NONE files in order to prevent their addition to the pipeline downstream.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt; rm *.BAD.nbc.fa *.NONE.nbc.fa &amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 4: Add file names to sample tables ===&lt;br /&gt;
&lt;br /&gt;
The following needs to be run from the root of the 454 directory.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt; ${SCRIPTS}/add_file.pl [batch]/[batch]_barcode.csv [batch]/part &amp;gt; [batch]/[batch]_names.csv &amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
As a result, the file [batch]/[batch]_names.csv will associate each Sample ID to a file name and also record the number of sequences in that file.  Note that only files ending in .nbc.fa are procesed.&lt;br /&gt;
&lt;br /&gt;
=== Step 5: Merge all the batch meta-info files into a same file at the top ===&lt;br /&gt;
&lt;br /&gt;
Note: the addition of file names must be done on a batch by batch basis as multiple files might refer to a same Sample ID - within each batch it can be assumed that the Sample ID -&amp;gt; Filename mapping is unique.  In the 454.csv file in the top directory the unique key is the file name.  &lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt; &lt;br /&gt;
mv ../454.csv ../454.csv.bak&lt;br /&gt;
${SCRIPTS}/merge_csv.pl ../454.csv.bak [batch]/[batch]_names.csv &amp;quot;Sample ID&amp;quot; non-unique &amp;gt; ../454.csv&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This should probably be run with the filename as the key but then the tables need to be sorted  by filename.  Ultimately, this will all work better once the data are in a relational database.&lt;br /&gt;
&lt;br /&gt;
=== Step 6: Update sample file ===&lt;br /&gt;
From the top directory:&lt;br /&gt;
&lt;br /&gt;
* First add all new samples to the samples.csv file&lt;br /&gt;
&amp;lt;pre&amp;gt; &lt;br /&gt;
mv samples.csv samples.csv.bak&lt;br /&gt;
${SCRIPTS}/merge_csv.pl samples.csv.bak 454.csv &amp;quot;Sample ID&amp;quot; unique merge &amp;gt; samples.csv&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Note: merge means that if record keys conflict, the empty fields will be updated with the new data.&lt;br /&gt;
&lt;br /&gt;
* Update the tag indicating 454 sequences available for this sample&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mv samples.csv samples.csv.bak&lt;br /&gt;
${SCRIPTS}/update_field_csv.pl samples.csv.bak 454.csv &amp;quot;Sample ID&amp;quot; 454 Y &amp;gt; samples.csv&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 7: Assign numbers to all filenames ===&lt;br /&gt;
&lt;br /&gt;
Each file in the 454.csv file will be assigned an integer (if one is not already available).  This number will be used to prefix the sequences in the combined file for the project.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mv 454.csv 454.csv.bak&lt;br /&gt;
${SCRIPTS}/add_filenum.pl 454.csv.bak &amp;gt; 454.csv&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 8: Combine all fasta files into a single one ===&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Note: this assumes we&#039;re running the pipeline for the first time - a different protocol is necessary for adding new sequences to an already existing analysis&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
In the new file all sequences will be named &amp;lt;n&amp;gt;_&amp;lt;nn&amp;gt; where &amp;lt;n&amp;gt; is the value in the &amp;quot;File #&amp;quot;&lt;br /&gt;
field in 454.csv and &amp;lt;nn&amp;gt; is the index of the sequence in the file.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mkdir Analysis/Run[date]&lt;br /&gt;
${SCRIPTS}/combinefa.pl -c Analysis/Run[date]/Run[date] -i 454.csv 454&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The output will be in Analysis/Run[date]/Run[date].fna&lt;br /&gt;
&lt;br /&gt;
=== Step 9: Run clustering tool ===&lt;br /&gt;
&lt;br /&gt;
* First generate clusters&lt;br /&gt;
&#039;&#039;&#039;Note: This part assumes we&#039;re running the whole set of sequences as one batch.&#039;&#039;&#039;&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
cd Analysis/Run[date]/Run[date].fna&lt;br /&gt;
/fs/szasmg2/ghodsi/Src/clusterk/clusterk7 -r 2 -i Run[date].fna &amp;gt; Run[date].fna.cluster&lt;br /&gt;
/fs/szasmg2/ghodsi/Src/clusterk/clusterk7 -r 2 -m -i Run[date].fna &amp;gt; Run[date].fna.align&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Output will be in Run[date].fna.cluster, one cluster per line, cluster center listed as the first identifier.&lt;br /&gt;
&lt;br /&gt;
The .align file contains aligned FASTA records for all the sequences in each cluster.  Clusters are separated by #&amp;lt;number&amp;gt; where &amp;lt;number&amp;gt; is the number of sequences in the cluster.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* Then extract the cluster centers&lt;br /&gt;
&#039;&#039;&#039;From here on the code runs the same in both full-run and batch modes&#039;&#039;&#039;&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
/fs/szasmg2/ghodsi/Src/clusterk/fastaselect Run[date].fna &amp;lt; Run[date].fna.cluster &amp;gt; Run[date].centers.fna&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 10: Assign putative taxonomic labels to clusters ===&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
/fs/szasmg2/ghodsi/rdp/findtax/findtaxid.sh Run[date].centers.fna &amp;gt; Run[date].centers.taxid&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Output is tab-delimited: sequence name &amp;lt;TAB&amp;gt; taxid&lt;br /&gt;
Note: using &amp;quot;findtax&amp;quot; instead of &amp;quot;findtaxid&amp;quot; will retrieve actual taxonomy names.&lt;br /&gt;
&lt;br /&gt;
=== Step 11: Build summary tables ===&lt;br /&gt;
&lt;br /&gt;
Using the output from steps 9 and 10 we construct a collection of tables linking OTUs, taxIDs, taxnames at various taxonomic levels to individual samples.  The colums are the samples and the rows are the respective units.  The cells are numbers of sequences assigned to the specific group. If looking at taxonomic levels, the sequences without an assignment at that level are assigned to a generic &amp;quot;No Assignment&amp;quot; bin.&lt;br /&gt;
&lt;br /&gt;
* First create a partition file that contains all the clusters and associated taxids (if they exist)&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
${SCRIPTS}/cluster2part.pl [batch].fna.cluster [batch] [batch].centers.taxid &amp;gt; [batch].part&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Using this partition, construct summary tables at various taxonomic levels&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
${SCRIPTS}/taxpart2summary.pl [batch].part ${MAIN}/454.csv [batch]&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The following outputs will be created:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
[batch].stats.txt - overall statistics for the data-set&lt;br /&gt;
[batch].otus.count.csv - table containing OTUs as rows, samples as columns, and entries represent&lt;br /&gt;
       # of sequences in each OTU/Sample pair&lt;br /&gt;
[batch].otus.percent.csv - same as &amp;quot;count&amp;quot; except that entries are percentages wrt total sequences&lt;br /&gt;
       in each sample&lt;br /&gt;
[batch].[tax].[otu|count|percent] - same as the &amp;quot;otus&amp;quot; file except at varying taxonomic levels.&lt;br /&gt;
       [tax] is one of &amp;quot;strain&amp;quot;, &amp;quot;species&amp;quot;, &amp;quot;genus&amp;quot;, &amp;quot;family&amp;quot;, &amp;quot;order&amp;quot;, &amp;quot;class&amp;quot;, &amp;quot;phylum&amp;quot;&lt;br /&gt;
       the &amp;quot;count&amp;quot; and &amp;quot;percent&amp;quot; entries are the same as for the &amp;quot;otus&amp;quot; files&lt;br /&gt;
       the &amp;quot;otu&amp;quot; entries contain number of OTUs assigned to the taxonomic group/Sample pair.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;/div&gt;</summary>
		<author><name>Mpop</name></author>
	</entry>
	<entry>
		<id>https://wiki.umiacs.umd.edu/cbcb/index.php?title=Cbcb:Pop-Lab:16S-pipeline_16S_analysis_pipeline_(for_Gates_project)&amp;diff=6125</id>
		<title>Cbcb:Pop-Lab:16S-pipeline 16S analysis pipeline (for Gates project)</title>
		<link rel="alternate" type="text/html" href="https://wiki.umiacs.umd.edu/cbcb/index.php?title=Cbcb:Pop-Lab:16S-pipeline_16S_analysis_pipeline_(for_Gates_project)&amp;diff=6125"/>
		<updated>2009-12-04T15:53:25Z</updated>

		<summary type="html">&lt;p&gt;Mpop: /* Step 11: Build summary tables */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== 16S analysis pipeline ==&lt;br /&gt;
&lt;br /&gt;
Assumptions:&amp;lt;br&amp;gt;&lt;br /&gt;
* 16S rRNA sequences are generated with barcoded 454 and are received as either: (i) .sff file; (ii) fasta and quality file; (iii) just fasta file&lt;br /&gt;
* each batch is a single 96-well plate and is accompanied by a tab-delimited file containing information about the sample, including the &amp;quot;Sample ID&amp;quot;, well on the plate, and additional information regarding the sample quality and DNA concentration&lt;br /&gt;
* we have a file that specifies the barcode and sequencing adapter used for each well (these entities do not change).&lt;br /&gt;
&lt;br /&gt;
=== Directory structure ===&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
Gates_SOM/&lt;br /&gt;
   Main/&lt;br /&gt;
     samples.csv  - information about all the samples available to us&lt;br /&gt;
     454.csv      - information about all 454 runs (essentially concatenation of .csvs from 454 dir)&lt;br /&gt;
     phylochip.csv - information about all Phylochip runs&lt;br /&gt;
     IGS_Barcodes.csv - information about barcodes used to multiplex 454 runs&lt;br /&gt;
     scripts/     - scripts used to process the data&lt;br /&gt;
     454/         - here&#039;s where all 454 sequences live&lt;br /&gt;
       [batch1]/  - ... each batch in a separate directory&lt;br /&gt;
          [batch1].csv - meta-information about the batch &lt;br /&gt;
          [fasta1]- fasta files containing the batch&lt;br /&gt;
          ... &lt;br /&gt;
          [fastan]&lt;br /&gt;
          [batch1].part - partition file describing how the sequences get split by barcode/sample&lt;br /&gt;
          part/   - directory where all the partitioned files live &lt;br /&gt;
       ... &lt;br /&gt;
       [batchn]&lt;br /&gt;
     Phylochip/   - all the CEL files and auxiliary information on the Phylochip runs&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 0: Get the sequence information ===&lt;br /&gt;
* From .SFF files (assuming these are 454 sequences)&lt;br /&gt;
&lt;br /&gt;
This step uses the sff_extract program from the Staden package (if I&#039;m not mistaken)&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
for i in *.sff ;do&lt;br /&gt;
  name=`expr $i : &#039;\(.*\)\.sff&#039;`&lt;br /&gt;
  sff_extract -c -s $name.seq -q $name.qual $i&lt;br /&gt;
done&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 1: Cleanup meta-information ===&lt;br /&gt;
* Convert the Excel sheet containing the batch information into tab-delimited file (and run dos2unix) making sure the quotes added by Excel/OOffice are removed, adding the date (if not already in the file), and sorting the file by Sample ID.  At this stage also check that the header row information is in canonical format.&lt;br /&gt;
* Add barcode information using add_barcode.pl&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;${SCRIPTDIR}/add_barcode.pl [batch].csv ${MAINDIR}IGS_Barcodes.csv &amp;gt; [batch]_barcode.csv&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 2: Create partition file ===&lt;br /&gt;
&lt;br /&gt;
First concatenate all the sequence files (if multiple) from a batch into a single file [batch].all.seq.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;${SCRIPTDIR}/code2part.pl [batch].csv [batch].all.seq [batch] &amp;gt; [batch].part&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The files [batch].BAD.list and [batch].NONE.list contain the names of all sequences that have failed quality checks either because they are too short, or contain Ns (BAD), or because the barcode is not recognized (NONE). The files contain additional information that can be used to troubleshoot the pipeline:  sequences that are too short (&amp;lt; 75 454 cycles) are followed by the number of cycles, sequences that either contain Ns or have an unknown barcode are followed by the first 8 characters in the sequence.&lt;br /&gt;
&lt;br /&gt;
=== Step 3: Break up the fasta file into separate batches by partition ===&lt;br /&gt;
&lt;br /&gt;
* Create partition directory&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mkdir part&lt;br /&gt;
cd part&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Partition main file into sub-parts&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
${SCRIPTS}/partitionFasta.pl ../[batch].all.seq ../[batch].part&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The result is one fasta file per sub-partition (i.e. individual subject).&lt;br /&gt;
&lt;br /&gt;
* Remove barcodes &lt;br /&gt;
(still in the part/ subdirectory)&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
for i in *.seq; do&lt;br /&gt;
  ${SCRIPTS}/unbarcode.pl $i&lt;br /&gt;
done&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The output files will have the same name as the original file but with the addition of the .nbc suffix.  You should remove the .nbc files from the BAD/NONE files in order to prevent their addition to the pipeline downstream.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt; rm *.BAD.nbc.fa *.NONE.nbc.fa &amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 4: Add file names to sample tables ===&lt;br /&gt;
&lt;br /&gt;
The following needs to be run from the root of the 454 directory.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt; ${SCRIPTS}/add_file.pl [batch]/[batch]_barcode.csv [batch]/part &amp;gt; [batch]/[batch]_names.csv &amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
As a result, the file [batch]/[batch]_names.csv will associate each Sample ID to a file name and also record the number of sequences in that file.  Note that only files ending in .nbc.fa are procesed.&lt;br /&gt;
&lt;br /&gt;
=== Step 5: Merge all the batch meta-info files into a same file at the top ===&lt;br /&gt;
&lt;br /&gt;
Note: the addition of file names must be done on a batch by batch basis as multiple files might refer to a same Sample ID - within each batch it can be assumed that the Sample ID -&amp;gt; Filename mapping is unique.  In the 454.csv file in the top directory the unique key is the file name.  &lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt; &lt;br /&gt;
mv ../454.csv ../454.csv.bak&lt;br /&gt;
${SCRIPTS}/merge_csv.pl ../454.csv.bak [batch]/[batch]_names.csv &amp;quot;Sample ID&amp;quot; non-unique &amp;gt; ../454.csv&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This should probably be run with the filename as the key but then the tables need to be sorted  by filename.  Ultimately, this will all work better once the data are in a relational database.&lt;br /&gt;
&lt;br /&gt;
=== Step 6: Update sample file ===&lt;br /&gt;
From the top directory:&lt;br /&gt;
&lt;br /&gt;
* First add all new samples to the samples.csv file&lt;br /&gt;
&amp;lt;pre&amp;gt; &lt;br /&gt;
mv samples.csv samples.csv.bak&lt;br /&gt;
${SCRIPTS}/merge_csv.pl samples.csv.bak 454.csv &amp;quot;Sample ID&amp;quot; unique merge &amp;gt; samples.csv&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Note: merge means that if record keys conflict, the empty fields will be updated with the new data.&lt;br /&gt;
&lt;br /&gt;
* Update the tag indicating 454 sequences available for this sample&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mv samples.csv samples.csv.bak&lt;br /&gt;
${SCRIPTS}/update_field_csv.pl samples.csv.bak 454.csv &amp;quot;Sample ID&amp;quot; 454 Y &amp;gt; samples.csv&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 7: Assign numbers to all filenames ===&lt;br /&gt;
&lt;br /&gt;
Each file in the 454.csv file will be assigned an integer (if one is not already available).  This number will be used to prefix the sequences in the combined file for the project.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mv 454.csv 454.csv.bak&lt;br /&gt;
${SCRIPTS}/add_filenum.pl 454.csv.bak &amp;gt; 454.csv&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 8: Combine all fasta files into a single one ===&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Note: this assumes we&#039;re running the pipeline for the first time - a different protocol is necessary for adding new sequences to an already existing analysis&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
In the new file all sequences will be named &amp;lt;n&amp;gt;_&amp;lt;nn&amp;gt; where &amp;lt;n&amp;gt; is the value in the &amp;quot;File #&amp;quot;&lt;br /&gt;
field in 454.csv and &amp;lt;nn&amp;gt; is the index of the sequence in the file.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mkdir Analysis/Run[date]&lt;br /&gt;
${SCRIPTS}/combinefa.pl -c Analysis/Run[date]/Run[date] -i 454.csv 454&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The output will be in Analysis/Run[date]/Run[date].fna&lt;br /&gt;
&lt;br /&gt;
=== Step 9: Run clustering tool ===&lt;br /&gt;
&lt;br /&gt;
* First generate clusters&lt;br /&gt;
&#039;&#039;&#039;Note: This part assumes we&#039;re running the whole set of sequences as one batch.&#039;&#039;&#039;&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
cd Analysis/Run[date]/Run[date].fna&lt;br /&gt;
/fs/szasmg2/ghodsi/Src/clusterk/clusterk7 -r 2 -i Run[date].fna &amp;gt; Run[date].fna.cluster&lt;br /&gt;
/fs/szasmg2/ghodsi/Src/clusterk/clusterk7 -r 2 -m -i Run[date].fna &amp;gt; Run[date].fna.align&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Output will be in Run[date].fna.cluster, one cluster per line, cluster center listed as the first identifier.&lt;br /&gt;
&lt;br /&gt;
The .align file contains aligned FASTA records for all the sequences in each cluster.  Clusters are separated by #&amp;lt;number&amp;gt; where &amp;lt;number&amp;gt; is the number of sequences in the cluster.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* Then extract the cluster centers&lt;br /&gt;
&#039;&#039;&#039;From here on the code runs the same in both full-run and batch modes&#039;&#039;&#039;&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
/fs/szasmg2/ghodsi/Src/clusterk/fastaselect Run[date].fna &amp;lt; Run[date].fna.cluster &amp;gt; Run[date].centers.fna&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 10: Assign putative taxonomic labels to clusters ===&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
/fs/szasmg2/ghodsi/rdp/findtax/findtaxid.sh Run[date].centers.fna &amp;gt; Run[date].centers.taxid&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Output is tab-delimited: sequence name &amp;lt;TAB&amp;gt; taxid&lt;br /&gt;
Note: using &amp;quot;findtax&amp;quot; instead of &amp;quot;findtaxid&amp;quot; will retrieve actual taxonomy names.&lt;br /&gt;
&lt;br /&gt;
=== Step 11: Build summary tables ===&lt;br /&gt;
&lt;br /&gt;
Using the output from steps 9 and 10 we construct a collection of tables linking OTUs, taxIDs, taxnames at various taxonomic levels to individual samples.  The colums are the samples and the rows are the respective units.  The cells are numbers of sequences assigned to the specific group. If looking at taxonomic levels, the sequences without an assignment at that level are assigned to a generic &amp;quot;No Assignment&amp;quot; bin.&lt;br /&gt;
&lt;br /&gt;
* First create a partition file that contains all the clusters and associated taxids (if they exist)&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
${SCRIPTS}/cluster2part.pl [batch].fna.cluster [batch] [batch].centers.taxid &amp;gt; [batch].part&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Using this partition, construct summary tables at various taxonomic levels&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
${SCRIPTS}/taxpart2summary.pl [batch].part ${MAIN}/454.csv [batch]&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The following outputs will be created:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
[batch].stats.txt - overall statistics for the data-set&lt;br /&gt;
[batch].otus.count.csv - table containing OTUs as rows, samples as columns, and entries represent&lt;br /&gt;
       # of sequences in each OTU/Sample pair&lt;br /&gt;
[batch].otus.percent.csv - same as &amp;quot;count&amp;quot; except that entries are percentages wrt total sequences&lt;br /&gt;
       in each sample&lt;br /&gt;
[batch].[tax].[otu|count|percent] - same as the &amp;quot;otus&amp;quot; file except at varying taxonomic levels.&lt;br /&gt;
       [tax] is one of &amp;quot;strain&amp;quot;, &amp;quot;species&amp;quot;, &amp;quot;genus&amp;quot;, &amp;quot;family&amp;quot;, &amp;quot;order&amp;quot;, &amp;quot;class&amp;quot;, &amp;quot;phylum&amp;quot;&lt;br /&gt;
       the &amp;quot;count&amp;quot; and &amp;quot;percent&amp;quot; entries are the same as for the &amp;quot;otus&amp;quot; files&lt;br /&gt;
       the &amp;quot;otu&amp;quot; entries contain number of OTUs assigned to the taxonomic group/Sample pair.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;/div&gt;</summary>
		<author><name>Mpop</name></author>
	</entry>
	<entry>
		<id>https://wiki.umiacs.umd.edu/cbcb/index.php?title=Cbcb:Pop-Lab:16S-pipeline_16S_analysis_pipeline_(for_Gates_project)&amp;diff=6115</id>
		<title>Cbcb:Pop-Lab:16S-pipeline 16S analysis pipeline (for Gates project)</title>
		<link rel="alternate" type="text/html" href="https://wiki.umiacs.umd.edu/cbcb/index.php?title=Cbcb:Pop-Lab:16S-pipeline_16S_analysis_pipeline_(for_Gates_project)&amp;diff=6115"/>
		<updated>2009-12-01T15:52:18Z</updated>

		<summary type="html">&lt;p&gt;Mpop: /* Step 11: Build summary tables */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== 16S analysis pipeline ==&lt;br /&gt;
&lt;br /&gt;
Assumptions:&amp;lt;br&amp;gt;&lt;br /&gt;
* 16S rRNA sequences are generated with barcoded 454 and are received as either: (i) .sff file; (ii) fasta and quality file; (iii) just fasta file&lt;br /&gt;
* each batch is a single 96-well plate and is accompanied by a tab-delimited file containing information about the sample, including the &amp;quot;Sample ID&amp;quot;, well on the plate, and additional information regarding the sample quality and DNA concentration&lt;br /&gt;
* we have a file that specifies the barcode and sequencing adapter used for each well (these entities do not change).&lt;br /&gt;
&lt;br /&gt;
=== Directory structure ===&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
Gates_SOM/&lt;br /&gt;
   Main/&lt;br /&gt;
     samples.csv  - information about all the samples available to us&lt;br /&gt;
     454.csv      - information about all 454 runs (essentially concatenation of .csvs from 454 dir)&lt;br /&gt;
     phylochip.csv - information about all Phylochip runs&lt;br /&gt;
     IGS_Barcodes.csv - information about barcodes used to multiplex 454 runs&lt;br /&gt;
     scripts/     - scripts used to process the data&lt;br /&gt;
     454/         - here&#039;s where all 454 sequences live&lt;br /&gt;
       [batch1]/  - ... each batch in a separate directory&lt;br /&gt;
          [batch1].csv - meta-information about the batch &lt;br /&gt;
          [fasta1]- fasta files containing the batch&lt;br /&gt;
          ... &lt;br /&gt;
          [fastan]&lt;br /&gt;
          [batch1].part - partition file describing how the sequences get split by barcode/sample&lt;br /&gt;
          part/   - directory where all the partitioned files live &lt;br /&gt;
       ... &lt;br /&gt;
       [batchn]&lt;br /&gt;
     Phylochip/   - all the CEL files and auxiliary information on the Phylochip runs&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 0: Get the sequence information ===&lt;br /&gt;
* From .SFF files (assuming these are 454 sequences)&lt;br /&gt;
&lt;br /&gt;
This step uses the sff_extract program from the Staden package (if I&#039;m not mistaken)&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
for i in *.sff ;do&lt;br /&gt;
  name=`expr $i : &#039;\(.*\)\.sff&#039;`&lt;br /&gt;
  sff_extract -c -s $name.seq -q $name.qual $i&lt;br /&gt;
done&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 1: Cleanup meta-information ===&lt;br /&gt;
* Convert the Excel sheet containing the batch information into tab-delimited file (and run dos2unix) making sure the quotes added by Excel/OOffice are removed, adding the date (if not already in the file), and sorting the file by Sample ID.  At this stage also check that the header row information is in canonical format.&lt;br /&gt;
* Add barcode information using add_barcode.pl&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;${SCRIPTDIR}/add_barcode.pl [batch].csv ${MAINDIR}IGS_Barcodes.csv &amp;gt; [batch]_barcode.csv&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 2: Create partition file ===&lt;br /&gt;
&lt;br /&gt;
First concatenate all the sequence files (if multiple) from a batch into a single file [batch].all.seq.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;${SCRIPTDIR}/code2part.pl [batch].csv [batch].all.seq [batch] &amp;gt; [batch].part&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The files [batch].BAD.list and [batch].NONE.list contain the names of all sequences that have failed quality checks either because they are too short, or contain Ns (BAD), or because the barcode is not recognized (NONE). The files contain additional information that can be used to troubleshoot the pipeline:  sequences that are too short (&amp;lt; 75 454 cycles) are followed by the number of cycles, sequences that either contain Ns or have an unknown barcode are followed by the first 8 characters in the sequence.&lt;br /&gt;
&lt;br /&gt;
=== Step 3: Break up the fasta file into separate batches by partition ===&lt;br /&gt;
&lt;br /&gt;
* Create partition directory&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mkdir part&lt;br /&gt;
cd part&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Partition main file into sub-parts&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
${SCRIPTS}/partitionFasta.pl ../[batch].all.seq ../[batch].part&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The result is one fasta file per sub-partition (i.e. individual subject).&lt;br /&gt;
&lt;br /&gt;
* Remove barcodes &lt;br /&gt;
(still in the part/ subdirectory)&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
for i in *.seq; do&lt;br /&gt;
  ${SCRIPTS}/unbarcode.pl $i&lt;br /&gt;
done&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The output files will have the same name as the original file but with the addition of the .nbc suffix.  You should remove the .nbc files from the BAD/NONE files in order to prevent their addition to the pipeline downstream.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt; rm *.BAD.nbc.fa *.NONE.nbc.fa &amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 4: Add file names to sample tables ===&lt;br /&gt;
&lt;br /&gt;
The following needs to be run from the root of the 454 directory.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt; ${SCRIPTS}/add_file.pl [batch]/[batch]_barcode.csv [batch]/part &amp;gt; [batch]/[batch]_names.csv &amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
As a result, the file [batch]/[batch]_names.csv will associate each Sample ID to a file name and also record the number of sequences in that file.  Note that only files ending in .nbc.fa are procesed.&lt;br /&gt;
&lt;br /&gt;
=== Step 5: Merge all the batch meta-info files into a same file at the top ===&lt;br /&gt;
&lt;br /&gt;
Note: the addition of file names must be done on a batch by batch basis as multiple files might refer to a same Sample ID - within each batch it can be assumed that the Sample ID -&amp;gt; Filename mapping is unique.  In the 454.csv file in the top directory the unique key is the file name.  &lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt; &lt;br /&gt;
mv ../454.csv ../454.csv.bak&lt;br /&gt;
${SCRIPTS}/merge_csv.pl ../454.csv.bak [batch]/[batch]_names.csv &amp;quot;Sample ID&amp;quot; non-unique &amp;gt; ../454.csv&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This should probably be run with the filename as the key but then the tables need to be sorted  by filename.  Ultimately, this will all work better once the data are in a relational database.&lt;br /&gt;
&lt;br /&gt;
=== Step 6: Update sample file ===&lt;br /&gt;
From the top directory:&lt;br /&gt;
&lt;br /&gt;
* First add all new samples to the samples.csv file&lt;br /&gt;
&amp;lt;pre&amp;gt; &lt;br /&gt;
mv samples.csv samples.csv.bak&lt;br /&gt;
${SCRIPTS}/merge_csv.pl samples.csv.bak 454.csv &amp;quot;Sample ID&amp;quot; unique merge &amp;gt; samples.csv&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Note: merge means that if record keys conflict, the empty fields will be updated with the new data.&lt;br /&gt;
&lt;br /&gt;
* Update the tag indicating 454 sequences available for this sample&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mv samples.csv samples.csv.bak&lt;br /&gt;
${SCRIPTS}/update_field_csv.pl samples.csv.bak 454.csv &amp;quot;Sample ID&amp;quot; 454 Y &amp;gt; samples.csv&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 7: Assign numbers to all filenames ===&lt;br /&gt;
&lt;br /&gt;
Each file in the 454.csv file will be assigned an integer (if one is not already available).  This number will be used to prefix the sequences in the combined file for the project.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mv 454.csv 454.csv.bak&lt;br /&gt;
${SCRIPTS}/add_filenum.pl 454.csv.bak &amp;gt; 454.csv&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 8: Combine all fasta files into a single one ===&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Note: this assumes we&#039;re running the pipeline for the first time - a different protocol is necessary for adding new sequences to an already existing analysis&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
In the new file all sequences will be named &amp;lt;n&amp;gt;_&amp;lt;nn&amp;gt; where &amp;lt;n&amp;gt; is the value in the &amp;quot;File #&amp;quot;&lt;br /&gt;
field in 454.csv and &amp;lt;nn&amp;gt; is the index of the sequence in the file.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mkdir Analysis/Run[date]&lt;br /&gt;
${SCRIPTS}/combinefa.pl -c Analysis/Run[date]/Run[date] -i 454.csv 454&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The output will be in Analysis/Run[date]/Run[date].fna&lt;br /&gt;
&lt;br /&gt;
=== Step 9: Run clustering tool ===&lt;br /&gt;
&lt;br /&gt;
* First generate clusters&lt;br /&gt;
&#039;&#039;&#039;Note: This part assumes we&#039;re running the whole set of sequences as one batch.&#039;&#039;&#039;&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
cd Analysis/Run[date]/Run[date].fna&lt;br /&gt;
/fs/szasmg2/ghodsi/Src/clusterk/clusterk7 -r 2 -i Run[date].fna &amp;gt; Run[date].fna.cluster&lt;br /&gt;
/fs/szasmg2/ghodsi/Src/clusterk/clusterk7 -r 2 -m -i Run[date].fna &amp;gt; Run[date].fna.align&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Output will be in Run[date].fna.cluster, one cluster per line, cluster center listed as the first identifier.&lt;br /&gt;
&lt;br /&gt;
The .align file contains aligned FASTA records for all the sequences in each cluster.  Clusters are separated by #&amp;lt;number&amp;gt; where &amp;lt;number&amp;gt; is the number of sequences in the cluster.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* Then extract the cluster centers&lt;br /&gt;
&#039;&#039;&#039;From here on the code runs the same in both full-run and batch modes&#039;&#039;&#039;&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
/fs/szasmg2/ghodsi/Src/clusterk/fastaselect Run[date].fna &amp;lt; Run[date].fna.cluster &amp;gt; Run[date].centers.fna&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 10: Assign putative taxonomic labels to clusters ===&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
/fs/szasmg2/ghodsi/rdp/findtax/findtaxid.sh Run[date].centers.fna &amp;gt; Run[date].centers.taxid&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Output is tab-delimited: sequence name &amp;lt;TAB&amp;gt; taxid&lt;br /&gt;
Note: using &amp;quot;findtax&amp;quot; instead of &amp;quot;findtaxid&amp;quot; will retrieve actual taxonomy names.&lt;br /&gt;
&lt;br /&gt;
=== Step 11: Build summary tables ===&lt;br /&gt;
&lt;br /&gt;
Using the output from steps 9 and 10 we construct a collection of tables linking OTUs, taxIDs, taxnames at various taxonomic levels to individual samples.  The colums are the samples and the rows are the respective units.  The cells are numbers of sequences assigned to the specific group. If looking at taxonomic levels, the sequences without an assignment at that level are assigned to a generic &amp;quot;No Assignment&amp;quot; bin.&lt;br /&gt;
&lt;br /&gt;
* First create a partition file that contains all the clusters and associated taxids (if they exist)&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
${SCRIPTS}/cluster2part.pl [batch].fna.cluster [batch] [batch].centers.taxid &amp;gt; [batch].part&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Using this partition, construct summary tables at various taxonomic levels&lt;/div&gt;</summary>
		<author><name>Mpop</name></author>
	</entry>
	<entry>
		<id>https://wiki.umiacs.umd.edu/cbcb/index.php?title=Cbcb:Pop-Lab:16S-pipeline_16S_analysis_pipeline_(for_Gates_project)&amp;diff=6114</id>
		<title>Cbcb:Pop-Lab:16S-pipeline 16S analysis pipeline (for Gates project)</title>
		<link rel="alternate" type="text/html" href="https://wiki.umiacs.umd.edu/cbcb/index.php?title=Cbcb:Pop-Lab:16S-pipeline_16S_analysis_pipeline_(for_Gates_project)&amp;diff=6114"/>
		<updated>2009-12-01T15:23:50Z</updated>

		<summary type="html">&lt;p&gt;Mpop: /* Step 9: Run clustering tool */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== 16S analysis pipeline ==&lt;br /&gt;
&lt;br /&gt;
Assumptions:&amp;lt;br&amp;gt;&lt;br /&gt;
* 16S rRNA sequences are generated with barcoded 454 and are received as either: (i) .sff file; (ii) fasta and quality file; (iii) just fasta file&lt;br /&gt;
* each batch is a single 96-well plate and is accompanied by a tab-delimited file containing information about the sample, including the &amp;quot;Sample ID&amp;quot;, well on the plate, and additional information regarding the sample quality and DNA concentration&lt;br /&gt;
* we have a file that specifies the barcode and sequencing adapter used for each well (these entities do not change).&lt;br /&gt;
&lt;br /&gt;
=== Directory structure ===&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
Gates_SOM/&lt;br /&gt;
   Main/&lt;br /&gt;
     samples.csv  - information about all the samples available to us&lt;br /&gt;
     454.csv      - information about all 454 runs (essentially concatenation of .csvs from 454 dir)&lt;br /&gt;
     phylochip.csv - information about all Phylochip runs&lt;br /&gt;
     IGS_Barcodes.csv - information about barcodes used to multiplex 454 runs&lt;br /&gt;
     scripts/     - scripts used to process the data&lt;br /&gt;
     454/         - here&#039;s where all 454 sequences live&lt;br /&gt;
       [batch1]/  - ... each batch in a separate directory&lt;br /&gt;
          [batch1].csv - meta-information about the batch &lt;br /&gt;
          [fasta1]- fasta files containing the batch&lt;br /&gt;
          ... &lt;br /&gt;
          [fastan]&lt;br /&gt;
          [batch1].part - partition file describing how the sequences get split by barcode/sample&lt;br /&gt;
          part/   - directory where all the partitioned files live &lt;br /&gt;
       ... &lt;br /&gt;
       [batchn]&lt;br /&gt;
     Phylochip/   - all the CEL files and auxiliary information on the Phylochip runs&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 0: Get the sequence information ===&lt;br /&gt;
* From .SFF files (assuming these are 454 sequences)&lt;br /&gt;
&lt;br /&gt;
This step uses the sff_extract program from the Staden package (if I&#039;m not mistaken)&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
for i in *.sff ;do&lt;br /&gt;
  name=`expr $i : &#039;\(.*\)\.sff&#039;`&lt;br /&gt;
  sff_extract -c -s $name.seq -q $name.qual $i&lt;br /&gt;
done&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 1: Cleanup meta-information ===&lt;br /&gt;
* Convert the Excel sheet containing the batch information into tab-delimited file (and run dos2unix) making sure the quotes added by Excel/OOffice are removed, adding the date (if not already in the file), and sorting the file by Sample ID.  At this stage also check that the header row information is in canonical format.&lt;br /&gt;
* Add barcode information using add_barcode.pl&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;${SCRIPTDIR}/add_barcode.pl [batch].csv ${MAINDIR}IGS_Barcodes.csv &amp;gt; [batch]_barcode.csv&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 2: Create partition file ===&lt;br /&gt;
&lt;br /&gt;
First concatenate all the sequence files (if multiple) from a batch into a single file [batch].all.seq.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;${SCRIPTDIR}/code2part.pl [batch].csv [batch].all.seq [batch] &amp;gt; [batch].part&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The files [batch].BAD.list and [batch].NONE.list contain the names of all sequences that have failed quality checks either because they are too short, or contain Ns (BAD), or because the barcode is not recognized (NONE). The files contain additional information that can be used to troubleshoot the pipeline:  sequences that are too short (&amp;lt; 75 454 cycles) are followed by the number of cycles, sequences that either contain Ns or have an unknown barcode are followed by the first 8 characters in the sequence.&lt;br /&gt;
&lt;br /&gt;
=== Step 3: Break up the fasta file into separate batches by partition ===&lt;br /&gt;
&lt;br /&gt;
* Create partition directory&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mkdir part&lt;br /&gt;
cd part&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Partition main file into sub-parts&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
${SCRIPTS}/partitionFasta.pl ../[batch].all.seq ../[batch].part&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The result is one fasta file per sub-partition (i.e. individual subject).&lt;br /&gt;
&lt;br /&gt;
* Remove barcodes &lt;br /&gt;
(still in the part/ subdirectory)&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
for i in *.seq; do&lt;br /&gt;
  ${SCRIPTS}/unbarcode.pl $i&lt;br /&gt;
done&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The output files will have the same name as the original file but with the addition of the .nbc suffix.  You should remove the .nbc files from the BAD/NONE files in order to prevent their addition to the pipeline downstream.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt; rm *.BAD.nbc.fa *.NONE.nbc.fa &amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 4: Add file names to sample tables ===&lt;br /&gt;
&lt;br /&gt;
The following needs to be run from the root of the 454 directory.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt; ${SCRIPTS}/add_file.pl [batch]/[batch]_barcode.csv [batch]/part &amp;gt; [batch]/[batch]_names.csv &amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
As a result, the file [batch]/[batch]_names.csv will associate each Sample ID to a file name and also record the number of sequences in that file.  Note that only files ending in .nbc.fa are procesed.&lt;br /&gt;
&lt;br /&gt;
=== Step 5: Merge all the batch meta-info files into a same file at the top ===&lt;br /&gt;
&lt;br /&gt;
Note: the addition of file names must be done on a batch by batch basis as multiple files might refer to a same Sample ID - within each batch it can be assumed that the Sample ID -&amp;gt; Filename mapping is unique.  In the 454.csv file in the top directory the unique key is the file name.  &lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt; &lt;br /&gt;
mv ../454.csv ../454.csv.bak&lt;br /&gt;
${SCRIPTS}/merge_csv.pl ../454.csv.bak [batch]/[batch]_names.csv &amp;quot;Sample ID&amp;quot; non-unique &amp;gt; ../454.csv&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This should probably be run with the filename as the key but then the tables need to be sorted  by filename.  Ultimately, this will all work better once the data are in a relational database.&lt;br /&gt;
&lt;br /&gt;
=== Step 6: Update sample file ===&lt;br /&gt;
From the top directory:&lt;br /&gt;
&lt;br /&gt;
* First add all new samples to the samples.csv file&lt;br /&gt;
&amp;lt;pre&amp;gt; &lt;br /&gt;
mv samples.csv samples.csv.bak&lt;br /&gt;
${SCRIPTS}/merge_csv.pl samples.csv.bak 454.csv &amp;quot;Sample ID&amp;quot; unique merge &amp;gt; samples.csv&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Note: merge means that if record keys conflict, the empty fields will be updated with the new data.&lt;br /&gt;
&lt;br /&gt;
* Update the tag indicating 454 sequences available for this sample&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mv samples.csv samples.csv.bak&lt;br /&gt;
${SCRIPTS}/update_field_csv.pl samples.csv.bak 454.csv &amp;quot;Sample ID&amp;quot; 454 Y &amp;gt; samples.csv&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 7: Assign numbers to all filenames ===&lt;br /&gt;
&lt;br /&gt;
Each file in the 454.csv file will be assigned an integer (if one is not already available).  This number will be used to prefix the sequences in the combined file for the project.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mv 454.csv 454.csv.bak&lt;br /&gt;
${SCRIPTS}/add_filenum.pl 454.csv.bak &amp;gt; 454.csv&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 8: Combine all fasta files into a single one ===&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Note: this assumes we&#039;re running the pipeline for the first time - a different protocol is necessary for adding new sequences to an already existing analysis&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
In the new file all sequences will be named &amp;lt;n&amp;gt;_&amp;lt;nn&amp;gt; where &amp;lt;n&amp;gt; is the value in the &amp;quot;File #&amp;quot;&lt;br /&gt;
field in 454.csv and &amp;lt;nn&amp;gt; is the index of the sequence in the file.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mkdir Analysis/Run[date]&lt;br /&gt;
${SCRIPTS}/combinefa.pl -c Analysis/Run[date]/Run[date] -i 454.csv 454&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The output will be in Analysis/Run[date]/Run[date].fna&lt;br /&gt;
&lt;br /&gt;
=== Step 9: Run clustering tool ===&lt;br /&gt;
&lt;br /&gt;
* First generate clusters&lt;br /&gt;
&#039;&#039;&#039;Note: This part assumes we&#039;re running the whole set of sequences as one batch.&#039;&#039;&#039;&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
cd Analysis/Run[date]/Run[date].fna&lt;br /&gt;
/fs/szasmg2/ghodsi/Src/clusterk/clusterk7 -r 2 -i Run[date].fna &amp;gt; Run[date].fna.cluster&lt;br /&gt;
/fs/szasmg2/ghodsi/Src/clusterk/clusterk7 -r 2 -m -i Run[date].fna &amp;gt; Run[date].fna.align&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Output will be in Run[date].fna.cluster, one cluster per line, cluster center listed as the first identifier.&lt;br /&gt;
&lt;br /&gt;
The .align file contains aligned FASTA records for all the sequences in each cluster.  Clusters are separated by #&amp;lt;number&amp;gt; where &amp;lt;number&amp;gt; is the number of sequences in the cluster.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* Then extract the cluster centers&lt;br /&gt;
&#039;&#039;&#039;From here on the code runs the same in both full-run and batch modes&#039;&#039;&#039;&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
/fs/szasmg2/ghodsi/Src/clusterk/fastaselect Run[date].fna &amp;lt; Run[date].fna.cluster &amp;gt; Run[date].centers.fna&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 10: Assign putative taxonomic labels to clusters ===&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
/fs/szasmg2/ghodsi/rdp/findtax/findtaxid.sh Run[date].centers.fna &amp;gt; Run[date].centers.taxid&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Output is tab-delimited: sequence name &amp;lt;TAB&amp;gt; taxid&lt;br /&gt;
Note: using &amp;quot;findtax&amp;quot; instead of &amp;quot;findtaxid&amp;quot; will retrieve actual taxonomy names.&lt;br /&gt;
&lt;br /&gt;
=== Step 11: Build summary tables ===&lt;br /&gt;
&lt;br /&gt;
Using the output from steps 9 and 10 we construct a collection of tables linking OTUs, taxIDs, taxnames at various taxonomic levels to individual samples.  The colums are the samples and the rows are the respective units.  The cells are numbers of sequences assigned to the specific group. If looking at taxonomic levels, the sequences without an assignment at that level are assigned to a generic &amp;quot;No Assignment&amp;quot; bin.&lt;/div&gt;</summary>
		<author><name>Mpop</name></author>
	</entry>
	<entry>
		<id>https://wiki.umiacs.umd.edu/cbcb/index.php?title=Cbcb:Pop-Lab:16S-pipeline_16S_analysis_pipeline_(for_Gates_project)&amp;diff=6113</id>
		<title>Cbcb:Pop-Lab:16S-pipeline 16S analysis pipeline (for Gates project)</title>
		<link rel="alternate" type="text/html" href="https://wiki.umiacs.umd.edu/cbcb/index.php?title=Cbcb:Pop-Lab:16S-pipeline_16S_analysis_pipeline_(for_Gates_project)&amp;diff=6113"/>
		<updated>2009-12-01T02:23:37Z</updated>

		<summary type="html">&lt;p&gt;Mpop: /* Step 9: Run clustering tool */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== 16S analysis pipeline ==&lt;br /&gt;
&lt;br /&gt;
Assumptions:&amp;lt;br&amp;gt;&lt;br /&gt;
* 16S rRNA sequences are generated with barcoded 454 and are received as either: (i) .sff file; (ii) fasta and quality file; (iii) just fasta file&lt;br /&gt;
* each batch is a single 96-well plate and is accompanied by a tab-delimited file containing information about the sample, including the &amp;quot;Sample ID&amp;quot;, well on the plate, and additional information regarding the sample quality and DNA concentration&lt;br /&gt;
* we have a file that specifies the barcode and sequencing adapter used for each well (these entities do not change).&lt;br /&gt;
&lt;br /&gt;
=== Directory structure ===&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
Gates_SOM/&lt;br /&gt;
   Main/&lt;br /&gt;
     samples.csv  - information about all the samples available to us&lt;br /&gt;
     454.csv      - information about all 454 runs (essentially concatenation of .csvs from 454 dir)&lt;br /&gt;
     phylochip.csv - information about all Phylochip runs&lt;br /&gt;
     IGS_Barcodes.csv - information about barcodes used to multiplex 454 runs&lt;br /&gt;
     scripts/     - scripts used to process the data&lt;br /&gt;
     454/         - here&#039;s where all 454 sequences live&lt;br /&gt;
       [batch1]/  - ... each batch in a separate directory&lt;br /&gt;
          [batch1].csv - meta-information about the batch &lt;br /&gt;
          [fasta1]- fasta files containing the batch&lt;br /&gt;
          ... &lt;br /&gt;
          [fastan]&lt;br /&gt;
          [batch1].part - partition file describing how the sequences get split by barcode/sample&lt;br /&gt;
          part/   - directory where all the partitioned files live &lt;br /&gt;
       ... &lt;br /&gt;
       [batchn]&lt;br /&gt;
     Phylochip/   - all the CEL files and auxiliary information on the Phylochip runs&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 0: Get the sequence information ===&lt;br /&gt;
* From .SFF files (assuming these are 454 sequences)&lt;br /&gt;
&lt;br /&gt;
This step uses the sff_extract program from the Staden package (if I&#039;m not mistaken)&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
for i in *.sff ;do&lt;br /&gt;
  name=`expr $i : &#039;\(.*\)\.sff&#039;`&lt;br /&gt;
  sff_extract -c -s $name.seq -q $name.qual $i&lt;br /&gt;
done&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 1: Cleanup meta-information ===&lt;br /&gt;
* Convert the Excel sheet containing the batch information into tab-delimited file (and run dos2unix) making sure the quotes added by Excel/OOffice are removed, adding the date (if not already in the file), and sorting the file by Sample ID.  At this stage also check that the header row information is in canonical format.&lt;br /&gt;
* Add barcode information using add_barcode.pl&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;${SCRIPTDIR}/add_barcode.pl [batch].csv ${MAINDIR}IGS_Barcodes.csv &amp;gt; [batch]_barcode.csv&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 2: Create partition file ===&lt;br /&gt;
&lt;br /&gt;
First concatenate all the sequence files (if multiple) from a batch into a single file [batch].all.seq.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;${SCRIPTDIR}/code2part.pl [batch].csv [batch].all.seq [batch] &amp;gt; [batch].part&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The files [batch].BAD.list and [batch].NONE.list contain the names of all sequences that have failed quality checks either because they are too short, or contain Ns (BAD), or because the barcode is not recognized (NONE). The files contain additional information that can be used to troubleshoot the pipeline:  sequences that are too short (&amp;lt; 75 454 cycles) are followed by the number of cycles, sequences that either contain Ns or have an unknown barcode are followed by the first 8 characters in the sequence.&lt;br /&gt;
&lt;br /&gt;
=== Step 3: Break up the fasta file into separate batches by partition ===&lt;br /&gt;
&lt;br /&gt;
* Create partition directory&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mkdir part&lt;br /&gt;
cd part&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Partition main file into sub-parts&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
${SCRIPTS}/partitionFasta.pl ../[batch].all.seq ../[batch].part&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The result is one fasta file per sub-partition (i.e. individual subject).&lt;br /&gt;
&lt;br /&gt;
* Remove barcodes &lt;br /&gt;
(still in the part/ subdirectory)&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
for i in *.seq; do&lt;br /&gt;
  ${SCRIPTS}/unbarcode.pl $i&lt;br /&gt;
done&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The output files will have the same name as the original file but with the addition of the .nbc suffix.  You should remove the .nbc files from the BAD/NONE files in order to prevent their addition to the pipeline downstream.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt; rm *.BAD.nbc.fa *.NONE.nbc.fa &amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 4: Add file names to sample tables ===&lt;br /&gt;
&lt;br /&gt;
The following needs to be run from the root of the 454 directory.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt; ${SCRIPTS}/add_file.pl [batch]/[batch]_barcode.csv [batch]/part &amp;gt; [batch]/[batch]_names.csv &amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
As a result, the file [batch]/[batch]_names.csv will associate each Sample ID to a file name and also record the number of sequences in that file.  Note that only files ending in .nbc.fa are procesed.&lt;br /&gt;
&lt;br /&gt;
=== Step 5: Merge all the batch meta-info files into a same file at the top ===&lt;br /&gt;
&lt;br /&gt;
Note: the addition of file names must be done on a batch by batch basis as multiple files might refer to a same Sample ID - within each batch it can be assumed that the Sample ID -&amp;gt; Filename mapping is unique.  In the 454.csv file in the top directory the unique key is the file name.  &lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt; &lt;br /&gt;
mv ../454.csv ../454.csv.bak&lt;br /&gt;
${SCRIPTS}/merge_csv.pl ../454.csv.bak [batch]/[batch]_names.csv &amp;quot;Sample ID&amp;quot; non-unique &amp;gt; ../454.csv&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This should probably be run with the filename as the key but then the tables need to be sorted  by filename.  Ultimately, this will all work better once the data are in a relational database.&lt;br /&gt;
&lt;br /&gt;
=== Step 6: Update sample file ===&lt;br /&gt;
From the top directory:&lt;br /&gt;
&lt;br /&gt;
* First add all new samples to the samples.csv file&lt;br /&gt;
&amp;lt;pre&amp;gt; &lt;br /&gt;
mv samples.csv samples.csv.bak&lt;br /&gt;
${SCRIPTS}/merge_csv.pl samples.csv.bak 454.csv &amp;quot;Sample ID&amp;quot; unique merge &amp;gt; samples.csv&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Note: merge means that if record keys conflict, the empty fields will be updated with the new data.&lt;br /&gt;
&lt;br /&gt;
* Update the tag indicating 454 sequences available for this sample&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mv samples.csv samples.csv.bak&lt;br /&gt;
${SCRIPTS}/update_field_csv.pl samples.csv.bak 454.csv &amp;quot;Sample ID&amp;quot; 454 Y &amp;gt; samples.csv&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 7: Assign numbers to all filenames ===&lt;br /&gt;
&lt;br /&gt;
Each file in the 454.csv file will be assigned an integer (if one is not already available).  This number will be used to prefix the sequences in the combined file for the project.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mv 454.csv 454.csv.bak&lt;br /&gt;
${SCRIPTS}/add_filenum.pl 454.csv.bak &amp;gt; 454.csv&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 8: Combine all fasta files into a single one ===&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Note: this assumes we&#039;re running the pipeline for the first time - a different protocol is necessary for adding new sequences to an already existing analysis&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
In the new file all sequences will be named &amp;lt;n&amp;gt;_&amp;lt;nn&amp;gt; where &amp;lt;n&amp;gt; is the value in the &amp;quot;File #&amp;quot;&lt;br /&gt;
field in 454.csv and &amp;lt;nn&amp;gt; is the index of the sequence in the file.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mkdir Analysis/Run[date]&lt;br /&gt;
${SCRIPTS}/combinefa.pl -c Analysis/Run[date]/Run[date] -i 454.csv 454&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The output will be in Analysis/Run[date]/Run[date].fna&lt;br /&gt;
&lt;br /&gt;
=== Step 9: Run clustering tool ===&lt;br /&gt;
&lt;br /&gt;
* First generate clusters&lt;br /&gt;
&#039;&#039;&#039;Note: This part assumes we&#039;re running the whole set of sequences as one batch.&#039;&#039;&#039;&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
cd Analysis/Run[date]/Run[date].fna&lt;br /&gt;
/fs/szasmg2/ghodsi/Src/clusterk/clusterk7 -r 2 -i Run[date].fna &amp;gt; Run[date].fna.clusters&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Output will be in Run[date].fna.clusters, one cluster per line, cluster center listed as the first identifier.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* Then extract the cluster centers&lt;br /&gt;
&#039;&#039;&#039;From here on the code runs the same in both full-run and batch modes&#039;&#039;&#039;&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
/fs/szasmg2/ghodsi/Src/clusterk/fastaselect Run[date].fna &amp;lt; Run[date].fna.clusters &amp;gt; Run[date].centers.fna&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 10: Assign putative taxonomic labels to clusters ===&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
/fs/szasmg2/ghodsi/rdp/findtax/findtaxid.sh Run[date].centers.fna &amp;gt; Run[date].centers.taxid&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Output is tab-delimited: sequence name &amp;lt;TAB&amp;gt; taxid&lt;br /&gt;
Note: using &amp;quot;findtax&amp;quot; instead of &amp;quot;findtaxid&amp;quot; will retrieve actual taxonomy names.&lt;br /&gt;
&lt;br /&gt;
=== Step 11: Build summary tables ===&lt;br /&gt;
&lt;br /&gt;
Using the output from steps 9 and 10 we construct a collection of tables linking OTUs, taxIDs, taxnames at various taxonomic levels to individual samples.  The colums are the samples and the rows are the respective units.  The cells are numbers of sequences assigned to the specific group. If looking at taxonomic levels, the sequences without an assignment at that level are assigned to a generic &amp;quot;No Assignment&amp;quot; bin.&lt;/div&gt;</summary>
		<author><name>Mpop</name></author>
	</entry>
	<entry>
		<id>https://wiki.umiacs.umd.edu/cbcb/index.php?title=Cbcb:Pop-Lab:16S-pipeline_16S_analysis_pipeline_(for_Gates_project)&amp;diff=6108</id>
		<title>Cbcb:Pop-Lab:16S-pipeline 16S analysis pipeline (for Gates project)</title>
		<link rel="alternate" type="text/html" href="https://wiki.umiacs.umd.edu/cbcb/index.php?title=Cbcb:Pop-Lab:16S-pipeline_16S_analysis_pipeline_(for_Gates_project)&amp;diff=6108"/>
		<updated>2009-11-25T02:55:04Z</updated>

		<summary type="html">&lt;p&gt;Mpop: /* 16S analysis pipeline */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== 16S analysis pipeline ==&lt;br /&gt;
&lt;br /&gt;
Assumptions:&amp;lt;br&amp;gt;&lt;br /&gt;
* 16S rRNA sequences are generated with barcoded 454 and are received as either: (i) .sff file; (ii) fasta and quality file; (iii) just fasta file&lt;br /&gt;
* each batch is a single 96-well plate and is accompanied by a tab-delimited file containing information about the sample, including the &amp;quot;Sample ID&amp;quot;, well on the plate, and additional information regarding the sample quality and DNA concentration&lt;br /&gt;
* we have a file that specifies the barcode and sequencing adapter used for each well (these entities do not change).&lt;br /&gt;
&lt;br /&gt;
=== Directory structure ===&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
Gates_SOM/&lt;br /&gt;
   Main/&lt;br /&gt;
     samples.csv  - information about all the samples available to us&lt;br /&gt;
     454.csv      - information about all 454 runs (essentially concatenation of .csvs from 454 dir)&lt;br /&gt;
     phylochip.csv - information about all Phylochip runs&lt;br /&gt;
     IGS_Barcodes.csv - information about barcodes used to multiplex 454 runs&lt;br /&gt;
     scripts/     - scripts used to process the data&lt;br /&gt;
     454/         - here&#039;s where all 454 sequences live&lt;br /&gt;
       [batch1]/  - ... each batch in a separate directory&lt;br /&gt;
          [batch1].csv - meta-information about the batch &lt;br /&gt;
          [fasta1]- fasta files containing the batch&lt;br /&gt;
          ... &lt;br /&gt;
          [fastan]&lt;br /&gt;
          [batch1].part - partition file describing how the sequences get split by barcode/sample&lt;br /&gt;
          part/   - directory where all the partitioned files live &lt;br /&gt;
       ... &lt;br /&gt;
       [batchn]&lt;br /&gt;
     Phylochip/   - all the CEL files and auxiliary information on the Phylochip runs&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 0: Get the sequence information ===&lt;br /&gt;
* From .SFF files (assuming these are 454 sequences)&lt;br /&gt;
&lt;br /&gt;
This step uses the sff_extract program from the Staden package (if I&#039;m not mistaken)&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
for i in *.sff ;do&lt;br /&gt;
  name=`expr $i : &#039;\(.*\)\.sff&#039;`&lt;br /&gt;
  sff_extract -c -s $name.seq -q $name.qual $i&lt;br /&gt;
done&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 1: Cleanup meta-information ===&lt;br /&gt;
* Convert the Excel sheet containing the batch information into tab-delimited file (and run dos2unix) making sure the quotes added by Excel/OOffice are removed, adding the date (if not already in the file), and sorting the file by Sample ID.  At this stage also check that the header row information is in canonical format.&lt;br /&gt;
* Add barcode information using add_barcode.pl&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;${SCRIPTDIR}/add_barcode.pl [batch].csv ${MAINDIR}IGS_Barcodes.csv &amp;gt; [batch]_barcode.csv&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 2: Create partition file ===&lt;br /&gt;
&lt;br /&gt;
First concatenate all the sequence files (if multiple) from a batch into a single file [batch].all.seq.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;${SCRIPTDIR}/code2part.pl [batch].csv [batch].all.seq [batch] &amp;gt; [batch].part&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The files [batch].BAD.list and [batch].NONE.list contain the names of all sequences that have failed quality checks either because they are too short, or contain Ns (BAD), or because the barcode is not recognized (NONE). The files contain additional information that can be used to troubleshoot the pipeline:  sequences that are too short (&amp;lt; 75 454 cycles) are followed by the number of cycles, sequences that either contain Ns or have an unknown barcode are followed by the first 8 characters in the sequence.&lt;br /&gt;
&lt;br /&gt;
=== Step 3: Break up the fasta file into separate batches by partition ===&lt;br /&gt;
&lt;br /&gt;
* Create partition directory&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mkdir part&lt;br /&gt;
cd part&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Partition main file into sub-parts&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
${SCRIPTS}/partitionFasta.pl ../[batch].all.seq ../[batch].part&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The result is one fasta file per sub-partition (i.e. individual subject).&lt;br /&gt;
&lt;br /&gt;
* Remove barcodes &lt;br /&gt;
(still in the part/ subdirectory)&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
for i in *.seq; do&lt;br /&gt;
  ${SCRIPTS}/unbarcode.pl $i&lt;br /&gt;
done&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The output files will have the same name as the original file but with the addition of the .nbc suffix.  You should remove the .nbc files from the BAD/NONE files in order to prevent their addition to the pipeline downstream.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt; rm *.BAD.nbc.fa *.NONE.nbc.fa &amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 4: Add file names to sample tables ===&lt;br /&gt;
&lt;br /&gt;
The following needs to be run from the root of the 454 directory.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt; ${SCRIPTS}/add_file.pl [batch]/[batch]_barcode.csv [batch]/part &amp;gt; [batch]/[batch]_names.csv &amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
As a result, the file [batch]/[batch]_names.csv will associate each Sample ID to a file name and also record the number of sequences in that file.  Note that only files ending in .nbc.fa are procesed.&lt;br /&gt;
&lt;br /&gt;
=== Step 5: Merge all the batch meta-info files into a same file at the top ===&lt;br /&gt;
&lt;br /&gt;
Note: the addition of file names must be done on a batch by batch basis as multiple files might refer to a same Sample ID - within each batch it can be assumed that the Sample ID -&amp;gt; Filename mapping is unique.  In the 454.csv file in the top directory the unique key is the file name.  &lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt; &lt;br /&gt;
mv ../454.csv ../454.csv.bak&lt;br /&gt;
${SCRIPTS}/merge_csv.pl ../454.csv.bak [batch]/[batch]_names.csv &amp;quot;Sample ID&amp;quot; non-unique &amp;gt; ../454.csv&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This should probably be run with the filename as the key but then the tables need to be sorted  by filename.  Ultimately, this will all work better once the data are in a relational database.&lt;br /&gt;
&lt;br /&gt;
=== Step 6: Update sample file ===&lt;br /&gt;
From the top directory:&lt;br /&gt;
&lt;br /&gt;
* First add all new samples to the samples.csv file&lt;br /&gt;
&amp;lt;pre&amp;gt; &lt;br /&gt;
mv samples.csv samples.csv.bak&lt;br /&gt;
${SCRIPTS}/merge_csv.pl samples.csv.bak 454.csv &amp;quot;Sample ID&amp;quot; unique merge &amp;gt; samples.csv&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Note: merge means that if record keys conflict, the empty fields will be updated with the new data.&lt;br /&gt;
&lt;br /&gt;
* Update the tag indicating 454 sequences available for this sample&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mv samples.csv samples.csv.bak&lt;br /&gt;
${SCRIPTS}/update_field_csv.pl samples.csv.bak 454.csv &amp;quot;Sample ID&amp;quot; 454 Y &amp;gt; samples.csv&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 7: Assign numbers to all filenames ===&lt;br /&gt;
&lt;br /&gt;
Each file in the 454.csv file will be assigned an integer (if one is not already available).  This number will be used to prefix the sequences in the combined file for the project.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mv 454.csv 454.csv.bak&lt;br /&gt;
${SCRIPTS}/add_filenum.pl 454.csv.bak &amp;gt; 454.csv&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 8: Combine all fasta files into a single one ===&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Note: this assumes we&#039;re running the pipeline for the first time - a different protocol is necessary for adding new sequences to an already existing analysis&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
In the new file all sequences will be named &amp;lt;n&amp;gt;_&amp;lt;nn&amp;gt; where &amp;lt;n&amp;gt; is the value in the &amp;quot;File #&amp;quot;&lt;br /&gt;
field in 454.csv and &amp;lt;nn&amp;gt; is the index of the sequence in the file.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mkdir Analysis/Run[date]&lt;br /&gt;
${SCRIPTS}/combinefa.pl -c Analysis/Run[date]/Run[date] -i 454.csv 454&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The output will be in Analysis/Run[date]/Run[date].fna&lt;br /&gt;
&lt;br /&gt;
=== Step 9: Run clustering tool ===&lt;br /&gt;
&lt;br /&gt;
* First generate clusters&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
cd Analysis/Run[date]/Run[date].fna&lt;br /&gt;
/fs/szasmg2/ghodsi/Src/clusterk/clusterk7 -r 2 -m -i Run[date].fna&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Output will be in Run[date].fna.clusters, one cluster per line, cluster center listed as the first identifier.&lt;br /&gt;
&lt;br /&gt;
* Then extract the cluster centers&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
/fs/szasmg2/ghodsi/Src/clusterk/fastaselect Run[date].fna &amp;lt; Run[date].fna.clusters &amp;gt; Run[date].centers.fna&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 10: Assign putative taxonomic labels to clusters ===&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
/fs/szasmg2/ghodsi/rdp/findtax/findtaxid.sh Run[date].centers.fna &amp;gt; Run[date].centers.taxid&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Output is tab-delimited: sequence name &amp;lt;TAB&amp;gt; taxid&lt;br /&gt;
Note: using &amp;quot;findtax&amp;quot; instead of &amp;quot;findtaxid&amp;quot; will retrieve actual taxonomy names.&lt;br /&gt;
&lt;br /&gt;
=== Step 11: Build summary tables ===&lt;br /&gt;
&lt;br /&gt;
Using the output from steps 9 and 10 we construct a collection of tables linking OTUs, taxIDs, taxnames at various taxonomic levels to individual samples.  The colums are the samples and the rows are the respective units.  The cells are numbers of sequences assigned to the specific group. If looking at taxonomic levels, the sequences without an assignment at that level are assigned to a generic &amp;quot;No Assignment&amp;quot; bin.&lt;/div&gt;</summary>
		<author><name>Mpop</name></author>
	</entry>
	<entry>
		<id>https://wiki.umiacs.umd.edu/cbcb/index.php?title=Cbcb:Pop-Lab:16S-pipeline_16S_analysis_pipeline_(for_Gates_project)&amp;diff=6107</id>
		<title>Cbcb:Pop-Lab:16S-pipeline 16S analysis pipeline (for Gates project)</title>
		<link rel="alternate" type="text/html" href="https://wiki.umiacs.umd.edu/cbcb/index.php?title=Cbcb:Pop-Lab:16S-pipeline_16S_analysis_pipeline_(for_Gates_project)&amp;diff=6107"/>
		<updated>2009-11-25T02:21:13Z</updated>

		<summary type="html">&lt;p&gt;Mpop: /* 16S analysis pipeline */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== 16S analysis pipeline ==&lt;br /&gt;
&lt;br /&gt;
Assumptions:&amp;lt;br&amp;gt;&lt;br /&gt;
* 16S rRNA sequences are generated with barcoded 454 and are received as either: (i) .sff file; (ii) fasta and quality file; (iii) just fasta file&lt;br /&gt;
* each batch is a single 96-well plate and is accompanied by a tab-delimited file containing information about the sample, including the &amp;quot;Sample ID&amp;quot;, well on the plate, and additional information regarding the sample quality and DNA concentration&lt;br /&gt;
* we have a file that specifies the barcode and sequencing adapter used for each well (these entities do not change).&lt;br /&gt;
&lt;br /&gt;
=== Directory structure ===&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
Gates_SOM/&lt;br /&gt;
   Main/&lt;br /&gt;
     samples.csv  - information about all the samples available to us&lt;br /&gt;
     454.csv      - information about all 454 runs (essentially concatenation of .csvs from 454 dir)&lt;br /&gt;
     phylochip.csv - information about all Phylochip runs&lt;br /&gt;
     IGS_Barcodes.csv - information about barcodes used to multiplex 454 runs&lt;br /&gt;
     scripts/     - scripts used to process the data&lt;br /&gt;
     454/         - here&#039;s where all 454 sequences live&lt;br /&gt;
       [batch1]/  - ... each batch in a separate directory&lt;br /&gt;
          [batch1].csv - meta-information about the batch &lt;br /&gt;
          [fasta1]- fasta files containing the batch&lt;br /&gt;
          ... &lt;br /&gt;
          [fastan]&lt;br /&gt;
          [batch1].part - partition file describing how the sequences get split by barcode/sample&lt;br /&gt;
          part/   - directory where all the partitioned files live &lt;br /&gt;
       ... &lt;br /&gt;
       [batchn]&lt;br /&gt;
     Phylochip/   - all the CEL files and auxiliary information on the Phylochip runs&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 0: Get the sequence information ===&lt;br /&gt;
* From .SFF files (assuming these are 454 sequences)&lt;br /&gt;
&lt;br /&gt;
This step uses the sff_extract program from the Staden package (if I&#039;m not mistaken)&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
for i in *.sff ;do&lt;br /&gt;
  name=`expr $i : &#039;\(.*\)\.sff&#039;`&lt;br /&gt;
  sff_extract -c -s $name.seq -q $name.qual $i&lt;br /&gt;
done&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 1: Cleanup meta-information ===&lt;br /&gt;
* Convert the Excel sheet containing the batch information into tab-delimited file (and run dos2unix) making sure the quotes added by Excel/OOffice are removed, adding the date (if not already in the file), and sorting the file by Sample ID.  At this stage also check that the header row information is in canonical format.&lt;br /&gt;
* Add barcode information using add_barcode.pl&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;${SCRIPTDIR}/add_barcode.pl [batch].csv ${MAINDIR}IGS_Barcodes.csv &amp;gt; [batch]_barcode.csv&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 2: Create partition file ===&lt;br /&gt;
&lt;br /&gt;
First concatenate all the sequence files (if multiple) from a batch into a single file [batch].all.seq.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;${SCRIPTDIR}/code2part.pl [batch].csv [batch].all.seq [batch] &amp;gt; [batch].part&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The files [batch].BAD.list and [batch].NONE.list contain the names of all sequences that have failed quality checks either because they are too short, or contain Ns (BAD), or because the barcode is not recognized (NONE). The files contain additional information that can be used to troubleshoot the pipeline:  sequences that are too short (&amp;lt; 75 454 cycles) are followed by the number of cycles, sequences that either contain Ns or have an unknown barcode are followed by the first 8 characters in the sequence.&lt;br /&gt;
&lt;br /&gt;
=== Step 3: Break up the fasta file into separate batches by partition ===&lt;br /&gt;
&lt;br /&gt;
* Create partition directory&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mkdir part&lt;br /&gt;
cd part&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Partition main file into sub-parts&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
${SCRIPTS}/partitionFasta.pl ../[batch].all.seq ../[batch].part&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The result is one fasta file per sub-partition (i.e. individual subject).&lt;br /&gt;
&lt;br /&gt;
* Remove barcodes &lt;br /&gt;
(still in the part/ subdirectory)&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
for i in *.seq; do&lt;br /&gt;
  ${SCRIPTS}/unbarcode.pl $i&lt;br /&gt;
done&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The output files will have the same name as the original file but with the addition of the .nbc suffix.  You should remove the .nbc files from the BAD/NONE files in order to prevent their addition to the pipeline downstream.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt; rm *.BAD.nbc.fa *.NONE.nbc.fa &amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 4: Add file names to sample tables ===&lt;br /&gt;
&lt;br /&gt;
The following needs to be run from the root of the 454 directory.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt; ${SCRIPTS}/add_file.pl [batch]/[batch]_barcode.csv [batch]/part &amp;gt; [batch]/[batch]_names.csv &amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
As a result, the file [batch]/[batch]_names.csv will associate each Sample ID to a file name and also record the number of sequences in that file.  Note that only files ending in .nbc.fa are procesed.&lt;br /&gt;
&lt;br /&gt;
=== Step 5: Merge all the batch meta-info files into a same file at the top ===&lt;br /&gt;
&lt;br /&gt;
Note: the addition of file names must be done on a batch by batch basis as multiple files might refer to a same Sample ID - within each batch it can be assumed that the Sample ID -&amp;gt; Filename mapping is unique.  In the 454.csv file in the top directory the unique key is the file name.  &lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt; &lt;br /&gt;
mv ../454.csv ../454.csv.bak&lt;br /&gt;
${SCRIPTS}/merge_csv.pl ../454.csv.bak [batch]/[batch]_names.csv &amp;quot;Sample ID&amp;quot; non-unique &amp;gt; ../454.csv&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This should probably be run with the filename as the key but then the tables need to be sorted  by filename.  Ultimately, this will all work better once the data are in a relational database.&lt;br /&gt;
&lt;br /&gt;
=== Step 6: Update sample file ===&lt;br /&gt;
From the top directory:&lt;br /&gt;
&lt;br /&gt;
* First add all new samples to the samples.csv file&lt;br /&gt;
&amp;lt;pre&amp;gt; &lt;br /&gt;
mv samples.csv samples.csv.bak&lt;br /&gt;
${SCRIPTS}/merge_csv.pl samples.csv.bak 454.csv &amp;quot;Sample ID&amp;quot; unique merge &amp;gt; samples.csv&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Note: merge means that if record keys conflict, the empty fields will be updated with the new data.&lt;br /&gt;
&lt;br /&gt;
* Update the tag indicating 454 sequences available for this sample&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mv samples.csv samples.csv.bak&lt;br /&gt;
${SCRIPTS}/update_field_csv.pl samples.csv.bak 454.csv &amp;quot;Sample ID&amp;quot; 454 Y &amp;gt; samples.csv&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 7: Assign numbers to all filenames ===&lt;br /&gt;
&lt;br /&gt;
Each file in the 454.csv file will be assigned an integer (if one is not already available).  This number will be used to prefix the sequences in the combined file for the project.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mv 454.csv 454.csv.bak&lt;br /&gt;
${SCRIPTS}/add_filenum.pl 454.csv.bak &amp;gt; 454.csv&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 8: Combine all fasta files into a single one ===&lt;br /&gt;
&lt;br /&gt;
In the new file all sequences will be named &amp;lt;n&amp;gt;_&amp;lt;nn&amp;gt; where &amp;lt;n&amp;gt; is the value in the &amp;quot;File #&amp;quot;&lt;br /&gt;
field in 454.csv and &amp;lt;nn&amp;gt; is the index of the sequence in the file.&lt;/div&gt;</summary>
		<author><name>Mpop</name></author>
	</entry>
	<entry>
		<id>https://wiki.umiacs.umd.edu/cbcb/index.php?title=Cbcb:Pop-Lab:16S-pipeline_16S_analysis_pipeline_(for_Gates_project)&amp;diff=6106</id>
		<title>Cbcb:Pop-Lab:16S-pipeline 16S analysis pipeline (for Gates project)</title>
		<link rel="alternate" type="text/html" href="https://wiki.umiacs.umd.edu/cbcb/index.php?title=Cbcb:Pop-Lab:16S-pipeline_16S_analysis_pipeline_(for_Gates_project)&amp;diff=6106"/>
		<updated>2009-11-25T01:45:53Z</updated>

		<summary type="html">&lt;p&gt;Mpop: /* 16S analysis pipeline */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== 16S analysis pipeline ==&lt;br /&gt;
&lt;br /&gt;
Assumptions:&amp;lt;br&amp;gt;&lt;br /&gt;
* 16S rRNA sequences are generated with barcoded 454 and are received as either: (i) .sff file; (ii) fasta and quality file; (iii) just fasta file&lt;br /&gt;
* each batch is a single 96-well plate and is accompanied by a tab-delimited file containing information about the sample, including the &amp;quot;Sample ID&amp;quot;, well on the plate, and additional information regarding the sample quality and DNA concentration&lt;br /&gt;
* we have a file that specifies the barcode and sequencing adapter used for each well (these entities do not change).&lt;br /&gt;
&lt;br /&gt;
=== Directory structure ===&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
Gates_SOM/&lt;br /&gt;
   Main/&lt;br /&gt;
     samples.csv  - information about all the samples available to us&lt;br /&gt;
     454.csv      - information about all 454 runs (essentially concatenation of .csvs from 454 dir)&lt;br /&gt;
     phylochip.csv - information about all Phylochip runs&lt;br /&gt;
     IGS_Barcodes.csv - information about barcodes used to multiplex 454 runs&lt;br /&gt;
     scripts/     - scripts used to process the data&lt;br /&gt;
     454/         - here&#039;s where all 454 sequences live&lt;br /&gt;
       [batch1]/  - ... each batch in a separate directory&lt;br /&gt;
          [batch1].csv - meta-information about the batch &lt;br /&gt;
          [fasta1]- fasta files containing the batch&lt;br /&gt;
          ... &lt;br /&gt;
          [fastan]&lt;br /&gt;
          [batch1].part - partition file describing how the sequences get split by barcode/sample&lt;br /&gt;
          part/   - directory where all the partitioned files live &lt;br /&gt;
       ... &lt;br /&gt;
       [batchn]&lt;br /&gt;
     Phylochip/   - all the CEL files and auxiliary information on the Phylochip runs&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 0: Get the sequence information ===&lt;br /&gt;
* From .SFF files (assuming these are 454 sequences)&lt;br /&gt;
&lt;br /&gt;
This step uses the sff_extract program from the Staden package (if I&#039;m not mistaken)&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
for i in *.sff ;do&lt;br /&gt;
  name=`expr $i : &#039;\(.*\)\.sff&#039;`&lt;br /&gt;
  sff_extract -c -s $name.seq -q $name.qual $i&lt;br /&gt;
done&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 1: Cleanup meta-information ===&lt;br /&gt;
* Convert the Excel sheet containing the batch information into tab-delimited file (and run dos2unix) making sure the quotes added by Excel/OOffice are removed, adding the date (if not already in the file), and sorting the file by Sample ID.  At this stage also check that the header row information is in canonical format.&lt;br /&gt;
* Add barcode information using add_barcode.pl&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;${SCRIPTDIR}/add_barcode.pl [batch].csv ${MAINDIR}IGS_Barcodes.csv &amp;gt; [batch]_barcode.csv&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 2: Create partition file ===&lt;br /&gt;
&lt;br /&gt;
First concatenate all the sequence files (if multiple) from a batch into a single file [batch].all.seq.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;${SCRIPTDIR}/code2part.pl [batch].csv [batch].all.seq [batch] &amp;gt; [batch].part&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The files [batch].BAD.list and [batch].NONE.list contain the names of all sequences that have failed quality checks either because they are too short, or contain Ns (BAD), or because the barcode is not recognized (NONE). The files contain additional information that can be used to troubleshoot the pipeline:  sequences that are too short (&amp;lt; 75 454 cycles) are followed by the number of cycles, sequences that either contain Ns or have an unknown barcode are followed by the first 8 characters in the sequence.&lt;br /&gt;
&lt;br /&gt;
=== Step 3: Break up the fasta file into separate batches by partition ===&lt;br /&gt;
&lt;br /&gt;
* Create partition directory&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mkdir part&lt;br /&gt;
cd part&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Partition main file into sub-parts&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
${SCRIPTS}/partitionFasta.pl ../[batch].all.seq ../[batch].part&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The result is one fasta file per sub-partition (i.e. individual subject).&lt;br /&gt;
&lt;br /&gt;
* Remove barcodes &lt;br /&gt;
(still in the part/ subdirectory)&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
for i in *.seq; do&lt;br /&gt;
  ${SCRIPTS}/unbarcode.pl $i&lt;br /&gt;
done&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The output files will have the same name as the original file but with the addition of the .nbc suffix.  You should remove the .nbc files from the BAD/NONE files in order to prevent their addition to the pipeline downstream.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt; rm *.BAD.nbc.fa *.NONE.nbc.fa &amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 4: Add file names to sample tables ===&lt;br /&gt;
&lt;br /&gt;
The following needs to be run from the root of the 454 directory.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt; ${SCRIPTS}/add_file.pl [batch]/[batch]_barcode.csv [batch]/part &amp;gt; [batch]/[batch]_names.csv &amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
As a result, the file [batch]/[batch]_names.csv will associate each Sample ID to a file name and also record the number of sequences in that file.  Note that only files ending in .nbc.fa are procesed.&lt;br /&gt;
&lt;br /&gt;
=== Step 5: Merge all the batch meta-info files into a same file at the top ===&lt;br /&gt;
&lt;br /&gt;
Note: the addition of file names must be done on a batch by batch basis as multiple files might refer to a same Sample ID - within each batch it can be assumed that the Sample ID -&amp;gt; Filename mapping is unique.  In the 454.csv file in the top directory the unique key is the file name.  &lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt; &lt;br /&gt;
mv ../454.csv ../454.csv.bak&lt;br /&gt;
${SCRIPTS}/merge_csv.pl ../454.csv.bak [batch]/[batch]_names.csv &amp;quot;Sample ID&amp;quot; non-unique &amp;gt; ../454.csv&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This should probably be run with the filename as the key but then the tables need to be sorted  by filename.  Ultimately, this will all work better once the data are in a relational database.&lt;br /&gt;
&lt;br /&gt;
=== Step 6: Update sample file ===&lt;br /&gt;
From the top directory:&lt;br /&gt;
&lt;br /&gt;
* First add all new samples to the samples.csv file&lt;br /&gt;
&amp;lt;pre&amp;gt; &lt;br /&gt;
mv samples.csv samples.csv.bak&lt;br /&gt;
${SCRIPTS}/merge_csv.pl samples.csv.bak 454.csv &amp;quot;Sample ID&amp;quot; unique merge &amp;gt; samples.csv&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Note: merge means that if record keys conflict, the empty fields will be updated with the new data.&lt;br /&gt;
&lt;br /&gt;
* Update the tag indicating 454 sequences available for this sample&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mv samples.csv samples.csv.bak&lt;br /&gt;
${SCRIPTS}/update_field_csv.pl samples.csv.bak 454.csv &amp;quot;Sample ID&amp;quot; 454 Y &amp;gt; samples.csv&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 7: Assign numbers to all filenames ===&lt;br /&gt;
&lt;br /&gt;
Each file in the 454.csv file will be assigned an integer (if one is not already available).  This number will be used to prefix the sequences in the combined file for the project.&lt;/div&gt;</summary>
		<author><name>Mpop</name></author>
	</entry>
	<entry>
		<id>https://wiki.umiacs.umd.edu/cbcb/index.php?title=Cbcb:Pop-Lab:16S-pipeline_16S_analysis_pipeline_(for_Gates_project)&amp;diff=6105</id>
		<title>Cbcb:Pop-Lab:16S-pipeline 16S analysis pipeline (for Gates project)</title>
		<link rel="alternate" type="text/html" href="https://wiki.umiacs.umd.edu/cbcb/index.php?title=Cbcb:Pop-Lab:16S-pipeline_16S_analysis_pipeline_(for_Gates_project)&amp;diff=6105"/>
		<updated>2009-11-25T01:35:55Z</updated>

		<summary type="html">&lt;p&gt;Mpop: /* 16S analysis pipeline */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== 16S analysis pipeline ==&lt;br /&gt;
&lt;br /&gt;
Assumptions:&amp;lt;br&amp;gt;&lt;br /&gt;
* 16S rRNA sequences are generated with barcoded 454 and are received as either: (i) .sff file; (ii) fasta and quality file; (iii) just fasta file&lt;br /&gt;
* each batch is a single 96-well plate and is accompanied by a tab-delimited file containing information about the sample, including the &amp;quot;Sample ID&amp;quot;, well on the plate, and additional information regarding the sample quality and DNA concentration&lt;br /&gt;
* we have a file that specifies the barcode and sequencing adapter used for each well (these entities do not change).&lt;br /&gt;
&lt;br /&gt;
=== Directory structure ===&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
Gates_SOM/&lt;br /&gt;
   Main/&lt;br /&gt;
     samples.csv  - information about all the samples available to us&lt;br /&gt;
     454.csv      - information about all 454 runs (essentially concatenation of .csvs from 454 dir)&lt;br /&gt;
     phylochip.csv - information about all Phylochip runs&lt;br /&gt;
     IGS_Barcodes.csv - information about barcodes used to multiplex 454 runs&lt;br /&gt;
     scripts/     - scripts used to process the data&lt;br /&gt;
     454/         - here&#039;s where all 454 sequences live&lt;br /&gt;
       [batch1]/  - ... each batch in a separate directory&lt;br /&gt;
          [batch1].csv - meta-information about the batch &lt;br /&gt;
          [fasta1]- fasta files containing the batch&lt;br /&gt;
          ... &lt;br /&gt;
          [fastan]&lt;br /&gt;
          [batch1].part - partition file describing how the sequences get split by barcode/sample&lt;br /&gt;
          part/   - directory where all the partitioned files live &lt;br /&gt;
       ... &lt;br /&gt;
       [batchn]&lt;br /&gt;
     Phylochip/   - all the CEL files and auxiliary information on the Phylochip runs&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 0: Get the sequence information ===&lt;br /&gt;
* From .SFF files (assuming these are 454 sequences)&lt;br /&gt;
&lt;br /&gt;
This step uses the sff_extract program from the Staden package (if I&#039;m not mistaken)&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
for i in *.sff ;do&lt;br /&gt;
  name=`expr $i : &#039;\(.*\)\.sff&#039;`&lt;br /&gt;
  sff_extract -c -s $name.seq -q $name.qual $i&lt;br /&gt;
done&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 1: Cleanup meta-information ===&lt;br /&gt;
* Convert the Excel sheet containing the batch information into tab-delimited file (and run dos2unix) making sure the quotes added by Excel/OOffice are removed, adding the date (if not already in the file), and sorting the file by Sample ID.  At this stage also check that the header row information is in canonical format.&lt;br /&gt;
* Add barcode information using add_barcode.pl&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;${SCRIPTDIR}/add_barcode.pl [batch].csv ${MAINDIR}IGS_Barcodes.csv &amp;gt; [batch]_barcode.csv&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 2: Create partition file ===&lt;br /&gt;
&lt;br /&gt;
First concatenate all the sequence files (if multiple) from a batch into a single file [batch].all.seq.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;${SCRIPTDIR}/code2part.pl [batch].csv [batch].all.seq [batch] &amp;gt; [batch].part&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The files [batch].BAD.list and [batch].NONE.list contain the names of all sequences that have failed quality checks either because they are too short, or contain Ns (BAD), or because the barcode is not recognized (NONE). The files contain additional information that can be used to troubleshoot the pipeline:  sequences that are too short (&amp;lt; 75 454 cycles) are followed by the number of cycles, sequences that either contain Ns or have an unknown barcode are followed by the first 8 characters in the sequence.&lt;br /&gt;
&lt;br /&gt;
=== Step 3: Break up the fasta file into separate batches by partition ===&lt;br /&gt;
&lt;br /&gt;
* Create partition directory&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mkdir part&lt;br /&gt;
cd part&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Partition main file into sub-parts&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
${SCRIPTS}/partitionFasta.pl ../[batch].all.seq ../[batch].part&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The result is one fasta file per sub-partition (i.e. individual subject).&lt;br /&gt;
&lt;br /&gt;
* Remove barcodes &lt;br /&gt;
(still in the part/ subdirectory)&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
for i in *.seq; do&lt;br /&gt;
  ${SCRIPTS}/unbarcode.pl $i&lt;br /&gt;
done&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The output files will have the same name as the original file but with the addition of the .nbc suffix.  You should remove the .nbc files from the BAD/NONE files in order to prevent their addition to the pipeline downstream.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt; rm *.BAD.nbc.fa *.NONE.nbc.fa &amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 4: Add file names to sample tables ===&lt;br /&gt;
&lt;br /&gt;
The following needs to be run from the root of the 454 directory.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt; ${SCRIPTS}/add_file.pl [batch]/[batch]_barcode.csv [batch]/part &amp;gt; [batch]/[batch]_names.csv &amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
As a result, the file [batch]/[batch]_names.csv will associate each Sample ID to a file name and also record the number of sequences in that file.  Note that only files ending in .nbc.fa are procesed.&lt;br /&gt;
&lt;br /&gt;
=== Step 5: Merge all the batch meta-info files into a same file at the top ===&lt;br /&gt;
&lt;br /&gt;
Note: the addition of file names must be done on a batch by batch basis as multiple files might refer to a same Sample ID - within each batch it can be assumed that the Sample ID -&amp;gt; Filename mapping is unique.  In the 454.csv file in the top directory the unique key is the file name.  &lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt; &lt;br /&gt;
mv ../454.csv ../454.csv.bak&lt;br /&gt;
${SCRIPTS}/merge_csv.pl ../454.csv.bak [batch]/[batch]_names.csv &amp;quot;Sample ID&amp;quot; non-unique &amp;gt; ../454.csv&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This should probably be run with the filename as the key but then the tables need to be sorted  by filename.  Ultimately, this will all work better once the data are in a relational database.&lt;br /&gt;
&lt;br /&gt;
=== Step 6: Update sample file ===&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Step 7: Assign numbers to all filenames ===&lt;br /&gt;
&lt;br /&gt;
Each file in the .csv&lt;/div&gt;</summary>
		<author><name>Mpop</name></author>
	</entry>
	<entry>
		<id>https://wiki.umiacs.umd.edu/cbcb/index.php?title=Cbcb:Pop-Lab:16S-pipeline_16S_analysis_pipeline_(for_Gates_project)&amp;diff=6098</id>
		<title>Cbcb:Pop-Lab:16S-pipeline 16S analysis pipeline (for Gates project)</title>
		<link rel="alternate" type="text/html" href="https://wiki.umiacs.umd.edu/cbcb/index.php?title=Cbcb:Pop-Lab:16S-pipeline_16S_analysis_pipeline_(for_Gates_project)&amp;diff=6098"/>
		<updated>2009-11-23T21:21:28Z</updated>

		<summary type="html">&lt;p&gt;Mpop: /* 16S analysis pipeline */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== 16S analysis pipeline ==&lt;br /&gt;
&lt;br /&gt;
Assumptions:&amp;lt;br&amp;gt;&lt;br /&gt;
* 16S rRNA sequences are generated with barcoded 454 and are received as either: (i) .sff file; (ii) fasta and quality file; (iii) just fasta file&lt;br /&gt;
* each batch is a single 96-well plate and is accompanied by a tab-delimited file containing information about the sample, including the &amp;quot;Sample ID&amp;quot;, well on the plate, and additional information regarding the sample quality and DNA concentration&lt;br /&gt;
* we have a file that specifies the barcode and sequencing adapter used for each well (these entities do not change).&lt;br /&gt;
&lt;br /&gt;
=== Directory structure ===&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
Gates_SOM/&lt;br /&gt;
   Main/&lt;br /&gt;
     samples.csv  - information about all the samples available to us&lt;br /&gt;
     454.csv      - information about all 454 runs (essentially concatenation of .csvs from 454 dir)&lt;br /&gt;
     phylochip.csv - information about all Phylochip runs&lt;br /&gt;
     IGS_Barcodes.csv - information about barcodes used to multiplex 454 runs&lt;br /&gt;
     scripts/     - scripts used to process the data&lt;br /&gt;
     454/         - here&#039;s where all 454 sequences live&lt;br /&gt;
       [batch1]/  - ... each batch in a separate directory&lt;br /&gt;
          [batch1].csv - meta-information about the batch &lt;br /&gt;
          [fasta1]- fasta files containing the batch&lt;br /&gt;
          ... &lt;br /&gt;
          [fastan]&lt;br /&gt;
          [batch1].part - partition file describing how the sequences get split by barcode/sample&lt;br /&gt;
          part/   - directory where all the partitioned files live &lt;br /&gt;
       ... &lt;br /&gt;
       [batchn]&lt;br /&gt;
     Phylochip/   - all the CEL files and auxiliary information on the Phylochip runs&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 0: Get the sequence information ===&lt;br /&gt;
* From .SFF files (assuming these are 454 sequences)&lt;br /&gt;
&lt;br /&gt;
This step uses the sff_extract program from the Staden package (if I&#039;m not mistaken)&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
for i in *.sff ;do&lt;br /&gt;
  name=`expr $i : &#039;\(.*\)\.sff&#039;`&lt;br /&gt;
  sff_extract -c -s $name.seq -q $name.qual $i&lt;br /&gt;
done&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 1: Cleanup meta-information ===&lt;br /&gt;
* Convert the Excel sheet containing the batch information into tab-delimited file (and run dos2unix) making sure the quotes added by Excel/OOffice are removed, adding the date (if not already in the file), and sorting the file by Sample ID.  At this stage also check that the header row information is in canonical format.&lt;br /&gt;
* Add barcode information using add_barcode.pl&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;${SCRIPTDIR}/add_barcode.pl [batch].csv ${MAINDIR}IGS_Barcodes.csv &amp;gt; [batch]_barcode.csv&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 2: Create partition file ===&lt;br /&gt;
&lt;br /&gt;
First concatenate all the sequence files (if multiple) from a batch into a single file [batch].all.seq.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;${SCRIPTDIR}/code2part.pl [batch].csv [batch].all.seq [batch] &amp;gt; [batch].part&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The files [batch].BAD.list and [batch].NONE.list contain the names of all sequences that have failed quality checks either because they are too short, or contain Ns (BAD), or because the barcode is not recognized (NONE). The files contain additional information that can be used to troubleshoot the pipeline:  sequences that are too short (&amp;lt; 75 454 cycles) are followed by the number of cycles, sequences that either contain Ns or have an unknown barcode are followed by the first 8 characters in the sequence.&lt;br /&gt;
&lt;br /&gt;
=== Step 3: Break up the fasta file into separate batches by partition ===&lt;br /&gt;
&lt;br /&gt;
* Create partition directory&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mkdir part&lt;br /&gt;
cd part&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Partition main file into sub-parts&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
${SCRIPTS}/partitionFasta.pl ../[batch].all.seq ../[batch].part&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The result is one fasta file per sub-partition (i.e. individual subject).&lt;br /&gt;
&lt;br /&gt;
* Remove barcodes &lt;br /&gt;
(still in the part/ subdirectory)&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
for i in *.seq; do&lt;br /&gt;
  ${SCRIPTS}/unbarcode.pl $i&lt;br /&gt;
done&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The output files will have the same name as the original file but with the addition of the .nbc suffix.  You should remove the .nbc files from the BAD/NONE files in order to prevent their addition to the pipeline downstream.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt; rm *.BAD.nbc.fa *.NONE.nbc.fa &amp;lt;/pre&amp;gt;&lt;/div&gt;</summary>
		<author><name>Mpop</name></author>
	</entry>
	<entry>
		<id>https://wiki.umiacs.umd.edu/cbcb/index.php?title=Cbcb:Pop-Lab:16S-pipeline_16S_analysis_pipeline_(for_Gates_project)&amp;diff=6092</id>
		<title>Cbcb:Pop-Lab:16S-pipeline 16S analysis pipeline (for Gates project)</title>
		<link rel="alternate" type="text/html" href="https://wiki.umiacs.umd.edu/cbcb/index.php?title=Cbcb:Pop-Lab:16S-pipeline_16S_analysis_pipeline_(for_Gates_project)&amp;diff=6092"/>
		<updated>2009-11-21T19:51:40Z</updated>

		<summary type="html">&lt;p&gt;Mpop: /* 16S analysis pipeline */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== 16S analysis pipeline ==&lt;br /&gt;
&lt;br /&gt;
Assumptions:&amp;lt;br&amp;gt;&lt;br /&gt;
* 16S rRNA sequences are generated with barcoded 454 and are received as either: (i) .sff file; (ii) fasta and quality file; (iii) just fasta file&lt;br /&gt;
* each batch is a single 96-well plate and is accompanied by a tab-delimited file containing information about the sample, including the &amp;quot;Sample ID&amp;quot;, well on the plate, and additional information regarding the sample quality and DNA concentration&lt;br /&gt;
* we have a file that specifies the barcode and sequencing adapter used for each well (these entities do not change).&lt;br /&gt;
&lt;br /&gt;
=== Directory structure ===&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
Gates_SOM/&lt;br /&gt;
   Main/&lt;br /&gt;
     samples.csv  - information about all the samples available to us&lt;br /&gt;
     454.csv      - information about all 454 runs (essentially concatenation of .csvs from 454 dir)&lt;br /&gt;
     phylochip.csv - information about all Phylochip runs&lt;br /&gt;
     IGS_Barcodes.csv - information about barcodes used to multiplex 454 runs&lt;br /&gt;
     scripts/     - scripts used to process the data&lt;br /&gt;
     454/         - here&#039;s where all 454 sequences live&lt;br /&gt;
       [batch1]/  - ... each batch in a separate directory&lt;br /&gt;
          [batch1].csv - meta-information about the batch &lt;br /&gt;
          [fasta1]- fasta files containing the batch&lt;br /&gt;
          ... &lt;br /&gt;
          [fastan]&lt;br /&gt;
          [batch1].part - partition file describing how the sequences get split by barcode/sample&lt;br /&gt;
          part/   - directory where all the partitioned files live &lt;br /&gt;
       ... &lt;br /&gt;
       [batchn]&lt;br /&gt;
     Phylochip/   - all the CEL files and auxiliary information on the Phylochip runs&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 1: Cleanup meta-information ===&lt;br /&gt;
* Convert the Excel sheet containing the batch information into tab-delimited file (and run dos2unix) making sure the quotes added by Excel/OOffice are removed, adding the date (if not already in the file), and sorting the file by Sample ID.  At this stage also check that the header row information is in canonical format.&lt;br /&gt;
* Add barcode information using add_barcode.pl&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;${SCRIPTDIR}/add_barcode.pl [batch].csv ${MAINDIR}IGS_Barcodes.csv &amp;gt; [batch]_barcode.csv&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 2: Create partition file ===&lt;br /&gt;
&lt;br /&gt;
First concatenate all the sequence files (if multiple) from a batch into a single file [batch].all.seq.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;${SCRIPTDIR}/code2part.pl [batch].csv [batch].all.seq [batch] &amp;gt; [batch].part&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The files [batch].BAD.list and [batch].NONE.list contain the names of all sequences that have failed quality checks either because they are too short, or contain Ns (BAD), or because the barcode is not recognized (NONE). The files contain additional information that can be used to troubleshoot the pipeline:  sequences that are too short (&amp;lt; 75 454 cycles) are followed by the number of cycles, sequences that either contain Ns or have an unknown barcode are followed by the first 8 characters in the sequence.&lt;br /&gt;
&lt;br /&gt;
=== Step 3: Break up the fasta file into separate batches by partition ===&lt;br /&gt;
&lt;br /&gt;
* Create partition directory&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mkdir part&lt;br /&gt;
cd part&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Partition main file into sub-parts&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
${SCRIPTS}/partitionFasta.pl ../[batch].all.seq ../[batch].part&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The result is one fasta file per sub-partition (i.e. individual subject).&lt;/div&gt;</summary>
		<author><name>Mpop</name></author>
	</entry>
	<entry>
		<id>https://wiki.umiacs.umd.edu/cbcb/index.php?title=Cbcb:Pop-Lab:16S-pipeline_16S_analysis_pipeline_(for_Gates_project)&amp;diff=6091</id>
		<title>Cbcb:Pop-Lab:16S-pipeline 16S analysis pipeline (for Gates project)</title>
		<link rel="alternate" type="text/html" href="https://wiki.umiacs.umd.edu/cbcb/index.php?title=Cbcb:Pop-Lab:16S-pipeline_16S_analysis_pipeline_(for_Gates_project)&amp;diff=6091"/>
		<updated>2009-11-21T19:50:30Z</updated>

		<summary type="html">&lt;p&gt;Mpop: /* Step 1: Cleanup meta-information */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== 16S analysis pipeline ==&lt;br /&gt;
&lt;br /&gt;
Assumptions:&amp;lt;br&amp;gt;&lt;br /&gt;
* 16S rRNA sequences are generated with barcoded 454 and are received as either: (i) .sff file; (ii) fasta and quality file; (iii) just fasta file&lt;br /&gt;
* each batch is a single 96-well plate and is accompanied by a tab-delimited file containing information about the sample, including the &amp;quot;Sample ID&amp;quot;, well on the plate, and additional information regarding the sample quality and DNA concentration&lt;br /&gt;
* we have a file that specifies the barcode and sequencing adapter used for each well (these entities do not change).&lt;br /&gt;
&lt;br /&gt;
=== Directory structure ===&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
Gates_SOM/&lt;br /&gt;
   Main/&lt;br /&gt;
     samples.csv  - information about all the samples available to us&lt;br /&gt;
     454.csv      - information about all 454 runs (essentially concatenation of .csvs from 454 dir)&lt;br /&gt;
     phylochip.csv - information about all Phylochip runs&lt;br /&gt;
     IGS_Barcodes.csv - information about barcodes used to multiplex 454 runs&lt;br /&gt;
     scripts/     - scripts used to process the data&lt;br /&gt;
     454/         - here&#039;s where all 454 sequences live&lt;br /&gt;
       [batch1]/  - ... each batch in a separate directory&lt;br /&gt;
          [batch1].csv - meta-information about the batch &lt;br /&gt;
          [fasta1]- fasta files containing the batch&lt;br /&gt;
          ... &lt;br /&gt;
          [fastan]&lt;br /&gt;
          [batch1].part - partition file describing how the sequences get split by barcode/sample&lt;br /&gt;
          part/   - directory where all the partitioned files live &lt;br /&gt;
       ... &lt;br /&gt;
       [batchn]&lt;br /&gt;
     Phylochip/   - all the CEL files and auxiliary information on the Phylochip runs&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 1: Cleanup meta-information ===&lt;br /&gt;
* Convert the Excel sheet containing the batch information into tab-delimited file (and run dos2unix) making sure the quotes added by Excel/OOffice are removed, adding the date (if not already in the file), and sorting the file by Sample ID.  At this stage also check that the header row information is in canonical format.&lt;br /&gt;
* Add barcode information using add_barcode.pl&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;${SCRIPTDIR}/add_barcode.pl [batch].csv ${MAINDIR}IGS_Barcodes.csv &amp;gt; [batch]_barcode.csv&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Create partition file&lt;br /&gt;
&lt;br /&gt;
First concatenate all the sequence files (if multiple) from a batch into a single file [batch].all.seq.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;${SCRIPTDIR}/code2part.pl [batch].csv [batch].all.seq [batch] &amp;gt; [batch].part&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The files [batch].BAD.list and [batch].NONE.list contain the names of all sequences that have failed quality checks either because they are too short, or contain Ns (BAD), or because the barcode is not recognized (NONE). The files contain additional information that can be used to troubleshoot the pipeline:  sequences that are too short (&amp;lt; 75 454 cycles) are followed by the number of cycles, sequences that either contain Ns or have an unknown barcode are followed by the first 8 characters in the sequence.&lt;br /&gt;
&lt;br /&gt;
* Break up the fasta file into separate batches by partition&lt;br /&gt;
&lt;br /&gt;
** Create partition directory&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mkdir part&lt;br /&gt;
cd part&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
** Partition main file into sub-parts&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
${SCRIPTS}/partitionFasta.pl ../[batch].all.seq ../[batch].part&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The result is one fasta file per sub-partition (i.e. individual subject).&lt;/div&gt;</summary>
		<author><name>Mpop</name></author>
	</entry>
	<entry>
		<id>https://wiki.umiacs.umd.edu/cbcb/index.php?title=Cbcb:Pop-Lab:16S-pipeline_16S_analysis_pipeline_(for_Gates_project)&amp;diff=6090</id>
		<title>Cbcb:Pop-Lab:16S-pipeline 16S analysis pipeline (for Gates project)</title>
		<link rel="alternate" type="text/html" href="https://wiki.umiacs.umd.edu/cbcb/index.php?title=Cbcb:Pop-Lab:16S-pipeline_16S_analysis_pipeline_(for_Gates_project)&amp;diff=6090"/>
		<updated>2009-11-21T16:28:32Z</updated>

		<summary type="html">&lt;p&gt;Mpop: /* Step 1: Cleanup meta-information */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== 16S analysis pipeline ==&lt;br /&gt;
&lt;br /&gt;
Assumptions:&amp;lt;br&amp;gt;&lt;br /&gt;
* 16S rRNA sequences are generated with barcoded 454 and are received as either: (i) .sff file; (ii) fasta and quality file; (iii) just fasta file&lt;br /&gt;
* each batch is a single 96-well plate and is accompanied by a tab-delimited file containing information about the sample, including the &amp;quot;Sample ID&amp;quot;, well on the plate, and additional information regarding the sample quality and DNA concentration&lt;br /&gt;
* we have a file that specifies the barcode and sequencing adapter used for each well (these entities do not change).&lt;br /&gt;
&lt;br /&gt;
=== Directory structure ===&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
Gates_SOM/&lt;br /&gt;
   Main/&lt;br /&gt;
     samples.csv  - information about all the samples available to us&lt;br /&gt;
     454.csv      - information about all 454 runs (essentially concatenation of .csvs from 454 dir)&lt;br /&gt;
     phylochip.csv - information about all Phylochip runs&lt;br /&gt;
     IGS_Barcodes.csv - information about barcodes used to multiplex 454 runs&lt;br /&gt;
     scripts/     - scripts used to process the data&lt;br /&gt;
     454/         - here&#039;s where all 454 sequences live&lt;br /&gt;
       [batch1]/  - ... each batch in a separate directory&lt;br /&gt;
          [batch1].csv - meta-information about the batch &lt;br /&gt;
          [fasta1]- fasta files containing the batch&lt;br /&gt;
          ... &lt;br /&gt;
          [fastan]&lt;br /&gt;
          [batch1].part - partition file describing how the sequences get split by barcode/sample&lt;br /&gt;
          part/   - directory where all the partitioned files live &lt;br /&gt;
       ... &lt;br /&gt;
       [batchn]&lt;br /&gt;
     Phylochip/   - all the CEL files and auxiliary information on the Phylochip runs&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 1: Cleanup meta-information ===&lt;br /&gt;
* Convert the Excel sheet containing the batch information into tab-delimited file (and run dos2unix) making sure the quotes added by Excel/OOffice are removed, adding the date (if not already in the file), and sorting the file by Sample ID.  At this stage also check that the header row information is in canonical format.&lt;br /&gt;
* Add barcode information using add_barcode.pl&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;${SCRIPTDIR}/add_barcode.pl [batch].csv ${MAINDIR}IGS_Barcodes.csv &amp;gt; [batch]_barcode.csv&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Create partition file&lt;br /&gt;
&lt;br /&gt;
First concatenate all the sequence files (if multiple) from a batch into a single file [batch].all.seq.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;${SCRIPTDIR}/code2part.pl [batch].csv [batch].all.seq [batch] &amp;gt; [batch].part&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The files [batch].BAD.list and [batch].NONE.list contain the names of all sequences that have failed quality checks either because they are too short, or contain Ns (BAD), or because the barcode is not recognized (NONE). The files contain additional information that can be used to troubleshoot the pipeline:  sequences that are too short (&amp;lt; 75 454 cycles) are followed by the number of cycles, sequences that either contain Ns or have an unknown barcode are followed by the first 8 characters in the sequence.&lt;br /&gt;
&lt;br /&gt;
* Break up the fasta file into separate batches by partition&lt;/div&gt;</summary>
		<author><name>Mpop</name></author>
	</entry>
	<entry>
		<id>https://wiki.umiacs.umd.edu/cbcb/index.php?title=Cbcb:Pop-Lab:16S-pipeline_16S_analysis_pipeline_(for_Gates_project)&amp;diff=6089</id>
		<title>Cbcb:Pop-Lab:16S-pipeline 16S analysis pipeline (for Gates project)</title>
		<link rel="alternate" type="text/html" href="https://wiki.umiacs.umd.edu/cbcb/index.php?title=Cbcb:Pop-Lab:16S-pipeline_16S_analysis_pipeline_(for_Gates_project)&amp;diff=6089"/>
		<updated>2009-11-21T16:25:11Z</updated>

		<summary type="html">&lt;p&gt;Mpop: /* Step 1: Cleanup meta-information */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== 16S analysis pipeline ==&lt;br /&gt;
&lt;br /&gt;
Assumptions:&amp;lt;br&amp;gt;&lt;br /&gt;
* 16S rRNA sequences are generated with barcoded 454 and are received as either: (i) .sff file; (ii) fasta and quality file; (iii) just fasta file&lt;br /&gt;
* each batch is a single 96-well plate and is accompanied by a tab-delimited file containing information about the sample, including the &amp;quot;Sample ID&amp;quot;, well on the plate, and additional information regarding the sample quality and DNA concentration&lt;br /&gt;
* we have a file that specifies the barcode and sequencing adapter used for each well (these entities do not change).&lt;br /&gt;
&lt;br /&gt;
=== Directory structure ===&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
Gates_SOM/&lt;br /&gt;
   Main/&lt;br /&gt;
     samples.csv  - information about all the samples available to us&lt;br /&gt;
     454.csv      - information about all 454 runs (essentially concatenation of .csvs from 454 dir)&lt;br /&gt;
     phylochip.csv - information about all Phylochip runs&lt;br /&gt;
     IGS_Barcodes.csv - information about barcodes used to multiplex 454 runs&lt;br /&gt;
     scripts/     - scripts used to process the data&lt;br /&gt;
     454/         - here&#039;s where all 454 sequences live&lt;br /&gt;
       [batch1]/  - ... each batch in a separate directory&lt;br /&gt;
          [batch1].csv - meta-information about the batch &lt;br /&gt;
          [fasta1]- fasta files containing the batch&lt;br /&gt;
          ... &lt;br /&gt;
          [fastan]&lt;br /&gt;
          [batch1].part - partition file describing how the sequences get split by barcode/sample&lt;br /&gt;
          part/   - directory where all the partitioned files live &lt;br /&gt;
       ... &lt;br /&gt;
       [batchn]&lt;br /&gt;
     Phylochip/   - all the CEL files and auxiliary information on the Phylochip runs&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 1: Cleanup meta-information ===&lt;br /&gt;
* Convert the Excel sheet containing the batch information into tab-delimited file (and run dos2unix) making sure the quotes added by Excel/OOffice are removed, adding the date (if not already in the file), and sorting the file by Sample ID.  At this stage also check that the header row information is in canonical format.&lt;br /&gt;
* Add barcode information using add_barcode.pl&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;${SCRIPTDIR}/add_barcode.pl [batch].csv ${MAINDIR}IGS_Barcodes.csv &amp;gt; [batch]_barcode.csv&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
* Create partition file&lt;br /&gt;
&lt;br /&gt;
First concatenate all the sequence files (if multiple) from a batch into a single file [batch].all.seq.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;${SCRIPTDIR}/code2part.pl [batch].csv [batch].all.seq [batch] &amp;gt; [batch].part&amp;lt;/pre&amp;gt;&lt;/div&gt;</summary>
		<author><name>Mpop</name></author>
	</entry>
	<entry>
		<id>https://wiki.umiacs.umd.edu/cbcb/index.php?title=Cbcb:Pop-Lab:16S-pipeline_16S_analysis_pipeline_(for_Gates_project)&amp;diff=6088</id>
		<title>Cbcb:Pop-Lab:16S-pipeline 16S analysis pipeline (for Gates project)</title>
		<link rel="alternate" type="text/html" href="https://wiki.umiacs.umd.edu/cbcb/index.php?title=Cbcb:Pop-Lab:16S-pipeline_16S_analysis_pipeline_(for_Gates_project)&amp;diff=6088"/>
		<updated>2009-11-21T16:22:45Z</updated>

		<summary type="html">&lt;p&gt;Mpop: /* Directory structure */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== 16S analysis pipeline ==&lt;br /&gt;
&lt;br /&gt;
Assumptions:&amp;lt;br&amp;gt;&lt;br /&gt;
* 16S rRNA sequences are generated with barcoded 454 and are received as either: (i) .sff file; (ii) fasta and quality file; (iii) just fasta file&lt;br /&gt;
* each batch is a single 96-well plate and is accompanied by a tab-delimited file containing information about the sample, including the &amp;quot;Sample ID&amp;quot;, well on the plate, and additional information regarding the sample quality and DNA concentration&lt;br /&gt;
* we have a file that specifies the barcode and sequencing adapter used for each well (these entities do not change).&lt;br /&gt;
&lt;br /&gt;
=== Directory structure ===&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
Gates_SOM/&lt;br /&gt;
   Main/&lt;br /&gt;
     samples.csv  - information about all the samples available to us&lt;br /&gt;
     454.csv      - information about all 454 runs (essentially concatenation of .csvs from 454 dir)&lt;br /&gt;
     phylochip.csv - information about all Phylochip runs&lt;br /&gt;
     IGS_Barcodes.csv - information about barcodes used to multiplex 454 runs&lt;br /&gt;
     scripts/     - scripts used to process the data&lt;br /&gt;
     454/         - here&#039;s where all 454 sequences live&lt;br /&gt;
       [batch1]/  - ... each batch in a separate directory&lt;br /&gt;
          [batch1].csv - meta-information about the batch &lt;br /&gt;
          [fasta1]- fasta files containing the batch&lt;br /&gt;
          ... &lt;br /&gt;
          [fastan]&lt;br /&gt;
          [batch1].part - partition file describing how the sequences get split by barcode/sample&lt;br /&gt;
          part/   - directory where all the partitioned files live &lt;br /&gt;
       ... &lt;br /&gt;
       [batchn]&lt;br /&gt;
     Phylochip/   - all the CEL files and auxiliary information on the Phylochip runs&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Step 1: Cleanup meta-information ===&lt;br /&gt;
* Convert the Excel sheet containing the batch information into tab-delimited file (and run dos2unix) making sure the quotes added by Excel/OOffice are removed, adding the date (if not already in the file), and sorting the file by Sample ID.  At this stage also check that the header row information is in canonical format.&lt;br /&gt;
* Add barcode information using add_barcode.pl&lt;br /&gt;
&lt;br /&gt;
add_barcode.pl [batch].csv IGS_Barcodes.csv &amp;gt; [batch]_barcode.csv&lt;/div&gt;</summary>
		<author><name>Mpop</name></author>
	</entry>
	<entry>
		<id>https://wiki.umiacs.umd.edu/cbcb/index.php?title=Cbcb:Pop-Lab:16S-pipeline_16S_analysis_pipeline_(for_Gates_project)&amp;diff=6078</id>
		<title>Cbcb:Pop-Lab:16S-pipeline 16S analysis pipeline (for Gates project)</title>
		<link rel="alternate" type="text/html" href="https://wiki.umiacs.umd.edu/cbcb/index.php?title=Cbcb:Pop-Lab:16S-pipeline_16S_analysis_pipeline_(for_Gates_project)&amp;diff=6078"/>
		<updated>2009-11-18T21:03:37Z</updated>

		<summary type="html">&lt;p&gt;Mpop: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== 16S analysis pipeline ==&lt;br /&gt;
&lt;br /&gt;
Assumptions:&amp;lt;br&amp;gt;&lt;br /&gt;
* 16S rRNA sequences are generated with barcoded 454 and are received as either: (i) .sff file; (ii) fasta and quality file; (iii) just fasta file&lt;br /&gt;
* each batch is a single 96-well plate and is accompanied by a tab-delimited file containing information about the sample, including the &amp;quot;Sample ID&amp;quot;, well on the plate, and additional information regarding the sample quality and DNA concentration&lt;br /&gt;
* we have a file that specifies the barcode and sequencing adapter used for each well (these entities do not change).&lt;br /&gt;
&lt;br /&gt;
=== Directory structure ===&lt;br /&gt;
Gates_SOM/&amp;lt;br&amp;gt;&lt;br /&gt;
   Main/&amp;lt;br&amp;gt;&lt;br /&gt;
     samples.csv &amp;lt;br&amp;gt;&lt;br /&gt;
     454.csv &amp;lt;br&amp;gt;&lt;br /&gt;
     phylochip.csv &amp;lt;br&amp;gt;&lt;br /&gt;
     scripts/ &amp;lt;br&amp;gt;&lt;br /&gt;
     454/ &amp;lt;br&amp;gt;&lt;br /&gt;
       [batch1]/ &amp;lt;br&amp;gt;&lt;br /&gt;
          [batch1].csv &amp;lt;br&amp;gt;&lt;br /&gt;
          [fasta1] &amp;lt;br&amp;gt;&lt;br /&gt;
          ... &amp;lt;br&amp;gt;&lt;br /&gt;
          [fastan] &amp;lt;br&amp;gt;&lt;br /&gt;
       ... &amp;lt;br&amp;gt;&lt;br /&gt;
       [batchn] &amp;lt;br&amp;gt;&lt;br /&gt;
     Phylochip/ &amp;lt;br&amp;gt;&lt;br /&gt;
=== Step 1: Cleanup meta-information ===&lt;br /&gt;
* Convert the Excel sheet containing the batch information into tab-delimited file (and run dos2unix) making sure the quotes added by Excel/OOffice are removed, adding the date (if not already in the file), and sorting the file by Sample ID.  At this stage also check that the header row information is in canonical format.&lt;br /&gt;
* Add barcode information using add_barcode.pl&lt;br /&gt;
&lt;br /&gt;
add_barcode.pl [batch].csv IGS_Barcodes.csv &amp;gt; [batch]_barcode.csv&lt;/div&gt;</summary>
		<author><name>Mpop</name></author>
	</entry>
	<entry>
		<id>https://wiki.umiacs.umd.edu/cbcb/index.php?title=Cbcb:Pop-Lab:How-to&amp;diff=6077</id>
		<title>Cbcb:Pop-Lab:How-to</title>
		<link rel="alternate" type="text/html" href="https://wiki.umiacs.umd.edu/cbcb/index.php?title=Cbcb:Pop-Lab:How-to&amp;diff=6077"/>
		<updated>2009-11-18T20:47:58Z</updated>

		<summary type="html">&lt;p&gt;Mpop: /* How-To repository */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==How-To repository==&lt;br /&gt;
[http://www.cbcb.umd.edu/intranet/resources.shtml Getting started at the CBCB] &amp;lt;br&amp;gt;&lt;br /&gt;
[[Cbcb:Pop-Lab:AMOS-CVS How to use AMOS through CVS]] &amp;lt;br&amp;gt;&lt;br /&gt;
[[Cbcb:Pop-Lab:AMOScmp-SR How to use AMOScmp with short read data ]] &amp;lt;br&amp;gt;&lt;br /&gt;
How do I annotate a genome at the CBCB?&amp;lt;br&amp;gt;&lt;br /&gt;
[[Cbcb:Pop-Lab:How do I run the new Bambus | How do I run the new Bambus?]] &amp;lt;br&amp;gt;&lt;br /&gt;
How do I use the antibiotic resistance database?&amp;lt;br&amp;gt;&lt;br /&gt;
How do I use the antibiotic resistance database locally? &amp;lt;br&amp;gt;&lt;br /&gt;
How do I run jobs on the grid? &amp;lt;br&amp;gt;&lt;br /&gt;
[[Cbcb:Pop-Lab:OTUs How do I create OTUs from 16S rRNA sequence data? | How do I create OTUs from 16S rRNA sequence data?]] &amp;lt;br&amp;gt;&lt;br /&gt;
How do I compare metagenomic datasets through the metastats website?&amp;lt;br&amp;gt;&lt;br /&gt;
How do I compare metagenomic datasets using R directly? &amp;lt;br&amp;gt;&lt;br /&gt;
How do I find CRISPRs in a new genome?&amp;lt;br&amp;gt;&lt;br /&gt;
[[Cbcb:Pop-Lab:SOMA How do I scaffold a genome using optical maps (both locally and through the web) ]] &amp;lt;br&amp;gt;&lt;br /&gt;
How do I generate graph information out of Minimus?&amp;lt;br&amp;gt;&lt;br /&gt;
What tools are available for doing &amp;lt;i&amp;gt;in silico&amp;lt;/i&amp;gt; finishing at the CBCB? &amp;lt;br&amp;gt;&lt;br /&gt;
How do I generate a scaffold graph starting from a 454 .ace file?&amp;lt;br&amp;gt;&lt;br /&gt;
How do I draw a pretty picture of a scaffold stored in an AMOS bank?&amp;lt;br&amp;gt;&lt;br /&gt;
[[Cbcb:Pop-Lab:How to use the partition software? | How do I use the partition software?]]&amp;lt;br&amp;gt;&lt;br /&gt;
[[Cbcb:Pop-Lab:16S-pipeline 16S analysis pipeline (for Gates project)]]&lt;/div&gt;</summary>
		<author><name>Mpop</name></author>
	</entry>
</feed>