Difference between revisions of "Cbcb:Pop-Lab:16S-pipeline 16S analysis pipeline (for Gates project)"

From Cbcb
Jump to navigation Jump to search
 
Line 7: Line 7:
  
 
=== Directory structure ===
 
=== Directory structure ===
Gates_SOM/<br>
+
<pre>
   Main/<br>
+
Gates_SOM/
     samples.csv <br>
+
   Main/
     454.csv <br>
+
     samples.csv - information about all the samples available to us
     phylochip.csv <br>
+
     454.csv     - information about all 454 runs (essentially concatenation of .csvs from 454 dir)
     scripts/ <br>
+
     phylochip.csv - information about all Phylochip runs
     454/ <br>
+
    IGS_Barcodes.csv - information about barcodes used to multiplex 454 runs
       [batch1]/ <br>
+
     scripts/     - scripts used to process the data
           [batch1].csv <br>
+
     454/         - here's where all 454 sequences live
           [fasta1] <br>
+
       [batch1]/ - ... each batch in a separate directory
           ... <br>
+
           [batch1].csv - meta-information about the batch
           [fastan] <br>
+
           [fasta1]- fasta files containing the batch
       ... <br>
+
           ...  
       [batchn] <br>
+
           [fastan]
     Phylochip/ <br>
+
          [batch1].part - partition file describing how the sequences get split by barcode/sample
 +
          part/  - directory where all the partitioned files live
 +
       ...  
 +
       [batchn]
 +
     Phylochip/   - all the CEL files and auxiliary information on the Phylochip runs
 +
</pre>
 +
 
 
=== Step 1: Cleanup meta-information ===
 
=== Step 1: Cleanup meta-information ===
 
* Convert the Excel sheet containing the batch information into tab-delimited file (and run dos2unix) making sure the quotes added by Excel/OOffice are removed, adding the date (if not already in the file), and sorting the file by Sample ID.  At this stage also check that the header row information is in canonical format.
 
* Convert the Excel sheet containing the batch information into tab-delimited file (and run dos2unix) making sure the quotes added by Excel/OOffice are removed, adding the date (if not already in the file), and sorting the file by Sample ID.  At this stage also check that the header row information is in canonical format.

Revision as of 16:22, 21 November 2009

16S analysis pipeline

Assumptions:

  • 16S rRNA sequences are generated with barcoded 454 and are received as either: (i) .sff file; (ii) fasta and quality file; (iii) just fasta file
  • each batch is a single 96-well plate and is accompanied by a tab-delimited file containing information about the sample, including the "Sample ID", well on the plate, and additional information regarding the sample quality and DNA concentration
  • we have a file that specifies the barcode and sequencing adapter used for each well (these entities do not change).

Directory structure

Gates_SOM/
   Main/
     samples.csv  - information about all the samples available to us
     454.csv      - information about all 454 runs (essentially concatenation of .csvs from 454 dir)
     phylochip.csv - information about all Phylochip runs
     IGS_Barcodes.csv - information about barcodes used to multiplex 454 runs
     scripts/     - scripts used to process the data
     454/         - here's where all 454 sequences live
       [batch1]/  - ... each batch in a separate directory
          [batch1].csv - meta-information about the batch 
          [fasta1]- fasta files containing the batch
          ... 
          [fastan]
          [batch1].part - partition file describing how the sequences get split by barcode/sample
          part/   - directory where all the partitioned files live 
       ... 
       [batchn]
     Phylochip/   - all the CEL files and auxiliary information on the Phylochip runs

Step 1: Cleanup meta-information

  • Convert the Excel sheet containing the batch information into tab-delimited file (and run dos2unix) making sure the quotes added by Excel/OOffice are removed, adding the date (if not already in the file), and sorting the file by Sample ID. At this stage also check that the header row information is in canonical format.
  • Add barcode information using add_barcode.pl

add_barcode.pl [batch].csv IGS_Barcodes.csv > [batch]_barcode.csv