Cbcb:Pop-Lab:16S-pipeline 16S analysis pipeline (for Gates project)

From Cbcb
Revision as of 21:03, 18 November 2009 by Mpop (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

16S analysis pipeline

Assumptions:

  • 16S rRNA sequences are generated with barcoded 454 and are received as either: (i) .sff file; (ii) fasta and quality file; (iii) just fasta file
  • each batch is a single 96-well plate and is accompanied by a tab-delimited file containing information about the sample, including the "Sample ID", well on the plate, and additional information regarding the sample quality and DNA concentration
  • we have a file that specifies the barcode and sequencing adapter used for each well (these entities do not change).

Directory structure

Gates_SOM/

  Main/
samples.csv
454.csv
phylochip.csv
scripts/
454/
[batch1]/
[batch1].csv
[fasta1]
...
[fastan]
...
[batchn]
Phylochip/

Step 1: Cleanup meta-information

  • Convert the Excel sheet containing the batch information into tab-delimited file (and run dos2unix) making sure the quotes added by Excel/OOffice are removed, adding the date (if not already in the file), and sorting the file by Sample ID. At this stage also check that the header row information is in canonical format.
  • Add barcode information using add_barcode.pl

add_barcode.pl [batch].csv IGS_Barcodes.csv > [batch]_barcode.csv