Cbcb:Pop-Lab:16S-pipeline 16S analysis pipeline (for Gates project)
16S analysis pipeline
Assumptions:
- 16S rRNA sequences are generated with barcoded 454 and are received as either: (i) .sff file; (ii) fasta and quality file; (iii) just fasta file
- each batch is a single 96-well plate and is accompanied by a tab-delimited file containing information about the sample, including the "Sample ID", well on the plate, and additional information regarding the sample quality and DNA concentration
- we have a file that specifies the barcode and sequencing adapter used for each well (these entities do not change).
Directory structure
Gates_SOM/ Main/ samples.csv - information about all the samples available to us 454.csv - information about all 454 runs (essentially concatenation of .csvs from 454 dir) phylochip.csv - information about all Phylochip runs IGS_Barcodes.csv - information about barcodes used to multiplex 454 runs scripts/ - scripts used to process the data 454/ - here's where all 454 sequences live [batch1]/ - ... each batch in a separate directory [batch1].csv - meta-information about the batch [fasta1]- fasta files containing the batch ... [fastan] [batch1].part - partition file describing how the sequences get split by barcode/sample part/ - directory where all the partitioned files live ... [batchn] Phylochip/ - all the CEL files and auxiliary information on the Phylochip runs
Step 0: Get the sequence information
- From .SFF files (assuming these are 454 sequences)
This step uses the sff_extract program from the Staden package (if I'm not mistaken)
for i in *.sff ;do name=`expr $i : '\(.*\)\.sff'` sff_extract -c -s $name.seq -q $name.qual $i done
Step 1: Cleanup meta-information
- Convert the Excel sheet containing the batch information into tab-delimited file (and run dos2unix) making sure the quotes added by Excel/OOffice are removed, adding the date (if not already in the file), and sorting the file by Sample ID. At this stage also check that the header row information is in canonical format.
- Add barcode information using add_barcode.pl
${SCRIPTDIR}/add_barcode.pl [batch].csv ${MAINDIR}IGS_Barcodes.csv > [batch]_barcode.csv
Step 2: Create partition file
First concatenate all the sequence files (if multiple) from a batch into a single file [batch].all.seq.
${SCRIPTDIR}/code2part.pl [batch].csv [batch].all.seq [batch] > [batch].part
The files [batch].BAD.list and [batch].NONE.list contain the names of all sequences that have failed quality checks either because they are too short, or contain Ns (BAD), or because the barcode is not recognized (NONE). The files contain additional information that can be used to troubleshoot the pipeline: sequences that are too short (< 75 454 cycles) are followed by the number of cycles, sequences that either contain Ns or have an unknown barcode are followed by the first 8 characters in the sequence.
Step 3: Break up the fasta file into separate batches by partition
- Create partition directory
mkdir part cd part
- Partition main file into sub-parts
${SCRIPTS}/partitionFasta.pl ../[batch].all.seq ../[batch].part
The result is one fasta file per sub-partition (i.e. individual subject).
- Remove barcodes
(still in the part/ subdirectory)
for i in *.seq; do ${SCRIPTS}/unbarcode.pl $i done
The output files will have the same name as the original file but with the addition of the .nbc suffix. You should remove the .nbc files from the BAD/NONE files in order to prevent their addition to the pipeline downstream.
rm *.BAD.nbc.fa *.NONE.nbc.fa