Cbcb:Pop-Lab:16S-pipeline 16S analysis pipeline (for Gates project)

From Cbcb
Revision as of 15:32, 10 June 2010 by Mpop (talk | contribs)
Jump to navigation Jump to search

16S analysis pipeline

Overview

Assumptions

  • 16S rRNA sequences are generated with barcoded 454 and are received as either: (i) .sff file; (ii) fasta and quality file; (iii) just fasta file
  • each batch is a single 96-well plate and is accompanied by a tab-delimited file containing information about the sample, including the "Sample ID", well on the plate, and additional information regarding the sample quality and DNA concentration
  • we have a file that specifies the barcode and sequencing adapter used for each well (these entities do not change).

Directory structure

Gates_SOM/
  Main/
    samples.csv      - information about all the samples available to us
    454.csv          - information about all 454 runs (essentially concatenation of .csvs from 454 dir)
    phylochip.csv    - information about all Phylochip runs
    IGS_Barcodes.csv - information about barcodes used to multiplex 454 runs
    scripts/         - scripts used to process the data (referred to as ${SCRIPTS} below)
    DB/              - scripts used to access the database (referred to as ${DB_SCRIPTS} below)
  454/               - here's where all 454 sequences live
    [batch1]/        - ... each batch in a separate directory
    [batch1].csv     - meta-information about the batch 
    [fasta1]         - fasta files containing the batch
     ... 
    [fastan]
    [batch1].part    - partition file describing how the sequences get split by barcode/sample
     part/           - directory where all the partitioned files live 
     ... 
    [batchn]
  Phylochip/         - all the CEL files and auxiliary information on the Phylochip runs

File-based approach

Step 0: Get the sequence information

  • From .SFF files (assuming these are 454 sequences)

This step uses the sff_extract program from the Staden package (if I'm not mistaken)

for i in *.sff ;do
  name=`expr $i : '\(.*\)\.sff'`
  sff_extract -c -s $name.seq -q $name.qual $i
done

Step 1: Cleanup meta-information

  • Convert the Excel sheet containing the batch information into tab-delimited file (and run dos2unix) making sure the quotes added by Excel/OOffice are removed, adding the date (if not already in the file), and sorting the file by Sample ID. At this stage also check that the header row information is in canonical format.
  • Add barcode information using add_barcode.pl
${SCRIPTDIR}/add_barcode.pl [batch].csv ${MAINDIR}IGS_Barcodes.csv > [batch]_barcode.csv

Step 2: Create partition file

First concatenate all the sequence files (if multiple) from a batch into a single file [batch].all.seq.

${SCRIPTDIR}/code2part.pl [batch].csv [batch].all.seq [batch] > [batch].part

The files [batch].BAD.list and [batch].NONE.list contain the names of all sequences that have failed quality checks either because they are too short, or contain Ns (BAD), or because the barcode is not recognized (NONE). The files contain additional information that can be used to troubleshoot the pipeline: sequences that are too short (< 75 454 cycles) are followed by the number of cycles, sequences that either contain Ns or have an unknown barcode are followed by the first 8 characters in the sequence.

Step 3: Break up the fasta file into separate batches by partition

  • Create partition directory
mkdir part
cd part
  • Partition main file into sub-parts
${SCRIPTS}/partitionFasta.pl ../[batch].all.seq ../[batch].part

The result is one fasta file per sub-partition (i.e. individual subject).

  • Remove barcodes

(still in the part/ subdirectory)

for i in *.seq; do
  ${SCRIPTS}/unbarcode.pl $i
done

The output files will have the same name as the original file but with the addition of the .nbc suffix. You should remove the .nbc files from the BAD/NONE files in order to prevent their addition to the pipeline downstream.

 rm *.BAD.nbc.fa *.NONE.nbc.fa 

Step 4: Add file names to sample tables

The following needs to be run from the root of the 454 directory.

 ${SCRIPTS}/add_file.pl [batch]/[batch]_barcode.csv [batch]/part > [batch]/[batch]_names.csv 

As a result, the file [batch]/[batch]_names.csv will associate each Sample ID to a file name and also record the number of sequences in that file. Note that only files ending in .nbc.fa are procesed.

Step 5: Merge all the batch meta-info files into a same file at the top

Note: the addition of file names must be done on a batch by batch basis as multiple files might refer to a same Sample ID - within each batch it can be assumed that the Sample ID -> Filename mapping is unique. In the 454.csv file in the top directory the unique key is the file name.

 
mv ../454.csv ../454.csv.bak
${SCRIPTS}/merge_csv.pl ../454.csv.bak [batch]/[batch]_names.csv "Sample ID" non-unique > ../454.csv

This should probably be run with the filename as the key but then the tables need to be sorted by filename. Ultimately, this will all work better once the data are in a relational database.

Step 6: Update sample file

From the top directory:

  • First add all new samples to the samples.csv file
 
mv samples.csv samples.csv.bak
${SCRIPTS}/merge_csv.pl samples.csv.bak 454.csv "Sample ID" unique merge > samples.csv

Note: merge means that if record keys conflict, the empty fields will be updated with the new data.

  • Update the tag indicating 454 sequences available for this sample
mv samples.csv samples.csv.bak
${SCRIPTS}/update_field_csv.pl samples.csv.bak 454.csv "Sample ID" 454 Y > samples.csv

Step 7: Assign numbers to all filenames

Each file in the 454.csv file will be assigned an integer (if one is not already available). This number will be used to prefix the sequences in the combined file for the project.

mv 454.csv 454.csv.bak
${SCRIPTS}/add_filenum.pl 454.csv.bak > 454.csv

Step 8: Combine all fasta files into a single one

Note: this assumes we're running the pipeline for the first time - a different protocol is necessary for adding new sequences to an already existing analysis

In the new file all sequences will be named <n>_<nn> where <n> is the value in the "File #" field in 454.csv and <nn> is the index of the sequence in the file.

mkdir Analysis/Run[date]
${SCRIPTS}/combinefa.pl -c Analysis/Run[date]/Run[date] -i 454.csv 454

The output will be in Analysis/Run[date]/Run[date].fna

Step 9: Run clustering tool

  • First generate clusters

Note: This part assumes we're running the whole set of sequences as one batch.

cd Analysis/Run[date]/Run[date].fna
/fs/szasmg2/ghodsi/Src/clusterk/clusterk7 -r 2 -i Run[date].fna > Run[date].fna.cluster
/fs/szasmg2/ghodsi/Src/clusterk/clusterk7 -r 2 -m -i Run[date].fna > Run[date].fna.align

Output will be in Run[date].fna.cluster, one cluster per line, cluster center listed as the first identifier.

The .align file contains aligned FASTA records for all the sequences in each cluster. Clusters are separated by #<number> where <number> is the number of sequences in the cluster.


  • Then extract the cluster centers

From here on the code runs the same in both full-run and batch modes

/fs/szasmg2/ghodsi/Src/clusterk/fastaselect Run[date].fna < Run[date].fna.cluster > Run[date].centers.fna

Step 10: Assign putative taxonomic labels to clusters

/fs/szasmg2/ghodsi/rdp/findtax/findtaxid.sh Run[date].centers.fna > Run[date].centers.taxid

Output is tab-delimited: sequence name <TAB> taxid Note: using "findtax" instead of "findtaxid" will retrieve actual taxonomy names.

Step 11: Build summary tables

Using the output from steps 9 and 10 we construct a collection of tables linking OTUs, taxIDs, taxnames at various taxonomic levels to individual samples. The colums are the samples and the rows are the respective units. The cells are numbers of sequences assigned to the specific group. If looking at taxonomic levels, the sequences without an assignment at that level are assigned to a generic "No Assignment" bin.

  • First create a partition file that contains all the clusters and associated taxids (if they exist)
${SCRIPTS}/cluster2part.pl [batch].fna.cluster [batch] [batch].centers.taxid > [batch].part
  • Using this partition, construct summary tables at various taxonomic levels
${SCRIPTS}/taxpart2summary.pl [batch].part ${MAIN}/454.csv [batch]

The following outputs will be created:

[batch].stats.txt - overall statistics for the data-set
[batch].otus.count.csv - table containing OTUs as rows, samples as columns, and entries represent
       # of sequences in each OTU/Sample pair
[batch].otus.percent.csv - same as "count" except that entries are percentages wrt total sequences
       in each sample
[batch].[tax].[otu|count|percent] - same as the "otus" file except at varying taxonomic levels.
       [tax] is one of "strain", "species", "genus", "family", "order", "class", "phylum"
       the "count" and "percent" entries are the same as for the "otus" files
       the "otu" entries contain number of OTUs assigned to the taxonomic group/Sample pair.

Database-based approach

Database information

  • To access the database through a command line use the command shown below with password "access":
mysql -u access -p -h cbcbmysql00 gems
  • A stub Perl script for playing with the database is provided at /fs/szasmg2/Gates_SOM/Main/454/DB/stub.pl
  • All users can read the data from the database using users/password combo "access"/"access".
  • Database schema:
  • The commands listed below assume you have write access to the database.

Step 0: Get the sequence information

  • From .SFF files (assuming these are 454 sequences)

This step uses the sff_extract program from the Staden package (if I'm not mistaken)

for i in *.sff ;do
  name=`expr $i : '\(.*\)\.sff'`
  sff_extract -c -s $name.seq -q $name.qual $i
done

Step 1: Cleanup meta-information

  • Convert the Excel sheet containing the batch information into tab-delimited file (and run dos2unix) making sure the quotes added by Excel/OOffice are removed, adding the date (if not already in the file), and sorting the file by Sample ID. At this stage also check that the header row information is in canonical format.
  • Add barcode information using add_barcode.pl
${SCRIPTDIR}/add_barcode.pl [batch].csv ${MAINDIR}IGS_Barcodes.csv > [batch]_barcode.csv

Step 2: Create partition file

First concatenate all the sequence files (if multiple) from a batch into a single file [batch].all.seq.

${SCRIPTDIR}/code2part.pl [batch].csv [batch].all.seq [batch] > [batch].part

The files [batch].BAD.list and [batch].NONE.list contain the names of all sequences that have failed quality checks either because they are too short, or contain Ns (BAD), or because the barcode is not recognized (NONE). The files contain additional information that can be used to troubleshoot the pipeline: sequences that are too short (< 75 454 cycles) are followed by the number of cycles, sequences that either contain Ns or have an unknown barcode are followed by the first 8 characters in the sequence.

Step 3: Break up the fasta file into separate batches by partition

  • Create partition directory
mkdir part
cd part
  • Partition main file into sub-parts
${SCRIPTS}/partitionFasta.pl ../[batch].all.seq ../[batch].part

The result is one fasta file per sub-partition (i.e. individual subject).

  • Remove barcodes

(still in the part/ subdirectory)

for i in *.seq; do
  ${SCRIPTS}/unbarcode.pl $i
done

The output files will have the same name as the original file but with the addition of the .nbc suffix. You should remove the .nbc files from the BAD/NONE files in order to prevent their addition to the pipeline downstream.

 rm *.BAD.nbc.fa *.NONE.nbc.fa 

Step 4: Upload file names

Run from 454 directory

${DB_SCRIPTS}/add_file_db.pl [batch]/[batch]_barcode.csv [batch]/part

Step 5: Combine all fasta files into a single one

Note: this assumes we're running the pipeline for the first time - a different protocol is necessary for adding new sequences to an already existing analysis

In the new file all sequences will be named <n>_<nn> where <n> is the value in the "File #" field in 454.csv and <nn> is the index of the sequence in the file.

mkdir Analysis/Run[date]
${DB_SCRIPTS}/combinefa_db.pl -c Analysis/Run[date]/Run[date] 454

The output will be in Analysis/Run[date]/Run[date].fna

Step 6: Run clustering tool

  • First generate clusters

Note: This part assumes we're running the whole set of sequences as one batch.

cd Analysis/Run[date]/Run[date].fna
/fs/szasmg2/ghodsi/Src/clusterk/clusterk7 -r 2 -i Run[date].fna > Run[date].fna.cluster
/fs/szasmg2/ghodsi/Src/clusterk/clusterk7 -r 2 -m -i Run[date].fna > Run[date].fna.align

Output will be in Run[date].fna.cluster, one cluster per line, cluster center listed as the first identifier.

The .align file contains aligned FASTA records for all the sequences in each cluster. Clusters are separated by #<number> where <number> is the number of sequences in the cluster.


  • Then extract the cluster centers

From here on the code runs the same in both full-run and batch modes

/fs/szasmg2/ghodsi/Src/clusterk/fastaselect Run[date].fna < Run[date].fna.cluster > Run[date].centers.fna

Step 7: Assign putative taxonomic labels to clusters

/fs/szasmg2/ghodsi/rdp/findtax/findtaxid.sh Run[date].centers.fna > Run[date].centers.taxid

Output is tab-delimited: sequence name <TAB> taxid Note: using "findtax" instead of "findtaxid" will retrieve actual taxonomy names.