EULER-SR Readme.  Updated 12/4/08.

Contents.
  1.  Installing EULER-SR.
    1.1.  Required software/hardware.
		1.2.  Environment variables.
		1.3.  Building EULER-SR.
		1.4.  Running programs.
    1.5.  Sample data

  2.  Cleaning up data for assemblies.
	  2.1.  Translating data / quality filtering.
			2.1.1.  FASTQ -> Fasta (Sanger)
			2.1.2.  FASTQ -> Fasta (Illumina FASTQ)
			2.1.3.  sff -> Fasta (454 binary output)
    2.2.  Quality trimming.
    2.3.  Filtering bad Illumina reads.
    2.4.  Preparing 454 mate-paired reads.
		2.5.  Cleaning vector and masked sequences from SANGER
		      reads.
  3. Running assemblies.
	  3.1. Preparing your input file.
	  3.2. Running a default (unpaired) assembly.
		3.3. Running a paired-end assembly.
		3.4. Nonstandard assemblies.
		  3.4.1.  Detecting indel variations.
			3.4.2.  Fast pooling of reads. 

  4. Examining output and refining assemblies.
    4.1. Detecting fragmented assemblies.
    4.2. Joining fragmented assemblies.

  5.  Release notes.
	  5.1 Known issues.
	

1.  Installing EULER-SR.

The installation of EULER-SR has been made more straightforward from
the previous versions. You probably will only need to set one
environment variable.  The first step is to unpack euler-sr.tgz,
probably already done since you are reading this.

> tar zxvf euler-sr.tgz.

1.1. Required software.
  
  Software: You must have g++ and gmake installed on your system.  

  Hardware: EULER-SR is only tested on an x86_64 linux system.  It
  should run on the intel macs, and has limited support for running on
  a powerbook.

1.2.  Environment variables.

You need to set EUSRC to the full path to where euler-sr was unzipped,
and MACHTYPE to one of x86_64, i686, or powerpc:

If you are running bash:

> tar zxvf euler-sr.tgz
> cd euler-sr
> export  EUSRC=`pwd`
> export  MACHTYPE=x86_64

If you are running tcsh:

> tar zxvf euler-sr.tgz
> cd euler-sr
> setenv EUSRC  `pwd`
> setenv MACHTYPE x86_64


1.3. Build euler-sr.  

If all environment variables are set, this is a straightforward make.

>cd euler-sr
>gmake


1.4 Running programs.

	 When ran without arguments (or when there is an error on the
	 command line), all programs produce a brief help file.


1.5. Sample data.
  Provided with the assembly is a sample paired end reads file
  (reads.fasta), a rule file for this (readtitles.rules), and a sample
  file that contains reads from two haplotypes (reads.variants.fasta).

	To assemble the reads file:
	${EUSRC}/assembly/Assemble.pl reads.fasta 25 readtitle.rules.

	Instructions to extract the indel variants from the haplotypes file
	are given below.

	
2. CLEANING UP DATA

The assembly quality is strongly dependent on how clean the data is.
Quality values may be used to trim/remove low quality reads.  If you
are assembling Illumina reads, particularly those produced by the
first generation sequencer, some extra programs are provided to filter
reads that have errors particular to the system (discussed below).

2.1 Translating data.

  EULER-SR takes as input a single fasta file with ALL reads, an
  optional mate file that specifies mate-pairing (which may be
  generated), and a rule file that describes how to map a fasta title
  to a clone type.

	Utilities are provided to translate from:
	  .sff (the 454 binary output file).
		.fastq (Sanger format)
		.fastq (Illumina quality values).

2.1.1 FASTQ -> Fasta (Sanger)

  To translate fastq to fasta, use the quality trimmer, given reads.fastq

	${EUSRC}/assembly/x86_64/qualityTrimmer	-fasta reads.fastq -outFasta reads.fasta

	Options may be given to quality trimmer to tune the trimming, and
	are described in the output help.

2.1.2 FASTQ -> Fasta (Illumina)
	
  Illumina uses a different scaling of quality values than the normal
  phred scores.  Although qualityTrimmer attempts to guess the type
  (Sanger vs. Illumina), this may be forced on the command line:

	${EUSRC}/assembly/x86_64/qualityTrimmer	-fasta reads.fastq -outFasta reads.fasta -type illumina
	

2.1.3 sff -> fasta

  The 454 output files by default are in a binary format that contains
  both the fasta, quality, and flow values.  Newbler can take
  advantage of the flow values, but euler-sr does not.  To translate
  from sff to fasta, use sff2fasta

	${EUSRC}/assembly/x86_64/sff2fasta reads.sff reads.fasta

	Some diagnostic information will be printed regarding the parsing.
	If the parser is able to read the binary format, the diagnostic
	information will be sensical (below). If 454 has updated the sff
	format, and sff2fasta is out of date, the following sample
	diagnostic output will look like junk:

magic number: 779314790
version: 1
index offset: 535227192
index length: 6438410
number of reads: 321891
header length: 440
key length: 4
flowsPerRead: 400
format code: 1
flow chars: TACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACG

2.2  Quality filtering.
	
  If you have quality values, you can trim low quality sequences:

	${EUSRC}/assembly/${MACHTYPE}/qualityTrimmer -fasta reads.fasta	-qual reads.fasta.qual -outFasta reads.trimmed 

	The sensitivity of the trimmer may be tuned with -span S -minQual Q,
	where reads are trimmed so that the entire read is of min quality Q
	for each consecutive S nucleotides.
	

2.3. Filtering bad illumina reads:

	Illumina reads, particularly ones from the first generation machine,
	have particular output values.  An example bad read is:

>read1
ACGGCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC

  These may be filtered using filterIlluminaReads


${EUSRC}/assembly/${MACHTYPE}/filterIlluminaReads reads.fasta reads.filt.fasta

2.4. Preparing paired-end 454 reads.

 454 Paired-ends are prepared by linking ends of a clone with a
 linker, shearing the clones, pulling out the linked sequences
 (hopefully linked), then sequencing: READ1-linker-READ2.  

 To split these, you will need to know the sequence of the linker,
 here called linker.fasta

 ${EUSRC}/assembly/${MACHTYPE}/splitLinkedClones reads.454.fasta linker.fasta reads.454.fasta.split


 To print reads that are not mated to a separate file, use:

 ${EUSRC}/assembly/${MACHTYPE}/splitLinkedClones reads.454.fasta linker.fasta reads.454.fasta.split -singletons reads.454.fasta.singletons
 
 
2.5. Cleaning vector and masked sequences from SANGER reads.

		 The vector and masked sequences in Sanger reads can cause
	   problems for error correction and assembly.  These can be
	   filtered by piping them through two utilities

cat reads.fasta ${EUSRC}/assembly_utils/RemoveMaskedSequence.pl |
${EUSRC}/assembly_utils/RemoveVectorSequence.pl > reads.trimmed

3. Running assemblies.

  3.1  Preparing your input file.
     
		 If you have multiple input files, concatenate them into one file:

>cat file1.fasta file2.fasta file3.fasta > reads.fasta
  
 
  3.2. Running a default assembly.
  
  You can run a default assembly using:

  ${EUSRC}/assembly/Assemble.pl reads.fa 25


  3.3. Running a paired-end assembly.

	You will have to generate a "rule file" that describes the mates you
	are using.  This is a file with a regular expression in quotes,
	followed by two required keywords, CloneLength and CloneVar.

  An example line is:

"([^/]*)/([12])" CloneLength=200 CloneVar=50

	In general, the format of a pair of reads	should be:
  
	CLONE_NAME.DIR1
	CLONE_NAME.DIR2

	where CLONE_NAME is the unique name of each clone, and DIR1 and DIR2
	name what side from the clone the reads are from.

	The regular expressions follow the extended regular expression
	format, described at:
http://opengroup.org/onlinepubs/007908775/xbd/re.html#tag_007_004

  Each regular expression needs two sub matches, one for the clone
  name, and the other for the direction.  No more than two reads
  should have the same clone name.
	
  The CloneLength keyword is the expected gap from the end of the
  first read to the beginning of the second read, or the length of the
  DNA fragment from which both reads are sequenced - the sum of the
  two read lengths.

	The CloneVar keyword is expected window of variation of clone
	lengths.  The gap separating two reads in an assembly is estimated,
	and if it is greater than CloneLength + CloneVar, or less than
	CloneLength - CloneVar, the mate-pair is considered invalid (it may
	be a chimeric clone), and it is not used.

  
 Examples of mate-files are:


454 clones:
	
>000556_0922_1963.a
ACTGGCGAGAGCCCAGACGT...
>000556_0922_1963.b
GCCCGAGACCGGACTGGGAT...

The rule line for this is:

"([0-9]+_[0-9]+_[0-9]+)\.([ab])" CloneLength=2500 CloneVar=500 Type=6

Illumina clones:

>SLXA-EAS1_89:3:1:715:750/1
GTCTTGAAAGCTATGATGTCAAGATTAATTTAATC
>SLXA-EAS1_89:3:1:715:750/2
GTGTATTGCTCAATCTTCGAACGGGGGGAGGATTG

The rule line for this is:
"([^/]*)/([12])" CloneLength=200 CloneVar=50 Type=1      # Illumina mate-pair


	3.4. Nonstandard assemblies.
	
	3.4.1 Detecting variation.

	To detect variants, not all the steps to run an assembly need to be
	performed.  Since variant detection is still being developed, you
	have to manually run most assembly steps, however a final
	post-processing tool exists that prints likely variants.

	To detect variants using a file reads.fasta:
	1. Fix errors:
	
	${EUSRC}/assembly/FixErrors.pl reads.fasta 25

	2. Build the de bruijn graph:
	
	${EUSRC}/assembly/assemblesec.pl reads.fasta.fixed -vertexSize 25
	
	3. Remove some errors from the graph

	mkdir simple;
	${EUSRC}/assembly/${MACHTYPE}/simplifyGraph reads.fasta.fixed
	simple/reads.fasta -minEdgeLength 75 -removeLowCoverage 5 3

	4. Run post-processing to detect variants.

	${EUSRC}/assembly/${MACHTYPE}/printVariants simple/reads.fasta
	reads.variants.fasta

	This will produce a fasta file, reads.variants.fasta.  You can use
	the fasta titles to collect the variants.  The format is:

	>SHARED_EDGE INDEX (length, read support) VARIANT_INDEX (length,
   support) [... VARIANT_INDEX2 (length2, support2)] SHARED_EDGE2(length,support)
	
This outputs a variant and the flanking sequences so that the variant
may be easily aligned to a reference.  The numbers in the FASTA titles
are: EDGE_INDEX(LENGTH,SUPPORT), where EDGE_INDEX is the index of the
edge in the ".edge" file (the same as a contig, but edges exist for
both the forward and reverse orientation), the length of the edge (the
middle lengths represent the lengths of the variants), and finally the
support, the number of reads used to build the edge in the assembly.
Very low support indicates that the variant is probably erroneous.  In
the example below, there is an indel of length 6 in the assembly.  The
alignment below is the bl2seq alignment of the two variants.


>0(1855, 2957) 21(49, 44) 8(2274, 3692) 
AGAAAGGGATTTTAGTTTGTAATATCGCAGCAAGTCGATTGATTTTACCGTCTCCCAATGCATTCAAAGATAGTATTGTAAAAATCTCAGTTGGTGAAGAATATGATCAACACGCGTTTATCCATCAGTTAAAGGAAAATGGCTATCGAAAAGTTACTCAAGTACAAACTCAGGGCGAATTTAGTCTTCGAGGAGATATTATTTTTGAAATATCCCAGTTAGAACCTTGTCGAATTGAGTTTTTTGGTGATGAAATTGATGGTATCAGGTCATTTGAAGTAGAAACACAATTATCGAAAGAAAATAAGACAGAACTCACTATCTTTCCAGCTAGTGATATGCTTTTGAGAGAAAAGGATTATCAACGAGGACAGTCAGCTTTAGAAAAACAAATTTCAA
>0(1855, 2957) 22(55, 39) 8(2274, 3692) 
AGAAAGGGATTTTAGTTTGTAATATCGCAGCAAGTCGATTGATTTTACCGTCTCCCAATGCATTCAAAGATAGTATTGTAAAAATCTCAGTTGGTGAAGAATATGATCAACACGCGTTTATCCATCAGTTAAAGGAAAATGGCTATCGAAAAGTTACTCAAGTACAAACTCAGGGCGAATTTAGTCTTCGAGGAGATATTTTAGATATTTTTGAAATATCCCAGTTAGAACCTTGTCGAATTGAGTTTTTTGGTGATGAAATTGATGGTATCAGGTCATTTGAAGTAGAAACACAATTATCGAAAGAAAATAAGACAGAACTCACTATCTTTCCAGCTAGTGATATGCTTTTGAGAGAAAAGGATTATCAACGAGGACAGTCAGCTTTAGAAAAACAAATTTCAA

Query  1    AGAAAGGGATTTTAGTTTGTAATATCGCAGCAAGTCGATTGATTTTACCGTCTCCCAATG  60
            ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct  1    AGAAAGGGATTTTAGTTTGTAATATCGCAGCAAGTCGATTGATTTTACCGTCTCCCAATG  60

Query  61   CATTCAAAGATAGTATTGTAAAAATCTCAGTTGGTGAAGAATATGATCAACACGCGTTTA  120
            ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct  61   CATTCAAAGATAGTATTGTAAAAATCTCAGTTGGTGAAGAATATGATCAACACGCGTTTA  120

Query  121  TCCATCAGTTAAAGGAAAATGGCTATCGAAAAGTTACTCAAGTACAAACTCAGGGCGAAT  180
            ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct  121  TCCATCAGTTAAAGGAAAATGGCTATCGAAAAGTTACTCAAGTACAAACTCAGGGCGAAT  180

Query  181  TTAGTCTTCGAGGAGATAT------TATTTTTGAAATATCCCAGTTAGAACCTTGTCGAA  234
            |||||||||||||||||||      |||||||||||||||||||||||||||||||||||
Sbjct  181  TTAGTCTTCGAGGAGATATTTTAGATATTTTTGAAATATCCCAGTTAGAACCTTGTCGAA  240

Query  235  TTGAGTTTTTTGGTGATGAAATTGATGGTATCAGGTCATTTGAAGTAGAAACACAATTAT  294
            ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct  241  TTGAGTTTTTTGGTGATGAAATTGATGGTATCAGGTCATTTGAAGTAGAAACACAATTAT  300

Query  295  CGAAAGAAAATAAGACAGAACTCACTATCTTTCCAGCTAGTGATATGCTTTTGAGAGAAA  354
            ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct  301  CGAAAGAAAATAAGACAGAACTCACTATCTTTCCAGCTAGTGATATGCTTTTGAGAGAAA  360

Query  355  AGGATTATCAACGAGGACAGTCAGCTTTAGAAAAACAAATTTCAA  399
            |||||||||||||||||||||||||||||||||||||||||||||
Sbjct  361  AGGATTATCAACGAGGACAGTCAGCTTTAGAAAAACAAATTTCAA  405


   3.4.2  Fast pooling of reads.

	 If the error rate of the reads is low, and all that is required is
	 a clustering of reads that have shared k-mers, simply running
	 assemblesec is sufficient.

	 
	 ${EUSRC}/assembly/assemblesec.pl reads.fasta -vertexSize 25

	 You can increase specificity by increasing the vertex size,
	 although running with vertex sizes greater than 32 is VERY slow.

	 You will have to parse the file "reads.fasta.intv" file to
	 determine the reads that overlap.

	 The format of this file is:

	 EDGE E Length L Multiplicity M
	 INTV 1 0 50 135
	 INTV 8 0 50 140
	 INTV a b c  d
	 ...
	 EDGE ...
	 INTV ...

	 The edge E is an edge in the assembly, corresponding to either the
	 forward or reverse contig (edges for both are created).  Length is
	 the length of the edge, and M is the number of reads included in
	 the assembly of the edge.

	 each INTV corresponds to a portion of a read that maps to the
	 edge.  In INTV a b c d, "a" is the index of the read.  Odd numbered
	 indices are the reverse complement of a read included in the
	 data. So read 0 is a read in your input file, and read 1 is the
	 reverse complement of 0, 2 is a read, 3 is 2's reverse complement,
	 and so on.  "b" is the starting position in the read mapped to the
	 edge, "c" is the length of the portion of the read mapped to the
	 edge, and "d" is the position along the edge where the read maps.

	 When "c" is the length of the read, the full read maps to the
	 edge, when it is less, the read either maps to two edges, or part
	 of the read does not map to the assembly, usually when there are
	 errors in the read.


4. Examining output and refining assemblies.
	 Some assemblies may be improved by iteratively re-simplifying the
	 graph, or examined to determine if the data are insufficient for
	 producing a valid assembly.

    4.1. Detecting fragmented assemblies.

		If you have mate-pairs, the assembly will be in the
		'matetransformed' directory, otherwise the 'transformed'
		directory.  For examples here the 'transformed' directory is used.
		
		Print all 'dead-end' edges:
		 ${EUSRC}/assembly/${MACHTYPE}/printGraphSummary transformed/reads.fasta -sources

		This will print edge indices followed by their lengths. If there
		is a very high number of dead-end edges, the coverage is probably
		low causing a fragmented assembly.  Some sequencing platforms such
		as Illumina have a lot of "coverage deserts", such as in GC-rich
		regions, where the coverage is very low, causing fragmentation of
		the assembly.


    4.2. Joining fragmented assemblies.

		If there are a lot of dead end edges, you can try and fix them by
		joining them when the overlap is less than the vertex size, but
		greater than some minimal length, say 10.

		${EUSRC}/assembly/${MACHTYPE}/joinsas transformed/reads.fasta 10 transformed/reads.j
		${EUSRC}/assembly/${MACHTYPE}/printContigs transformed/reads.j
		cp transformed/reads.j.contig .


		This will look for unambiguous joins of edges that are at least 10
		exactly matching nucleotides.


5. Release notes.
  5.1 Known issues.
	   When using mate-pairs to resolve repeats, the sequence of the
  repeat will be a consensus, so there will likely be many mismatches
  in repeat regions.  The overall accuracy of the layout of the
  alignment is usually correct though.

	Support for running on a rocks cluster has been removed.  Since
	EULER-SR is 100X faster than the first release, running on a cluster
	is less important.  Still, a new release will use multiple cores for
	error correction.


  5.2 Notes.

	This release contains most new functionality described in a recently
	accepted paper: De novo fragment assembly with short mate-paired
	reads: does the read length matter?  The significant improvements
	over the previous release are:

  - Mate-paired assembly is used.
     - Multiple clone libraries are allowed.
  - Repeats are resolved using longer reads by default (previously
     this was just done as part of a "beta" assembly step).
  - Speed.  While there are some areas for improvement, the speed of
     the method is much faster than before.

		 If speed is still an issue, more improvements will be made,
		 however they will take lower priority to new functionality.

  - Easier installation.  Now only one (possibly two) environment
    variables need to be set to install and run EULER-SR.  Before,
    dynamic linking was used to save disk space with executables,
    however since most data sets are in the 10's of gigabytes now, a
    few megs of executables will probably fit anywhere.

  - Easier running. Previously different scripts were used to assemble
    illumina and 454 (or any non illumina) reads.  This is now merged
    into one script, although pre-processing is done by hand at an
    earlier step.  This is necessary for heterogeneous data sets.