Repeat search: Difference between revisions

From Cbcb
Jump to navigation Jump to search
 
Line 21: Line 21:


= Software Packages =
= Software Packages =
* [http://mummer.sourceforge.net/manual/#repeat MUMer repeat-match] : does not classify the repeats
* [http://mummer.sourceforge.net/manual/#repeat MUMer repeat-match] :  
** does not classify the repeats
** works fine for small genomes
   ~/bin/RepeatSearch.amos prefix
   ~/bin/RepeatSearch.amos prefix
   ...
   ...
Line 29: Line 31:
   40: $(BINPATH)/show-coords  -c -l  -r -o -H $(REPEATS).delta  | awk '{print $18,$19}' | ~/bin/cluster.pl > $(REPEATS).cluster
   40: $(BINPATH)/show-coords  -c -l  -r -o -H $(REPEATS).delta  | awk '{print $18,$19}' | ~/bin/cluster.pl > $(REPEATS).cluster
   50: $(SCRIPTPATH)/extractfromfastanames.pl -f $(REPEATS).cluster < $(REPEATS).fasta > $(REPEATS).cluster.fasta
   50: $(SCRIPTPATH)/extractfromfastanames.pl -f $(REPEATS).cluster < $(REPEATS).fasta > $(REPEATS).cluster.fasta
** does not work on large genomes
  split the genome in smaller pieces (10Mbp?)
  align them to one another:
  nucmer -l 20 -c 65  -g 90  -b 200  (default): takes too long
  nucmer -l 35 -c 2500 -g 100 


* [http://www.repeatmasker.org/RepeatModeler.html RepeatModeler]
* [http://www.repeatmasker.org/RepeatModeler.html RepeatModeler]

Latest revision as of 18:25, 18 February 2010

Mobile elements

  • plasmids
  • bacteriophages:
    • up to 20% of the genome
    • most common transporters of virulence genes in bacteria
    • have site specificity
  • transposable elements
    • up to 2Kbp
    • no site specificity

Tandem repeats

  • satellites (spanning megabases of DNA, associated with heterochromatin)
  • minisatellites (repeat units in the range 6-100 bp, spanning hundreds of base-pairs)
  • microsatellites (repeat units in the range 1-5 bp, spanning a few tens of nucleotides).

Insertion Elements(IS)

  • 0.7-2.5K bp
  • small, genetically compact (1-2 ORFs) : transposase and/or reverse transcriptase
  • end in short terminal inverted repeat sequences (IR) 10-40bp
  • ISFinder

Software Packages

  ~/bin/RepeatSearch.amos prefix
 ...
 10: $(BINPATH)/repeat-match -n $(REPEATLEN) $(PREFIX).fasta | $(SCRIPTPATH)/repeat-match2gff.pl > $(REPEATS).gff
 20: $(SCRIPTPATH)/extractfromfastagff.pl $(PREFIX).fasta $(REPEATS).gff > $(REPEATS).fasta
 30: $(BINPATH)/nucmer -maxmatch $(REPEATS).fasta $(REPEATS).fasta -p $(REPEATS)
 40: $(BINPATH)/show-coords  -c -l  -r -o -H $(REPEATS).delta  | awk '{print $18,$19}' | ~/bin/cluster.pl > $(REPEATS).cluster
 50: $(SCRIPTPATH)/extractfromfastanames.pl -f $(REPEATS).cluster < $(REPEATS).fasta > $(REPEATS).cluster.fasta
    • does not work on large genomes
 split the genome in smaller pieces (10Mbp?)
 align them to one another:
 nucmer -l 20 -c 65   -g 90  -b 200  (default): takes too long
 nucmer -l 35 -c 2500 -g 100  


 Library:
   $ ls /fs/szdevel/dpuiu/RepeatMasker/Libraries/RepeatMaskerLib.embl 
 
   $ ~/bin//readseq.sh -f Fasta -o RepeatMaskerLib.fasta RepeatMaskerLib.embl
 
   $ infoseq RepeatMaskerLib.fasta | getSummary.pl -c 1 -t Len
             #elem   min     max     mean    median  n50     sum
     Len     9055    4       35042   2205    890     4846    19966330

Articles