Repeat search: Difference between revisions
Jump to navigation
Jump to search
(6 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
= Mobile elements = | |||
* plasmids | |||
* bacteriophages: | |||
** up to 20% of the genome | |||
** most common transporters of virulence genes in bacteria | |||
** have site specificity | |||
* transposable elements | |||
** up to 2Kbp | |||
** no site specificity | |||
= Tandem repeats = | = Tandem repeats = | ||
* satellites (spanning megabases of DNA, associated with heterochromatin) | * satellites (spanning megabases of DNA, associated with heterochromatin) | ||
* minisatellites (repeat units in the range 6-100 bp, spanning hundreds of base-pairs) | * minisatellites (repeat units in the range 6-100 bp, spanning hundreds of base-pairs) | ||
* microsatellites (repeat units in the range 1-5 bp, spanning a few tens of nucleotides). | * microsatellites (repeat units in the range 1-5 bp, spanning a few tens of nucleotides). | ||
= Insertion Elements(IS) = | |||
* 0.7-2.5K bp | |||
* small, genetically compact (1-2 ORFs) : transposase and/or reverse transcriptase | |||
* end in short terminal inverted repeat sequences (IR) 10-40bp | |||
* [http://www-is.biotoul.fr/ ISFinder] | |||
= Software Packages = | = Software Packages = | ||
* [http://mummer.sourceforge.net/manual/#repeat MUMer repeat-match] : does not classify the repeats | * [http://mummer.sourceforge.net/manual/#repeat MUMer repeat-match] : | ||
** does not classify the repeats | |||
** works fine for small genomes | |||
~/bin/RepeatSearch.amos prefix | |||
... | |||
10: $(BINPATH)/repeat-match -n $(REPEATLEN) $(PREFIX).fasta | $(SCRIPTPATH)/repeat-match2gff.pl > $(REPEATS).gff | |||
20: $(SCRIPTPATH)/extractfromfastagff.pl $(PREFIX).fasta $(REPEATS).gff > $(REPEATS).fasta | |||
30: $(BINPATH)/nucmer -maxmatch $(REPEATS).fasta $(REPEATS).fasta -p $(REPEATS) | |||
40: $(BINPATH)/show-coords -c -l -r -o -H $(REPEATS).delta | awk '{print $18,$19}' | ~/bin/cluster.pl > $(REPEATS).cluster | |||
50: $(SCRIPTPATH)/extractfromfastanames.pl -f $(REPEATS).cluster < $(REPEATS).fasta > $(REPEATS).cluster.fasta | |||
** does not work on large genomes | |||
split the genome in smaller pieces (10Mbp?) | |||
align them to one another: | |||
nucmer -l 20 -c 65 -g 90 -b 200 (default): takes too long | |||
nucmer -l 35 -c 2500 -g 100 | |||
* [http://www.repeatmasker.org/RepeatModeler.html RepeatModeler] | * [http://www.repeatmasker.org/RepeatModeler.html RepeatModeler] | ||
* [http://bix.ucsd.edu/repeatscout/ RepeatScout] | * [http://bix.ucsd.edu/repeatscout/ RepeatScout] | ||
* [http://www.repeatmasker.org/ RepeatMasker] ; [http://www.girinst.org/server/RepBase RepBase] : mostly eukariotic genomes | * [http://www.repeatmasker.org/ RepeatMasker] ; [http://www.girinst.org/server/RepBase RepBase] : mostly eukariotic genomes | ||
Library: | |||
$ ls /fs/szdevel/dpuiu/RepeatMasker/Libraries/RepeatMaskerLib.embl | |||
$ ~/bin//readseq.sh -f Fasta -o RepeatMaskerLib.fasta RepeatMaskerLib.embl | |||
$ infoseq RepeatMaskerLib.fasta | getSummary.pl -c 1 -t Len | |||
#elem min max mean median n50 sum | |||
Len 9055 4 35042 2205 890 4846 19966330 | |||
* [http://minisatellites.u-psud.fr/GPMS/ Microorganisms Tandem Repeats Database (Online,FR)] | * [http://minisatellites.u-psud.fr/GPMS/ Microorganisms Tandem Repeats Database (Online,FR)] | ||
* [http://crispr.u-psud.fr/Server/CRISPRfinder.php/ CRISPRfinder (Online,FR)]; [http://nar.oxfordjournals.org/cgi/content/full/gkm360v2 Article] | * [http://crispr.u-psud.fr/Server/CRISPRfinder.php/ CRISPRfinder (Online,FR)]; [http://nar.oxfordjournals.org/cgi/content/full/gkm360v2 Article] | ||
* [http://tandem.bu.edu/trf/trf.html TRF] | |||
= Articles = | |||
* [http://www.nature.com/nrmicro/journal/v3/n9/full/nrmicro1233.html Mobile DNA in obligate intracellular bacteria] |
Latest revision as of 18:25, 18 February 2010
Mobile elements
- plasmids
- bacteriophages:
- up to 20% of the genome
- most common transporters of virulence genes in bacteria
- have site specificity
- transposable elements
- up to 2Kbp
- no site specificity
Tandem repeats
- satellites (spanning megabases of DNA, associated with heterochromatin)
- minisatellites (repeat units in the range 6-100 bp, spanning hundreds of base-pairs)
- microsatellites (repeat units in the range 1-5 bp, spanning a few tens of nucleotides).
Insertion Elements(IS)
- 0.7-2.5K bp
- small, genetically compact (1-2 ORFs) : transposase and/or reverse transcriptase
- end in short terminal inverted repeat sequences (IR) 10-40bp
- ISFinder
Software Packages
- MUMer repeat-match :
- does not classify the repeats
- works fine for small genomes
~/bin/RepeatSearch.amos prefix ... 10: $(BINPATH)/repeat-match -n $(REPEATLEN) $(PREFIX).fasta | $(SCRIPTPATH)/repeat-match2gff.pl > $(REPEATS).gff 20: $(SCRIPTPATH)/extractfromfastagff.pl $(PREFIX).fasta $(REPEATS).gff > $(REPEATS).fasta 30: $(BINPATH)/nucmer -maxmatch $(REPEATS).fasta $(REPEATS).fasta -p $(REPEATS) 40: $(BINPATH)/show-coords -c -l -r -o -H $(REPEATS).delta | awk '{print $18,$19}' | ~/bin/cluster.pl > $(REPEATS).cluster 50: $(SCRIPTPATH)/extractfromfastanames.pl -f $(REPEATS).cluster < $(REPEATS).fasta > $(REPEATS).cluster.fasta
- does not work on large genomes
split the genome in smaller pieces (10Mbp?) align them to one another: nucmer -l 20 -c 65 -g 90 -b 200 (default): takes too long nucmer -l 35 -c 2500 -g 100
- RepeatModeler
- RepeatScout
- RepeatMasker ; RepBase : mostly eukariotic genomes
Library: $ ls /fs/szdevel/dpuiu/RepeatMasker/Libraries/RepeatMaskerLib.embl $ ~/bin//readseq.sh -f Fasta -o RepeatMaskerLib.fasta RepeatMaskerLib.embl $ infoseq RepeatMaskerLib.fasta | getSummary.pl -c 1 -t Len #elem min max mean median n50 sum Len 9055 4 35042 2205 890 4846 19966330