Our method for repeat protein analysis published

May 8, 2020

Self-analysis of repeat proteins reveals evolutionarily conserved patterns

Merski M, Młynarczyk K, Ludwiczak J, Skrzeczkowski J, Dunin-Horkawicz S, Górna MW.

BMC Bioinformatics 21, 179 (2020). https://doi.org/10.1186/s12859-020-3493-y


In our newest publication (PDF) in BMC Bioinformatics, we report a new method for analysis and clustering of repeat proteins, based on their sequence self-similarity illustrated by the DOTTER tool.

Comparison of these "dot plots" using a simple model (a Jaccard similarity score) allowed us to estimate the evolutionary distance between pairs of proteins. Typically, analysis of relationships between repeat proteins is a non-trivial task due to sequence degeneracy and problems with sequence alignments. Comparison of known repeat and non-repeat proteins from the PDB suggested that the information content in dot plots could be used to identify repeat proteins from pure sequence with no requirement for structural information. Analysis of the UniRef90 database suggested that 13.3 million proteins could be classified as repeat proteins. These putative repeat protein chains were clustered and a significant amount (82.9%) of clusters containing between 5 and 200 members were of a single functional type.

Identification of dot plot patterns can help to cluster related repeat proteins, discover parts of the protein that are essential for structure and function, and identify relationships between repeat proteins that may be remarkably difficult to analyze by purely sequence-based analysis.

This work was the result of a collaboration between our laboratory (MM, KM, JS, MWG) and the Structural Bioinformatics Group (JL, SD-H). Our work was supported by funding from the National Science Centre, Poland (#2014/15/D/NZ1/00968) and the European Union’s Horizon 2020 research and innovation programme (Marie Skłodowska-Curie grant agreement #655075). Computational research was supported in part by PL-Grid Infrastructure (Poland).

The programs developed by our group are accessible in our software section.