Software

This page will contain software developed by members of our group along with usage instructions.

ThDP repeats

LAST UPDATE: 2021/03/22

The files used for calculations can be found here.

Questions and comments: please contact m.merski[at]uw.edu.pl

Hydrogen-mediated interactions

LAST UPDATE: 2020/10/27

Here you can find the source code and other files used for our calculations of hydrogen-mediated interactions in proteins: HMI_code.zip

The supplementary file accompanying the publication can be dowloaded here.

Questions and comments: please contact m.merski[at]uw.edu.pl


Slider

LAST UPDATE: 2020/03/04

General description

Slider pipeline performs a comparison of repeat protein dot plots.

User input is a single multi-sequence FASTA file. We assume the first word in each sequence header is its identifier string.

The steps of the procedure are as follows:

  1. A modified dotter binary is run for every sequence. In this step the number of pixels per residue is calculated. The calculated value, along with other sequence-specific data, is written to an output text file. Each dotter run produces a binary file with dot plot data prepared for the sliding procedure. Job distribution and result parsing is handled by a Python script.
  2. A single binary input file is compiled using the files generated in the previous step by dotc.
  3. Sliding procedure is performed in a single step by slider. Note: for a large data set more steps will be required due to memory constraints.
  4. A human-readable file containing graph edges is generated by sparsify. For a large data set, this program is used to filter the data in order to make clustering feasible.

The data can be further clustered.

To learn more, please refer to the following paper: Merski M. et al. BMC Bioinformatics 21(1):179.

Software availability & testing

Docker container

The pipeline is provided in a convenient form of Docker image (will work with Intel Westmere processors and newer) based on phusion/baseimage:latest-amd64 (Ubuntu 18.04) image. Entry point is a simple helper shell script running pipeline stages. Note: clustering software is not included in the container.

How to use the container image

  1. Set up Docker on your Linux/Mac machine (please refer to your OS|cloud provider documentation or Docker website; if you use Centos7, please use the repository provided by Docker).
  2. Download the latest Docker image of our pipeline: slider-pipeline.tar.bz2
    $ wget https://gorna.uw.edu.pl/download_file/view/334/303 -O slider-pipeline.tar.bz2
  3. Verify SHA256 sum
    $ sha256sum slider-pipeline.tar.bz2
    cc44a4ef12f25131e24bd7b197c6175d16736c935f1f8a6a31e5c14a8d4f378d slider-pipeline.tar.bz2
  4. Load the image
    $ docker load -i slider-pipeline.tar.bz2
  5. Verify that slider-pipeline:latest is present on the image list by running
    $ docker images
  6. Prepare a multi-sequence fasta file. We recommend to keep it small, i.e. not larger than 1000 sequences of size 200-2000 amino acids each.
  7. Run calculations. In order to share the data with the container, the run command mounts <local_path> as share (please note the colon between local path and its mount point). The input file should be located there and all output produced by the pipeline will be written to that location.
    $ docker run --rm -v <local_path>:/share/data -it slider-pipeline:latest <your_fasta_file>

The final output file for processing of test.fasta file would be test-abc-dump.txt lines of which contain two sequence IDs and a Jaccard index value.


Source code

Source code (patch for seqtools, Python & C++ sources) is available here. Please check the SH256 sum:
$ sha256sum slider-suite-sources-20200304.tar.bz2
ab2c8461d31849fd32cf8c424690ed0b1dc12223319b5cc77c4c4fca7cfa9a0e slider-suite-sources-20200304.tar.bz2

You will also need this development version of seqtools. Below you will find rudimentary instructions on how to build the pipeline from scratch. Those are provided for users with experience in building software from sources.

  1. Patch, build and install seqtools. Please refer to seqtools installation manual.
  2. Install dependencies for slider and Python scripts. The following list should suffice for an Ubuntu 18.04 installation:
    cmake curl libopenmpi-dev openmpi-bin openmpi-common python python-numpy python-psutil python-biopython g++
  3. Build and install slider suite. It uses CMake toolkit which is an industry standard for many projects.
  4. Use our pipeline starting from dotrun.py through dotc, slider and sparsify. Each of them provides help with -h option.


Sample data set

If you want to test our pipeline but do not have any multisequence fasta file ready, we prepared one containing 79 proteins: DOTTER_standard_set.fasta as well as the corresponding output file.

Questions and comments

Please contact m.merski[at]uw.edu.pl