DDGun

Untrained predictor of protein stability change upon mutation



Methods


DDGun is an untrained method for predicting the cariation of unfolding free energy changeg upon mutation (ΔΔG). DDGun is an algorithm based on evolutionary information which predicts the unfolding ΔΔG for single and multiple variations. The predictions are performed through a linear combination of scores derived from protein sequence and structural features. The three following scores are based purely on sequence data:

  • the difference between the wild type and mutant residue in the Blosum62 substitution matrix (sBl);

  • the difference in the interaction energy (Skolnick statistical potential) between the wild-type and substituted residue with their sequence neighbours within a 2-residue window (sSk);

  • the difference in the hydrophobicity between wild type and mutant residues according to the Kyte-Doolittle scale (sHp).

We also developed, a structure-based version of DDGun (DDGun3D) adding two structure-based terms in the input features. The first structural term represents the difference in the interaction energy (Bastolla statistical potential) between the wildtype and mutant residue with its structural neighbours (sBV). The second structural term is the relative solvent accessibility of the residue (ac), computed as the current accessibility divided by its maximum value. The first four scores are linearly combined while the latter is used to modulate the mutation effect with the residue accessibility. This effect is obtained by multiplying the total score by (1-ac). For a better tuning of the predictions of also fully accessible residues (ac = 1), the modulation factor was set to (1- ac + ε), where ε was arbitrarily set to 0.1. All first four scores described above were weighted through the profile built on the multiple sequence alignment of the protein and its homologues. The multiple sequence alignement is built using the hhblits program from the hh-suite running on the UniRef30 database (uniclust30_2018_08). In detail the weighted scores are caluclated using the following equations:
   

The scores described above are combining using different weights for the sequence ad structure based methods. In the equantions below are reported the coefficients of each score for the sequence-based (sseq) structure-based (s3DA) methods:
   

This method can be adapted to the prediction of the ΔΔG for multiple site variations. Indeed for each multiple-site variation we compute the score for each single site variation comprising it. Given a multiple site variation with multiplicity M (that is composed of M single site variations), let name ss the vector of M single site scores; ss = (s1, s2, …sM). We compute the score for a multiple site variants as:
   

For single point variations, the following data sets were considered: the most commonly used S2648; the high quality VariBench which was integrated with the 605 manually curated variations selected in Broom et al. for a total of 1900 high quality variations; a data set of variations on the P53 protein and myoglobin data sets. The dataset for multiple site variations was derived from ProTherm. A total of 914 protein multiple site variations, with a number of simultaneous variants ranging from 2 to 10, were derived. We called this set of multiple site variations PTmul.

The performance of DDGun and DDGun3D were tested on different datasets calculating the Pearson correlation coefficient (PCC) and the Root Mean Square Error (RMSE) between the experimental and predicted ΔΔG values. In the table below we summarized the performance of DDGun and DDGun3D on different datasets of single point protein variants reporting for each of them the PCC and RMSE values.



The performance of both methods are compared with those achived by selected group of algorithms. For this comparison we calculated the performance of all the methods on the Ssym dataset that collects structures of both wild-type and mutant proteins. The results of this comparison is reported in the following table.



Similar test is performed on a set of 914 multiple site variants. In this case the mutant structure is obtained using Modeller. The results od this analysis is summarized in the following table.



All the predictions and the structures of the mutants generated for the PTmul dataset are available on zenodo.



Reference


Montanucci L, Capriotti E, Frank Y, Ben-Tal N, Fariselli P. (2019). DDGun: an untrained method for the prediction of protein stability changes upon single and multiple point variations. BMC Bioinformatics. 20 (Suppl 14): 335.



Standalone Package Installation


DDGun is available for download on GitHub. After cloning the scripts in your own machine please execute the following instruction for the installation. Minimum requirements for the installation are: git, wget, tar, cmake and biopython The correct version of biopython should include the NCBIStandalone object imported from Bio.SearchIO._legacy.


      Run:
        git clone https://github.com/biofold/ddgun
        cd ddgun
        python setup.py


        # Manual installation
        git clone https://github.com/biofold/ddgun

        cd ddgun/utils
        git clone https://github.com/soedinglab/hh-suite.git
	mkdir -p hh-suite/build && cd hh-suite/build
	cmake -DCMAKE_INSTALL_PREFIX=.. ..
	make -j 4 && make install

	cd ../../../data
	wget http://wwwuser.gwdg.de/~compbiol/uniclust/2018_08/uniclust30_2018_08_hhsuite.tar.gz

	tar -xzvf uniclust30_2018_08_hhsuite.tar.gz
	cd ../ 
      

The program can be tested running the following commands:


      Structure-based prediction:
        ./ddgun_3d.py test/1aar.pdb A test/1aar.muts

      Sequence-based prediction:
        ./ddgun_seq.py test/1aar.pdb.A.fasta test/1aar.muts
      

DDGun can also run on the docker platform.


      The full image of the DDGun including the uniclust30 can be downloaded using the following command:
         
         docker pull biofold/ddgun:full

      Warning: Before installing the image make sure that your docker environment
      has more than 25 Gb disk space.