ThermoScan

Scan biomedical publications to retrieve protein thermodynamic data.



Methods


ThermoScan is a semi-automatic method for retrieving protein thermodynamic data from literature. For each article in the PubMed Central (PMC) subset, the method scans the fulltext HTML page and selects all article paragraphs which include significant words regarding protein thermodynamics (Turina et al. Frontiers in Biomedical Sciences, 2021). The significant words belong to the following 4 classes:

  • Thermodynamic Concepts (TC): Important words frequently mentioned in protein thermodynamic studies (two-state, unfolding, denaturant, midpoint, dichroism).

  • Thermodynamic Measures (TM): Words are identified by a regular expression matching the abbreviations of the main thermodynamic measures (ΔG, ΔH, Δ Tm, etc.).

  • Units of Measure (UM): Words are identified by a regular expression matching the main units of measure used in thermodynamic experiments (kcal/mol, kJ/mol, etc)

  • Computational Concepts (CC): Words referring to computational studies (simulation, molecular dynamics, force field, predict etc.).

For each scanned article ThermoScan calculates an empirical score based on the words from those 4 classes, returning the total and paragraph/table scores. The paragraph/table scores are calculated for each paragraph and table found in the paper. A positive partial score is assigned to items of the first 3-word classes a negative one to the 4th class (computational concepts).


Data Processing

ThermoScan processes the fulltext article in HTML format using the BeautifulSoup parser. The HTML file is parsed and the text inside the paragraph (<p>) and table (<table>) tags is extracted. The text corresponding to each tag is searched for each of the Thermodynamic Concepts reported above. If one of the 5 words (two-state, unfolding, denaturant, midpoint, dichroism) is found, the significant terms are extracted using the 4 following regular expressions:

TC: u'(?:\W|^)(two-state|unfolding|denaturant|midpoint|dichroism)'

  •  TM: u'(?:\W|^)((?:(?:\u2206|\u0394){1,2}(?:Cp|Tm|UG|GU|G|H|T))|(?:Cp|Tm)

  •  UM: u'(?:(?:(?:kcal|kj)(?:\/mole?(?:\/[\u00b0|\u00b4]C)?|[\s\*\.\u00b7\u22c5]?(?:mole?[\-\u2212]1)|\/M\/mol|\/\(mol\s[MK]\)|\s[MK][\-\u2212]1)?)|(?:[\u00b0|\u00b4]C))'

  •  CC: u'(?:\W|^)(md simulation|simulation|molecular dynamics|force field|charmm|gromacs|amber|PBSA|GBSA|predict)'
         


Score Details

The score is calculated summing all found matches excluding the repetitions as follows:


   •  two-state = unfolding = denaturant = midpoint = dichroism = 1

   •  Cp = Tm = 1 - ΔX = 2 - ΔΔX = 3  (X = Cp, Tm, UG, GU, G, H, T, U).

   •  °C = 1 - E/C = 2  (E = kcal, kJ - C = mol, mole, mole/°C, mol/°C, mol/K, mol/M)

   •  simulation = molecular dynamics = force field = charmm = gromacs = amber = PBSA = GBSA = predict = -1 - md simulation = -2
          

The total score assigned to the article is obtained summing all paragraph/table scores.
in addition, ThermoScan searches for thermodynamic data relative to binding processes considering the following terms binding, affinity, dissociation, interaction, ppi, protein-protein, kcat/Km.


Method Benchmarking

The score described above is then used for classifying between manuscripts according to whether they report protein thermodynamic data (positives) or not (negatives). For classification, we considered two alternative measures, corresponding to the maximum (Max) or to the average (Mean) paragraph/table score in each paper.
The performances of ThermoScan were tested using different sets of publications. We initially collected a set of positives by considering the Open Access PMC articles referenced in the ProTherm database. Different types of negative sets were selected from the PMC Open Access repository using different keywords. In details we generated the following sets:


The information about the datasets and the PMC articles is made avilable through this link. ThermoScan is optimized by maximizing its performance on a set composed by the 157 publications from ProTherm and an equal number of manuscripts randomly selected from the Not-PS and Not-PU datasets (Not-PS/Not-PU). We averaged the performances on 10 replicates of the negative subset (Not-PU/Not-PS) while considering different classification thresholds. In the table below the performances that maximized the Matthews Correlation Coefficient (MCC). for both the Max and Mean scores are reported. The performance scores (ACC, TPR, PPV, NPR, NPV, MCC and AUC) were calculated as defined in Wikipedia.


Based on the optimizing thresholds reported in the table above we tested ThermoScan by calculating its performances on two new manually curated datasets collected among more recently published papers (New-PSU, and the closely related Snew-PSU obtained by removing from New-PSU 37 manuscripts, which were manually classified as ‘difficult’ negative cases, see below). Finally, we considered the classification performances based on the maximum paragraph score and compared the performances of ThermoScan against those of BioReader (Rollins et al. BMC Bioinformatics2, 2006) and MedlineRanker. (Fontaine et al NAR. 2009) The results of the comparison are reported below:


As evident from the data in the table above, ThermoScan reached better performances than both BioReader and MedlineRanker. Such a difference in the levels of performance is likely mainly due to the fact that BioRanker and MedlineRanker only use information extracted from the paper abstracts. The higher performances observed on the Snew-PSU dataset with respect New-PSU are are due to the manual removal of 38 difficult negative cases. The majority of them were manually classified as difficult negatives since they report thermodynamic data, but related to protein binding rather than protein unfolding.


References

Turina P, Fariselli P, Capriotti E. (2021) ThermoScan: Semi-automatic identification of protein stability data from PubMed. Frontiers in Biomedical Sciences. DOI:10.3389/fmolb.2021.620475

Rollins DK, Zhai D, Joe AL, Guidarelli JW, Murarka A, Gonzalez R. (2006) A novel data mining method to identify assay-specific signatures in functional genomic studies. BMC Bioinformatics. 7:377. PMID:16907975

Fontaine JF, Barbosa-Silva A, Schaefer M, Huska MR, Muro EM, Andrade-Navarro MA. (2009). MedlineRanker: flexible ranking of biomedical literature. Nucleic Acids Res. 37(Web Server issue):W141-6. PMID:19429696