Protein pKa Prediction with Machine Learning
We developed, so far as we know, the first deep-learning based protein pKa predictor DeepKa, which is established by the data sets generated by continuous constant pH molecular dynamics simulations of 279 soluble proteins that include 12809 pKa values of four residue types, namely Asp, Glu, His and Lys. Notably, to avoid discontinuities at the boundary, grid charges are proposed to represent protein electrostatics. We show that the prediction accuracy by DeepKa is close to that by CpHMD benchmarking simulations, validating DeepKa as an efficient protein pKa predictor. In addition, the training and validation sets created in this study can be applied to the development of machine learning-based protein pKa predictors in the future. Finally, the grid charge representation is general and applicable to other topics, such as the protein-ligand binding affinity prediction. (Data and code can be downloaded for free from https://gitlab.com/yandonghuang/deepka.)
Protein secondary structure prediction with a reductive deep learning method
Protein secondary structures have been identified as the links in the physical processes of primary sequences, typically random coils, folding into functional tertiary structures that enable proteins to involve a variety of biological events in life science. Therefore, an efficient protein secondary structure predictor is of importance especially when the structure of an amino acid sequence fragment is not solved by high-resolution experiments, such as X-ray crystallography, cryo-electron microscopy, and nuclear magnetic resonance spectroscopy, which are usually time consuming and expensive. In this paper, a reductive deep learning model MLPRNN has been proposed to predict either 3-state or 8-state protein secondary structures. The prediction accuracy by the MLPRNN on the publicly available benchmark CB513 data set is comparable with those by other state-of-the-art models. More importantly, taking into account the reductive architecture, MLPRNN could be a baseline for future developments. Data and code can be downloaded for free from https://gitlab.com/yandonghuang/mlpbgru.