Kabir, A., Bhattarai, M., Peterson, S., Najman-Licht, Y., Rasmussen, K. Ø., Shehu, A., Bishop, A. R., Alexandrov, B., & Usheva, A. (2024). DNA breathing integration with deep learning foundational model advances genome-wide binding prediction of human transcription factors. Nucleic Acids Research, gkae783. https://doi.org/10.1093/nar/gkae783
@article{kabirbhattarai2024epbdbert,
author = {Kabir, Anowarul and Bhattarai, Manish and Peterson, Selma and Najman-Licht, Yonatan and Rasmussen, Kim Ø and Shehu, Amarda and Bishop, Alan R and Alexandrov, Boian and Usheva, Anny},
title = {{DNA breathing integration with deep learning foundational model advances genome-wide binding prediction of human transcription factors}},
journal = {Nucleic Acids Research},
pages = {gkae783},
year = {2024},
month = sep,
issn = {0305-1048},
doi = {10.1093/nar/gkae783},
url = {https://doi.org/10.1093/nar/gkae783},
eprint = {https://academic.oup.com/nar/advance-article-pdf/doi/10.1093/nar/gkae783/59112098/gkae783.pdf},
impact = {16.6}
}
It was previously shown that DNA breathing, thermodynamic stability, as well as transcriptional activity and transcription factor (TF) bindings are functionally correlated. To ascertain the precise relationship between TF binding and DNA breathing, we developed the multi-modal deep learning model EPBDxDNABERT-2, which is based on the Extended Peyrard-Bishop-Dauxois (EPBD) nonlinear DNA dynamics model. To train our EPBDxDNABERT-2, we used chromatin immunoprecipitation sequencing (ChIP-Seq) data comprising 690 ChIP-seq experimental results encompassing 161 distinct TFs and 91 human cell types. EPBDxDNABERT-2 significantly improves the prediction of over 660 TF-DNA, with an increase in the area under the receiver operating characteristic (AUROC) metric of up to 9.6% when compared to the baseline model that does not leverage DNA biophysical properties. We expanded our analysis to in vitro high-throughput Systematic Evolution of Ligands by Exponential enrichment (HT-SELEX) dataset of 215 TFs from 27 families, comparing EPBD with established frameworks. The integration of the DNA breathing features with DNABERT-2 foundational model, greatly enhanced TF-binding predictions. Notably, EPBDxDNABERT-2, trained on a large-scale multi-species genomes, with a cross-attention mechanism, improved predictive power shedding light on the mechanisms underlying disease-related non-coding variants discovered in genome-wide association studies.
Kabir, A., Moldwin, A., Bromberg, Y., & Shehu, A. (2024). In the twilight zone of protein sequence homology: do protein language models learn protein structure? Bioinformatics Advances, 4(1), vbae119. https://doi.org/10.1093/bioadv/vbae119
@article{kabirshehu2023remhombioadv,
author = {Kabir, Anowarul and Moldwin, Asher and Bromberg, Yana and Shehu, Amarda},
title = {{In the twilight zone of protein sequence homology: do protein language models learn protein structure?}},
journal = {Bioinformatics Advances},
volume = {4},
number = {1},
pages = {vbae119},
year = {2024},
month = aug,
issn = {2635-0041},
doi = {10.1093/bioadv/vbae119},
url = {https://doi.org/10.1093/bioadv/vbae119},
eprint = {https://academic.oup.com/bioinformaticsadvances/article-pdf/4/1/vbae119/58914492/vbae119.pdf},
impact = {4.4}
}
Protein language models based on the transformer architecture are increasingly improving performance on protein prediction tasks, including secondary structure, subcellular localization, and more. Despite being trained only on protein sequences, protein language models appear to implicitly learn protein structure. This paper investigates whether sequence representations learned by protein language models encode structural information and to what extent.We address this by evaluating protein language models on remote homology prediction, where identifying remote homologs from sequence information alone requires structural knowledge, especially in the “twilight zone” of very low sequence identity. Through rigorous testing at progressively lower sequence identities, we profile the performance of protein language models ranging from millions to billions of parameters in a zero-shot setting. Our findings indicate that while transformer-based protein language models outperform traditional sequence alignment methods, they still struggle in the twilight zone. This suggests that current protein language models have not sufficiently learned protein structure to address remote homology prediction when sequence signals are weak.We believe this opens the way for further research both on remote homology prediction and on the broader goal of learning sequence- and structure-rich representations of protein molecules. All code, data, and models are made publicly available.
Bromberg, Y., Prabakaran, R., Kabir, A., & Shehu, A. (2024). Variant Effect Prediction in the Age of Machine Learning. Cold Spring Harbor Perspectives in Biology, 16(7), a041467. http://dx.doi.org/10.1101/cshperspect.a041467
@article{brombergshehu2024,
title = {Variant Effect Prediction in the Age of Machine Learning},
volume = {16},
issn = {1943-0264},
url = {http://dx.doi.org/10.1101/cshperspect.a041467},
doi = {10.1101/cshperspect.a041467},
number = {7},
journal = {Cold Spring Harbor Perspectives in Biology},
publisher = {Cold Spring Harbor Laboratory},
author = {Bromberg, Yana and Prabakaran, R. and Kabir, Anowarul and Shehu, Amarda},
year = {2024},
month = apr,
pages = {a041467},
impact = {6.9}
}
Over the years, many computational methods have been created for the analysis of the impact of single amino acid substitutions resulting from single-nucleotide variants in genome coding regions. Historically, all methods have been supervised and thus limited by the inadequate sizes of experimentally curated data sets and by the lack of a standardized definition of variant effect. The emergence of unsupervised, deep learning (DL)-based methods raised an important question: Can machines learn the language of life from the unannotated protein sequence data well enough to identify significant errors in the protein “sentences”? Our analysis suggests that some unsupervised methods perform as well or better than existing supervised methods. Unsupervised methods are also faster and can, thus, be useful in large-scale variant evaluations. For all other methods, however, their performance varies by both evaluation metrics and by the type of variant effect being predicted. We also note that the evaluation of method performance is still lacking on less-studied, nonhuman proteins where unsupervised methods hold the most promise.
Kabir, A., Bhattarai, M., Rasmussen, K. Ø., Shehu, A., Usheva, A., Bishop, A. R., & Alexandrov, B. (2023). Examining DNA breathing with pyDNA-EPBD. Bioinformatics, 39(11), btad699. https://doi.org/10.1093/bioinformatics/btad699
@article{kabir2023pydnaepbd,
author = {Kabir, Anowarul and Bhattarai, Manish and Rasmussen, Kim Ø and Shehu, Amarda and Usheva, Anny and Bishop, Alan R and Alexandrov, Boian},
title = {{Examining DNA breathing with pyDNA-EPBD}},
journal = {Bioinformatics},
volume = {39},
number = {11},
pages = {btad699},
year = {2023},
month = nov,
issn = {1367-4811},
doi = {10.1093/bioinformatics/btad699},
url = {https://doi.org/10.1093/bioinformatics/btad699},
eprint = {https://academic.oup.com/bioinformatics/article-pdf/39/11/btad699/53863029/btad699.pdf},
impact = {4.4}
}
The two strands of the DNA double helix locally and spontaneously separate and recombine in living cells due to the inherent thermal DNA motion. This dynamics results in transient openings in the double helix and is referred to as “DNA breathing” or “DNA bubbles.” The propensity to form local transient openings is important in a wide range of biological processes, such as transcription, replication, and transcription factors binding. However, the modeling and computer simulation of these phenomena, have remained a challenge due to the complex interplay of numerous factors, such as, temperature, salt content, DNA sequence, hydrogen bonding, base stacking, and others.We present pyDNA-EPBD, a parallel software implementation of the Extended Peyrard-Bishop-Dauxois (EPBD) nonlinear DNA model that allows us to describe some features of DNA dynamics in detail. The pyDNA-EPBD generates genomic scale profiles of average base-pair openings, base flipping probability, DNA bubble probability, and calculations of the characteristically dynamic length indicating the number of base pairs statistically significantly affected by a single point mutation using the Markov Chain Monte Carlo algorithm.pyDNA-EPBD is supported across most operating systems and is freely available at https://github.com/lanl/pyDNA_EPBD. Extensive documentation can be found at https://lanl.github.io/pyDNA_EPBD/.
Kabir, A., & Shehu, A. (2022). GOProFormer: A Multi-Modal Transformer Method for Gene Ontology Protein Function Prediction. Biomolecules, 12(11). https://www.mdpi.com/2218-273X/12/11/1709
@article{kabirshehu2022goproformer,
author = {Kabir, Anowarul and Shehu, Amarda},
title = {GOProFormer: A Multi-Modal Transformer Method for Gene Ontology Protein Function Prediction},
journal = {Biomolecules},
volume = {12},
year = {2022},
number = {11},
article-number = {1709},
url = {https://www.mdpi.com/2218-273X/12/11/1709},
pubmedid = {36421723},
issn = {2218-273X},
doi = {10.3390/biom12111709},
impact = {5.8}
}
Protein Language Models (PLMs) are shown to be capable of learning sequence representations useful for various prediction tasks, from subcellular localization, evolutionary relationships, family membership, and more. They have yet to be demonstrated useful for protein function prediction. In particular, the problem of automatic annotation of proteins under the Gene Ontology (GO) framework remains open. This paper makes two key contributions. It debuts a novel method that leverages the transformer architecture in two ways. A sequence transformer encodes protein sequences in a task-agnostic feature space. A graph transformer learns a representation of GO terms while respecting their hierarchical relationships. The learned sequence and GO terms representations are combined and utilized for multi-label classification, with the labels corresponding to GO terms. The method is shown superior over recent representative GO prediction methods. The second major contribution in this paper is a deep investigation of different ways of constructing training and testing datasets. The paper shows that existing approaches under- or over-estimate the generalization power of a model. A novel approach is proposed to address these issues, resulting in a new benchmark dataset to rigorously evaluate and compare methods and advance the state-of-the-art.
Peer Reviewed Conference Proceedings
Kabir, A., Moldwin, A., & Shehu, A. (2023). A Comparative Analysis of Transformer-based Protein Language Models for Remote Homology Prediction. Proceedings of the 14th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics. https://doi.org/10.1145/3584371.3612942
@inproceedings{kabirshehu2023remhomcsbw,
author = {Kabir, Anowarul and Moldwin, Asher and Shehu, Amarda},
title = {A Comparative Analysis of Transformer-based Protein Language Models for Remote Homology Prediction},
year = {2023},
isbn = {9798400701269},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3584371.3612942},
doi = {10.1145/3584371.3612942},
booktitle = {Proceedings of the 14th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics},
articleno = {97},
numpages = {9},
keywords = {large language model, transformer, remote homology},
location = {Houston, TX, USA},
series = {BCB '23}
}
Protein language models based on the transformer architecture are increasingly shown to learn rich representations from protein sequences that improve performance on a variety of downstream protein prediction tasks. These tasks encompass a wide range of predictions, including prediction of secondary structure, subcellular localization, evolutionary relationships within protein families, as well as superfamily and family membership. There is recent evidence that such models also implicitly learn structural information. In this paper we put this to the test on a hallmark problem in computational biology, remote homology prediction. We employ a rigorous setting, where, by lowering sequence identity, we clarify whether the problem of remote homology prediction has been solved. Among various interesting findings, we report that current state-of-the-art, large models are still underperforming in the "twilight zone" of very low sequence identity.
Kabir, A., Inan, T., & Shehu, A. (2022). Analysis of AlphaFold2 for Modeling Structures of Wildtype and Variant Protein Sequences. In H. Al-Mubaid, T. Aldwairi, & O. Eulenstein (Eds.), Proceedings of 14th International Conference on Bioinformatics and Computational Biology (Vol. 83, pp. 53–65). EasyChair.
@inproceedings{kabirshehu2022af2mutanalysis,
author = {Kabir, Anowarul and Inan, Toki and Shehu, Amarda},
title = {Analysis of AlphaFold2 for Modeling Structures of Wildtype and Variant Protein Sequences},
booktitle = {Proceedings of 14th International Conference on Bioinformatics and Computational Biology},
editor = {Al-Mubaid, Hisham and Aldwairi, Tamer and Eulenstein, Oliver},
series = {EPiC Series in Computing},
volume = {83},
publisher = {EasyChair},
bibsource = {EasyChair, https://easychair.org},
issn = {2398-7340},
doi = {10.29007/5g4v},
pages = {53-65},
year = {2022}
}
ResNet and, more recently, AlphaFold2 have demonstrated that deep neural networks can now predict a tertiary structure of a given protein amino-acid sequence with high accuracy. This seminal development will allow molecular biology researchers to advance various studies linking sequence, structure, and function. Many studies will undoubtedly focus on the impact of sequence mutations on stability, fold, and function. In this paper, we evaluate the ability of AlphaFold2 to predict accurate tertiary structures of wildtype and mutated sequences of protein molecules. We do so on a benchmark dataset in mutation modeling studies. Our empirical evaluation utilizes global and local structure analyses and yields several interesting observations. It shows, for instance, that AlphaFold2 performs similarly on wildtype and variant sequences. The placement of the main chain of a protein molecule is highly accurate. However, while AlphaFold2 reports similar confidence in its predictions over wildtype and variant sequences, its performance on placements of the side chains suffers in comparison to main-chain predictions. The analysis overall supports the premise that AlphaFold2-predicted structures can be utilized in further downstream tasks, but that further refinement of these structures may be necessary.
Kabir, A., & Shehu, A. (2022). Sequence-Structure Embeddings via Protein Language Models Improve on Prediction Tasks. 2022 IEEE International Conference on Knowledge Graph (ICKG), 105–112.
@inproceedings{kabirshehu2022protoformer,
author = {Kabir, Anowarul and Shehu, Amarda},
booktitle = {2022 IEEE International Conference on Knowledge Graph (ICKG)},
title = {Sequence-Structure Embeddings via Protein Language Models Improve on Prediction Tasks},
year = {2022},
volume = {},
number = {},
pages = {105-112},
keywords = {Location awareness;Soft sensors;Semantics;Training data;Predictive models;Transformers;Protein sequence;Protein language model;Transformer;Sequence structure transformer;Protein function;superfamily},
doi = {10.1109/ICKG55886.2022.00021}
}
Building on the transformer architecture and its revolutionizing of language models for natural language processing, protein language models (PLMs) are now emerging as a powerful tool for learning over large numbers of sequences in protein sequence databases and linking protein sequence to function. PLMs are shown to learn useful, task-agnostic sequence representations that allow predicting protein secondary structure, protein subcellular localization, and evolutionary relationships within protein families. However, existing models are strictly trained over protein sequences and miss an opportunity to leverage and integrate the information present in heterogeneous data sources. In this paper, inspired by the intrinsic role of three-dimensional/tertiary protein structure in determining a broad range of protein properties, we propose a PLM that integrates and attends to both protein sequence and tertiary structure. In particular, this paper posits that learning joint sequence-structure representations yields better representations for function-related prediction tasks. A detailed experimental evaluation shows that such joint sequence-structure representations are more powerful than sequence-based representations, yield better performance on superfamily membership across various metrics, and capture interesting relationships in the PLM-learned embedding space.
Du, Y., Kabir, A., Zhao, L., & Shehu, A. (2020). From Interatomic Distances to Protein Tertiary Structures with a Deep Convolutional Neural Network. Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics. https://doi.org/10.1145/3388440.3414699
@inproceedings{dushehu2020protstruct,
author = {Du, Yuanqi and Kabir, Anowarul and Zhao, Liang and Shehu, Amarda},
title = {From Interatomic Distances to Protein Tertiary Structures with a Deep Convolutional Neural Network},
year = {2020},
isbn = {9781450379649},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3388440.3414699},
doi = {10.1145/3388440.3414699},
booktitle = {Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics},
articleno = {101},
numpages = {8},
keywords = {coordinate reconstruction, deep learning, protein modeling, tertiary structure},
location = {Virtual Event, USA},
series = {BCB '20}
}
Elucidating biologically-active protein structures remains a daunting task both in the wet and dry laboratory, and many proteins lack structural characterization. This lack of knowledge continues to motivate the development of computational methods for protein structure prediction. Methods are diverse in their approaches, and recent efforts have debuted deep learning-based methods for various sub-problems within the larger problem of protein structure prediction. In this paper, we focus on such a sub-problem, the reconstruction of three-dimensional structures consistent with given inter-atomic distances. Inspired by a recent architecture put forward in the larger context of generative frameworks, we design and evaluate a deep convolutional network model on experimentally- and computationally-obtained tertiary structures. Comparison with convex and stochastic optimization-based methods shows that the deep model is faster and similarly or more accurate, opening up several venues of further research to advance the larger problem of protein structure prediction.
Khan, T. S., Kabir, A., Pfoser, D., & Züfle, A. (2019). CrowdZIP: A System to Improve Reverse ZIP Code Geocoding using Spatial and Crowdsourced Data (Demo Paper). Proceedings of the 27th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, 588–591. https://doi.org/10.1145/3347146.3359362
@inproceedings{khanandreas2019crowdzip,
author = {Khan, Tunaggina Subrina and Kabir, Anowarul and Pfoser, Dieter and Z\"{u}fle, Andreas},
title = {CrowdZIP: A System to Improve Reverse ZIP Code Geocoding using Spatial and Crowdsourced Data (Demo Paper)},
year = {2019},
isbn = {9781450369091},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3347146.3359362},
doi = {10.1145/3347146.3359362},
booktitle = {Proceedings of the 27th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems},
pages = {588–591},
numpages = {4},
keywords = {ZIP Codes, ZIP Code Classification, Reverse Geocoding, Microblog Data, Location Based Services, Geocoding},
location = {Chicago, IL, USA},
series = {SIGSPATIAL '19},
impact = {4.6}
}
Zoning Improvement Plan (ZIP) Codes provide a sub-division of space. Interestingly, the ZIP code area polygons for different data sources do not match, resulting in uncertainty for a range of services that rely on such data. This paper presents a system that employs traditional classification methods to map a given spatial coordinate to a distribution of ZIP-codes using various public available ZIP-code maps as predictors, and using the (not publicly available) United States Postal Service (USPS) map as an authoritative ground truth. We show that large sets of microblog data, from which we extract potential ZIP-codes, can significantly improve classification accuracy despite the noise of such data. The demonstrator allows users to select locations on a map of Orlando, FL, view the resulting distribution of ZIP-codes predicted for this location, compare the results to the ground-truth, and view the microblogs that have enriched the result. A focus will be on showing that the signal present in large, noisy, and 99.99% unrelated microblog data can indeed be used to improve reverse ZIP code geo-coding.
Book Chapters
Kabir, A., & Shehu, A. (2022). Graph Neural Networks in Predicting Protein Function and Interactions. In L. Wu, P. Cui, J. Pei, & L. Zhao (Eds.), Graph Neural Networks: Foundations, Frontiers, and Applications (pp. 541–556). Springer Nature Singapore. https://doi.org/10.1007/978-981-16-6054-2_25
@inbook{Kabirshehu2022gnnbookchapter,
author = {Kabir, Anowarul and Shehu, Amarda},
editor = {Wu, Lingfei and Cui, Peng and Pei, Jian and Zhao, Liang},
title = {Graph Neural Networks in Predicting Protein Function and Interactions},
booktitle = {Graph Neural Networks: Foundations, Frontiers, and Applications},
year = {2022},
publisher = {Springer Nature Singapore},
address = {Singapore},
pages = {541--556},
isbn = {978-981-16-6054-2},
doi = {10.1007/978-981-16-6054-2_25},
url = {https://doi.org/10.1007/978-981-16-6054-2_25}
}
Graph Neural Networks (GNNs) are becoming increasingly popular and powerful tools in molecular modeling research due to their ability to operate over non-Euclidean data, such as graphs. Because of their ability to embed both the inherent structure and preserve the semantic information in a graph, GNNs are advancing diverse molecular structure-function studies. In this chapter, we focus on GNNaided studies that bring together one or more protein-centric sources of data with the goal of elucidating protein function. We provide a short survey on GNNs and their most successful, recent variants designed to tackle the related problems of predicting the biological function and molecular interactions of protein molecules. We review the latest methodological advances, discoveries, as well as open challenges promising to spur further research.
Preprints
Kabir, A., & Shehu, A. (2022). Transformer Neural Networks Attending to Both Sequence and Structure for Protein Prediction Tasks. https://arxiv.org/abs/2206.11057
@unpublished{kabirshehu2022protoformerpreprint,
title = {Transformer Neural Networks Attending to Both Sequence and Structure for Protein Prediction Tasks},
author = {Kabir, Anowarul and Shehu, Amarda},
year = {2022},
eprint = {2206.11057},
archiveprefix = {arXiv},
primaryclass = {cs.LG},
url = {https://arxiv.org/abs/2206.11057}
}
The increasing number of protein sequences decoded from genomes is opening up new avenues of research on linking protein sequence to function with transformer neural networks. Recent research has shown that the number of known protein sequences supports learning useful, task-agnostic sequence representations via transformers. In this paper, we posit that learning joint sequence-structure representations yields better representations for function-related prediction tasks. We propose a transformer neural network that attends to both sequence and tertiary structure. We show that such joint representations are more powerful than sequence-based representations only, and they yield better performance on superfamily membership across various metrics.