Track Awesome Computational Biology Updates Weekly

Awesome list of computational biology.

🏠 Home · 🔍 Search · 🔥 Feed · 📮 Subscribe · ❤️ Sponsor · 😺 inoue0426/awesome-computational-biology · ⭐ 120 · 🏷️ Miscellaneous

[ Daily / Weekly / Overview ]

Mar 09 - Mar 15, 2026

Benchmarks & Datasets

Therapeutics Data Commons (TDC) — Unified benchmark suite covering ADMET, drug-target interaction, drug response, and more.

BindingDB Curated Sets — Curated binding affinity datasets for protein–ligand interaction benchmarking.

Cancer Therapeutics Response Portal (CTRP) — Drug sensitivity profiles across ~900 cancer cell lines for >400 compounds.

GuacaMol (⭐500) — Benchmark suite for generative molecular design models.

MOSES (⭐957) — Benchmarking platform for molecular generation models.

Database

AlphaFold Protein Structure Database — 3D protein structure predictions powered by AlphaFold.

scRNA

CZ CELLxGENE — Single-cell dataset repository and interactive explorer from the Chan Zuckerberg Initiative.

Human Cell Atlas — Open global atlas of all cells in the human body.

Genome

ENCODE — Encyclopedia of DNA Elements; regulatory and functional genomic elements across the genome.

Ensembl — Genome browser and annotation database for vertebrate and other eukaryotic genomes.

gnomAD — Genome Aggregation Database; genetic variation from large-scale sequencing projects.

Rfam — Database of RNA families with sequence alignments and consensus structures.

Model

AlphaFold 3 (⭐7.7k) — Deep learning model from Google DeepMind that predicts the joint 3-D structure of proteins, nucleic acids, small molecules, and their complexes.

Genomics Foundation Models / Protein Structure Prediction and Design

Enformer (⭐15k) — Transformer model predicting gene expression from DNA sequence.

Nucleotide Transformer (⭐830) — Foundation model for genomic sequences across multiple species.

DNABERT (⭐743) — Pre-trained bidirectional encoder for DNA sequence analysis.

DNABERT-2 (⭐459) — Improved genome foundation model with efficient tokenization.

Basenji (⭐466) — Sequential regulatory activity prediction from DNA sequences.

Caduceus (⭐226) — Bidirectional equivariant long-range DNA sequence model based on Mamba.

Evo (⭐1.5k) — Long-context genomic foundation model (up to 1M tokens).

HyenaDNA (⭐764) — Long-range genomic foundation model handling sequences up to 1M tokens with sub-quadratic attention.

Toolkit

Scanpy — Scalable Python toolkit for analyzing single-cell gene expression data, covering preprocessing, visualization, clustering, and trajectory inference.

Compound

HMDB (Human Metabolome Database) — Comprehensive database of small molecule metabolites found in the human body.

DrugCentral — Online drug compendium with drug mode of action and indication information.

Protein

SAbDab — Structural Antibody Database containing all antibody structures in the PDB.

OADB (Observed Antibody Space Database) — Database of antibody sequences from immune repertoire sequencing.

Disease

DisGeNET — Database of gene-disease associations integrating expert-curated and GWAS data.

OMIM (Online Mendelian Inheritance in Man) — Comprehensive database of human genes and genetic disorders.

Protein-Protein Interaction

IntAct — Open-source molecular interaction database and analysis system from EMBL-EBI.

Preprocessing Tools

Biopython — Collection of Python tools for biological computation including sequence analysis, structure parsing, and database access.

DeepChem (⭐6.6k) — Deep learning library for drug discovery, quantum chemistry, and materials science.

scvi-tools — Probabilistic models for single-cell omics data analysis.

CellTypist (⭐457) — Automated cell type annotation for scRNA-seq.

GROMACS — Molecular dynamics simulation package for biochemical molecules.

MDAnalysis — Python library for analyzing and altering molecular dynamics simulation trajectories.

OpenMM — High-performance toolkit for molecular simulation and GPU-accelerated MD.

Molecular Generation

REINVENT (⭐370) — Reinforcement learning for de novo drug design.

MolGPT (⭐169) — Transformer-based model for molecular generation.

Molecular Transformer (⭐413) — Sequence-to-sequence model for retrosynthesis prediction.

TargetDiff (⭐323) — 3D equivariant diffusion model for structure-based drug design.

LLM for Biology

ClawBio (⭐101) — Bioinformatics-native AI agent skill library with local-first pharmacogenomics, ancestry PCA, semantic similarity, nutrigenomics, and metagenomics skills.

Single-cell Foundation Models / Transcriptomics Foundation Models

Geneformer — Context-aware, attention-based deep learning model pretrained on a large corpus of single-cell transcriptomes.

scBERT (⭐347) — BERT-based foundation model pretrained on large-scale scRNA-seq data for cell type annotation.

CellPLM (⭐101) — Cell pre-trained language model with inter-cell transformer architecture for diverse single-cell analysis tasks.

Single-cell Foundation Models / Spatial Foundation Models

GigaPath (⭐578) — Slide-level digital pathology foundation model pretrained on 1.3 billion pathology image tokens from whole-slide images.

UNI (⭐681) — General-purpose self-supervised pathology foundation model trained on 100K+ whole-slide images for diverse computational pathology tasks.

CONCH (⭐472) — Vision-language foundation model for computational pathology trained with contrastive captioning on pathology image–text pairs.

Phikon — ViT-based pathology foundation model pretrained with iBOT self-supervision on TCGA whole-slide images.

Single-cell Foundation Models / Multi-Omics Foundation Models

scMulan (⭐62) — Single-cell multi-omic language model pretrained on ~10M cells spanning transcriptomics, epigenomics, and proteomics for cross-omics transfer tasks.

totalVI (⭐1.6k) — Probabilistic framework for joint analysis of paired scRNA-seq and protein (CITE-seq) data enabling multi-modal cell state representation across single-cell datasets.

MultiVI (⭐1.6k) — Multi-modal variational autoencoder for integrating paired and unpaired single-cell RNA-seq and ATAC-seq measurements into a unified latent space.

MIRA (⭐67) — Probabilistic multimodal topic model jointly modeling single-cell transcriptomics and chromatin accessibility for regulatory network inference.

GLUE (⭐455) — Graph-Linked Unified Embedding framework for unpaired single-cell multi-omics data integration across RNA, ATAC, methylation, and protein modalities.

BABEL (⭐47) — Cross-modality translation model enabling prediction between scRNA-seq and scATAC-seq profiles without requiring paired single-cell measurements.

Multigrate (⭐31) — Asymmetric multi-omics variational autoencoder for integrating single-cell data across RNA, ATAC, and protein modalities with missing-modality support.

MOFA+ (⭐384) — Multi-Omics Factor Analysis framework identifying shared axes of variation across bulk and single-cell datasets including RNA, ATAC, proteomics, methylation, and copy number.

GeneCompass (⭐111) — Large-scale foundation model integrating DNA regulatory sequences and single-cell transcriptomics from 120M+ cells across multiple species for gene regulation prediction.

UnitedNet (⭐52) — Interpretable multi-task deep neural network for single-cell multi-omics integration spanning transcriptomics, chromatin accessibility, and proteomics.

SpatialGlue — Graph attention network for spatial multi-omics integration jointly embedding spatial transcriptomics with chromatin accessibility or proteomics.

MIDAS (⭐62) — Mosaic integration and differential accessibility model for single-cell multi-omics data that handles arbitrary missing-modality combinations across transcriptomics, chromatin accessibility, and proteomics.

Single-cell Foundation Models / Domain Alignment

scArches (⭐399) — Transfer learning framework for mapping new single-cell datasets onto pre-trained reference atlases across batches, conditions, and modalities.

TOSICA — Transformer-based framework for one-stop interpretable cell-type annotation supporting cross-dataset and cross-species transfer.

Protein Foundation Models / Protein Structure Prediction and Design

AlphaFold3 (⭐7.7k) — Predicts structures of proteins, nucleic acids, small molecules, and their complexes.

Boltz-1 (⭐3.8k) — Open-source all-atom biomolecular structure prediction model for proteins, nucleic acids, small molecules, and their complexes achieving AlphaFold3-level accuracy.

Chai-1 (⭐1.9k) — Unified molecular structure prediction model covering proteins, nucleic acids, small molecules, and complexes.

ESM3 (⭐2.3k) — Multimodal protein language model that jointly reasons over sequence, structure, and function for generative protein design and engineering.

ESMFold (⭐4k) — Fast protein structure prediction using language model embeddings.

RFdiffusion (⭐2.8k) — Generative model for protein backbone design using diffusion.

ProteinMPNN (⭐1.6k) — Deep learning model for protein sequence design given backbone structure.

OmegaFold (⭐612) — High-resolution de novo protein structure prediction from sequence.

RoseTTAFold (⭐2.2k) — Three-track neural network for protein structure prediction.

Multi-Modal Foundation Models / Protein Structure Prediction and Design

CHIEF (⭐688) — Clinical Histopathology Imaging Evaluation Foundation model integrating histology images and clinical context for pan-cancer analysis.

BiomedCLIP — CLIP-based vision-language foundation model for biomedical images and text trained on PubMed figure–caption pairs.

Feb 02 - Feb 08, 2026

API

ChEMBL Web Services — REST API for bioactive molecules, targets, and bioassays.

PubMed E-utilities (esearch/efetch) — APIs for searching and retrieving biomedical literature from PubMed.

NCBI E-utilities — Unified APIs for accessing NCBI databases (Gene, GEO, SRA, PubChem, etc).

UniProt REST API — Programmatic access to protein sequence and functional annotation data.

Ensembl REST API — API for genomic annotations, variants, genes, and comparative genomics.

KEGG REST API — API for accessing KEGG pathways, compounds, genes, and reactions.

Open Targets Platform API — API for target–disease associations integrating genetics, genomics, and drug data.

ClinicalTrials.gov API — API for querying clinical trial metadata and results.

Benchmarks & Datasets

Genomics of Drug Sensitivity in Cancer (GDSC) — Drug sensitivity for ~1000 human cancer cell lines and hundreds of compounds.

CrossDocked2020 — Large-scale dataset for structure-based virtual screening.

OpenBioLink (⭐158) — Benchmark datasets for biological knowledge graph completion.

Knowledge Graph

PrimeKG (⭐706) — Multi-modal precision medicine knowledge graph integrating clinical, genetic, and drug data.

DRKG (⭐671) — Large-scale biological knowledge graph for drug discovery.

Hetionet (⭐343) — Heterogeneous network integrating genes, diseases, drugs, pathways, and more.

Pathway

Reactome — Expert-curated, peer-reviewed pathway database with detailed reaction mechanisms.

BioCyc — Collection of pathway/genome databases across thousands of organisms.

SIGNOR — Database of causal signaling interactions and pathways.

MSigDB (Molecular Signatures Database) — Curated gene sets derived from pathways and biological processes.

Protein

PROTEIN DATA BANK (PDB) — 3D structures of proteins, nucleic acids, complexes.

RCSB Protein Data Bank — Repository for structural data of biological molecules.

Disease

DrugBank — Database of drugs and targets (University of Alberta).

Drug-Gene Interaction

Comparative Toxicogenomics Database — Chemical-gene interactions, chemical-disease and gene-disease associations, chemical-phenotype associations.

SNAP — Dataset of drug-gene interactions.

Drug (Cell Line) Response

Cancer Cell Line Encyclopedia — Database of ~1000 cancer cell lines.

CellMiner Cross Database (CellMinerCDB) — Integrates multiple cancer cell line databases.

Chemical-Protein Interaction

BindingDB — Compounds and target database.

PDBBind — Binding affinity data for biomolecular complexes.

Protein-Protein Interaction

BioGRID — Protein, genetic, and chemical interactions.

HIPPIE — Human protein-protein interaction database.

Drug Target Interaction

DTINet (⭐185) — Network-based framework integrating heterogeneous biological data for DTI prediction.

DeepDTA (⭐293) — Deep learning model using CNNs on protein sequences and drug SMILES.

GraphDTA (⭐293) — Graph neural network–based DTI prediction using molecular graphs.

MolTrans (⭐225) — Transformer-based DTI model leveraging molecular substructures.

DrugBAN (⭐138) — Bilinear attention network for interpretable DTI prediction.

Protein Foundation Models / Pre-trained Embedding

Evolutionary Scale Modeling (ESM) (⭐4k) — Protein embeddings.

Jan 12 - Jan 18, 2026

Single-cell Foundation Models / Transcriptomics Foundation Models

scGPT (⭐1.5k) — Transformer-based foundation model pretrained on millions of single-cell profiles.

scFoundation (⭐392) — Large-scale foundation model for single-cell gene expression, enabling multiple downstream tasks.

BulkFormer (⭐42) — Foundation model for bulk RNA-seq data; learns general transcriptomic representations.

Preprocessing Tools

ChatSpatial (⭐18) — MCP server for spatial transcriptomics analysis via natural language.

Jan 05 - Jan 11, 2026

Preprocessing Tools

Squidpy — Python library for spatial single-cell analysis.

FlashDeconv (⭐13) — High-performance spatial transcriptomics deconvolution (~1M spots in ~3 min).

Nov 11 - Nov 17, 2024

Drug Response Prediction

MOFGCN (⭐6) — GCN + heterogeneous network.

DeepDSC — Autoencoder + fully connected NN.

DGDRP (⭐0) — Multi-view embedding neural network.

DeepAEG (⭐3) — GNN embedding + attention mechanism.

Aug 26 - Sep 01, 2024

Benchmarks & Datasets

MoleculeNet — Benchmark datasets for molecular machine learning.

NCI60 — Drug sensitivity benchmark across 60 diverse human cancer cell lines.

Genome

Dependency Map (DepMap) — CRISPR-Cas9 screens in cancer cell lines.

10x Genomics Dataset — Collection of single-cell datasets.

The Genotype-Tissue Expression (GTEx) — Human gene expression and regulation resource.

Catalogue Of Somatic Mutations In Cancer (COSMIC) — Resource on somatic mutations in cancers.

MGnify — Resource for metagenomic and metatranscriptomic data.

JASPAR — Database of transcription factor binding profiles.

Compound

ZINC ligand discovery database — Free database of commercially-available compounds for virtual screening.

Protein

Critical Assessment of Structure Prediction (CASP) — Assessing methods for protein structure prediction.

Uniclust — Clustered protein sequence databases.

CATH database — Hierarchical classification of protein domain structures.

Chemical-Protein Interaction

STITCH — Chemical-protein interactions.

Clinical Trial

ClinicalTrials.gov — Privately and publicly funded clinical studies.

ICD10 — International Classification of Diseases, 10th revision.

EU Drug Regulating Authorities Clinical Trials DB (EudraCT) — European clinical trial database.

MIMIC-IV — Freely accessible critical care database.

Aug 05 - Aug 11, 2024

Compound

Therapeutic Target Database — Drug-target, target-disease, and drug-disease datasets.

Knowledge Graph

Drug Mechanism Database (DrugMechDB) (⭐69) — Mechanisms of action from drug to disease.

Drug Repurposing

DeepPurpose (⭐1.1k) — Deep learning library for drug repurposing.

LLM for Biology

scPRINT (⭐142) — Pretrained on 50M cells for scRNA-seq denoising & zero imputation.

Jul 15 - Jul 21, 2024

Drug Response Prediction

drGAT (⭐1) — Attention-based model for drug response prediction with gene explainability.

LLM for Biology

GeneGPT (⭐423) — LLM for biomedical information, integrated with various APIs.

GenePT (⭐310) — Foundation LLM for single-cell data.

Mar 11 - Mar 17, 2024

Compound-Protein Interaction

TransformerCPI (⭐153) — CPI prediction using Transformer.

LLM for Biology

AI4Chem/ChemLLM-7B-Chat — LLM for chemical & molecular science.

BioGPT (⭐4.5k) — LLM for biomedical text generation.

Protein Foundation Models / Pre-trained Embedding

ChemBERTa-2 (⭐487) — Chemical embeddings & prediction.

Nov 27 - Dec 03, 2023

Compound

Drug Repurposing Hub — Collections of drug repurposing data (drug, MoA, target, etc).

Protein

AlphaFold Protein Structure Database — 3D protein structure predictions.

Protein-Protein Interaction

STRING — PPI networks for multiple organisms.

Sep 04 - Sep 10, 2023

Compound

Rhea — Database of chemical reactions.

Jun 12 - Jun 18, 2023

scRNA

Single Cell Expression Atlas — Public database for single-cell RNA.

Pathway

PathwayCommons — Database of pathways and interactions.

Genome

cBioPortal — Cancer genomics database; aggregating many patient datasets.

Drug-Gene Interaction

DGIdb — Drug-gene interactions and the druggable genome.

Preprocessing Tools

Scanpy — Python library for scRNA-seq analysis.

Seurat — R library for scRNA-seq analysis.

Apr 10 - Apr 16, 2023

scRNA

Gene Expression Omnibus — Public functional genomics database.

Single Cell PORTAL — Public database for single-cell RNA.

May 16 - May 22, 2022

Compound

ChEMBL — Bioactive molecules with drug-like properties.

PubChem — One of the largest chemical databases (compounds, genes, and proteins).

ChEBI — Database focused on small chemical compounds.

ChemSpider — Chemical structure database.

KEGG COMPOUND — Collection of small molecules and biopolymers.

LIPID MAPS — Database of lipids.

Mass Spectra

MassBank — Open source databases and tools for mass spectrometry reference spectra.

MoNA MassBank of North America — Meta-database of metabolite mass spectra, metadata, and associated compounds.

Protein

UniProt — Functional information on proteins.

THE HUMAN PROTEIN ATLAS — Comprehensive human protein database (cells, tissues, organs).

Pathway

KEGG PATHWAY — Collection of pathway maps.

WikiPathways — Database of biological pathways.

Genome

Human Genome Resources at NCBI — Database for genomics, proteomics, transcriptomics, and systems biology.

GenBank — NCBI's database of genetic sequences.

UCSC Genome Browser — UCSC's genome browser.

Disease

KEGG DRUG — Comprehensive, approved drug information.

May 09 - May 15, 2022

Preprocessing Tools

Chemistry Development Kit (⭐571) — Cheminformatics software & machine learning tools.

RDKit (⭐3.3k) — Cheminformatics software & machine learning toolkit.

Drug Target Interaction

NeoDTI (⭐77) — Library for drug-target interaction prediction.

Compound-Protein Interaction

MCPINN (⭐3) — Drug discovery via compound-protein interaction and machine learning.