The accelerating crisis of antimicrobial resistance (AMR) demands new computational methods to stay ahead of evolving pathogens. ApexOracle is a unified ML platform designed to both predict the activity of candidate compounds against specific bacterial strains and generate novel molecules de novo, proactively targeting future superbugs.

Motivation and Scope

  • Global Impact: AMR contributes to nearly 5 million deaths annually.
  • Traditional Challenges: Standard drug discovery pipelines are slow, resource-intensive, and reactive.
  • ApexOracle Goal: Integrate genomic context and molecular design into one end-to-end framework.

ApexOracle Architecture

Layman’s Explanation: Imagine you have three sets of clues: the code of the bacteria (its genome), a simple description of its behaviors (like a basic fact sheet), and the building blocks of a potential drug (a molecular recipe). ApexOracle acts like a super-smart detective that reads all three clues at once. It combines them, figures out which molecules might work best, and even drafts entirely new molecular recipes that could stop the bacteria in its tracks.

  1. Inputs

    • Genome Embedding: Evo2, a DNA language model pretrained on millions of microbial genomes.
    • Trait Description: Me-LLaMA embeddings fine-tuned for phenotypic metadata (taxonomy, morphology, resistance profile).
    • Molecular Representation: SELFIES encoded small molecules or peptides processed by a Discrete Diffusion Language Model (DLM).
  2. Cross-Attention Fusion

    • Multi-modal embeddings are fused to inform both prediction and generation tasks.
  3. Task Heads

    • Prediction Head: Regression for Minimum Inhibitory Concentration (MIC) and binary activity classification.
    • Generation Head: Guided SELFIES synthesis conditioned on pathogen embedding and prediction feedback.

Training Paradigm

  • Dual DLM Tasks:

    1. Mask & Reconstruct: BERT-like masked token recovery for SELFIES.
    2. Multi-Descriptor Regression: Predict 209 physicochemical properties via RDKit descriptors.
  • Datasets:

    • Public small-molecule MIC libraries for Gram-positive and Gram-negative bacteria.
    • Antimicrobial peptide datasets with non-canonical residues.

Results

  • Predictive Performance:

    • Outperforms strain-specific baselines on unseen pathogens, with R² gains up to 15% on challenging strains like Pseudomonas aeruginosa.
    • AUC-ROC improvements in binary activity classification.
  • De Novo Generation:

    • Generated compounds show predicted MIC values comparable to clinical standards.
    • Preliminary in vitro assays confirm activity of select molecules.

Discussion and Future Directions

  • Proactive Discovery: Enables anticipatory design against emerging pathogens.
  • Extensibility: Potential to integrate protein structure, clinical metadata, and active learning loops.
  • Challenges: Scalability across thousands of species, toxicity/farmacokinetics prediction, and automated wet-lab integration.

Conclusion

ApexOracle illustrates how multi-modal ML can revolutionize antibiotic discovery by closing the loop between prediction and generation, providing a blueprint for combating future AMR threats.