Apr 1, 2025
|
10
min read
Welcome to Revilico's first product digest and case study. We'll be walking you through one use case for Hit Identification given a very well defined and established target: EGFR, which yields high application to cancer treatments. We would like to welcome you all to our community and encourage you all to join our Webinar April 2nd at 12pm. Sign up here!
Join our Slack Community here!
See our Step by Step Walkthrough on our Linkedin Page
See a direct platform use case of Revilico's Discovery Engine
Understanding the Core Biology
Epidermal Growth Factor Receptor (EGFR) is a critical protein involved in cell signaling pathways that regulate cell growth, survival, proliferation, and differentiation. Dysregulation or mutation of EGFR is a hallmark in various cancers, including lung, colorectal, head and neck, and pancreatic cancers. Mutations, particularly in the kinase domain of EGFR, lead to aberrant activation, promoting uncontrolled cellular growth and cancer progression. The EGFR kinase domain's crystal structure (PDB: 2GS2) reveals important structural details that underpin its activity and interaction with therapeutic compounds. Understanding this structural landscape is essential for the rational design of effective inhibitors targeting EGFR.
Revilico Inc. is pioneering advancements across drug discovery, especially Hit Identification (Hit ID). Hit ID is a crucial phase in drug discovery, traditionally assessed using key metrics such as IC50, Kd, and Ki. Our models are also capable of predicting EC50 for agonists, but given the nature of the targeted biology, we will primarily be focusing on IC50 values for hit qualifications. These metrics are instrumental in quantifying how effectively a molecule binds to a target protein, thus helping researchers pinpoint potential therapeutic hits. IC50 measures the concentration at which a molecule inhibits biological activity by 50%, Kd (dissociation constant) assesses binding affinity, and Ki indicates inhibitor potency. Utilizing these key metrics, drug hunters can quickly identify compounds exhibiting promising therapeutic profiles.
Priming our Case Study
For this case study, Revilico conducted a high-throughput screening (HTS) involving 125,000 compounds. Among these were nine clinically validated, FDA-approved drugs—Erlotinib, Osimertinib, Neratinib, Gefitinib, Mobocertinib, Lapatinib, Dacomitinib, and Vandetanib—embedded within a pool of randomly selected chemBL compounds. This screening allowed Revilico to benchmark the AI model’s predictive accuracy against established drugs. Before undertaking activity assessment, an analysis on the chemical space and dispersity of the set was performed using Revilico’s Chemical Space Analysis engine. This clustering algorithm, based on structural embeddings, was evaluated using two independent random samples of 1,500 compounds each from the 125,000-compound dataset.

These results show t-SNE projections of the molecular feature embeddings, derived from Revilico’s Discovery Engine.
In our evaluation of 125,000 compounds, FDA-approved drugs ranked among the top-performing compounds, reinforcing their clinical efficacy. Using Revilico’s Chemical Space Analysis Engine, we ensured a diverse and representative molecular sampling, optimizing hit identification across chemical space. A striking 7 out of 8 FDA-approved compounds (87.5%) ranked in the top quartiles, demonstrating their superior potency. These compounds consistently fell within the top 0.42% to 13.8%, reflecting their exceptional IC50 values. One outlier, Vandetanib, ranked in the top 27.27%, still performing significantly well but below the others.

Extract from Discovery Engine: An Analysis of the embedded hits. The Y-axis indicates logged IC50 activity. The X-axis represents top percentiles of IC50, with the origin yielding the top performers, and the top 100% being the lowest performers. The lower the IC50 values, the greater the affinity. All reported activity values are in units (nM).
Performance Breakdown (Percentile of Compounds Identified):
Lazertinib – Top 0.42% (Elite-tier efficacy, among the absolute highest-ranked)
Neratinib – Top 1.01% (Exceptional potency, near the very top)
Dacomitinib – Top 3.28% (Highly effective, premier performer)
Mobocertinib – Top 4.09% (Strong, high-ranking inhibitor)
Osimertinib – Top 5.05% (Best in class, top-tier performance)
Gefitinib – Top 5.11% (Outstanding, closely comparable to osimertinib)
Erlotinib – Top 11.52% (Moderate efficacy, still a strong performer)
Lapatinib – Top 13.8% (Competitive, ranking among the strongest)
Vandetanib – Top 27.27% (Outlier, moderate efficacy relative to the set)
Optimizing Performance: FDA-Approved Compounds Dominate the Top Quartiles
As a follow up of the previous screen, Revilico’s on-premise readily synthesizable library of 70 Billion compounds was searched, and another orthogonal assessment was conducted across Enamine REAL library’s Discovery Diversity Set which contains 50,240 compounds of diverse chemical structure - a good foundation for investigatory screens that are screening for potential hits. An analysis of the chemical diversity was conducted, with a different approach taking precedence: a random sampling of 500 compounds and 250 compounds respectively to test diversity across regions in the dataset. Results are provided below:

These results show t-SNE projections of the molecular feature embeddings, derived from Revilico’s Discovery Engine. The set of 250 randomly sampled compounds is on the left, and the set of 500 randomly sampled compounds is on the right.
The results of this screen against the same EGFR sequence demonstrated significantly improved performance. By utilizing a more chemically diverse compound library, spanning a broader range of structural diversity, the model identified higher-efficacy hits, outperforming the previous case study. The range of top-performing FDA approved drug hits in this dataset spanned from 0.014% to 2.61%, marking a substantial enhancement over the prior screen and emphasizing the critical role of chemical diversity in early-stage hit identification. Revilico’s recommendation of advancing 10-15% of compounds to orthogonal validation mitigates false negatives and ensures the highest-quality candidates proceed for further study. This screen highlights the tangible benefits of strategic library selection, identifying all FDA drugs in the top 3% performers, which translates into projected cost and time savings in the Design-Make-Test-Analyze (DMTA) cycle. These results were expected given the efficacy needed to meet FDA regulatory standards. Relevant business and R&D optimizations can be derived by reducing the downstream compound set required for further analysis by over 97%—dramatically streamlining hit-to-lead optimization.
Performance Breakdown (Percentile of Compounds Identified):
Neratinib – Top 0.014% (Elite-tier efficacy, among the absolute highest-ranked)
Lazertinib – Top 0.037% (Exceptional potency, near the very top)
Osimertinib – Top 0.07% (Best in class, top-tier performance)
Dacomitinib – Top 0.09% (Highly effective, premier performer)
Lapatinib – Top 0.14% (Competitive, ranking among the strongest)
Mobocertinib – Top 0.15% (Strong, high-ranking inhibitor)
Gefitinib – Top 0.44% (Outstanding, closely comparable to osimertinib)
Erlotinib – Top 1.87% (Moderate efficacy, still a strong performer)
Vandetanib – Top 2.61% (Outlier, moderate efficacy relative to the set)

Extract from Discovery Engine: An Analysis of the embedded hits. The Y-axis indicates logged IC50 activity. The X-axis represents top percentiles of IC50, with the origin yielding the top performers, and the 100% being the lowest performers. The lower the IC50 values, the greater the affinity. All reported activity values are in units (nM). 9 FDA approved compounds assessed against Enamine Real’s Discovery Diversity Dataset.
Extracting out Potential Quantitative Structure Activity Relationship
All hits were processed through Revilico's QSAR modeling algorithm to identify key structural motifs for scaffold decoration, lead expansion, and activity optimization. The analysis, detailed in the platform walkthrough, highlights compound motifs selected for further investigation in downstream case studies. Only the top 10% of IC50 performers (5,000 compounds) were advanced to QSAR modeling, with clustering based on structural and physicochemical relationships. Top clusters featured the lowest IC50 values, indicating high activity among shared motifs. Following clustering, scaffold search algorithms identified relevant similarities to guide downstream optimization.
Validating and Benchmarking the Model
Revilico leverages artificial intelligence models to translate complex molecular and protein information into computationally meaningful representations. Our AI platform transforms molecular structures (SMILES strings) into rich representations capturing essential chemical features, connectivity patterns, and spatial arrangements. Additionally, protein sequences are converted into representations encoding structural features, amino acid interactions, and evolutionary information. This nuanced understanding allows our models to effectively predict crucial affinity metrics, significantly accelerating the hit identification process. Our models have been trained against 2 million combinations of small molecule-protein interactions and validated on 400,000 interactions, ensuring robust predictive accuracy.
To assess the benchmarks for validation we ran cross-examination across predicted versus experimental values of the model, achieving a Pearson correlation coefficient of 0.872, indicating a strong correlation between predicted and experimental data, yielding promise for our model. Kd, Ki, and EC50 followed a strong performance with a respective 0.802, 0.873, and 0.866 pearson correlation coefficient. Downstream analytics can be tailored towards different insights depending on the intended biologies and outcomes.

An Overview of model examinations for primary IC50 Evaluations. This represents the predicted versus actual values for IC50, with an emphasis on testing performance deviations in different percentile ranges across the testing set.
Advancing the Case Study Further
Regarding the initial results of this case study to assess the validity of one of Revilico’s breakthrough activity models, next steps include an iterative lead optimization process, atomistic examinations using physics based modeling, and assessments for molecular properties, novelty, safety, and ease of synthesis. In the coming few weeks we will be closing out this case study to show you how to run a computationally driven drug discovery campaign using Revilico’s Discovery Engine.
Thank you for supporting Revilico! Join our community down below!
Webinar April 2nd, 12pm PST Sign up here: https://lnkd.in/g33iNSuC
Join our slack community: https://shorturl.at/ul0gi