• Search Menu
  • Sign in through your institution
  • Advance articles
  • Author Guidelines
  • Submission Site
  • Open Access
  • Why publish with this journal?
  • About Bioinformatics
  • Journals Career Network
  • Editorial Board
  • Advertising and Corporate Services
  • Self-Archiving Policy
  • Dispatch Dates
  • Journals on Oxford Academic
  • Books on Oxford Academic

Issue Cover

Article Contents

1 introduction, 2 materials and methods, 4 conclusion, supplementary data, conflict of interest, data availability.

  • < Previous

CODEX: COunterfactual Deep learning for the in silico EXploration of cancer cell line perturbations

  • Article contents
  • Figures & tables
  • Supplementary Data

Stefan Schrod, Helena U Zacharias, Tim Beißbarth, Anne-Christin Hauschild, Michael Altenbuchinger, CODEX: COunterfactual Deep learning for the in silico EXploration of cancer cell line perturbations, Bioinformatics , Volume 40, Issue Supplement_1, July 2024, Pages i91–i99, https://doi.org/10.1093/bioinformatics/btae261

  • Permissions Icon Permissions

High-throughput screens (HTS) provide a powerful tool to decipher the causal effects of chemical and genetic perturbations on cancer cell lines. Their ability to evaluate a wide spectrum of interventions, from single drugs to intricate drug combinations and CRISPR-interference, has established them as an invaluable resource for the development of novel therapeutic approaches. Nevertheless, the combinatorial complexity of potential interventions makes a comprehensive exploration intractable. Hence, prioritizing interventions for further experimental investigation becomes of utmost importance.

We propose CODEX (COunterfactual Deep learning for the in silico EXploration of cancer cell line perturbations) as a general framework for the causal modeling of HTS data, linking perturbations to their downstream consequences. CODEX relies on a stringent causal modeling strategy based on counterfactual reasoning. As such, CODEX predicts drug-specific cellular responses, comprising cell survival and molecular alterations, and facilitates the in silico exploration of drug combinations. This is achieved for both bulk and single-cell HTS. We further show that CODEX provides a rationale to explore complex genetic modifications from CRISPR-interference in silico in single cells.

Our implementation of CODEX is publicly available at https://github.com/sschrod/CODEX . All data used in this article are publicly available.

Large-scale perturbation experiments in human cancer cell lines offer a powerful approach to connect genetic or chemical interventions with downstream effects. Such high-throughput screens (HTS) aid the identification of new drug compounds and more effective cancer treatments, and provide a way to study genomic susceptibilities in cancer ( Ling et al. 2018 ).

This motivated several data collections and technical advancements. The Genomics of Drug Sensitivity in Cancer (GDSC) database ( Iorio et al. 2016 ) provides response measurements of 1001 cancer cell lines to 265 anticancer drugs. In comparison to individual drugs, drug combinations can have increased efficacy, and reduced toxicity and adverse side effects as a consequence of reduced dosages ( Al-Lazikani et al. 2012 , Csermely et al. 2013 ). This motivated drug combination databases such as DrugComb ( Zagidullin et al. 2019 ), gathering data of more than 700 000 drug combinations for more than 8000 unique compounds. Investigated downstream effects are not limited to measures of drug response. The Connectivity Map ( Subramanian et al. 2017 ) offers more than 3 000 000 perturbed gene expression profiles using the L1000 technology (measuring the expression of 978 genes) ( https://clue.io/ ), containing diverse genetic (shRNA, CRISPR, and overexpression), chemical, and physiological perturbations. Advances in barcoding strategies enabled bulk RNA sequencing at comparatively low costs ( Bush et al. 2017 , Ye et al. 2018 ) and allowed the generation of perturbation profiles without restricting the gene space [see, e.g. PANACEA ( Douglass et al. 2022 )]. A further essential development are high-throughput single-cell perturbation screens, providing measurements of perturbed transcriptomes of individual cells for diverse interventions. For instance, Sci-Plex was used to screen cancer cell lines exposed to different compounds and dosages at single-cell resolution ( Srivatsan et al. 2020 ). Genetic perturbations on a single-cell level were approached by Perturb-Seq, which uses barcoding techniques and CRISPR interference to perform genome-scale perturbation screens covering more than 1000 gene knockouts on RPE-1 and K562 cells ( Replogle et al. 2022 ). Further, Norman et al. (2019) explored the effects of 131 two-gene knockouts on K562 cells. Even though more than 100 000 single cells are recorded, only a small fraction of the combinatorial space could be experimentally covered, highlighting the need for computational approaches to infer novel drug combination effects and guide experimental studies.

The outlined resources differ with respect to the employed techniques, and the investigated downstream effects and interventions. However, different data types typically require tailored solutions. For instance, drug sensitivity predictions in cell lines were addressed in an NCI-DREAM challenge ( Costello et al. 2014 ) and identified a Bayesian multitask multiple kernel learning approach to perform best. Subsequent methods built on this idea and additionally addressed the model adaptation to real tumor specimens to provide treatment response predictions in individual patients ( Sharifi-Noghabi et al. 2021 , He et al. 2022 ). The prediction of drug synergies was pioneered by DeepSynergy ( Preuer et al. 2018 ). DeepSynergy predicts drug synergisms from cell-line transcriptomic data in combination with features representing compound structures. Drug response predictions in single cells were addressed by the Compositional Perturbation Autoencoder (CPA) ( Lotfollahi et al. 2023 ), which also addressed drug combinations, by perturbing the latent representation of a Variational Autoencoder (VAE). The prediction of unseen genetic interventions and combinations thereof was addressed by GEARS ( Roohani et al. 2023 ). GEARS utilizes Graph Neural Networks (GNNs) to incorporate prior knowledge of gene-gene relationships into the model architecture. The generative approach, however, limits applications to a specific cell type. In summary, all outlined approaches are tailored to the problem at hand, even though they describe the common problem of predicting causal consequences of a specified intervention.

The inference of many causal actions necessitates counterfactual reasoning. For instance, in a typical clinical trial, patient outcome is recorded for a treated and a control group. By comparing the outcomes of both groups, the average treatment effect can be derived. However, a treatment that is beneficial on average may not be helpful for every individual patient. The computational challenge in predicting individual patients’ treatment responses arises from the fact that patients’ outcomes can only be observed for the specific treatment they received, not for the alternative treatments they did not receive (here, the control) ( Rubin 1974 ). Consequently, algorithms cannot directly learn rules for selecting the “better” treatment for each patient. Instead, they must infer the alternative, or “counterfactual,” outcomes in order to adequately assess the treatment’s relative benefit. In HTS, alternative interventions can be observed, assuming that cultures of the same cell line can be considered as technical replicates. Nevertheless, it is impractical to test every possible intervention, especially for combinations of interventions; the combinatorial complexity makes a comprehensive exploration intractable. Thus, in silico approaches become necessary to prioritize interventions for further experimental investigations. To address this task, we will build on counterfactual machine learning approaches to extrapolate the space of interventions to the yet unseen “counterfactual” perturbations and cell lines.

Counterfactual deep-learning (DL) approaches turned out to be particularly promising due to their enormous flexibility. The typical strategy is to construct networks which facilitate joint representation learning across all investigated interventions, and to account for treatment specific effects via dedicated network branches ( Johansson et al. 2016 , Shalit et al. 2017 , Yao et al. 2018 , Schrod et al. 2023 ). In the context of HTS data, molecular information is first aggregated in a treatment-agnostic manner to encode features of unperturbed control cells, and then, separated into treatment-specific representations to capture the treatment underlying molecular mechanisms. Note that a dedicated representation has to be trained for each intervention or intervention combination, and as such, downstream effects of novel combinations cannot be inferred. Alternative approaches that only consider the intervention as a predictor variable typically capture only average treatment effects and may overlook individual treatment effects. This can be explicitly seen from the comparison of an ordinary regression model, which uses a treatment variable, and a T-learner, which consists of a set of individual treatment-specific models ( Shalit et al. 2017 , Schrod et al. 2023 ). Notably, most recent counterfactual DL approaches also take into account potential treatment selection biases as might be present in observational studies, for which regularization techniques for distributional balancing are used or adversarial learning techniques ( Johansson et al. 2016 , Shalit et al. 2017 , Yao et al. 2018 , Yoon et al. 2018 , Schrod et al. 2022 ). Confounding effects due to treatment biases, however, might be of less relevance in the context of HTS, where interventions are a priori unbiased.

Here, we will introduce CODEX (COunterfactual Deep learning for the in silico EXploration of cancer cell line perturbations), which builds on counterfactual DL approaches to provide a general framework to model HTS data. In contrast to existing counterfactual DL approaches, CODEX facilitates the prediction of unseen perturbation combinations by learning from individually applied interventions and complementary combinations. CODEX can account for nonlinear combinatorial effects and can incorporate prior knowledge about gene-gene relationships, such as provided by Gene Ontologies (GO). We demonstrate that CODEX can extrapolate the space of interventions to new cell lines and new treatment combinations, and via prior knowledge even to completely unseen perturbations and combinations thereof. This is illustrated for both bulk and single-cell transcriptomics data, and for the predictions of drug responses and perturbed gene-expression profiles.

2.1.1 Model architecture

The CODEX approach ( Fig. 1 ) is based on deep neural network architectures for counterfactual reasoning ( Johansson et al. 2016 , Shalit et al. 2017 , Yoon et al. 2018 , Schrod et al. 2022 ). Let x i represent a vector of unperturbed molecular features that characterizes a bulk cell line specimen i , or, in the context of single-cell HTS, an unperturbed single cell i . Throughout our analyses, gene expression levels are used as input features, but in principle, any omics data could be used. Note, however, that transcriptomics data are most readily available and have been shown to be most relevant for drug sensitivity predictions ( Costello et al. 2014 ). Further, let the vector t i = ( t i 1 , … , t i K ) represent the perturbations applied in experiment i , where each element t i k ∈ 0 , 1 indicates whether intervention k was applied (1) or not 0. Thus, t i encodes the complete set of perturbations applied in experiment i . In case of chemical interventions, an additional vector d i = ( d i 1 , … , d i K ) encodes drug dosages, respectively. The prediction target will be denoted as y i ∈ Y ⁠ , which can be any variable or set of variables characterizing the outcome of the perturbation, such as measures of drug efficacy ( Costello et al. 2014 ), drug synergies ( Preuer et al. 2018 ), as well as perturbation profiles ( Lotfollahi et al. 2023 , Roohani et al. 2023 ). Thus, CODEX requires the triplet of input information consisting of a control (the unperturbed transcriptome), the intervention (potentially associated with dosages), and the recorded downstream effects.

CODEX architecture used for causal representation learning: an unperturbed profile X is transformed to map the effect of specific interventions T, which might be associated with a dosage D, to the perturbed outcome Y. The unperturbed state is first encoded in a latent state e, which is passed through the respective treatment specific representations ft. Treatment combinations are naturally incorporated by simultaneously propagating profiles through respective treatment arms. Further, if dosage specific interventions are used, the dosage is incorporated as input variable to the intervention specific layers. Finally, the individual effects are aggregated and combined by a shared decoder d to model the perturbed outcome.

CODEX architecture used for causal representation learning: an unperturbed profile X is transformed to map the effect of specific interventions T , which might be associated with a dosage D , to the perturbed outcome Y . The unperturbed state is first encoded in a latent state e , which is passed through the respective treatment specific representations f t . Treatment combinations are naturally incorporated by simultaneously propagating profiles through respective treatment arms. Further, if dosage specific interventions are used, the dosage is incorporated as input variable to the intervention specific layers. Finally, the individual effects are aggregated and combined by a shared decoder d to model the perturbed outcome.

2.1.2 CODEX for drug-synergy prediction

2.1.3 codex for the prediction of single-cell perturbation profiles, 2.1.4 codex for the prediction of drug dosage effects.

Different drug dosages add to the combinatorial complexity of intervention experiments and can introduce complex nonlinear dependencies. In principle, dosage effects can be incorporated into CODEX via different treatment branches. This, however, would prohibit the extrapolation to unseen drug dosages and would substantially increase model complexity. Therefore, rather than introducing additional dose representations, CODEX incorporates dosage information as an additional categorical feature d i k which adds to the first layer of each treatment specific branch k ( Fig. 1 ).

2.1.5 CODEX for the prediction of unobserved perturbations

CODEX infers a distinct model branch for each individual perturbation, effectively reducing the number of model representations to the number of distinct perturbations. However, it does not require that those perturbations were independently observed. Rather, the training data can comprise perturbations observed in combinations. Consequently, each combination adds information to decipher the individual model branches. We make use of this feature and introduce a weighting scheme to share information among perturbations.

We illustrate this concept for CRISPR interference screens, where the individual perturbations correspond to silenced genes. In this case, we use gene similarities derived by ( Roohani et al. 2023 ), which aggregate information from GO ( Gene Ontology Consortium 2004 ). The basic idea is to compute the Jaccard index between a pair of genes j , j ′ as J j , j ′ = | N j ∩ N j ′ | | N j ∪ N j ′ | ⁠ , where N u is the set of pathways containing gene u . This is the fraction of shared pathways between two genes.

Consider an unobserved gene knockout of gene j ′ ⁠ . Then, we can construct the treatment proxy model by setting t i ( j ) = C J j , j ′ ⁠ , where C is determined by normalization C = ∑ j = 1 K t i ( j ) ⁠ . In the case of single unobserved perturbations, the proxy vector is used as is, otherwise, respective proxy vectors and observed treatment vectors are added up. For instance, let ( 1 , 0 , 0 , 0 , 0 ) be a treatment vector, where the observed interventions are indicated by the 1 at the first position. Further, let ( 0 , 0.4 , 0.35 , 0.25 , 0 ) be a normalized proxy treatment vector, determined as outlined. Then, we evaluated CODEX with the sum of both, t i = ( 1 , 0.4 , 0.35 , 0.25 , 0 ) ⁠ . Thus, in summary, CODEX passes the data through treatment branches which likely resemble the unobserved perturbation. It is noteworthy that the proposed weighting scheme is one specific choice to incorporate prior knowledge to share information among perturbations. Other, viable options could rely, e.g. on measurements of protein–protein interactions or sequential similarity.

For more details about the implementation and the used network architecture refer to the Supplementary Materials .

2.2 Competing methods

2.2.1 algorithms for drug-synergy predictions in cell lines.

We compared CODEX to

TreeCombo ( Janizek et al. 2018 ), which is a tree-based approach using Extreme Gradient Boosting (XGBoost),

DeepSynergy ( Preuer et al. 2018 ), a dense feed forward neural network to predict synergy scores,

MatchMaker ( Kuru et al. 2022 ), which is inspired by DeepSynergy but additionally splits the network into two parts representing the two different interventions, and

MARSY ( El Khili et al. 2023 ), which uses a drug-drug representation network combined with the latent representation learner of DeepSynergy.

In contrast to CODEX, the competing models additionally use chemical drug encodings as input.

2.2.2 Algorithms for the prediction of post-perturbation profiles

We compared CODEX to:

Random Baseline: in line with Lotfollahi et al. (2023) , we implemented a random baseline to assess the relative benefit of CODEX in the context of dosage extrapolations.

Linear baseline: we implemented a linear baseline which simply averages the downstream predictions of individual perturbations to predict the effect of perturbation combinations.

Gene Regulatory Network (GRN): GRN, as implemented by Roohani et al. (2023) , infers a GRN to linearly propagate the effect of gene perturbations.

Compositional Autoencoder (CPA) ( Lotfollahi et al. 2023 ): CPA is based on a Variational Autoencoder (VAE) architecture trained to encode both control and perturbation profiles. It encodes different perturbations and dosages in a latent space. This space is made indistinguishable with respect to the different interventions using an adversarial discriminator. Downstream effects are predicted by (1) encoding control cells and (2) decoding them with respective interventions activated in the latent space.

Graph-Enhanced gene Activation and Repression Simulator (GEARS) ( Roohani et al. 2023 ): GEARS is a generative approach assuming a single type of cell. It uses two separate Graph Neural Networks (GNNs) to encode additional prior knowledge about gene–gene relationships and perturbation relationships. The first GNN embeds the unperturbed state using a gene co-expression knowledge graph, and the second GNN learns perturbation embeddings using a graph derived from GO. The states are combined using the respective perturbations, and a feed-forward decoder is used to reconstruct the post-perturbation gene expression.

Linear CODEX: lin-CODEX removes the nonlinear effect combination from CODEX. This baseline is included to illustrate the benefit of the effect decoder, and thus, serves as an ablation study. For model inference, the trained CODEX model is evaluated only for individual model branches and subsequently averaged to receive predictions for respective treatment combinations.

3.1 CODEX improves drug-synergy predictions in cancer cell lines

DrugComb ( Zagidullin et al. 2019 ) provides more than 700 000 recorded drug combinations for 8379 different drugs on 2320 different tissues, making it an invaluable resource to explore drug combinations. However, the combinatorial complexity prohibits a comprehensive experimental exploration and asks for computational solutions. For instance, exploring all combinations of two drugs in all tissues would correspond to ∼10 11 experiments. Following El Khili et al. (2023) , we extracted 670 unique drugs and a set of 2353 corresponding drug pairs on 75 selected cancer cell lines. Synergy effects were extracted from DrugComb ( Zagidullin et al. 2019 ), and the normalized untreated cancer cell lines were retrieved from CellMiner ( Reinhold et al. 2012 ). Lowly expressed genes with   log   2 ( RPKM + 1 ) < 1 and a variance smaller than 0.8 were excluded, resulting in a final set of 4639 features. For validation, we followed El Khili et al. (2023) , and performed two different settings:

a 5-fold cross-validation, where random triplets of cell line-drug–drug combinations were selected for testing,

a stratified 5-fold cross-validation strategy, where individual treatment combinations were removed from the training, meaning that the test fold contains unseen drug combinations.

This validation strategy does not prohibit that similar cell lines and perturbations are captured by the training and test data, which enables meaningful extrapolations among cell lines and perturbations. For performance comparison, we evaluated the Zero Interaction Potency (ZIP) ( Yadav et al. 2015 ), which measures the change in potency of the dose–response curves for individual drugs and their combinations. The main assumption is that noninteracting drugs do not change the underlying dose–response curves. ZIP combines the Bliss independence model ( Bliss 1939 ) and Loewes Additivity ( Loewe 1953 ), using fitted dose–response curves rather than experimentally observed points. All baselines were extracted from ( El Khili et al. 2023 ).

3.1.1 Performance comparison

The results are given in Table 1 . We evaluated three performance measures: the Spearman Correlation Coefficient (SCC), the Pearson Correlation Coefficient (PCC), and the MSE between ground truth and predictions. In the first scenario, the test set contained drug–drug combinations which were already part of the training data, although they were not seen in the same cell line. CODEX yielded the highest PCC and lowest MSEs between predictions and ground truth values and was only out-competed by MARSY with respect to SCC. The remaining competitors performed substantially worse than both approaches.

Mean 5-fold cross-validation performance for the prediction of ZIP synergy scores, in terms of Spearman Correlation Coefficient (SCC), Pearson Correlation Coefficient (PCC), and Mean Squared Error (MSE) for unseen cell lines (1) and unseen drug combinations (2). a

ModelSetting (1) Setting (2)
SCC ↑PCC ↑RMSE ↓SCC ↑PCC ↑RMSE ↓
CODEX0.752 (±0.004) 0.725 (±0.005)
MARSY 0.886 (±0.005)5.36 (±0.19) 0.875 (±0.005)5.62 (±0.15)
MatchMaker0.742 (±0.004)0.873 (±0.006)6.11 (±0.32)0.720 (±0.006)0.864 (±0.007)6.23 (±0.12)
TreeCombo0.737 (±0.004)0.870 (±0.003)5.73 (±0.17)0.689 (±0.005)0.856 (±0.006)6.00 (±0.15)
DeepSynergy0.701 (±0.003)0.869 (±0.004)5.78 (±0.18)0.676 (±0.006)0.860 (±0.008)5.95 (±0.16)
ModelSetting (1) Setting (2)
SCC ↑PCC ↑RMSE ↓SCC ↑PCC ↑RMSE ↓
CODEX0.752 (±0.004) 0.725 (±0.005)
MARSY 0.886 (±0.005)5.36 (±0.19) 0.875 (±0.005)5.62 (±0.15)
MatchMaker0.742 (±0.004)0.873 (±0.006)6.11 (±0.32)0.720 (±0.006)0.864 (±0.007)6.23 (±0.12)
TreeCombo0.737 (±0.004)0.870 (±0.003)5.73 (±0.17)0.689 (±0.005)0.856 (±0.006)6.00 (±0.15)
DeepSynergy0.701 (±0.003)0.869 (±0.004)5.78 (±0.18)0.676 (±0.006)0.860 (±0.008)5.95 (±0.16)

The best performing scores were highlighted in bold, and the arrow indicates whether the measure is maximized or minimized.

Next, we evaluated the more challenging scenario of validation experiment (2), where the test fold contains only unseen drug combinations. As expected, all approaches performed worse than in the validation experiment (1). However, both CODEX and MARSY still achieve reasonable performance with PCCs > 0.87. As previously, CODEX performed best with respect to PCC and MSE and MARSY with respect to SCC. The same holds true considering an alternative synergy score S mean ⁠ , where MARSY slightly out-competed CODEX, while the other approaches showed inferior performance (refer to Supplementary Table S1 ).

3.2 CODEX improves dose–response predictions in single-cell data

We studied CODEX’s ability to predict dose specific responses in single-cell perturbation data. We used the Sciplex2 data ( Srivatsan et al. 2020 ), which contain 12 656 post-perturbation transcriptomic profiles of A549 human lung adenocarcinoma cells measured for four different perturbations (Vorinostat, BMS-34 554, Dexamethasone and Nutlin3a) and seven different concentrations (in total 28 drug–dose combinations). We used the preprocessed data provided by Lotfollahi et al. (2023) , which were normalized and log transformed and limited to the 5000 most variable genes. Accordingly, we left out the second highest treatment concentration and tested CODEX ability to reproduce the dose–response curves and to interpolate to unseen concentrations.

3.2.1 Performance comparison

Figure 2 shows dose–response curves (response versus dosage) for the top three Differentially Expressed Genes (DEGs) for each treatment. Ground truth values (solid lines and error bars), with dots representing individual measurements, are contrasted with predictions by CPA (dotted lines) and CODEX (dashed lines). The second highest dosage was hold out for testing and is highlighted by a vertical dashed line. CODEX achieves a substantially better description of both training and test dosages, capturing also nonlinear dependencies ( Fig. 2 ). This is also supported by considering the reconstruction performance in terms of the coefficient of determination R 2 ( Fig. 3 ), for all genes (blue) and for the top 50 DEGs (orange). CODEX performed best for Dexamethasone, Nutlin-3a, Vorinostat and the linear baseline for BMS-34554, where substantial improvements on the top 50 DEGs were observed for the drugs Vorinostat (0.94 vs. 0.85 for CODEX vs. CPA), and Nutlin-3a (0.84 vs. 0.81 for CODEX vs. CPA). On average, we observed a median performance gain of 9.1% for the top 50 DEGs compared to CPA ( Supplementary Fig. S1A ). Furthermore, considering all dosages (both validation and out-of-distribution data), CODEX showed significantly improved R 2 values compared to CPA, and a median performance increase of 11.5% for the top 50 DEGs ( Supplementary Fig. S1B ). We further observed that the linear baseline performs well on BMS-34554, while it failed for Dexamethasone and Vorinostat. This suggests that linear effects are sufficient to model some but not all interventions.

Dose-dependent reconstruction of the gene expression levels of the top 3 DEGs for each of the treatments. Ground truth values (solid lines and error bars), with the size of the dots representing the number of available samples for each measurement, are contrasted with predictions by CPA (dotted lines) and CODEX (dashed lines). The dashed vertical line indicates the dosage left out for testing.

Dose-dependent reconstruction of the gene expression levels of the top 3 DEGs for each of the treatments. Ground truth values (solid lines and error bars), with the size of the dots representing the number of available samples for each measurement, are contrasted with predictions by CPA (dotted lines) and CODEX (dashed lines). The dashed vertical line indicates the dosage left out for testing.

R2 reconstruction performance of the mean post perturbation gene expression of all genes (blue) and the top 50 DEGs (orange) obtained for the second highest dose (left out for testing) on Sciplex2.

R 2 reconstruction performance of the mean post perturbation gene expression of all genes (blue) and the top 50 DEGs (orange) obtained for the second highest dose (left out for testing) on Sciplex2.

3.3 CODEX can predict molecular downstream effects of drug combinations in single cells

We further investigated molecular downstream effects in terms of perturbed transcriptomes of 13 anticancer drugs on A549 cells obtained from the combinatorial sci-Plex (Combosciplex) assay ( Lotfollahi et al. 2023 ). The data comprise 18 different anticancer medications evaluated on 63 430 single cells, including a total of 25 unique drug combinations and seven individually observed drug perturbations (for an overview, see Supplementary Table S5 ). Similar to the Sciplex2 data, Combosciplex was normalized, log-transformed, and restricted to the 5000 most variable genes. To study the ability of CODEX to infer the effects of unseen drug combinations, we excluded four combinations during model development and evaluated the reconstruction of perturbed transcriptomes using R 2 scores. We compared CODEX to the linear baseline, which averages the observed single effects, CPA ( Lotfollahi et al. 2023 ), and lin-CODEX.

3.3.1 Performance comparison

Considering median R 2 values, we observed that CODEX performs best with an increase of 5.5% on the top 50 DEGs compared to CPA ( Supplementary Fig. S2 ). The performance resolved for the individual hold-out treatment combinations is shown in Fig. 4 . There, already lin-CODEX is able to improve predictions compared to the linear baseline. However, it is outperformed by both CPA and CODEX, suggesting a crucial role of nonlinear decodings (see also Supplementary Fig. S2 ). The performance gains of CODEX were mainly attributed to the combinations including Alvespimicin ( Fig. 4 ). Those were weakly supported by the training data with significantly different effects than observed during training ( Lotfollahi et al. 2023 ). This is further illustrated in a UMAP representation, where the left-out treatment combinations containing Alvespimin (purple triangles, with blue circles) are separated from the majority of the training combinations (red circles), both on the latent space and the final predictions ( Fig. 5A and B , see also Supplementary Fig. S3 for increased resolution).

R2 reconstruction performance of the mean post perturbation gene expression profiles for all genes (blue) and top 50 DEGs (orange) obtained for the four held out treatment combinations on the Combosciplex data.

R 2 reconstruction performance of the mean post perturbation gene expression profiles for all genes (blue) and top 50 DEGs (orange) obtained for the four held out treatment combinations on the Combosciplex data.

UMAP of the combined latent representation z of CODEX (A) and of its final predictions (B) for all possible treatment combinations of the Combosciplex data. Each colored triangle represents a treatment component, and the full squares represent their respective combinations. Training combinations are circled in red, test combinations in blue, and the ground truth for the unobserved combinations in green (only for final predictions).

UMAP of the combined latent representation z of CODEX (A) and of its final predictions (B) for all possible treatment combinations of the Combosciplex data. Each colored triangle represents a treatment component, and the full squares represent their respective combinations. Training combinations are circled in red, test combinations in blue, and the ground truth for the unobserved combinations in green (only for final predictions).

The Combosciplex data comprise only a subset of all possible drug combinations. We next used CODEX to infer all remaining unobserved drug combinations and visualized the results using a UMAP representation on the latent linear effects ( z i ) and on the final predictions (top and bottom of Fig. 5 , respectively). UMAP on the latent representation reveals distinct clusters associated with the dominant effects attributed to Pirarubicin, Danusertib, Tanespimycin, and Sorafenib. Those are not revealed on the final predictions (bottom Fig. 5 ), suggesting that the nonlinear adjustments of the decoder regulate the treatment effect sizes. We further confirmed that out-of-sample predictions (blue circles) are in near vicinity of respective ground truths (green circles), with one-to-one correspondence indicated by connecting black lines.

3.4 CODEX facilitates predictions of combined gene knock-out perturbations from CRISPRi in single cells

Finally, we used CODEX to explore genetic perturbations implemented via CRISPRi. CRISPRi facilitates the targeted silencing of genes and has become increasingly feasible in large-scale single-cell perturbation screens in recent years. For instance, Norman et al. (2019) proposed Perturb-seq to perform single-cell pooled CRISPRi screens and provided data containing a total of 284 unique knock-out conditions, comprising 131 unique two-gene knockouts, on 108 000 single-cells from the K562 cancer cell line ( Norman et al. 2019 ). In the first experiment, we selected test combinations where the individual perturbations were part of the training data. To further guarantee a fair comparison to state-of-the-art competitors, we followed Roohani et al. (2023) and implemented the same 5 test-training splits using the same feature set of 5045 genes. We assessed the performance in reconstructing perturbed profiles in terms of normalized MSE on the top 20 DEGs (normalized using the reconstruction error of the random baseline) and PCC, where we evaluated the reconstructed effect on all genes (not considering the control background) and the reconstruction of the top 20 DEGs.

3.4.1 Performance comparisons

Also, considering this application, CODEX substantially improves the recently suggested state-of-the-art solutions CPA and GEARS and by far out-competes GRN. Considering PCC for the top 20 DEGs ( Fig. 6C ), we observed a median value of 0.98 for CODEX compared to 0.96 for lin-CODEX, 0.91 for CPA, 0.92 for GEARS, and 0.82 for GRN. This is consistent with the performance in terms of normalized MSEs ( Fig. 6A ) and by considering PCC for all genes ( Fig. 6B ). To further substantiate these findings, we performed an additional validation experiment corresponding to the setting of ( Lotfollahi et al. 2023 ), where each combination of perturbations is left out in one of 13 splits, and evaluated the reconstruction performance in terms of the R 2 score on all genes and on the top 50 DEGs, in line with ( Lotfollahi et al. 2023 ). These results demonstrate that in this comparison, CODEX out-competes the other methods significantly ( Supplementary Fig. S4 ).

Reconstruction performance in terms of normalied MSE for the top 20DEGs (A), PCC of the effect on all genes (B) and PCC of the top 20 DEGS (C) of unseen perturbation combinations on the Norman et al. (2019) data.

Reconstruction performance in terms of normalied MSE for the top 20DEGs (A), PCC of the effect on all genes (B) and PCC of the top 20 DEGS (C) of unseen perturbation combinations on the Norman et al. (2019) data.

3.5 CODEX facilitates predictions of unobserved gene knock-outs from CRISPRi in single cells

CODEX can predict downstream effects of unseen perturbations, combinations thereof with observed perturbations, as well as combinations of unseen perturbations only. This is achieved by proxy models resembling unseen perturbations. These proxies were established by a weighting scheme summarizing related gene perturbations (see Methods). To assess CODEX’s capability to predict unseen perturbations, we performed three additional experiments using the Norman et al. (2019) dataset, where we inferred the effect of left-out perturbations for a single missing perturbation (0/1 seen), a missing perturbation in a pair (1/2 seen), and two missing perturbations in a pair (0/2 seen). Due to the limited number of single gene perturbations (105), we tested CODEX on two additional genetic perturbation screens generated by Replogle et al. (2022) , comprising a total of 1543 RPE-1 and 1092 K562 single genetic perturbations, with 175 398 and 192 648 measured single cells, respectively. We again adapted the experimental setup of Roohani et al. (2023) and compared the reconstruction error based on five identical sets of held-out perturbations. Performance was again assessed using normalized MSE on the top 20 DEGs, PCC of the effect on all genes, and PCC on the top 20 genes.

3.5.1 Performance comparison

For the Norman et al. (2019) data, we observed that GEARS yielded the lowest median MSE in all three settings ( Fig. 7A ). However, CODEX consistently out-competes all other methods with respect to PCC ( Fig. 7B and C ). There, CODEX improves the median PCC on the top 20 DEGs from 0.90 to 0.92 for 0/1 seen, 0.83 to 0.93 for 1/2 seen, and 0.79 to 0.89 for 0/2 seen compared to the second-best performing model (GEARS).

Reconstruction performance in terms of normalied MSE for the top 20DEGs (A), PCC of the effect on all genes (B) and PCC of the top 20 DEGS (C) of unseen perturbations for varying degrees of difficulty. The inferred effect of perturbations is evaluated for a single missing perturbation (0/1 seen), a missing perturbation in a pair (1/2 seen), and two missing perturbations in a pair (0/2 seen) on the Norman et al. (2019) data.

Reconstruction performance in terms of normalied MSE for the top 20DEGs (A), PCC of the effect on all genes (B) and PCC of the top 20 DEGS (C) of unseen perturbations for varying degrees of difficulty. The inferred effect of perturbations is evaluated for a single missing perturbation (0/1 seen), a missing perturbation in a pair (1/2 seen), and two missing perturbations in a pair (0/2 seen) on the Norman et al. (2019) data.

On the Replogle et al. (2022) data, CODEX performs best on all compared measures for both K562 and RPE-1 cells ( Fig. 8A–C ). On K569 cells, CODEX improves the median PCC from 0.31 to 0.49, and on RPE-1 cells from 0.53 to 0.61, compared to GEARS ( Fig. 8B ). In comparison, the performance of CPA and GRN is highly compromised with respect to MSE ( Fig. 8A ) and PCC on all genes ( Fig. 8B ). These findings suggest that when many single perturbations are known, gene graph proxies can be highly efficient. This holds true for both CODEX and GEARS. For CODEX, however, this aspect seems to be even more beneficial.

Reconstruction performance in terms of normalied MSE for the top 20DEGs (A), PCC of the effect on all genes (B) and PCC of the top 20 DEGS (C) for an unseen single perturbation on single cells from K562 and RPE-1 cell lines (Replogle et al. 2022).

Reconstruction performance in terms of normalied MSE for the top 20DEGs (A), PCC of the effect on all genes (B) and PCC of the top 20 DEGS (C) for an unseen single perturbation on single cells from K562 and RPE-1 cell lines ( Replogle et al. 2022 ).

We proposed CODEX as a general framework to model high-throughput perturbation experiments. CODEX naturally facilitates diverse causal prediction tasks, can learn nonlinear effect combinations, and can model different intervention types. This was shown for both chemical and genetic perturbation combinations. Moreover, we suggested a weighting scheme to perform predictions for completely unseen perturbations. Thus, applications of CODEX are rich and the outlined performance comparisons suggest that CODEX offers highly competitive performance across diverse applications.

CODEX builds on counterfactual reasoning and implements different perturbations via distinct model representations, while most state-of-the-art approaches include perturbations as distinct variables ( Lotfollahi et al. 2023 , Roohani et al. 2023 ) or represent them using chemical embeddings ( Preuer et al. 2018 , Kuru et al. 2022 , El Khili et al. 2023 ). Causal modeling of representations increases model complexity, but has the benefit that highly complex downstream effects can be captured through the model architecture. This might be particularly relevant for the prediction of complex phenotypes, as our empirical results for the prediction of perturbed single-cell transcriptomes suggest. To enable out-of-distribution predictions for unseen drug combinations, different perturbation branches are combined. This has two advantages. First, subsequent network layers can account for potential nonlinearities. Those were repeatedly shown to be key to the outlined prediction tasks as supported by our comparison to the linear CODEX baseline (lin-CODEX). Second, and most importantly, CODEX can learn the individual perturbation from combinations of perturbations. Given the combinatorial complexity of high-throughput perturbation screens, it is not tractable to comprehensively explore the space of perturbations and their downstream effects experimentally. However, CODEX makes use of redundancy, and might further improve extrapolation performance as perturbations are captured in more and more diverse combinations, not even limited to pairs of perturbations. This will be crucial to maximize the use of upcoming, potentially more comprehensive perturbation screens.

CODEX has a number of limitations. First, the main issue that afflicts all prediction algorithms of such a kind is that in vitro cell line experiments provide only a very limited picture of the mechanisms taking place in vivo . It will be key to further consider the in vivo model transfer. Multiple approaches were already suggested, comprising Velodrome ( Sharifi-Noghabi et al. 2021 ) and CODE-AE ( He et al. 2022 ). However, they mainly rely on a measure of distributional similarity and do not address that biological processes between human and cell lines differ, with the former comprising, e.g. complex cellular interactions taking place in the tumor microenvironment. Second, nowadays perturbation experiments are destructive, meaning that we never observe an individual cell before and after the intervention. Observing the latter might substantially deepen our understanding of molecular downstream effects. In this direction, e.g. Bunne et al. (2023) and Dong et al. (2023) , attempt to identify counterfactual cell pairs using optimal transport. However, a combination of those methods with CODEX requires future research, addressing, e.g. potential biases from cell-pair selection.

In summary, CODEX provides a highly potent framework to extrapolate the space of interventions in high throughput perturbation experiments to unseen interventions and combinations thereof. Empirical results suggest that it substantially enhances out-of-distribution predictions and applies to diverse prediction tasks, suggesting rich applications in pharmacogenomics.

Supplementary data are available at Bioinformatics online.

None declared.

The work was supported by the German Federal Ministry of Education and Research (BMBF) within the framework of the e: Med research and funding concept [01ZX1912A, 01ZX1912C to H.U.Z. and M.A.]; the BMBF [01KD2209D] and the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) [AL 2355/1–1 “Digital Tissue Deconvolution—Aus Einzelzelldaten lernen” to S.S. and M.A.]; the BMBF projects FAIrPaCT [01KD2208A to T.B. and A.C.H.]; the BMBF projects PerMiCCion [01KD2101C], MATCH [01KU1910A], DFG TRR274 [408885537], and KFO5002 [426671079] to T.B.

All data used in this article are publicly available. The Gene Expression Omnibus accession numbers to obtain the raw datasets are: Srivatsan et al. (2020) GSM4150378, Lotfollahi et al. (2023) GSE206741, Norman et al. (2019) GSE133344. The data from Replogle et al. (2022) are available from https://doi.org/10.25452/figshare.plus.20022944 and the prepossessed drug-synergy data from https://github.com/Emad-COMBINE-lab/MARSY/tree/main/data ( El Khili et al. 2023 ). Further information and code to reproduce the experiments are provided in the CODEX repository at https://github.com/sschrod/CODEX .

Al-Lazikani B , Banerji U , Workman P. Combinatorial drug therapy for cancer in the post-genomic era . Nat Biotechnol 2012 ; 30 : 679 – 92 .

Google Scholar

Bliss CI. The toxicity of poisons applied jointly 1 . Ann Appl Biol 1939 ; 26 : 585 – 615 .

Bunne C , Stark SG , Gut G et al.  Learning single-cell perturbation responses using neural optimal transport . Nat Methods 2023 ; 20 : 1759 – 68 .

Bush EC , Ray F , Alvarez MJ et al.  Plate-seq for genome-wide regulatory network analysis of high-throughput screens . Nat Commun 2017 ; 8 : 105 .

Gene Ontology Consortium . The gene ontology (GO) database and informatics resource . Nucleic Acids Res 2004 ; 32 : D258 – 61 .

Costello JC , Heiser LM , Georgii E et al.  A community effort to assess and improve drug sensitivity prediction algorithms . Nat Biotechnol 2014 ; 32 : 1202 – 12 .

Csermely P , Korcsmáros T , Kiss HJ et al.  Structure and dynamics of molecular networks: a novel paradigm of drug discovery: a comprehensive review . Pharmacol Ther 2013 ; 138 : 333 – 408 .

Dong M , Wang B , Wei J et al.  Causal identification of single-cell experimental perturbation effects with CINEMA-OT . Nat Methods 2023 ; 20 : 1769 – 79 .

Douglass EF , Allaway RJ , Szalai B et al.  A community challenge for a pancancer drug mechanism of action inference from perturbational profile data . Cell Rep Med 2022 ; 3 : 100492 .

El Khili MR , Memon SA , Emad A. Marsy: a multitask deep-learning framework for prediction of drug combination synergy scores . Bioinformatics 2023 ; 39 : btad177 .

He D , Liu Q , Wu Y et al.  A context-aware deconfounding autoencoder for robust prediction of personalized clinical drug response from cell-line compound screening . Nat Mach Intell 2022 ; 4 : 879 – 92 .

Iorio F , Knijnenburg TA , Vis DJ et al.  A landscape of pharmacogenomic interactions in cancer . Cell 2016 ; 166 : 740 – 54 .

Janizek JD , Celik S , Lee S-I. Explainable machine learning prediction of synergistic drug combinations for precision cancer medicine. bioRxiv, 2018 ; 331769 , preprint: not peer reviewed.

Johansson F , Shalit U , Sontag D. Learning representations for counterfactual inference. In: International Conference on Machine Learning. PMLR, 2016 ; 48 : 3020 – 29 .

Kuru HI , Tastan O , Cicek AE. Matchmaker: a deep learning framework for drug synergy prediction . IEEE/ACM Trans Comput Biol Bioinform 2022 ; 19 : 2334 – 44 .

Ling A , Gruener RF , Fessler J et al.  More than fishing for a cure: the promises and pitfalls of high throughput cancer cell line screens . Pharmacol Ther 2018 ; 191 : 178 – 89 .

Loewe S. The problem of synergism and antagonism of combined drugs . Arzneimittelforschung 1953 ; 3 : 285 – 90 .

Lotfollahi M , Klimovskaia Susmelj A , De Donno C et al.  Predicting cellular responses to complex perturbations in high-throughput screens . Mol Syst Biol 2023 ; 19 : e11517 . page

Norman TM , Horlbeck MA , Replogle JM et al.  Exploring genetic interaction manifolds constructed from rich single-cell phenotypes . Science 2019 ; 365 : 786 – 93 .

Preuer K , Lewis RP , Hochreiter S et al.  Deepsynergy: predicting anti-cancer drug synergy with deep learning . Bioinformatics 2018 ; 34 : 1538 – 46 .

Reinhold WC , Sunshine M , Liu H et al.  Cellminer: a web-based suite of genomic and pharmacologic tools to explore transcript and drug patterns in the nci-60 cell line set . Cancer Res 2012 ; 72 : 3499 – 511 .

Replogle JM , Saunders RA , Pogson AN et al.  Mapping information-rich genotype-phenotype landscapes with genome-scale perturb-seq . Cell 2022 ; 185 : 2559 – 75.e28 .

Roohani Y , Huang K , Leskovec J. Predicting transcriptional outcomes of novel multigene perturbations with gears . Nat Biotechnol 2023 ; 1 – 9 .

Rubin DB. Estimating causal effects of treatments in randomized and nonrandomized studies . J Educ Psychol 1974 ; 66 : 688 – 701 .

Schrod S , Schäfer A , Solbrig S et al.  Bites: balanced individual treatment effect for survival data . Bioinformatics 2022 ; 38 : i60 – 7 .

Schrod S , Sinz F , Altenbuchinger M. Adversarial distribution balancing for counterfactual reasoning. arXiv, arXiv:2311.16616, 2023 , preprint: not peer reviewed.

Shalit U , Johansson FD , Sontag D. Estimating individual treatment effect: generalization bounds and algorithms. In: International Conference on Machine Learning . PMLR, 2017 , 3076 – 85 .

Sharifi-Noghabi H , Harjandi PA , Zolotareva O et al.  Out-of-distribution generalization from labelled and unlabelled gene expression data for drug response prediction . Nat Mach Intell 2021 ; 3 : 962 – 72 .

Srivatsan SR , McFaline-Figueroa JL , Ramani V et al.  Massively multiplex chemical transcriptomics at single-cell resolution . Science 2020 ; 367 : 45 – 51 .

Subramanian A , Narayan R , Corsello SM et al.  A next generation connectivity map: l 1000 platform and the first 1,000,000 profiles . Cell 2017 ; 171 : 1437 – 52.e17 .

Yadav B , Wennerberg K , Aittokallio T et al.  Searching for drug synergy in complex dose–response landscapes using an interaction potency model . Comput Struct Biotechnol J 2015 ; 13 : 504 – 13 .

Yao L , Li S , Li Y et al.  Representation learning for treatment effect estimation from observational data . Adv Neural Inform Process Syst 2018 ; 31 :11.

Ye C , Ho DJ , Neri M et al.  Drug-seq for miniaturized high-throughput transcriptome profiling in drug discovery . Nat Commun 2018 ; 9 : 4307 .

Yoon J , Jordon J , Van Der Schaar M. 2018 . Ganite: estimation of individualized treatment effects using generative adversarial nets. In: International Conference on Learning Representations .

Zagidullin B , Aldahdooh J , Zheng S et al.  Drugcomb: an integrative cancer drug combination data portal . Nucleic Acids Res 2019 ; 47 : W43 – 51 .

Month: Total Views:
June 2024 78
July 2024 450
August 2024 191
September 2024 100

Email alerts

Citing articles via, looking for your next opportunity.

  • Recommend to your Library

Affiliations

  • Online ISSN 1367-4811
  • Copyright © 2024 Oxford University Press
  • About Oxford Academic
  • Publish journals with us
  • University press partners
  • What we publish
  • New features  
  • Open access
  • Institutional account management
  • Rights and permissions
  • Get help with access
  • Accessibility
  • Advertising
  • Media enquiries
  • Oxford University Press
  • Oxford Languages
  • University of Oxford

Oxford University Press is a department of the University of Oxford. It furthers the University's objective of excellence in research, scholarship, and education by publishing worldwide

  • Copyright © 2024 Oxford University Press
  • Cookie settings
  • Cookie policy
  • Privacy policy
  • Legal notice

This Feature Is Available To Subscribers Only

Sign In or Create an Account

This PDF is available to Subscribers Only

For full access to this pdf, sign in to an existing account, or purchase an annual subscription.

in silico perturbation experiments

In-silico perturbations: leveraging single cell seq to perform perturbation experiments

Omnia mutantur: how researchers are using computational models to predict cell fates, and their regulatory networks..

in silico perturbation experiments

Dear Community,

Like the post last week , We will look at two methods that have been published (One a preprint and another reviewed) recently. I am calling the 76 members of this community to share your curiosity and enthusiasm about science and the field of bioinformatics by just doing one simple thing. Talk about our newsletter to your friends or colleagues, and show these posts to them and let them join our exciting adventure into the unknown. You will find the buttons to do this along the way as you read through. Also, I would recommend using the browser for this read, its a bit longer than the previous one.

Let’s Go!!!

in silico perturbation experiments

I am a machine, possibly an improbable one but yet I exist. Through careful orchestration of phenomena that I have evolved over millennia, I serve to exist. I evolve, change, adapt and mould myself. I am made of interesting components, Each of them does not know why they are there by themselves but I encapsulate them into a neat little package a robot of sorts to survive is my prerogative; For I am a cell.

The biological world and its “umwelt” (my new favourite german word) is as complex as any star system or galaxy that exist. I often marvel at the complexities in the packages they come in; Not only that they are usually tiny (of course we have extreme outliers as well) but they are especially well-tuned machines that perform complex tasks within their tolerances to what I would consider near perfection.

It is likely that you had taken a biology lesson in the past to have come across two of the plethora of these sub-machines that are of interest in this week's edition of the newsletter. They both are in a way consequent steps of the process by which your cell’s DNA is converted into a final product the protein. Which in turn provides and contributes to the cell's function, identity and in-built mechanisms for overall production and regulation of the cell itself. Confusing?

Cells contain a nucleus: This package has your genetic information the DNA.

Through a complex orchestration of events, DNA is then converted into RNA and then the protein.

The proteins in turn can dictate the fate of the cell and the triggering of events that might lead to the propagation of the cell.

So with this out of our way, let's dive into two recently published methods that aim towards a computational approach to perturb steps of the “cellular information highway”.

Dissecting cell identity via network inference and in silico gene perturbation.

Compressed Perturb-seq: highly efficient screens for regulatory circuits using random composite perturbations.

The original cells are but two, the sperm and the egg and the series of events that occur when they meet is an epic in the making. A careful orchestration of events, one such pivotal event is the activation of transcription factors. They are the gatekeepers of sorts that allow for the expression of specific genes, they fine-tune the expression of the genes allowing the exact specification of the cell fate i.e., which cell belongs where and its function.

Classical experiments that try to study the functions of transcription factors utilize one or many of the perturbation methods such as Genetic knockouts, Chemical inhibition, and Overexpression to name a few and observe the outcome of these experiments on the cell behaviour. But as one can imagine, these methods can be tedious and require thorough investigation so that the off-target effects of these interventions are minimized and our interpretation can be robust.

With the advent of single-cell technologies especially when it comes in conjunction with CRISPRscreens could be used to analyze the functions of transcription factors in the context of cell identity. Many computational methods that try to address the cell fate problem have their limitations.

The current study approaches this problem in the context of gene-regulatory networks (GRN). GRNs are models of how genes interact with each other and regulate the expression of other genes in a cell. I would compare this with your smart home network; Each and every instrument interacts with this network and thereby consumes bandwidth but by turning on and off certain devices it is possible to prioritize tasks on one device or the other. Similarly, GRNs comprise of a network of transcription factors(TF), proteins and other molecules that interact in order to control gene expression.

The use of GRNs inferred from single-cell multi-omics data to perform in silico transcription factor perturbations is the main aim of the paper, and therefore, one can simulate the consequent changes in cell identity using only unperturbed wild-type data. So how do they do this?

CellOracle, the tool they created uses a custom GRN modelling that stimulates the global downstream changes in gene expression following a TF knockout or over-expression. The insilico perturbation involves four key steps:

Construction of Cell-type or cell-state-specific GRN configurations using cluster-wise regularized linear regression models with multi-omics data.

By using these GRN models, shits in target gene expression in response to TF perturbations are calculated. This shift propagates the shift in expression of the gene rather than a change in absolute expression, ie this seems to be a sort of a proxy estimation of by how much the gene moves in a direction when TF is perturbed. Giving us hints to calculate the broad downstream effects of TF perturbation.

The cell-identity transition probability is converted into a weighted local average vector. This simulates the directionality of cell-state transition for each cell following the perturbation of candidate TFs.

Then the multi-dimensional gene expression shift vector is reduced to a two-dimensional vector for robust predictions.

The clear aim is to determine for a given transcription factor X in cell state C when perturbed what is the change ∂X on

The clear aim is to determine for a given transcription factor X in cell state C when perturbed what is the change ∂X on ∂C and not measuring the induced gene expression of the genes in cell state C.

Inference and benchmarking of the model:

Overview of step 1 of inference:.

CellOracle uses scATAC-seq data to identify accessible promoter and enhancer DNA sequences. This data allows them to construct a base GRN model.

The Transcription start sites are obtained from HOMER and by using Cicero to identify the co-accessible scATAC-seq peaks the authors are able to distinguish accessible promoters and enhancers.

The DNA sequences of these elements are scanned for TF-binding motifs, generating a base GRN structure of all the potential regulatory interactions in the species of interest.

The benefit of the process is to narrow the scope of possible regulatory candidate genes before model fitting, This also helps to identify and define the directionality of regulatory edges in the GRN.

All base GRNS are built into CellOracle library and provide an alternative solution when scATAC-seq is not available. This is the key advantage of using this tool when you don't have access to scATAC seq experiments.

in silico perturbation experiments

Overview of Step 2 of inference:

The authors use scRNA-seq data to identify active connections in the base GRN.

This generates cell-type or cell-state-specific GRN configurations for each cluster in scSEQ data.

Since CellOracle uses GRNs from the genomic sequence information and infers the structure and directionality, it does not need to know the cause of the directionality based on expression data.

It uses a relatively simple modelling method for inference: Regularized linear machine-learning model (see concepts in brief for an explanation)

This strategy allows signals above the transcription chain to simulate the TF perturbation.

The performance of this model was then assessed by Area under the receiver operating characteristics

The key claim is that by using CellOracle they can effectively interrogate a network within cells and therefore introduce perturbations in-silico that can identify cell-identity dynamics.

GRN analysis and TF KO in hematopoesis

Any method once proposed has to be tested. In this case, the authors do test it on a well-characterized differentiation paradigm. That is mouse haematopoiesis. The authors apply CellOracle to a 2730-cell scRNA-seq atlas of myeloid progenitors differentiation. To perform in-silico perturbations they constructed GRN models for each of the 24 myeloid clusters representing megakaryocyte and erythroid progenitors (MEPs) and granulocyte-monocyte progenitors (GMPs) which are differentiating toward erythrocytes, megakaryocytes, monocytes, and granulocytes.

By using CellOracle they perform an in-silico perturbation analysis of Spi1 and Gata1 (both are well-studied transcription factors.

They devise a ‘perturbation score’ metric, which compares the directionality of the perturbation vector to the natural differentiation vector.

A negative score implies that TF KO delays or blocks differentiation, whereas a positive score implies loss of TF function promotes differentiation.

Spi1: the KO of this transcription factor yeilded positive perturbation scores for MEPs and a negative score for GMPs. These scores suggest that Spi1 KO inhibits GMP differentiation and promotes MEP differentiation.

Overall these results align with previously reported findings in experimental setups

in silico perturbation experiments

  • Overview of CellOracle and its application in Heamatopoiesis. The figure broadly describes how CellOracle constructs the GRNS followed by mapping the clusters into six main cell types from scSEQ data. Overall we can see the differentiation vectors and how synthetic perturbation can be used to study the cell fate.

Similarly, they perform a wide array of in-silico transcription factor perturbations in different models and an array of transcription factors to show that their method is able to effectively simulate cell-state-specific TF function and corroborates with previously known knowledge of mechanisms that regulate cell fate in haematopoiesis and ground-truth in vivo phenotypes. The results also indicate that CellOracle enables objective and scalable in silico gene perturbation analysis. It has been used in an independent study in the context of human development; Where researchers applied CellOracle to understand medium spiny neuron maturation in the human fetal striatum. These are the cases where in-silico approaches can aid us in understanding

Not every method is one ring to rule them all, so every method has its own shortcomings. Similarly, CellOracle has the following limitations;

It visualizes the simulation vector within the existing trajectory space; therefore cell states that do not exist in input scRNA-seq data cannot be analysed. This makes me think, are there methods that impute unknown cell states in scSEQ data, if so can these methods be combined to unravel the Blackbox of transitional states?

TF simulation is limited by input data availability and quality, therefore a perturbation cannot be simulated if a TF-binding motif is unknown or TF expression is too sparse. This is challenging especially for less studied organisms which have low-quality data along with a non-robust description of TFs and their binding sites.

in silico perturbation experiments

Compressed Perturb-seq: highly efficient screens for regulatory circuits using random composite perturbations

As we have seen above, perturbation experiments are quite important for understanding the processes that occur in the cell. Through perturbation, the functional changes in the cellular transcriptional direction and therefore the cell fate have been determined through a computational framework.

But what if you are an immunologist, and are interested to learn about cellular processes such as innate immune response and are planning to perform an interference screen on a library of targets? Here is where perturb-Seq comes to play.

Perturb-Seq (I am going to call this P-Seq in this article), is a relatively new single-cell sequencing method that enables screening in an effective way for regulatory circuits. This method is in turn a convergence of scATAC-seq, random composite perturbations, and scRNA-seq. It measures the effects of multiple perturbations in a single cell. Instead of multiple individual experiments which are often the case in traditional screening methods, P-seq leverages the methods aforementioned to efficiently measure the effects of multiple perturbations in the cells. By using single-cell versus the traditional bulk methods, we as researchers can employ this in several rare cell types for example instead of relying on established cell lines as a proxy.

The authors of the current paper, propose a new framework for P-seq, which aims to address the challenges of large-scale P-seq since they are quite expensive and limited by the number of available cells, especially for primary cell systems. By leveraging the technique of compressed sensing from signal processing, they provide us with an essential tool kit to be able to study perturbations in a more effective manner.

Why compressed sensing you might ask:

Generally, perturbation effects tend to be sparse in that most perturbations only affect a small number of genes or go-regulated gene programs (similar to GRNs above). Instead of studying and measuring individual perturbations, the authors suggest measuring a much smaller number of random combinations of perturbations and accurately learn the effects of individual perturbations from a composite sample using sparsity-[romoting algorithms. This might also allow us to efficiently learn both the single gene perturbations and what is the consequence and also the network interactions.

The authors propose two experimental strategies and an inference method to learn individual perturbation responses from composite samples. Using this framework they are able to give insights into immune regulatory functions and illustrate their connection to human disease mechanisms by integrating data from GWAS and eQTL studies.

For the sake of brevity of the article, and I have been tired the past few days (I am human after all) I will keep this section relatively short. I might come and revisit this topic along with satija lab’s new paper when they are out and published after peer review.

The framework:

In the original perturb-seq paper from 2016, the authors had identified the core challenge for profiling cells subjected to genetic perturbations; Then there existed no viable method that was high throughput and are able to do high content screens without any labour-intensive process. The experimental workflow uses pooled CRISPR screens, where guide RNAs are used to target the genes of interest. By using lentiviral infection of cells for example they are able to generate single-cell RNAseq of the infected cells, the inherent nature of the guide RNA and using a unique cell barcode and a unique molecular identifier; the researchers were able to exactly pinpoint which perturbation arise from which cell and affected which genes or a network of genes.

in silico perturbation experiments

As one you might question, well what if there are two perturbations and how would you explain which perturbation leads to an effect? This is where using linear models and decomposing epistasis aid us in understanding the effects of perturbations.

Now with that out of the way, Let's see what the authors modified in compressed Perturb-seq. The goal of a P-seq experiment is to infer the effect sizes of the perturbations on the phenotype. Mathematically speaking, The gene expression profile can be visualized as a matrix n*m where n is the effect size of the perturbation and m is the expression level of the genes. By using the constant factor O based on sample replicates they can learn the effects of n perturbations, these scale linearly with the number of perturbations (tbh, I am slightly lost here as well since I don't have a strong math background).

They then base their hypothesis on compressed sensing theory, If the perturbation effects are sparse i.e., few perturbations affect the phenotype, perturbations tend to affect few factors that can be combined to “explain” the phenotype. They can then measure a small number of random composite samples and decompress those measurements to infer the effects of individual perturbations.

in silico perturbation experiments

Given a single-valued phenotype, O(k log n) composite samples were obtained from n perturbations. These are enough to accurately recover enough information about the effects of the perturbations. k is the number of non-zero elements among the perturbations.

When genes do not affect the phenotype k is small than n.

The phenotype itself is an m- dimensional gene expression profile.

They use a Factorize-recover algorithm which requires O((q+r)log n) composite samples where r is the rank of the n x m perturbation effect size matrix. The rank is the number of distinct groups of co-regulated genes whose expression changes in response to any perturbation.

q is the maximum number of co-functional perturbations with nonzero effects on any individual module.

The number of required composite samples scale logarithmically or at worst sub-linearly with n, leading to much fewer required samples than the conventional approach.

The factorize-recover algorithm:

Factorizes the expression count matrix with sparse factorization followed by sparse recovery. Sparse factorization is performed by sparse PCA and recovery by LASSO, both are often used in single-cell data.

The algorithm then computes perturbation effects on individual genes as the product of two matrices generated from the recovery step and the gene weights in each latent factor.

Okay, I understand this might be getting out as “complicated stuff” but really they are just multiplying different values generated to get a value for perturbation effects on the genes involved.

The comparison:

in silico perturbation experiments

As in any experimental procedure proposing a new method, comparisons are a common way of convincing researchers and you that there is a use case for the proposed new method. In the case of compressed Perturb-Seq, the authors suggest the following advantages:

Increase in power of estimation of the perturbation effect

Including approaches such as cell pooling and guide pooling in combination with compressed perturb-seq leads to cost reductions over conventional P-seq.

The accuracy overall remains the same.

Guide pooling reduces the number of cells needed for screening while having increased power to detect genetic interaction effects.

FR-Perturb substantially improves out-of-sample validation accuracy over conventional gene-by-gene comparison methods.

These are the few advantages proposed by the authors, It is to be seen how effective this method is when it is implemented for performing large-scale perturbation experiments in the future.

A brief summary,

This has been a relatively long post, I if you made it till this; I thank you for your time and patience, Also, I appreciate your curiosity and applaud it. So overall, we looked into two methods which have a computational aspect for understanding perturbations. Perturbation experiments are powerful tools in a researchers tool kit, they allow us to understand processes that need to be elucidated in order to have a thorough understanding of cellular processes. Also, we use perturbations to see for example, what effect a particular drug is exhibiting in a cell line or an animal model. Scientists have been using these sorts of methods historically; One example is Patient H.M. who had severe seizures as a kid and to control these seizures they had to remove the amygdala and the hippocampus. This was a medical intervention that led to key insights to understand memory formation and retention.

We looked at two modern methods to study perturbation, These are much more begin than that HM’s intervention. But give us a better understanding on cellular regulatory processes.

I hope you had some fun reading this.

References:

Lab socials: Morris Lab Twitter , Kenji Kamimoto

Compressed Perturb-seq: highly efficient screens for regulatory circuits using random composite perturbations .

Cleary lab website , Regev lab website

in silico perturbation experiments

Concepts briefly explained:

A vector map of transitions is a graph that depicts the transition of a gene or protein from one state to another. It is typically used to visualize the effects of gene or protein perturbations, such as knockouts or overexpression, and can help researchers better understand the interactions between genes and proteins in the cell.

Clusterwise linear regression aims to find the partition of the data into k clusters, such that linear regressions fitted to each cluster minimise overall mean squared error on the whole data. A research paper on this is linked here . The algorithm uses a combination of data clustering and linear regression to identify patterns in gene expression data and to identify interactions among genes and proteins. By assigning each gene expression state to a cluster, the algorithm is able to identify relationships between genes and proteins, and can then be used to predict the effects of gene perturbations on gene expression. The regularization term helps to reduce the complexity of the model and improve the accuracy of the results.

scATAC-seq (single-cell Assay for Transposase-Accessible Chromatin with sequencing) is a single-cell sequencing technique used to measure the accessibility of chromatin in individual cells. This method uses transposase to cut open the chromatin of individual cells, allowing the DNA within to be sequenced. This technique is used to study gene expression and epigenetic modifications and can provide insights into gene regulation and cell identity.

Review Papers on Data Analysis:

Assessment of computational methods for analysis of single-cell ATAC-seq data.

From reads to insight: a hitchiker’s guide to ATAC-seq data analysis

Regularized linear machine-learning models are a type of machine-learning algorithm that is used to predict outcomes from a set of data. This type of model is used to make predictions about future data based on patterns found in existing data. The model uses a regularization term to reduce the complexity of the model and improve the accuracy of the results. The regularization term penalizes the model for using too many variables, thus preventing the model from overfitting the data. This helps to improve the accuracy of the model and makes it more robust and reliable. The regularization term can also be used to identify important variables that are important for predicting the outcome.

The Area under the Receiver Operating Characteristic (AUC-ROC) is a measure of how accurately a model can predict the outcome of a given data set. AUC-ROC is used to evaluate the performance of a binary classifier, such as a machine learning algorithm. The AUC-ROC measures the ability of a classifier to distinguish between two classes, by plotting the true positive rate against the false positive rate. A higher AUC-ROC score indicates that the model is more accurate at predicting the outcome of the data set. AUC-ROC is typically used to compare different models and can be used to select the best model for a given task.

Haematopoiesis is the process by which the body produces new blood cells. In mice, it is the process of generating all the different types of blood cells from a few progenitor cells. The process of haematopoiesis begins in the embryo and continues throughout the life of the mouse. It is regulated by a number of hormones and growth factors, as well as environmental cues. During haematopoiesis, progenitor cells differentiate into various types of blood cells, such as white blood cells, red blood cells, and platelets.

Myeloid progenitors are a type of stem cell that can differentiate into a variety of different blood cells. These cells can be found in the bone marrow, and they are important for the production of red blood cells, white blood cells, and platelets.

Compressed sensing is a technique used to efficiently compress and store data. It uses mathematical algorithms to reduce the number of measurements required to accurately reconstruct a signal. This technique is often used in image processing, audio and video compression, and medical imaging. In image and video processing, compressed sensing can reduce the amount of data that needs to be transmitted and stored, resulting in faster transmission times and lower storage costs. Additionally, this technique can be used to improve the quality of data by reducing noise and artefacts.

Epistasis is the phenomenon where the effect of one gene is modified by another gene. It is a type of gene-gene interaction, where the expression of one gene alters the expression of another gene.

The factorize-recover algorithm is a mathematical method used to infer the effect sizes of perturbations on a phenotype. The algorithm uses data clustering and linear regression to identify patterns in gene expression data and to identify interactions among genes and proteins. The regularization term reduces the complexity of the model and improves the accuracy of the results. By assigning each gene expression state to a cluster, the algorithm is able to identify relationships between genes and proteins, and can then be used to predict the effects of gene perturbations on gene expression. Additionally, the regularization term can be used to identify important variables that are important for predicting the outcome.

Discussion about this post

in silico perturbation experiments

Ready for more?

  • Research article
  • Open access
  • Published: 30 April 2009

Using in silico models to simulate dual perturbation experiments: procedure development and interpretation of outcomes

  • Neema Jamshidi 1 &
  • Bernhard O Palsson 1  

BMC Systems Biology volume  3 , Article number:  44 ( 2009 ) Cite this article

5231 Accesses

8 Citations

Metrics details

A growing number of realistic in silico models of metabolic functions are being formulated and can serve as 'dry lab' platforms to prototype and simulate experiments before they are performed. For example, dual perturbation experiments that vary both genetic and environmental parameters can readily be simulated in silico . Genetic and environmental perturbations were applied to a cell-scale model of the human erythrocyte and subsequently investigated.

The resulting steady state fluxes and concentrations, as well as dynamic responses to the perturbations were analyzed, yielding two important conclusions: 1) that transporters are informative about the internal states (fluxes and concentrations) of a cell and, 2) that genetic variations can disrupt the natural sequence of dynamic interactions between network components. The former arises from adjustments in energy and redox states, while the latter is a result of shifting time scales in aggregate pool formation of metabolites. These two concepts are illustrated for glucose-6 phosphate dehydrogenase (G6PD) and pyruvate kinase (PK) in the human red blood cell.

Dual perturbation experiments in silico are much more informative for the characterization of functional states than single perturbations. Predictions from an experimentally validated cellular model of metabolism indicate that the measurement of cofactor precursor transport rates can inform the internal state of the cell when the external demands are altered or a causal genetic variation is introduced. Finally, genetic mutations that alter the clinical phenotype may also disrupt the 'natural' time scale hierarchy of interactions in the network.

In silico models of complex biological processes are now being built. The scope of such models can range from genetic circuits [ 1 , 2 ], to organelles [ 1 ], to whole cells [ 2 , 3 ], to whole organs [ 4 ]. Computational models are increasingly being recognized as important investigational tools for the analysis of complex biological systems [ 5 ]. There are now suggestions that even simulations of a whole human being may one day become possible; the virtual human [ 6 ]. Such models are being used to accelerate discovery [ 7 , 8 ], develop understanding of complex physiological processes [ 9 ], and for prospective biological design [ 10 ].

Some in silico whole cell models can now represent cellular functions mechanistically with a reasonable degree of accuracy [ 11 , 12 ]. Understanding and properly characterizing the function of a biological network includes characterization of how the network responds to different types of perturbations; environmental and/or genetic. Critical to eliciting the differences between two seemingly similar systems is the application of a stress on the systems, i.e. the systems need to be perturbed in order to determine whether their functional capabilities have changed. Thus, dual perturbation experiments are used to interrogate the functional abilities of cells. Just as for experimental models in biology, relative predictions by in silico models may be more useful than absolute predictions. This use of in silico models can now be prototyped at the single cell level, and in silico models of single cells should be used to predict outcomes of dual perturbation experiments (that cross genetic and environmental perturbations) before they are performed in the laboratory. Kinetic network models can be particularly useful for perturbation experiments, since they 1) enable predictions to be made for steady state fluxes as well as concentrations, 2) enable investigation of the dynamic properties when moving from one steady state to another, 3) allow perturbations to be made through alteration of enzyme parameters, initial conditions for concentrations, or the application of various 'load' functions or alteration of enzyme rate laws, 4) enable analysis of dynamics when moving from one steady state to another, 5) enable analysis of non-linear properties of networks.

We adopted the approach of using perturbation experiments in an effort to better understand the changes that occur at metabolic network steady states. A set of genetic variants were analyzed following environmental perturbations and compared to normal cells undergoing the same environmental perturbations. The well established cell-scale kinetic model of human red cell metabolism was used for these studies [ 13 , 14 ]. This network accounts for glycolysis, the pentose phosphate pathway, the Rapoport-Luebering Shunt, nucleotide salvage pathways, as well as sodium and potassium transport channels and the sodium potassium ATPase, described by 34 ODEs comprised of 44 enzyme rate expressions with allosteric influences when appropriate, in addition to magnesium complexing reactions [ 15 ]. The detailed description of the allosterically regulated enzymes, such as glucose-6 phosphate dehydrogenase (G6PD), pyruvate kinase (PK), and phosphofructokinase (PFK) enabled the direct application of causal SNP mutations to study well-known genetic and environmental variations. Since fluxes describe the functional states of cells and 'tie' the network together and metabolite concentrations are the most direct descriptions of biochemical phenotypes, we used this kinetic model of metabolism to investigate changes in fluxes as well as concentrations. It has previously been demonstrated that this experimentally validated dynamic model of red cell metabolism can be used to differentiate among phenotypes with a range of clinical severity [ 16 ]. In the present study we focused on well-known genetic variations in G6PD and PK and changes in environmental parameters that reflect changes in redox and energy loads on the cell and that these changes can be inferred based on transporter flux states and cofactor concentration ratios and without relying on extensive simulations.

The red blood cell's primary physiological function is to deliver oxygen that enters the body through the lungs, to tissues and to return carbon dioxide back to the lungs, to be expelled from the body. In order to carry out these respiratory functions, basic but critical metabolic functionality in the red cell must be maintained; including maintaining adequate ATP levels to maintain cell shape and the preservation of reduced glutathione to counteract oxidative stresses. These capabilities are directly affected to varying degrees in patients with metabolic enzyme deficiencies. The two most common of such deficiencies in the human red cell are in G6PD and PK [ 17 , 18 ]. We considered dual perturbations in silico where genetic variations in the properties of these two enzymes are crossed with changing environmental conditions.

1. Baseline responses of the 'normal' red cell: elucidating the role of pooled variables and transporters

The 'normal' or 'nominal' set of parameters in the dynamic model of red cell metabolism (abbreviated here as the nRBC) formed a reference case from which parameters can be perturbed. The dynamic responses nRBC (Figure 1 , top) were analyzed under conditions of increased redox and energy loads to provide a baseline, or the 'normal' response to such stresses. The nRBCunder increased redox vs. increased energy loads resulted in different flux states, notably shifting between reliance on the glycolytic pathway versus the pentose phosphate pathway. The computed steady state concentrations and fluxes are provided in Additional file 1 .

figure 1

Simplification of a network to aggregate pooled variables . Schematic diagram of the human red cell network that includes 34 metabolites and 44 reaction fluxes. The network can be simplified into biologically relevant pooled variables and metabolite ratios, such as cofactor pools. Further simplification will result in classification of the network functional states based on changes in transport fluxes. The network is coarse-grained here in order to highlight the ability to characterize different functional states based only on the lumped aggregate variables representing physiological states (redox potential and phosphate potential) and key transporters (HX, ADE, and ADO). When the cell needs to increase its redox potential in order to counteract an oxidative stress, ADE uptake and HX secretion are increased, while ADO and INO uptake are decreased. The reverse trends occur with increased energy loads. The relative magnitudes of the changes tend to be larger with ADE and ADO than INO and HX. Abbreviations: GLC: glucose, PYR: pyruvate, LAC: lactate, Na: sodium, K: potassium, HX: hypoxanthine, INO: inosine, ADE: adenine, ADO: adenosine.

The evaluation of the computed nRBC metabolic network responses and the resulting internal changes from the imposition of redox or energy loads led to the observation that changes in transport fluxes reflected changes in internal functional states (Additional file 2 , Figure 1 ):

As the redox load increased, flux through glycolysis decreased and flux through the pentose phosphate pathway increased. At very high redox loads, the flux through phosphoglucoisomerase (PGI) even reversed direction to achieve maximal total flux through the oxidative branch of the pentose phosphate pathway. These expected shifts in pathway uses were accompanied by some unexpected changes in transport fluxes. There was a dramatic decrease in adenosine (ADO) uptake and a moderate increase in the adenosine deaminase (ADA) flux. Hypoxanthine (HX) excretion increased and inosine (INO) uptake decreased. Internally, the fluxes through adenine phosphoribosyltransferase (ADPRT) and phosphoribosylpyrophosphate synthetase (PRPPSYN) increased significantly.

As the energy load is increased, flux through the pentose phosphate pathway decreased and flux increased through glycolysis. The adenosine kinase (AK) flux decreased, with a relatively small decrease in AMP deaminase (AMPDA), adenine phosphoribosyltransferase (ADPRT), IMP nucleotidase (IMPase) and purine nucleoside phosphorylase (PNPase). The phosphoribosylpyrophosphate synthetase (PRPPSYN) flux decreased. These changes in the flux state change the state of transporters. HX excretion decreased and INO uptake increased slightly. There was a reduction in adenine (ADE) uptake and a parallel increase in adenosine (ADO) uptake.

Taken together, these in silico results showed that the ratio of the adenosine to adenine uptake reflected whether the cell was responding to an increased redox load or an energy load.

Interpretation

Red cell metabolic functions have previously been interpreted [ 18 , 19 ] in terms of pooled concentration variables and ratios thereof; such as redox and phosphate capacities and potentials (Figure 1 , middle). This approach led to a functional (or physiological) description of the state of red cell metabolism, rather than a detailed biochemical description. The computations of changes in fluxes and concentrations at varying redox and energy loads in the nRBC can be interpreted within this functional network description.

The alterations in the computed internal function state of the nRBC in response to environmental parameters can be reduced to changes in the states of transport fluxes (Figure 1 , bottom). Increases in energy demands will shift the internal pathways to maintain the phosphate potential and ADO uptake will increase, while ADE uptake decreases and HX secretion decreases. Conversely, when increased oxidative stresses are experienced, the fluxes shift towards the redox potential with decrease in ADO uptake and an increase in ADE uptake and HX secretion.

These simulated responses provide the baseline case for the nRBC and they can be characterized in terms of a functional description of the state of nRBC metabolism (Figure 1 ). The key result here was that the environmental loads shift the use of pathways and the redox/energy pools. The state of these pools (their charge = occupancy/capacity [ 18 , 19 ]) was adjusted by alterations in key transport fluxes that are accessible to measurement.

Dual perturbations in silico

Since the simulated flux uptake patterns reflected changes in the internal states of the nRBC to environmental perturbations, it suggested the possibility to characterize the responses different genetic variants (nRBC versus vRBC; 'v' for 'variant') in the same fashion. The responses of the nRBC were again used as a reference. This led to the notion of a dual perturbation experiment in silico as a way to analyze and design experiments to determine the effects of genetic variation. Differences in functional capabilities were assessed by comparisons between different states in two different environmental conditions for the nRBC vs. a vRBC (i.e. any one of the 6 G6PD variants), see Figure 2 . Calculating the differences between the rows in Figure 2 and then comparing the nRBC to the vRBC resulted in a dual perturbation comparison, in which the one perturbation reflected an environmental change and the other results from a genetic mutation. Such in silico studies were carried out for G6PD variants and PK variants whose altered kinetic parameters were measured from patients [ 17 – 20 ].

figure 2

2 × 2 design of the in silico dual perturbation experiments . The condition change for the G6PD variant was an altered redox load and the condition change for the PK variant was an altered energy load.

The steady state flux distributions for the nRBC along with the six G6PD variants (vRBC i , i = 1, 6) were calculated at a high and a low redox load (see METHODS). The comprehensive set of steady state fluxes and concentrations calculated for the cell under different conditions led to the definition of a sensitivity parameter relating the intracellular redox state of the cell and extracellular transporters,

where R high and R low reflect the redox state of the cell (R = NADPH/(NADP + NADPH)) under high and low redox loads, respectively, and ν is a corresponding transport flux. A logarithmic sensitivity measure was used since this proved to be a sensitive relationship. The computational results illustrated how the change in the redox state of the cell (summarized as the relative size of the reduced NADP pool), an intracellular quantity, was tracked by transport fluxes, an extracellular quantity (see Figure 3 ). Hence, changes in the internal states of cells were detected by measuring the exchange rates.

figure 3

Sensitivity of intracellular cofactor pools on extracellular transporters . The sensitivity parameter is a logarithmic ratio between the changes in redox potential between two states over the relative change in a particular transport flux between two states. Changes in the intracellular redox state can be reflected in the extracellular transport of related metabolites.

To further elaborate the relationship between the cofactor pools and transport fluxes in the cell, the differences between the high and low exchange fluxes were calculated for each variant and then normalized to the corresponding flux difference in the nRBC. A value of 1 indicates no difference compared to the nRBC, whereas values greater than or less than one reflect differences between the vRBCs from the nRBC. The computational results showed that the transport fluxes alone are enough to distinguish between the less severe (A+ and A-, vRBC 1 and vRBC 2 , respectively) and more severe (Iwate, Niigata, Yamaguchi, and Portici, vRBC 3 through vRBC 6 , respectively) SNP variants (Figure 4 ). This finding was interesting because it suggested that changes in transport fluxes could identify changes in internal functional states, differentiating between normal and pathophysiological situations, and that changes were identified using relative changes in the fluxes.

figure 4

Characterization of G6PD genetic variants as functions of transport fluxes . The steady state flux differences were calculated for the nRBC and the vRBC variants considered at high and low redox loads for each case, respectively. The SNP variants were then normalized with respect to the flux differences for the normal case. There is a range of clinical phenotypes exhibited in these patients, determined by the severity of their hemolytic disorders. The A minus and A plus variants exhibit a more mild phenotype with non-chronic hemolytic anemia. The transport fluxes for these cases are effectively the same as the nRBC. In contrast, the other variants, with more severe phenotypes, have altered steady state flux differences compared to the normal case. Abbreviations: LEX: lactate transport, HXEX: hypoxanthine transport, INOEX: inosine transport, ADEEX: adenine transport, ADO: adenosine transport.

PK dual perturbation studies were carried out by applying different energy loads instead of redox loads (Figure 5 ). All of the vRBCs considered differed from the nRBC, but to varying degrees. Decreased lactate production by all of the variants reflected a decreased glycolytic flux due to the decreased activity of PK. The increased transport of the other metabolites was a result of the need to balance all carbon uptake and secretion. The Mantova variant was predicted to be less severe than the other vRBCs. In these cases, the altered transport fluxes reflected a decreased total pool size of ATP internally.

figure 5

Characterization of PK genetic variants in terms of transport fluxes . Calculations analogous to those in Figure 2 were carried out for the PK variants with high and low energy loads. All of these variants showed altered uptake/secretion patterns for the transport fluxes compared to the normal red cell.

2. Dynamic response of vRBCs to redox loads: elucidation of altered time-scale hierarchy

In order to further characterize the differences that occur between the nRBC and the vRBCs, the dynamic states of the vRBCs were analyzed by comparing changes in the dynamic sequence of events that occur when they respond to a perturbation. Different biochemical inter-conversions, illustrated in Figure 1B , occured on different time scales [ 21 ] when the nRBC was exposed to an environmental perturbation. This sequence of events can be different in a vRBC and how much they differ in silico from the nRBC could be a measure of the pathological severity of the genetic variation.

Dynamic responses of the nRBC

The dynamic response of a network can be characterized through the comprehensive identification of the time-sequence of pool formation between the metabolites in a network [ 22 ]. This approach for determining the temporal structure in metabolic network responses was used as a basis for the analysis how it responds to perturbations. As the nRBC responded to an increased redox load, the shift in flux from glycolysis to the pentose pathway was also reflected in changes in the time scales for pooling of metabolites (see Additional file 2 , Figure 1 ). The shifts in pooling time scales were observed throughout the network, however the most pronounced shifts involved the interactions between GL6P and the hexose-phosphates, the pentose phosphates and glutathione/NADPH, and the adenosine phosphates (AMP, ADP, ATP) with the nucleotide precursor and degradation products.

Metabolite pooling over progressive time scales involves a complex series of interactions. Pool formation on fast time scales generally reflect fast achievement of equilibrium between isomers and other metabolites whose steady state concentration ratios are close to their equilibrium ratios [ 25 , 26 ]. Pooling on slower time scales involves interactions between aggregate pools between different pathways in the network. These interactions are illustrated in Figure 6 , in order to highlight the distinction between pooling within pathways and interactions between pathways.

figure 6

Schematic overview of pooling interactions occurring over progressive time scales . Generally pooling occurs within particular pathways on the fast time scales and these pools of metabolites from the different pathways then interact with one another on the slower time scales. Changes in pooling time scales that occur within pathways can affect overall interactions between larger modules in the network. The pooling interactions do not happen strictly in this manner and there are overlapping interactions between pathways over time.

Responses of Glucose-6-Phosphate Dehydrogenase Deficient RBCs

The G6PD variants due to causal SNP mutations were analyzed in terms of their effect on the time progression of the pooling process. Modal decomposition [ 23 ] was carried out for each of these cases under low and high redox loads. The pooling between metabolites over time was then determined for the normal cell (Additional file 2 , Figure 1 ) and compared to each of the genetic variants under high redox loads (see Additional file 2 , Figures 2 , 3 , 4 , 5 , 6 , 7 ) [ 22 ].

figure 7

Disruptions of the global pooling structure in G6PD variants . The size of the boxes reflects the size of the redox and phosphate potentials. The NADPH redox state (NADPH/(NADP + NADPH)) is used as the surrogate for the overall redox potential. The A+ and A- variants are almost identical to the normal cell. The chronic hemolytic variants however have significant changes in the size of their redox pools. Additionally, the dynamics of the interactions between the adenosine and pentose phosphate pathways are disrupted in the more severe variants (slanted lines through arrows).

Changes in metabolite pooling over progressive time scales for the non-chronic hemolytic variants (A+ and A-; these patients exhibit a relatively mild form of hemolytic anemia [ 17 ]) were very similar to the normal cell (Additional file 2 , Figures 1 , 2 , 3 ), with the significant differences in pooling between metabolites occurring between members of the pentose phosphate pathway and some of the nucleotide precursors. The chronic hemolytic anemia variants in contrast, exhibited more pronounced changes in the dynamic sequence of interactions. The largest differences involved; 1) interactions between the oxidative branch of the pentose phosphate pathway and the first half of glycolysis, and 2) interactions among members within the non-oxidative branch of the pentose phosphate pathway and the nucleotide salvage pathway metabolites. The alterations in dynamics of the oxidative branch in the pentose phosphate pathway resulted in disruptions of the interactions between the different pathways on slower time scales (Figure 6 ).

Considering network functions in the context of the functional pools described in the middle panel in Figure 1 , a summary of the alterations in metabolic states under increased redox loads of the six genetic variants considered are shown in Figure 7 . The areas of the boxes reflect the size of the respective metabolite pools at the steady state. It is immediately apparent that the chronic hemolytic variants have substantially reduced abilities to counteract oxidative stresses.

Additionally there were changes in the dynamics of the chronic hemolytic variants, primarily among the metabolites that contribute to the adenosine phosphate potential and the pentose phosphate potential. The last four variants exhibited altered dynamic interactions among members of the nucleotide salvage pathways and the non-oxidative branch of the pentose phosphate pathway. These changes are denoted by slanted arrows in Figure 7 .

3. Dynamic response of the nRBC to energy loads

Similar computations of interactions among metabolites across time scales were applied to characterize the dynamic response of the nRBC to changes in energy loads (see Additional file 2 , Figure 8 and METHODS). As discussed above, when the nRBC responds to an increased energy load, flux through glycolysis is increased at the cost of a reduced flux through the pentose pathway. When the dynamic pooling among metabolites for the nRBC at a normal versus an increased load were compared, there was broad shifting of pooling between metabolites, with significant changes occurring between glycolytic intermediates and adenosine metabolites, and between nucleotide precursors and pentose phosphates.

Pyruvate Kinase Deficient Cases

In order to characterize the time hierarchical functionality of PK genetic variants, the dynamic pooling correlations were calculated for each vRBC under high-energy load conditions. Each vRBC was then compared to the nRBC under high-energy load conditions (see Additional file 2 , Figures 9–15). The salient observations from these computations were: 1) the vRBCs all exhibited comparatively similar patterns of variation from the responses of nRBC (unlike the G6PD variants considered above), and 2) the changes between high and low energy load in the variants were all similar to the changes observed for the nRBC. Since the energy charge remained fairly constant for all of the variants as well, the changes resulting from PK deficiency were reflected primarily as changes in the pool size of available ATP.

The utility of in silico models for elucidating the genotype-phenotype relationship through analysis of data from patients with two enzymopathies resulting from causal SNPs has been illustrated [ 16 ]. It has been established that intracellular cofactor ratios reflect intracellular functional states [ 24 , 25 ]. Cofactor secretion has previously been identified in bacteria such as E. coli [ 26 ]. Starting with an understanding of metabolic states in terms of a pooling structure (Figure 1 ), the present study focused on an in depth in silico investigation of the changes that occur in a 'normal' and a genetically 'variant' human erythrocyte. The chief results from the analysis herein were: 1) transport fluxes of bases can be used to characterize changes in the intracellular cofactor pools and dominant metabolic pathways, 2) internal charge ratios can be reflected in cofactor exchange rate response signatures, both of which are affected by genetic variants that have pathological consequences, and 3) the dynamic sequence of interactions that occur within a cell over time can be affected by pathological genetic variants.

Dual perturbation experiments were critical in eliciting the different capabilities among the different genetic variants, thus an external perturbation was necessary. Since the enzymopathies considered involved the redox and energy producing capabilities of the RBC and these are two of its essential metabolic functions, the relevant perturbation conditions are altered redox and energy loads. When a cell changes from one steady state to another, there may be changes in the steady state concentrations, steady state fluxes, and/or the hierarchical interactions that occur over time. We investigated these changes in the red cell under various conditions, including clinically relevant disease states.

Steady State Analyses

Calculation of the steady state concentrations and fluxes of the nRBC at various redox and energy loads illustrated that the two functional states can be distinguished according to the ADE, ADO, HX, and INO transport fluxes. This observation led to a reduced and experimentally accessible view of the functional metabolic red cell (Figure 1 ) that maintains information about its functional metabolic state.

An analysis comparing causal SNP variants and the normal red blood cell (Figures 3 and 4 ) demonstrated that the simplified view of the RBC could distinguish a range of pathophysiological states. The less severe G6PD variants could be differentiated from the more pathological chronic hemolytic patients based on changes in the transport fluxes, as depicted in Figure 3 . These results suggested that knowing the state of transporters can directly inform the intracellular metabolic state of the cell. The findings also highlight the importance of applying perturbations to cells (in the wet lab and dry lab alike) in order to understand the functional capabilities and differences between cells that have genetic differences or environmental stresses. The important role of transporters for

Dynamic Analyses

In order to investigate the dynamic changes of the network under different conditions we calculated the time scales at which different metabolites pool together [ 22 ]. The resulting dynamic pooling plots allowed global assessment of the network pooling structure. The responses of the cell to each type of fixed perturbation resulted in different alterations in the resulting pooling structure in the metabolic network, however both types of perturbations resulted in changes in pooling of nucleotide salvage metabolites such as adenosine, adenine, hypoxanthine, inosine, and ribose-1-phosphate. The significance of this result can be appreciated in consideration of the steady state changes (and the resulting distillation of the network shown in Figure 1B ) and topological considerations. The two objectives, ATP production and NADPH production, compete for available G6P. Additionally, the nucleotide salvage pathway connects the two by virtue of the importance of ATP/ADP for glycolysis and the ribose phosphates in the non-oxidative branch of the pentose phosphate pathway. Thus although NADPH and ATP generation are linked at the top and bottom of their related pathways in the red cell, the dynamic response to an ATP load however will not be the direct opposite of a load on NADPH because the response times of enzymes in the different pathways differ significantly.

The non-chronic and chronic G6PD variants exhibited marked differences in the pooling of metabolites on progressively slower time scales. In general, the more severe, chronic G6PD variants exhibited a 'disrupted' pooling structure of the network. The complex series of interactions and aggregate pool formation over progressive time scales are difficult to contemplate individually. As highlighted in Figure 6 , metabolite pooling typically occured within pathways on faster time scales and between pathways on slower time scales. Analysis of the nRBC and vRBC dynamics under redox loads illustrated that disruptions altering the pooling behavior between metabolites within a pathway, subsequently disrupted interactions between pathways on the slower time scales. For the G6PD variants, this was reflected by changes in the interactions between the pentose phosphate pathway and the nucleotide salvage pathway. Thus, not all G6PD variants had the same metabolic disruptions in their pathways and furthermore, a red cell with G6PD deficiency was not the same as a red cell with a decreased ability to tolerate an oxidative load.

It was observed that even PK variants were able to maintain their energy charge [ 24 ] in the 85–90% range [ 16 ], however it was not clear if the dynamics of the PK variants under higher energy loads were similar to that of the normal red cell. Investigations of the pooling structure suggested that indeed the dynamics of pool formation in PK variants, in contrast to the G6PD variants, were effectively the same as normal red cells, but with a diminished ability to respond to energy loads.

Models can be used to simplify what appears complex and in some cases reveal complexity in areas that appear to be simple. Wet lab experiments regularly make use of dual perturbation designs, surprisingly however, such designs have not been employed in dry lab experiments extensively. Herein we have shown that dual perturbations in silico are much more informative that single perturbation computational experiments. This was illustrated when the cell was exposed to altered external demands as well as altered rate expression parameters resulting from causal genetic mutations. Analysis of the time scale hierarchy of network dynamics in the context of genetic mutations also revealed a disruption of the normal interactions between metabolites on different time scales. These results suggest that in rough analogy to heterochronic mutants and organogenesis [ 27 – 29 ], the dynamics within metabolic networks may be disrupted in pathological states. Future experiments investigating and testing these hypotheses are worth investigation. It is conceivable that detecting altered pathway dynamics could be an approach for identifying sub-clinical disease or pathological changes in health that are not yet clinically identifiable.

There has been a growing literature concerned with computationally driven experimental design approaches in systems biology [ 10 , 30 – 33 ]. These wide ranging approaches have been used to address different aspects of biological complexity and have been employed to make predictions which may be used to test hypotheses, engineer microbes, or even build synthetic networks. Here we used a kinetic model of a biochemical network for perturbational analysis of genetic and environmental factors. The use of kinetic models to assist in the characterization and understanding of the genotype-phenotype relationship and disease states is likely to become a more valuable in the future, as models play a more active role in providing predictions and hypotheses to be tested experimentally, particularly by employing dual perturbation designs. Here we suggest that measurements of uptake rates of cofactor precursors and disruption of time scale hierarchy could be used to estimate the internal state cells and how genetic variants differ from a normal (or a reference) cell in response to external perturbations. The predictions put forth can serve as hypotheses to be tested experimentally in future studies.

The kinetic model of human red cell metabolism accounting for small metabolite allosteric regulation as originally described by [ 14 , 15 ] and implemented in Mathematica ® (Wolfram Research, Chicago, IL) [ 13 ] was used for the analyses. Steady state concentrations were calculated using Mathematica's numerical root solving functions and solving the mass balance equations at steady state (error for each equation < 10 -15 ) for the set of metabolite concentrations. Steady state fluxes were calculated by substituting the steady state concentrations into the corresponding reaction rate law. The implementation of the model in Mathematica is freely available for download at systemsbiology.ucsd.edu.

Application of redox and energy loads

Redox loads were applied by increasing the rate constant for the first order load on glutathione (reduced form) (GSHR reaction; a drain on glutathione) that is already part of the model. The maximum coefficient for the normal erythrocyte was 7.5/hr. Most of the variants could not tolerate a load this high, although some of the variants could tolerate an even higher load. The high and low redox loads between the normal cell and the G6PD variants for the comparative study were 2.5 and 0.5 mM/hr, respectively (the maximum applicable load was limited by some of the variants).

Energy loads were applied by adjusting the rate constant for the first order load on ATP (ATPase reaction; a drain on ATP) that is already in the model. The maximum coefficient was 1.6/hr, although many of the variants could not tolerate this high of a load. The high and low energy loads between the normal cell and the PK variants for the comparative study were 0.6 and 0.2 mM/hr, respectively.

Implementation of kinetic variants

The SNP variants were implemented in silico by varying the appropriate kinetic parameters (e.g. V max , K i , etc) that were measured for individual patients [ 18 – 20 ] in the rate equations for G6PDH and PK, as appropriate. In order to account for any systematic measurement errors between different labs, the values of the patient variants were scaled with respect to the 'normal' control values measured by the particular investigators. This was carried out in the same manner as described previously [ 16 ].

Analysis of dynamics

Decomposition of the dynamics and subsequent characterization of the pooling within the network has been developed and previously described in detail [ 22 ] and is described briefly here. Following common practice in dynamical systems analysis, linearization of the mass conservation equations for a chemical reacting system around the steady state yields,

where J is the nxn Jacobian matrix (n = the number of dynamic concentration variables), and x' (= x - x ss ) is the deviation vector of the concentration variables from the steady state ( x ss ). Temporal decomposition is then carried out, and the modal matrix is constructed by applying a similarity transformation to the Jacobian, J = MΛM -1 , and redefining the variables as modal variables, m = M -1 x, so that Equation ( 2 ) becomes,

as described previously [ 23 ]. The modal variables, m, are dynamically independent. Aggregate pool formation among metabolites was determined through the calculation of the angles between columns in the modal matrix,

In order to identify the time scales at which pool formation occurs, we compute the angle between two columns as a function of an index k that runs from 1 to n time scales. As each row of the modal matrix is removed ( k increases by one) the angle is recomputed to form a series of angles as a function of k ; i.e., ϑ ij (k) , k = 1, 2, ..., n. If the angle ϑ ij (k) is close to zero, the two columns are correlated at and above that k value and the two corresponding concentrations will move in tandem for the subsequent time scales, thus forming an aggregate variable or a pool.

Vo TD, Greenberg HJ, Palsson BO: Reconstruction and functional characterization of the human mitochondrial metabolic network based on proteomic and biochemical data. The Journal of biological chemistry. 2004, 279 (38): 39532-39540. 10.1074/jbc.M403782200

Article   CAS   PubMed   Google Scholar  

Feist AM, Palsson BO: The growing scope of applications of genome-scale metabolic reconstructions using Escherichia coli. Nature biotechnology. 2008, 26 (6): 659-667. 10.1038/nbt1401

Article   PubMed Central   CAS   PubMed   Google Scholar  

Saucerman JJ, Brunton LL, Michailova AP, McCulloch AD: Modeling beta-adrenergic control of cardiac myocyte contractility in silico. The Journal of biological chemistry. 2003, 278 (48): 47997-48003. 10.1074/jbc.M308362200

Noble D: Modeling the heart – from genes to cells to the whole organ. Science. 2002, 295 (5560): 1678-1682. 10.1126/science.1069881

Kell D: Metabolomics and systems biology: making sense of the soup. Curr Opin Microbiol. 2004, 7 (3): 296-307. 10.1016/j.mib.2004.04.012

Kell DB: The virtual human: towards a global systems biology of multiscale, distributed biochemical network models. IUBMB life. 2007, 59 (11): 689-695. 10.1080/15216540701694252

Cakir T, Efe C, Dikicioglu D, Hortacsu A, Kirdar B, Oliver SG: Flux balance analysis of a genome-scale yeast model constrained by exometabolomic data allows metabolic system identification of genetically different strains. Biotechnology progress. 2007, 23 (2): 320-326. 10.1021/bp060272r

Reed JL, Patel TR, Chen KH, Joyce AR, Applebee MK, Herring CD, Bui OT, Knight EM, Fong SS, Palsson BO: Systems approach to refining genome annotation. Proceedings of the National Academy of Sciences of the United States of America. 2006, 103 (46): 17480-17484. 10.1073/pnas.0603364103

Calzone L, Thieffry D, Tyson JJ, Novak B: Dynamical modeling of syncytial mitotic cycles in Drosophila embryos. Molecular systems biology. 2007, 3: 131- 10.1038/msb4100171

Article   PubMed Central   PubMed   Google Scholar  

Burgard AP, Pharkya P, Maranas CD: Optknock: a bilevel programming framework for identifying gene knockout strategies for microbial strain optimization. Biotechnology and bioengineering. 2003, 84 (6): 647-657. 10.1002/bit.10803.

Chen K, Csikasz-Nagy A, Gyorffy B, Val J, Novak B, Tyson J: Kinetic analysis of a molecular model of the budding yeast cell cycle. Mol Biol Cell. 2000, 11 (1): 369-391.

Klipp E, Nordlander B, Kruger R, Gennemark P, Hohmann S: Integrative model of the response of yeast to osmotic shock. Nature biotechnology. 2005, 23 (8): 975-982. 10.1038/nbt1114

Jamshidi N, Edwards JS, Fahland T, Church GM, Palsson BO: Dynamic simulation of the human red blood cell metabolic network. Bioinformatics. 2001, 17 (3): 286-287. 10.1093/bioinformatics/17.3.286

Joshi A, Palsson BO: Metabolic dynamics in the human red cell. Part I – A comprehensive kinetic model. J Theor Biol. 1989, 141 (4): 515-528. 10.1016/S0022-5193(89)80233-4

Joshi A, Palsson BO: Metabolic dynamics in the human red cell. Part III – Metabolic reaction rates. J Theor Biol. 1990, 142 (1): 41-68. 10.1016/S0022-5193(05)80012-8

Jamshidi N, Wiback SJ, Palsson BB: In silico model-driven assessment of the effects of single nucleotide polymorphisms (SNPs) on human red blood cell metabolism. Genome Res. 2002, 12 (11): 1687-1692. 10.1101/gr.329302

Beutler E: The genetics of glucose-6-phosphate dehydrogenase deficiency. Seminars in hematology. 1990, 27 (2): 137-164.

CAS   PubMed   Google Scholar  

Zanella A, Bianchi P: Red cell pyruvate kinase deficiency: from genetics to clinical manifestations. Bailliere's best practice & research. 2000, 13 (1): 57-81.

Article   CAS   Google Scholar  

Beutler E, Vulliamy TJ: Hematologically important mutations: glucose-6-phosphate dehydrogenase. Blood cells, molecules & diseases. 2002, 28 (2): 93-103. 10.1006/bcmd.2002.0490

Article   Google Scholar  

Fiorelli G, Martinez di Montemuros F, Cappellini MD: Chronic non-spherocytic haemolytic disorders associated with glucose-6-phosphate dehydrogenase variants. Baillieres Best Pract Res Clin Haematol. 2000, 13 (1): 39-55.

Kauffman KJ, Pajerowski JD, Jamshidi N, Palsson BO, Edwards JS: Description and analysis of metabolic connectivity and dynamics in the human red blood cell. Biophys J. 2002, 83 (2): 646-662. 10.1016/S0006-3495(02)75198-9

Jamshidi N, Palsson BO: Top-down analysis of temporal hierarchy in biochemical reaction networks. PLoS computational biology. 2008, 4 (9): e1000177- 10.1371/journal.pcbi.1000177

Palsson BO, Lightfoot EN: Mathematical modelling of dynamics and control in metabolic networks. I. On Michaelis-Menten kinetics. J Theor Biol. 1984, 111 (2): 273-302. 10.1016/S0022-5193(84)80211-8

Atkinson DE: The energy charge of the adenylate pool as a regulatory parameter. Interaction with feedback modifiers. Biochemistry. 1968, 7 (11): 4030-4034. 10.1021/bi00851a033

Reich J, Selkov E: Energy metabolism of the cell: a theoretical treatise. 1981, London; New York: Academic Press

Google Scholar  

Matin A, Matin MK: Cellular levels, excretion, and synthesis rates of cyclic AMP in Escherichia coli grown in continuous culture. J Bacteriol. 1982, 149 (3): 801-807.

PubMed Central   CAS   PubMed   Google Scholar  

Ambros V: A hierarchy of regulatory genes controls a larva-to-adult developmental switch in C. elegans. Cell. 1989, 57 (1): 49-57. 10.1016/0092-8674(89)90171-2

Ambros V, Horvitz HR: Heterochronic mutants of the nematode Caenorhabditis elegans. Science. 1984, 226 (4673): 409-416. 10.1126/science.6494891

Moss EG: Heterochronic genes and the nature of developmental time. Curr Biol. 2007, 17 (11): R425-434. 10.1016/j.cub.2007.03.043

Banga JR: Optimization in computational systems biology. BMC Syst Biol. 2008, 2: 47- 10.1186/1752-0509-2-47

Dasika MS, Maranas CD: OptCircuit: an optimization based method for computational design of genetic circuits. BMC Syst Biol. 2008, 2: 24- 10.1186/1752-0509-2-24

Zak DE, Gonye GE, Schwaber JS, Doyle FJ: Importance of input perturbations and stochastic gene expression in the reverse engineering of genetic regulatory networks: insights from an identifiability analysis of an in silico network. Genome Res. 2003, 13 (11): 2396-2405. 10.1101/gr.1198103

Gadkar KG, Gunawan R, Doyle FJ: Iterative approach to model identification of biological networks. BMC Bioinformatics. 2005, 6: 155- 10.1186/1471-2105-6-155

Download references

Acknowledgements

We thank the anonymous peer reviewers for their constructive comments. The authors would also like to thank Mr. Jan Schellenberger for his reading of the manuscript and comments. This work was partially supported by NIGMS grant number GM68837.

Author information

Authors and affiliations.

Department of Bioengineering, 9500 Gilman Drive, University of California, San Diego, La Jolla, CA, 92093-0412, USA

Neema Jamshidi & Bernhard O Palsson

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Bernhard O Palsson .

Additional information

Authors' contributions.

NJ conceived of the study, carried out the calculations, and drafted the initial draft of the manuscript. NJ and BOP analyzed the results and revised the manuscript.

Neema Jamshidi contributed equally to this work.

Electronic supplementary material

12918_2008_312_moesm1_esm.xls.

Additional file 1: Steady state concentrations and fluxes. The steady state concentrations of the metabolites and the steady state fluxes of the reactions are provided in the tables. (XLS 16 KB)

12918_2008_312_MOESM2_ESM.pdf

Additional file 2: Tiled pooled plots of the red cell. Tiled pooled plots of the normal and variant red cell models under different stress conditions (Figure 1–15). Maps of the qualitative changes in fluxes in response to redox and energy loads (Figure 16). (PDF 314 KB)

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Authors’ original file for figure 2, authors’ original file for figure 3, authors’ original file for figure 4, authors’ original file for figure 5, authors’ original file for figure 6, authors’ original file for figure 7, rights and permissions.

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( http://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article.

Jamshidi, N., Palsson, B.O. Using in silico models to simulate dual perturbation experiments: procedure development and interpretation of outcomes. BMC Syst Biol 3 , 44 (2009). https://doi.org/10.1186/1752-0509-3-44

Download citation

Received : 06 October 2008

Accepted : 30 April 2009

Published : 30 April 2009

DOI : https://doi.org/10.1186/1752-0509-3-44

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Pentose Phosphate Pathway
  • Pyruvate Kinase
  • Purine Nucleoside Phosphorylase
  • Transport Flux
  • Slow Time Scale

BMC Systems Biology

ISSN: 1752-0509

in silico perturbation experiments

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings
  • My Bibliography
  • Collections
  • Citation manager

Save citation to file

Email citation, add to collections.

  • Create a new collection
  • Add to an existing collection

Add to My Bibliography

Your saved search, create a file for external citation management software, your rss feed.

  • Search in PubMed
  • Search in NLM Catalog
  • Add to Search

Using in silico models to simulate dual perturbation experiments: procedure development and interpretation of outcomes

Affiliation.

  • 1 Department of Bioengineering, 9500 Gilman Drive, University of California, San Diego, La Jolla, CA 92093-0412, USA. [email protected]
  • PMID: 19405968
  • PMCID: PMC2689188
  • DOI: 10.1186/1752-0509-3-44

Background: A growing number of realistic in silico models of metabolic functions are being formulated and can serve as 'dry lab' platforms to prototype and simulate experiments before they are performed. For example, dual perturbation experiments that vary both genetic and environmental parameters can readily be simulated in silico. Genetic and environmental perturbations were applied to a cell-scale model of the human erythrocyte and subsequently investigated.

Results: The resulting steady state fluxes and concentrations, as well as dynamic responses to the perturbations were analyzed, yielding two important conclusions: 1) that transporters are informative about the internal states (fluxes and concentrations) of a cell and, 2) that genetic variations can disrupt the natural sequence of dynamic interactions between network components. The former arises from adjustments in energy and redox states, while the latter is a result of shifting time scales in aggregate pool formation of metabolites. These two concepts are illustrated for glucose-6 phosphate dehydrogenase (G6PD) and pyruvate kinase (PK) in the human red blood cell.

Conclusion: Dual perturbation experiments in silico are much more informative for the characterization of functional states than single perturbations. Predictions from an experimentally validated cellular model of metabolism indicate that the measurement of cofactor precursor transport rates can inform the internal state of the cell when the external demands are altered or a causal genetic variation is introduced. Finally, genetic mutations that alter the clinical phenotype may also disrupt the 'natural' time scale hierarchy of interactions in the network.

PubMed Disclaimer

Similar articles

  • Hemolytic anemias due to erythrocyte enzyme deficiencies. Jacobasch G, Rapoport SM. Jacobasch G, et al. Mol Aspects Med. 1996 Apr;17(2):143-70. doi: 10.1016/0098-2997(96)88345-2. Mol Aspects Med. 1996. PMID: 8813716
  • In silico model-driven assessment of the effects of single nucleotide polymorphisms (SNPs) on human red blood cell metabolism. Jamshidi N, Wiback SJ, Palsson B BØ. Jamshidi N, et al. Genome Res. 2002 Nov;12(11):1687-92. doi: 10.1101/gr.329302. Genome Res. 2002. PMID: 12421755 Free PMC article.
  • Erythrocyte pyruvate kinase- and glucose phosphate isomerase deficiency: perturbation of glycolysis by structural defects and functional alterations of defective enzymes and its relation to the clinical severity of chronic hemolytic anemia. Lakomek M, Winkler H. Lakomek M, et al. Biophys Chem. 1997 Jun 30;66(2-3):269-84. doi: 10.1016/s0301-4622(97)00057-4. Biophys Chem. 1997. PMID: 9362562
  • [Erythrocyte pyruvate kinase--an enzyme that may have an influence on oxygen transport to tissues]. Dabrowska A. Dabrowska A. Postepy Hig Med Dosw. 1997;51(3):305-18. Postepy Hig Med Dosw. 1997. PMID: 9333782 Review. Polish.
  • Glucose-6-phosphate dehydrogenase--from oxidative stress to cellular functions and degenerative diseases. Ho HY, Cheng ML, Chiu DT. Ho HY, et al. Redox Rep. 2007;12(3):109-18. doi: 10.1179/135100007X200209. Redox Rep. 2007. PMID: 17623517 Review.
  • Analysis of genetic variation and potential applications in genome-scale metabolic modeling. Cardoso JG, Andersen MR, Herrgård MJ, Sonnenschein N. Cardoso JG, et al. Front Bioeng Biotechnol. 2015 Feb 16;3:13. doi: 10.3389/fbioe.2015.00013. eCollection 2015. Front Bioeng Biotechnol. 2015. PMID: 25763369 Free PMC article. Review.
  • Network reconstruction of platelet metabolism identifies metabolic signature for aspirin resistance. Thomas A, Rahmanian S, Bordbar A, Palsson BØ, Jamshidi N. Thomas A, et al. Sci Rep. 2014 Jan 29;4:3925. doi: 10.1038/srep03925. Sci Rep. 2014. PMID: 24473230 Free PMC article.
  • In-silico models of stem cell and developmental systems. Setty Y. Setty Y. Theor Biol Med Model. 2014 Jan 8;11:1. doi: 10.1186/1742-4682-11-1. Theor Biol Med Model. 2014. PMID: 24401000 Free PMC article. Review.
  • Annotating individual human genomes. Torkamani A, Scott-Van Zeeland AA, Topol EJ, Schork NJ. Torkamani A, et al. Genomics. 2011 Oct;98(4):233-41. doi: 10.1016/j.ygeno.2011.07.006. Epub 2011 Aug 2. Genomics. 2011. PMID: 21839162 Free PMC article. Review.
  • Insight into human alveolar macrophage and M. tuberculosis interactions via metabolic reconstructions. Bordbar A, Lewis NE, Schellenberger J, Palsson BØ, Jamshidi N. Bordbar A, et al. Mol Syst Biol. 2010 Oct 19;6:422. doi: 10.1038/msb.2010.68. Mol Syst Biol. 2010. PMID: 20959820 Free PMC article.
  • Vo TD, Greenberg HJ, Palsson BO. Reconstruction and functional characterization of the human mitochondrial metabolic network based on proteomic and biochemical data. The Journal of biological chemistry. 2004;279:39532–39540. doi: 10.1074/jbc.M403782200. - DOI - PubMed
  • Feist AM, Palsson BO. The growing scope of applications of genome-scale metabolic reconstructions using Escherichia coli. Nature biotechnology. 2008;26:659–667. doi: 10.1038/nbt1401. - DOI - PMC - PubMed
  • Saucerman JJ, Brunton LL, Michailova AP, McCulloch AD. Modeling beta-adrenergic control of cardiac myocyte contractility in silico. The Journal of biological chemistry. 2003;278:47997–48003. doi: 10.1074/jbc.M308362200. - DOI - PubMed
  • Noble D. Modeling the heart – from genes to cells to the whole organ. Science. 2002;295:1678–1682. doi: 10.1126/science.1069881. - DOI - PubMed
  • Kell D. Metabolomics and systems biology: making sense of the soup. Curr Opin Microbiol. 2004;7:296–307. doi: 10.1016/j.mib.2004.04.012. - DOI - PubMed

Publication types

  • Search in MeSH

Related information

  • PubChem Compound
  • PubChem Substance

Grants and funding

  • R01 GM068837/GM/NIGMS NIH HHS/United States
  • T32 HL007089/HL/NHLBI NIH HHS/United States
  • GM68837/GM/NIGMS NIH HHS/United States

LinkOut - more resources

Full text sources.

  • BioMed Central
  • Europe PubMed Central
  • PubMed Central

Miscellaneous

  • NCI CPTAC Assay Portal

full text provider logo

  • Citation Manager

NCBI Literature Resources

MeSH PMC Bookshelf Disclaimer

The PubMed wordmark and PubMed logo are registered trademarks of the U.S. Department of Health and Human Services (HHS). Unauthorized use of these marks is strictly prohibited.

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

The PMC website is updating on October 15, 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • Nature Portfolio

Logo of npgopen

Dissecting cell identity via network inference and in silico gene perturbation

Kenji kamimoto.

1 Department of Developmental Biology, Washington University School of Medicine in St Louis, St Louis, MO USA

2 Department of Genetics, Washington University School of Medicine in St Louis, St Louis, MO USA

3 Center of Regenerative Medicine, Washington University School of Medicine in St Louis, St Louis, MO USA

Blerta Stringa

Christy m. hoffmann, kunal jindal, lilianna solnica-krezel, samantha a. morris, associated data.

All data, including sequencing reads and single-cell expression matrices, are available from the GEO under accession codes GSE72859 (ref. 16 ), GSE112824 (ref. 32 ) and GSE145298 for the zebrafish profiling from this study; and from ArrayExpress under accession codes E-MTAB-7325 ( Tal1 −/− chimeras) and E-MTAB-7324 (wild-type chimeras). Simulations can be explored at https://celloracle.org .

CellOracle code, documentation and tutorials are available at GitHub: https://github.com/morris-lab/CellOracle .

Cell identity is governed by the complex regulation of gene expression, represented as gene-regulatory networks 1 . Here we use gene-regulatory networks inferred from single-cell multi-omics data to perform in silico transcription factor perturbations, simulating the consequent changes in cell identity using only unperturbed wild-type data. We apply this machine-learning-based approach, CellOracle, to well-established paradigms—mouse and human haematopoiesis, and zebrafish embryogenesis—and we correctly model reported changes in phenotype that occur as a result of transcription factor perturbation. Through systematic in silico transcription factor perturbation in the developing zebrafish, we simulate and experimentally validate a previously unreported phenotype that results from the loss of noto , an established notochord regulator. Furthermore, we identify an axial mesoderm regulator, lhx1a . Together, these results show that CellOracle can be used to analyse the regulation of cell identity by transcription factors, and can provide mechanistic insights into development and differentiation.

A machine-learning-based strategy called CellOracle combines computational perturbation with modelling of gene-regulatory networks to analyse how cell identity is regulated by transcription factors, and correctly predicts phenotypic changes after transcription factor perturbation in the developing zebrafish.

The expansion of single-cell technologies into perturbational omics is enabling the development of methods to characterize cell identity. For example, single-cell RNA sequencing (scRNA-seq) coupled with pooled CRISPR screens offers much promise for analysing the genetic regulation of cell identity 2 – 4 , but cannot be readily used in many biological contexts. Computational methods to simulate single-cell phenotypes after perturbation are emerging, although many approaches still require experimental perturbation data for model training, and thus their scale and application are limited 5 . Moreover, previous deep-learning-based models represent a ‘black box’, which restricts the interpretation of gene-regulatory mechanisms that underlie the simulated biological events. In this respect, gene-regulatory network (GRN) modelling approaches are promising as they reconstruct systematic gene–gene associations from unperturbed single-cell omics data 6 – 11 . However, previous methods for analysing GRNs largely focus on the static network structure, and determining how a static GRN governs cell identity during dynamic biological processes therefore remains a challenge. Scalable and interpretable approaches are required to understand how gene-regulatory mechanisms relate to observed complex single-cell phenotypes.

Here we present a strategy that overcomes these limitations by combining computational perturbation with GRN modelling. CellOracle integrates multimodal data to build custom GRN models that are specifically designed to simulate shifts in cell identity following transcription factor (TF) perturbation, providing a systematic and intuitive interpretation of context-dependent TF function in regulating cell identity. We apply CellOracle to well-characterized biological systems: haematopoiesis in mice and humans; and the differentiation of axial mesoderm into notochord and prechordal plate in zebrafish. In haematopoiesis, we show that CellOracle recapitulates well-known cell fate regulation governed by TFs. Furthermore, we apply CellOracle to systematically perturb TFs across zebrafish development, recovering known and putative regulators of cell identity. Focusing on axial mesoderm, we predict and validate a prechordal plate phenotype after loss of function (LOF) of the prototypical notochord regulator, noto . Moreover, we also simulate and validate a role for the TF lhx1a in the development of axial mesoderm. Together, these results show that CellOracle can be used to infer and interpret cell-type-specific GRN configurations at high resolution, enabling mechanistic insights into the regulation of cell identity. CellOracle code and documentation are available at https://github.com/morris-lab/CellOracle and data can be explored at https://celloracle.org .

In silico gene perturbation using CellOracle

To gain mechanistic insight into the regulation of cell identity, we developed an in silico strategy to simulate changes in cell identity upon TF perturbation. CellOracle uses custom GRN modelling (Extended Data Fig. ​ Fig.1a) 1a ) to simulate global downstream shifts in gene expression following knockout (KO) or overexpression of TFs. These simulated values are converted into a vector map of transitions in cell identity, which enables simulated changes in cell identity to be intuitively visualized within a low-dimension space (Fig. ​ (Fig.1a 1a and Methods ). In silico perturbation involves four steps. (1) Cell-type- or cell-state-specific GRN configurations are constructed using cluster-wise regularized linear regression models with multi-omics data. (2) Using these GRN models, shifts in target gene expression in response to TF perturbation are calculated. This step applies the GRN model as a function to propagate the shift in gene expression rather than the absolute gene expression value, representing the signal flow from TF to target gene. This signal is propagated iteratively to calculate the broad, downstream effects of TF perturbation, allowing the global transcriptional ‘shift’ to be estimated (Extended Data Fig. 1b–d ). (3) The cell-identity transition probability is estimated by comparing this shift in gene expression to the gene expression of local neighbours. (4) The transition probability is converted into a weighted local average vector to represent the simulated directionality of cell-state transition for each cell following perturbation of candidate TFs. In the final calculation step, the multi-dimensional gene expression shift vector is reduced to a two-dimensional (2D) vector, allowing for more robust predictions against noise (Extended Data Fig. ​ Fig.1e). 1e ). We purposefully limit the simulation output data to a 2D vector representing the predicted shift in cell identity because our goal is to model changes in identity rather than predicting absolute changes in gene expression levels. Further details of the CellOracle algorithm are provided in the Methods , including validation of the range of simulated values; null or randomized model analysis; and hyperparameter evaluation (Supplementary Figs. 2 – 10 ).

An external file that holds a picture, illustration, etc.
Object name is 41586_2022_5688_Fig1_HTML.jpg

a , Simulation of cell-state transitions in response to TF perturbation. First, CellOracle constructs custom transcriptional GRNs using scRNA-seq and scATAC-seq data (left). Accessible promoter and enhancer peaks from scATAC-seq data are then combined with scRNA-seq data to generate cluster-specific GRN models (middle). CellOracle simulates the change in cell state in response to a TF perturbation, projecting the results onto the cell trajectory map (right). b , Force-directed graph of 2,730 myeloid progenitor cells from Paul et al. 16 . Twenty-four cell clusters (Louvain clustering) were organized into six main cell types. Mk, megakaryocytes. c , Differentiation vectors for each cell projected onto the force-directed graph. d , CellOracle simulation of cell-state transition in Spi1 KO simulation. Summarized cell-state transition vectors projected onto the force-directed graph. Vectors for each cell are shown in the inset. e , Spi1 KO simulation vector field with perturbation scores (PSs). f , Gata1 KO simulation with perturbation scores. g , Schematic of Spi1 – Gata1 lineage switching. MPP, multipotent progenitor. h , Detail of Gata1 simulation for the granulocyte branch. Left, cell-state transition vectors for each cell. Right, summarized vectors. i , Systematic KO simulation result of 90 TFs in the GM and ME lineage is summarized as a scatter plot of the sum of negative perturbation scores (shown in log scale). Dashed lines represent cut-off values corresponding to false-positive rate (FPR) = 0.01. Genes are classified into four categories on the basis of their previously reported functions (Supplementary Table 2 ). The asterisk refers to Supplementary Fig. 11, where we expand on the predicted phenotype. All scores can be explored through our web application ( https://celloracle.org ).

An external file that holds a picture, illustration, etc.
Object name is 41586_2022_5688_Fig6_ESM.jpg

(a) Overview of the CellOracle context-dependent GRN model construction method. First, genomic DNA sequence and TF-binding-motif information provide all potential regulatory links to construct a ‘base GRN.’ CellOracle uses scATAC-seq data to identify accessible promoter and enhancer DNA sequences in this step. The DNA sequence of these regulatory elements is scanned for TF-binding motifs, generating a list of potential regulatory connections between a TF and its target genes (left). Next, active connections (described below), dependent on cell state or cell type, are identified from all potential connections in the base GRN. CellOracle builds machine-learning (ML) models for this step that predict the quantitative relationship between the TF and the target gene. The ML model fitting results present the certainty of connection as a distribution, enabling the identification of GRN configurations by removing inactive connections from the base GRN structure. (b—d) Overview of signal propagation simulation. CellOracle leverages an inferred GRN model to simulate how target gene expression changes in response to the changes in regulatory gene expression. (b) The input TF perturbation (shown in yellow) is propagated side-by-side within the network model. (c) Input data and GRN coefficient matrix format used in the signal propagation calculation. (d) Leveraging the linear predictive ML algorithm features, CellOracle uses the GRN model as a function to perform the signal propagation calculation. Iterative matrix multiplication steps enable the estimation of indirect and global downstream effects resulting from the perturbation of a single TF. (e) After signal propagation, the simulated gene expression shift vector is converted into a 2D vector and projected onto the low-dimensional space. Details are described in the Methods.

GRN inference and benchmarking with CellOracle

The CellOracle GRN model must represent regulatory connections as a directed network edge to support signal propagation in response to TF perturbation. Thus, we developed a custom GRN modelling method motivated by previous approaches that incorporate promoter and TF-binding information with scRNA-seq data to infer a directional GRN 7 (Extended Data Fig. ​ Fig.1a 1a and Methods ). First, using single-cell chromatin accessibility data (single-cell assay for transposase-accessible chromatin using sequencing; scATAC-seq), we incorporate flexible promoter and enhancer regions, encompassing proximal and distal regulatory elements. This initial step uses the transcriptional start site (TSS) database ( http://homer.ucsd.edu/ ) and Cicero, an algorithm that identifies co-accessible scATAC-seq peaks, to distinguish accessible promoters and enhancers 12 . The DNA sequence of these elements is then scanned for TF-binding motifs, generating a ‘base GRN structure’ of all potential regulatory interactions in the species of interest (Extended Data Fig. ​ Fig.1a, 1a , left). This process is beneficial as it narrows the scope of possible regulatory candidate genes before model fitting (below) and helps define the directionality of regulatory edges in the GRN. To support GRN inference without requiring sample-specific scATAC-seq datasets, we have assembled a base GRN from a mouse scATAC-seq atlas 13 . We have also created general promoter base GRNs for ten commonly studied species (Supplementary Table 1 and Methods ). These base GRNs are built into the CellOracle library and provide an alternative solution when scATAC-seq data are unavailable.

In the second step of CellOracle GRN inference, we use scRNA-seq data to identify active connections in the base GRN, generating cell-type- or cell-state-specific GRN configurations for each cluster. In this step, we build a machine-learning model to predict the expression of target genes on the basis of TF expression (Extended Data Fig. ​ Fig.1a, 1a , right). Because CellOracle uses genomic sequences and information on TF-binding motifs to infer the base GRN structure and directionality, it does not need to infer the causality or directionality of the GRN from expression data. This approach allows CellOracle to adopt a relatively simple modelling method for GRN inference—a regularized linear machine-learning model. Crucially, this strategy enables the above signal propagation to simulate TF perturbation. To support the use of a linear model, the gene expression matrix of scRNA-seq data is divided into several clusters in advance so that a single data unit for each fitting process represents a linear relationship rather than non-linear or mixed regulatory relationships. Furthermore, a Bayesian or bagging strategy enables the certainty of connection to be presented as a distribution; this allows weak or insignificant connections to be removed from the base GRN (Extended Data Fig. ​ Fig.1a, 1a , right), producing a cell-type- or cell-state-specific GRN configuration.

To benchmark our GRN inference method, we generated a comprehensive transcriptional ground-truth GRN using 1,298 chromatin immunoprecipitation followed by sequencing (ChIP–seq) datasets for 80 regulatory factors across 5 different tissues 14 . In addition to benchmarking against diverse GRN inference algorithms, we also assessed the performance of our approach using different base GRNs, data sources and cell downsampling (Extended Data Fig. ​ Fig.2). 2 ). Inference performance as assessed by the area under the receiver operating characteristic (AUROC) ranged from 0.66 to 0.85 for the promoter base GRN and 0.73 to 0.91 for the scATAC-seq base GRN. Altogether, this benchmarking demonstrates the accuracy of our transcriptional GRN modelling method with a diverse range of data sources. Combined with our signal propagation strategy, CellOracle can effectively interrogate network biology and cell-identity dynamics through in silico perturbation.

An external file that holds a picture, illustration, etc.
Object name is 41586_2022_5688_Fig7_ESM.jpg

(a,b) We benchmarked the CellOracle GRN modelling method against pre-existing GRN inference algorithms: WGCNA, DCOL, GENIE3, and SCENIC. Details of input data and ground-truth data are described in the Methods. We generated a base GRN using the Cusanovich mouse sci-ATAC-seq atlas dataset 13 or UCSC mm9 promoter DNA sequence data. CellOracle scored better than or comparable to other algorithms. CellOracle results with a promoter base GRN received lower but comparable scores than the scATAC-seq base GRN results. In addition, we tested the CellOracle GRN method using two impaired base GRN datasets (Scrambled motif base GRN and no base GRN) to investigate how the base GRN data contributes to its performance. (a) AUROC (Area Under the Receiver Operating Characteristic curve) heat map. The top score in each condition is highlighted with a red rectangle. (b) EPR (Early Precision Ratio) heat map. EPR represents the EP ratio relative to the random model ER score. An EPR of less than 1 indicates that the GRN inference results are no better than random prediction. (c,d) The performance of CellOracle was tested after downsampling cells. GRN models were made after downsampling to 400, 200, 100, 50, 25, and 10 cells. We recommend at least 50 cells for GRN inference based on these results. CellOracle used the same mouse scATAC-seq base GRN as a and b . The Liver_2 sample contains less than 400 cells. (e,f) GRN inference performance comparison between different base GRN data generated from various tissue types. The top score in each condition is highlighted with a red rectangle.

GRN analysis and TF KO in haematopoiesis

For validation, we aimed to reproduce known TF regulation of mouse haematopoiesis, a well-characterized differentiation paradigm 15 , by applying CellOracle to a 2,730-cell scRNA-seq atlas of myeloid progenitor differentiation 16 (Fig. ​ (Fig.1b 1b and Extended Data Fig. ​ Fig.3a). 3a ). We constructed GRN models for each of the 24 myeloid clusters identified, representing megakaryocyte and erythroid progenitors (MEPs) and granulocyte–monocyte progenitors (GMPs), differentiating toward erythrocytes, megakaryocytes, monocytes and granulocytes (Fig. ​ (Fig.1c). 1c ). To test whether the CellOracle simulation could recapitulate known TF regulation of cell identity, we performed in silico gene perturbation using the inferred GRNs, and compared the CellOracle KO simulation results with previous biological knowledge and ground-truth KO data.

An external file that holds a picture, illustration, etc.
Object name is 41586_2022_5688_Fig8_ESM.jpg

(a) Force-directed graph of 2,730 myeloid progenitor cells from Paul et al. 16 with all clusters labelled. DC = Dendritic Cell; Ery = Erythrocyte; GMP = Granulocyte–Monocyte Progenitor; Gran = Granulocyte; Lym = Lymphoid; MEP = Megakaryocyte–Erythrocyte Progenitor; Mk = Megakaryocyte; Mo = Monocyte. We removed the DC and Lymphoid cell clusters to focus on myeloid cell differentiation. (b) Degree distribution of the MEP_0 cluster GRN model. After making the GRN model for each cluster, network edges were pruned. Then, we counted the network degree (k), representing the number of network edges for each gene. P(k) is the frequency of network degree k. The relationship between k and P(k) was visualized after log transformation to test whether the data follow a power law, in which there is a linear relationship between log(k) and log(P(k)). The R-squared value (R 2 ) was calculated to quantify the degree of the linear relationship. The same analysis was performed on the randomized GRN (lower panel). (c) Top 30 genes ranked by degree centrality in the MEP_0 cluster GRN. (d) Gata1 gene expression (log-transformed UMI) projected onto the force-directed graph (left) and violin plot grouped by cell-type annotation (right). (e) Spi1 gene expression (log-transformed UMI) projected onto the force-directed graph (left) and violin plot grouped by cell-type annotation (right). (f) Systematic KO simulation of TFs in the GM (Granulocyte–Monocyte) and ME (Megakaryocyte–Erythrocyte) lineages. The sum of the negative perturbation scores is calculated for each TF to quantify the perturbation effect along each lineage. (g) Negative PS sum cut-off value calculation. Cut-off values were calculated for GM and ME lineage simulations based on the distribution of PS sum score calculated from the randomized simulation result (false-positive rate (FPR) = 0.01).

First, Spi1 (also known as PU.1 ) and Gata1 KO simulation is used to illustrate the CellOracle in silico perturbation analysis. The TF perturbation simulation is visualized as a vector map on the 2D trajectory space (Fig. ​ (Fig.1d 1d and Supplementary Video  1 ), representing a potential shift in cell identity in response to TF perturbation. To enable the simulation results to be assessed systematically and objectively, we also devised a ‘perturbation score’ metric, which compares the directionality of the perturbation vector to the natural differentiation vector (Extended Data Fig. ​ Fig.4). 4 ). A negative perturbation score suggests that TF KO delays or blocks differentiation (Extended Data Fig. 4b–d , purple). Conversely, a positive perturbation score suggests that the differentiation and KO simulation vectors share the same direction, indicating that loss of TF function promotes differentiation (Extended Data Fig. 4b–d , green). Spi1 KO simulation yielded positive perturbation scores for MEPs, whereas GMPs had negative perturbation scores (Fig. ​ (Fig.1e), 1e ), suggesting that Spi1 KO inhibits GMP differentiation and promotes MEP differentiation. Inverse perturbation score distributions were produced for the Gata1 KO simulation (Fig. ​ (Fig.1f). 1f ). Comparing these predictions to previous reports 17 , 18 : PU.1 directs commitment to the neutrophil and monocyte lineages 19 , 20 , whereas GATA1 promotes the differentiation of erythroid cells 21 and eosinophil granulocytes 22 – 24 . Overall, CellOracle accurately simulated the myeloid lineage switching governed by Gata1 and Spi1 (refs. 15 , 25 – 27 ; Fig. ​ Fig.1g), 1g ), including a relatively mild Gata1 KO phenotype in early granulocyte differentiation (Fig. ​ (Fig.1h), 1h ), which cannot be inferred from the low levels of Gata1 expression in granulocytes (Extended Data Fig. ​ Fig.3d). 3d ). However, CellOracle did not detect a previously reported depletion of erythrocyte progenitors after Spi1 KO 27 , 28 , probably owing to changes in cell proliferation that are not predicted by the method.

An external file that holds a picture, illustration, etc.
Object name is 41586_2022_5688_Fig9_ESM.jpg

(a—d) Schematic for perturbation score (PS) calculation. CellOracle calculates a PS by comparing the direction of the simulated cell state transition with the direction of cell differentiation. (a) Schematic for differentiation vector calculations. First, the pseudotime data are summarized by grid points. Then, CellOracle calculates a 2D gradient vector of the pseudotime data representing the directionality of differentiation pseudotime. (b) Calculation of the inner-product value between the differentiation vector and gene perturbation vectors. First, the results of the perturbation simulation are converted into the same vector field format as the differentiation vector field, and the inner product of these vectors is calculated to produce a PS. (c) A positive PS (magenta) suggests the perturbation vector and differentiation vector share a similar direction, thus, suggesting the TF perturbation would promote differentiation. In contrast, a negative PS (green) represents inhibited differentiation. (d) Schematic for perturbation score interpretation. A positive perturbation score (green) predicts that the perturbation promotes differentiation. A negative perturbation score (purple) represents inhibited differentiation.

We next evaluated eight additional TFs that have established roles in myeloid differentiation: Klf1 (also known as Eklf) , Gfi1b , Fli1 , Gfi1 , Gata2 , Lmo2 , Runx1 and Irf8 (refs. 15 , 29 ). CellOracle also correctly reproduced their reported KO phenotypes (Extended Data Figs. ​ Figs.5 5 and ​ and6), 6 ), which we extended to two additional datasets of mouse and human haematopoiesis (Extended Data Figs. ​ Figs.7 7 and ​ and8 8 and Supplementary Figs. 13 and 14 ). In addition, we scaled up our simulation to all TFs that passed filtering ( Methods ) to systematically perturb 90 TFs in the dataset in the context of granulocyte–monocyte (GM) and megakaryocyte–erythroid (ME) differentiation. The reported cell-fate-regulatory functions of these TFs fall into three major categories: (1) ME lineage differentiation; (2) GM lineage differentiation; and (3) ME and GM lineage differentiation and maintenance of haematopoietic stem cell (HSC) identity (Supplementary Table 2 ). We ranked the TFs on the basis of the sum of the negative perturbation score in the KO simulation, representing the potential of a TF potential to promote differentiation ( Methods and Extended Data Fig. ​ Fig.3f 3f ).

An external file that holds a picture, illustration, etc.
Object name is 41586_2022_5688_Fig10_ESM.jpg

a – d , CellOracle KO simulation for four key TF regulators of haematopoiesis: Klf1 ( a ), Gfi1b ( b ), Gfi1 ( c ) and Irf8 ( d ) reported in 15 , 29 . The simulated cell state transition vector field is visualized with perturbation scores (PS; magenta: negative score; green: positive score). The right column shows a summary of the TF role based on the CellOracle simulation results, cell transition vector, and PS. For example, a positive PS in the TF KO simulation (green) implies that the TF has a role in cell state maintenance or inhibiting cell differentiation. In contrast, a negative PS in the KO simulation (magenta) implies that the TF normally promotes cell differentiation.

An external file that holds a picture, illustration, etc.
Object name is 41586_2022_5688_Fig11_ESM.jpg

a – d , CellOracle KO simulation results for Gata2 ( a ), Runx1 ( b ), Fli1 ( c ) and Lmo2 ( d ). The simulated cell state transition vector field is visualized with perturbation scores (PS; magenta: negative score; green: positive score). The right column shows a summary of the TF role based on the CellOracle simulation results, cell transition vector, and PS. For example, a positive PS in the TF KO simulation (green) implies that the TF has a role in cell state maintenance or inhibiting cell differentiation. In contrast, a negative PS in the KO simulation (magenta) implies that the TF normally promotes cell differentiation.

An external file that holds a picture, illustration, etc.
Object name is 41586_2022_5688_Fig12_ESM.jpg

a , Force-directed graph of 44,082 myeloid progenitor cells from Dahlin et al. 58 with all clusters labelled. MPP = Multipotent Progenitor; GMP = Granulocyte–Monocyte Progenitor; Gran = Granulocyte; LP = Lymphoid progenitor; MEP = Megakaryocyte–Erythrocyte Progenitor; Mk = Megakaryocyte; Mo = Monocyte; Baso = Basophil. (b) Marker gene expression (log-transformed UMI) projected onto the force-directed graph. Procr = MPP marker; Epor = Erythrocyte marker; Itga2b = Mk marker; Flt3 = LP marker; Mpo = Gran/Mo marker; Ms4a2 = Baso marker. (c) Pseudotime values projected onto the force-directed graph. (d) Differentiation vector calculated from the pseudotime gradient. ME and GM lineages are highlighted. (e) Csf1r and Cebpε gene expression projected onto the force-directed graph. The right panel is a magnified area of the GM lineage. Csf1r is a monocyte marker, and Cebpε is a granulocyte marker. (f) Early lineage bifurcation between monocytes and granulocytes is observed on the force-directed graph.

An external file that holds a picture, illustration, etc.
Object name is 41586_2022_5688_Fig13_ESM.jpg

(a) Force-directed graph of 5,610 myeloid progenitor cells from Setty et al. 67 with all clusters labelled. MPP = Multipotent Progenitor; GMP = Granulocyte–Monocyte Progenitor; Gran = Granulocyte; MEP = Megakaryocyte–Erythrocyte Progenitor; Mk = Megakaryocyte. (b) Marker gene expression (log-transformed UMI) projected onto the force-directed graph. PROCR = MPP marker; EPOR = Erythrocyte marker; ITGA2B = Mk marker; FLT3 = LP marker; MPO = Gran/Mo marker; MS4A2 = Baso marker. (c) Pseudotime values projected onto the force-directed graph. (d) Differentiation vector calculated from the pseudotime gradient. ME and GM lineages are highlighted. (e) CSF1R and CEBPE gene expression projected onto the force-directed graph. The right panel is a magnified area of the GM lineage. The CSF1R is a monocyte marker, and CEBPE is a granulocyte marker. (f) Early lineage bifurcation between monocytes and granulocytes is observed on the force-directed graph.

To summarize this systematic TF perturbation, the summed negative perturbation scores are shown on a scatter plot (Fig. ​ (Fig.1i). 1i ). The dashed lines represent cut-off values calculated with a randomized vector (Extended Data Fig. ​ Fig.3g). 3g ). The distribution of negative perturbation score sums for all TF KOs was highly consistent with known TF functions in differentiation. For example, TFs involved in ME lineage differentiation are enriched on the top left side of the scatter plot. By contrast, GM differentiation factors are found at the bottom right. TFs that regulate both lineages are located on the top right side, whereas the lower-ranked factors are enriched for TFs that have not been reported to regulate blood differentiation (Fig. ​ (Fig.1i). 1i ). Overall, 85% of the top 30 TFs ranked by this objective, systematic perturbation strategy are reported regulators of myeloid differentiation (Supplementary Table 2 ). Of the remaining TFs, several have no reported phenotypes in haematopoiesis at present, and therefore represent putative regulators. We note that the negative perturbation score metric does not always convey all information of the vector field, which might oversimplify the role of a TF. For example, Elf1 has a negative perturbation score in both the ME and the GM lineage, and its function is unclear on the summarized perturbation score plot; however, closer inspection of the vector reproduced its reported phenotype in the ME lineage, highlighting the importance of investigating the simulation output (Supplementary Fig. 11 ). Finally, we directly compared the output of CellOracle to existing methods for identifying regulatory TFs using gene expression and chromatin accessibility, demonstrating the unique insights into context-dependent TF regulation that CellOracle can provide (Supplementary Figs. 15 and 16 ).

We further validated CellOracle simulation by focusing on several genes for which experimental KO scRNA-seq data are available: Cebpa , Cebpe and Tal1 (refs. 16 , 30 ). Cebpa is necessary for the initial differentiation of GMPs, and its loss leads to a marked decrease in differentiated myeloid cells, accompanied by an increase in erythroid progenitors. By contrast, Cebpe is not required for initial GMP differentiation, but it is essential for the subsequent maturation of GMPs into granulocytes 16 . Notably, when we compare the simulation results to the experimental KO cell distribution, we must again consider the effects of TF perturbation in the context of natural cell differentiation (Fig. ​ (Fig.2a). 2a ). Thus, we performed a Markov random walk simulation based on the differentiation and simulation vectors to estimate how TF perturbation leads to changes in cell distribution (Supplementary Fig. 17 and Methods ). For Cebpa , CellOracle simulation predicted that differentiation is inhibited at GMP–late GMP clusters, whereas early erythroid differentiation is promoted (Fig. ​ (Fig.2b). 2b ). The simulation recapitulates the experimental cell distribution (Fig. 2b,d ). For Cebpe , CellOracle again correctly modelled the inhibition of differentiation at the entry stage of granulocyte differentiation (Fig. ​ (Fig.2c), 2c ), consistent with experimental KO data (Fig. ​ (Fig.2d 2d ).

An external file that holds a picture, illustration, etc.
Object name is 41586_2022_5688_Fig2_HTML.jpg

a , Biological interpretation of perturbation scores (estimation of cell density based on perturbation score). Case 1: the differentiation and perturbation simulation vectors share the same direction, indicating a population shift towards a more differentiated identity. Case 2: the two vectors are opposed, suggesting that differentiation is inhibited. Case 3: predicted inhibition precedes promotion; thus, cells will be likely to accumulate. b , c , CellOracle Cebpa KO ( b ) and Cebpe KO ( c ) simulations showing cell-state transition vectors, perturbation scores and estimated cell density (Markov simulation). Right, schematics of simulated phenotype. Ery, erythrocyte. d , Ground-truth experimental cell density plot of wild-type (WT) cells, Cebpa KO cells and Cebpe KO cells in the force-directed graph embedding space. Estimated kernel density data are shown as a contour line on a scatter plot to depict cell density. e , Cell-type proportions in the WT and ground-truth KO samples. Gra, granulocyte; KDE, kernel density estimation; Mo, monocyte.

We also analysed a single-cell atlas of mouse organogenesis 30 to simulate the loss of Tal1 function (Extended Data Fig. 9a–d). CellOracle reproduced the inhibited differentiation of haematoendothelial progenitors in the Tal1 KO 30 (Extended Data Fig. 9e–h ). In addition, CellOracle showed that loss of Tal1 in later stages of erythroid differentiation does not block cell differentiation (Extended Data Fig. 9i,j ), consistent with previous conditional Tal1 KO experiments at equivalent stages 31 . Together, these results show that CellOracle effectively simulates cell-state-specific TF function, corroborating previous knowledge of the mechanisms that regulate cell fate in haematopoiesis and ground-truth in vivo phenotypes. Furthermore, systematic KO simulations demonstrate that CellOracle enables objective and scalable in silico gene perturbation analysis.

An external file that holds a picture, illustration, etc.
Object name is 41586_2022_5688_Fig14_ESM.jpg

(a) UMAP plot of chimeric E8.5 embryos of wild-type (WT) and Tal1 KO cells (25,307 cells and 26,311 cells, respectively) from a published scRNA-seq atlas of mouse gastrulation and organogenesis 30 . (b) Tal1 gene expression (log-transformed UMI) projected onto the UMAP plot. (c) Pseudotime gradient vector field used in the perturbation score (PS) calculations. Developmental pseudotime was calculated using the DPT method with WT chimera scRNA-seq data and then converted into a 2D gradient vector field. (d) PS and cell transition vector field of the Tal1 KO simulation. (e) The magnified area of erythrocyte differentiation predicts inhibition or arrest of cell differentiation at the haematoendothelial progenitor stage. (f) The Markov random walk simulation result predicts high cell density in the haematoendothelial progenitor cluster and lower cell density at later stages, indicating that Tal1 KO would induce differentiation arrest at the haematoendothelial progenitor stage. (g) Experimentally measured Tal1 KO data. The kernel cell density of whole chimera (left), WT (middle), and Tal1 KO cells (right) were calculated after downsampling each condition (25,307 cells) to control for sample size. A scatter plot of whole chimera cells is shown as background (light grey) to highlight the overall cell trajectory structure. (h) The bar plot shows the cell type composition in each sample (right panel). Overall, the experimental result aligns with the simulated predictions. The relative fold change between WT and KO samples is also shown in Supplementary Table 4 . (i) Perturbation score and cell transition vector field of the Tal1 conditional KO simulation in the erythroid lineage. Tal1 expression was set to zero in the Blood progenitor and Erythrocyte clusters; CellOracle simulates KO effects in later erythroid differentiation stages. (j) The Markov simulation result shows uniform cell density, predicting that Tal1 KO would not induce differentiation arrest in a conditional KO targeting later stages of erythroid differentiation.

Systematic TF KO simulations in zebrafish

Next, we applied CellOracle to systematically perturb TFs across zebrafish development. We made use of a 38,731-cell atlas of zebrafish embryogenesis published in a study by Farrell et al. 32 , comprising 25 developmental trajectories that span zygotic genome activation to early somitogenesis. We first inferred GRN configurations for the 38 cell types and states identified in the Farrell et al. study 32 , splitting the main branching trajectory into four sub-branches: ectoderm; axial mesoderm; other mesendoderm; and germ layer branching point (Extended Data Fig. 10a,b ). In the absence of scATAC-seq data, we constructed a base GRN using promoter information from the UCSC database, obtaining information on TF-binding motifs from the Danio rerio CisBP motif database ( Methods ). Our benchmarking has shown that this approach produces reliable GRN inference (Extended Data Fig. ​ Fig.2). 2 ). After preprocessing and GRN inference, we performed KO simulations for all TFs with inferred connections to at least one other gene ( n  = 232 ‘active’ TFs; Methods ). The results of these simulations across all developmental trajectories can be explored at https://www.celloracle.org .

An external file that holds a picture, illustration, etc.
Object name is 41586_2022_5688_Fig15_ESM.jpg

(a) 2D force-directed graph of a published atlas 32 of zebrafish embryogenesis (n = 25,711 cells). (b) Main trajectory partitioned into four sub-branches. (c) Bar plots depicting the number of TFs after variable gene selection (black), the number of TFs with >1 network edge in the inferred GRN model (dark grey), and the number of TFs expressed in >1% of cells (light grey). (d) CellOracle noto LOF simulation result (left) and simulation results with a randomized GRN model (right) for the notochord lineage. Simulated cell state transitions for each cell were converted to a vector field and visualized with a scatter plot (shown in grey). (e) Noto LOF simulation for the prechordal plate lineage. (f) CellOracle noto LOF simulation vector is shown at single-cell resolution. Cells in the Notochord cluster are shown in orange, while the Prechordal Plate cells are shown in blue. The right panel is the magnified area. (g) Force-directed graph of the Other mesendoderm sub-branch with cell cluster annotations from the Farrell et al. study 32 (n = 10,265 cells). (h) Pseudotime data are projected onto the force-directed graph. (i) The Somite lineage, defined in the previous Farrell et al. study 32 , is in red. (j) Pseudotime gradient vector field calculated for the Somite lineage. (k) Noto LOF simulation vector field in the cells of the Somite lineage are shown with perturbation scores.

Our systematic TF KO simulation provides a valuable resource for identifying regulators of early zebrafish development and enables candidates to be prioritized for experimental validation. To further examine this comprehensive perturbation atlas, we focused on axial mesoderm differentiation, spanning 4.3 to 12 h post-fertilization (hpf) (Fig. 3a,b and Extended Data Fig. 10a,b ). This midline structure bifurcates into notochord and prechordal plate lineages, representing a crucial patterning axis 33 , and has been extensively characterized, in part through large-scale genetic screens 34 . For these lineages, we performed systematic TF KO simulation and network analysis for 232 candidate TFs (Extended Data Fig. 10c ). CellOracle ranked noto , a well-characterized TF regulator of notochord development, as the top TF on the basis of degree centrality, along with other known regulators of notochord development (Fig. ​ (Fig.3c). 3c ). Degree centrality is a straightforward measure that reports how many edges (genes) are directly connected to a node (TF); highly connected nodes are likely to be essential for a biological process 35 , 36 . In zebrafish floating head n1/n1 ( flh n1/n1 ) mutants, which lack a functional noto gene ( noto is also known as flh ) 37 , axial mesoderm does not differentiate into notochord, and assumes a somitic mesoderm fate instead 38 . Noto LOF simulation correctly reproduced the loss of notochord (Fig. 3d–f and Extended Data Fig. 10d–f ), in addition to enhanced somite differentiation (Extended Data Fig. 10g–k ). Moreover, CellOracle predicted a previously unknown (to our knowledge) consequence of noto LOF: enhanced prechordal plate differentiation (Fig. 3e,f ). We also noted that later stages of notochord differentiation received a positive perturbation score, indicating that continued expression of noto is not required for notochord differentiation. Alternatively, this finding could suggest that downregulation of noto is required for notochord maturation.

An external file that holds a picture, illustration, etc.
Object name is 41586_2022_5688_Fig3_HTML.jpg

a , Two-dimensional force-directed graph of the axial mesoderm (AM) sub-branch ( n  = 1,669 cells) in a published zebrafish embryogenesis atlas (Farrell et al. 32 ). Arrows indicate notochord cell differentiation (top) and prechordal plate differentiation (bottom). b , Conversion of URD-calculated pseudotime (left) into a 2D pseudotime gradient vector field (right). c , Degree centrality scores were used to rank the top 30 TFs in notochord (left) and prechordal plate (right). Black text denotes TFs. Grey text denotes non-TFs. d , Expression of noto projected onto the axial mesoderm sub-branch. e , Noto KO simulation vector and perturbation scores. f , Markov simulation to estimate cell density in the noto KO sample. The simulation predicted inhibited early notochord differentiation and promotion of prechordal plate differentiation, indicating a potential lineage switch.

Experimental validation of noto LOF

Next, we experimentally validated the predicted expansion of prechordal plate after noto LOF. First, we generated a 38,606-cell wild-type (WT) reference atlas from dissociated WT embryos at 6, 8 and 10 hpf (2 technical replicates per stage) and used Seurat’s label transfer function 39 to cluster and label the WT reference cells according to the annotations in Farrell et al. 32 (Extended Data Fig. ​ Fig.11). 11 ). Subsetting the axial mesoderm clusters showed the expected bifurcation of cells into notochord and prechordal plate, accompanied by upregulation of marker genes (Fig. 4a,b ). For visualization of axial mesoderm cells, we used a uniform manifold approximation and projection (UMAP) transfer function to enable comparable data visualization between different samples ( Methods ).

An external file that holds a picture, illustration, etc.
Object name is 41586_2022_5688_Fig4_HTML.jpg

a , UMAP plot of WT reference data for axial mesoderm (6, 8 and 10 hpf): notochord, early notochord, early axial mesoderm and prechordal plate clusters ( n  = 2,012 cells). Arrows indicate notochord differentiation (top) and prechordal plate differentiation (bottom). b , Gene expression (log-transformed unique molecular identifier (UMI) count) and developmental stage are projected onto the axial mesoderm UMAP plot. Noto and twist2 are expressed in notochord, whereas gsc marks the prechordal plate. c , Bar plots comparing cell cluster compositions between treatments and controls (left, flh n1/n1 mutants (10 hpf) and controls; right, noto crispants (10 hpf) and tyr crispants). Cluster compositions are presented as the proportion of each group normalized to the whole cell number. In both flh n1/n1 mutants and noto crispants, the notochord is significantly depleted ( flh n1/n1 : P  = 5.55 × 10 −52 ; noto: P  = 1.39 × 10 −33 , chi-square test) and the prechordal plate is significantly expanded ( flh n1/n1 : P  = 1.07 × 10 −4 ; noto : P  = 5.01 × 10 −18 , chi-square test. *** P < 0.001; **** P < 0.0001). d – g , flh n1/n1 mutant or noto crispant data projected onto the WT axial mesoderm UMAP plot. d , Cluster annotation and sample label projected onto the UMAP plot. e , Kernel cell density contour plot shows control cell density (left) and flh n1/n1 mutant cell density (right). f , Cluster annotation and sample label projected onto the UMAP plot. g , tyr crispant cell density (left) and noto crispant cell density (right) shown on the kernel cell density contour plot.

An external file that holds a picture, illustration, etc.
Object name is 41586_2022_5688_Fig16_ESM.jpg

(a) Schematic illustration of zebrafish scRNA-seq experiments. (1) The reference dataset was generated using cells from 6, 8, and 10 hpf wild-type (WT) embryos. To assess noto LOF, we also assayed (2) flh n1/n1 mutants and (3) noto/flh crispants at 10 hpf (~25 embryos per sample; Methods). (b) Cell cluster composition comparing tyr crispant (control) with WT cells, showing similar cell distributions. After data integration, cell-type labels were transferred from the whole WT 6, 8, and 10 hpf reference data (see Methods). (c) Sample label projected onto the t- SNE plot. flh n1/n1 mutant and control sample (left, n = 57,175 cells, 2 independent biological replicates for each sample), and t- SNE plot of noto crispant and tyr crispant samples (right, n = 9,185 cells, 2 biological, 3 technical replicates for noto crispant; n = 46,440 cells, n = 3 independent biological, 5 technical replicates for tyr crispant). (d) Cluster annotation label projected onto the t- SNE plot. WT zebrafish cells (left, n = 38,606 cells, two technical replicates per stage), flh n1/n1 mutant and control sample (middle), noto crispant and tyr crispant samples (right). (e) Cell cluster composition comparing LOF samples with the control samples.

For experimental perturbation of noto , we generated and dissociated pools of 25 flh n1/n1 mutant embryos, recognized at 10 hpf by the lack of notochord boundaries, and sibling controls ( flh n1/+ and flh +/+ ) for scRNA-seq. We integrated these datasets and projected them onto the WT axial mesoderm reference atlas. In agreement with previous observations, we observed a significant depletion of cells labelled as notochord in flh n1/n1 mutants (−98%, relative to control, P  = 5.55 × 10 −52 , chi-square test; Fig. ​ Fig.4c, 4c , left), concomitant with an expansion of the somite cluster (+41.3%; P  = 5.90 × 10 −29 ; Extended Data Fig. 11e , left). Furthermore, as predicted by noto LOF simulation, we observed a significant expansion of the prechordal plate cluster in flh n1/n1 mutants (+38.6%; P  = 1.07 × 10 −4 ; Fig. ​ Fig.4c, 4c , left). Plotting cell density revealed stalled notochord differentiation and bifiurcation of the mid axial mesoderm, with excess prechordal plate cells (Fig. 4d,e ), consistent with the noto LOF simulation (Fig. 3e,f ). To orthogonally validate these results, we produced noto LOF with a modified CRISPR–Cas9 protocol that we have previously used to achieve near-complete gene disruption in F 0 embryos injected with two noto -targeting guide RNAs (gRNAs) 40 ( Methods ). The resulting noto ‘crispants’ were dissociated at 10 hpf (9,185 cells, n  = 2 biological and n  = 3 technical replicates) and compared by single-cell analysis to controls that targeted the tyrosinase gene ( tyr) , which is not expressed until later in development ( n  = 46,440 single cells, n  = 3 biological and n  = 5 technical replicates; Extended Data Fig. 11b ). Analysis of cell-type composition confirmed a significant depletion of notochord, with an expansion of somitic mesoderm and prechordal plate (Fig. ​ (Fig.4c, 4c , right, Fig. 4f,g and Extended Data Fig. 11e , right) in noto crispants, highly consistent with our flh n1/n1 mutant analysis. Together, in addition to further validating the performance of CellOracle, these results highlight the ability of this approach to identify experimentally quantifiable phenotypes in well-characterized mutants that may have been previously overlooked owing to a reliance on gross morphology. We next sought to identify new LOF phenotypes in axial mesoderm development.

Discovery of axial mesoderm regulators

To identify novel TFs required for axial mesoderm differentiation, we prioritized TFs according to predicted KO phenotypes, focusing on early-stage differentiation before evident lineage specification (Extended Data Fig. 12a ). The resulting ranked list contains several known notochord regulators, including noto (Fig. ​ (Fig.5a, 5a , red and Supplementary Table 2 ), confirming CellOracle’s capacity to model known developmental regulation. However, it is important to note that some known notochord regulators, such as foxa3 (ref. 41 ), were not identified as they are filtered out in the first steps of data processing owing to low expression. Systematic perturbation simulations for all lineages can be found at https://celloracle.org . As well as the axial mesoderm, we also performed an in-depth analysis of the adaxial mesoderm, which gives rise to somites. Overall, more than 80% of the top 30 TFs in this analysis were associated with somite differentiation (Supplementary Table 3 ).

An external file that holds a picture, illustration, etc.
Object name is 41586_2022_5688_Fig5_HTML.jpg

a , Top 30 TFs according to predicted KO effects. Red and *: previously reported notochord regulators (Supplementary Table 2 ). lhx1a , sebox and irx3a were selected for experimental validation. b , lhx1a LOF simulation in the axial mesoderm sub-branch, predicting an inhibition of axial mesoderm differentiation from early stages. c , scRNA-seq validation of experimental LOF: cell cluster composition of the axial mesoderm clusters normalized to the whole cell number in lhx1a and tyr (control) crispant samples. Early notochord is significantly expanded ( P  = 1.34 x 10 −35 , chi-square test) and differentiated axial mesoderm populations are significantly depleted (notochord: P  = 3.83 x 10 −3 ; prechordal plate: P  = 1.28 x 10 −7 , chi-square test) in lhx1a crispants. d , lhx1a and tyr crispant axial mesoderm cells at 10 hpf. Left, cell type annotation of lhx1 and tyr crispant cells. Right, lhx1a and control crispant data projected onto the WT UMAP. e , Control cell density (left, n  = 2,342 cells) and lhx1a crispant cell density (right, n  = 2,502 cells). f , Rug plot showing the difference in averaged NMF module scores between lhx1a and tyr crispants in notochord lineage cells. Black, cell-type-specific modules. Light grey, broad cluster modules. CM, cephalic mesoderm. g , Violin plot of NMF module score in notochord lineage cells ( n  = 1,918 lhx1a crispant and n  = 2,616 tyr crispant cells. h , Violin plots of gene expression in the notochord (NC) lineage cells. **** P  < 0.0001, two-tailed Wilcoxon rank-sum test with Bonferroni correction. i , Quantification (number of spots in flattened HCR image) normalized to WT. Mean ± s.e.m. n  = 2 independent biological replicates, 8 embryos per replicate. nog1 : P  = 0.0022 (WT versus lhx1a crispant), P  = 0.0052 ( tyr versus lhx1a crispant); gsc : P  = 0.00042 (WT versus lhx1a crispant), P  = 0.0018 ( tyr versus lhx1a crispant); twist2 : P  = 0.0011 (WT versus lhx1a crispant), P  = 0.0012 ( tyr versus lhx1a crispant); two-sided t -test. j , Representative HCR images for nog1 expression (yellow) in whole embryos at 10 hpf. k , Representative flattened HCR images of 10 hpf embryos stained with probes against gsc (yellow) and twist2 (red); nuclei are stained with DAPI (blue). Scale bars, 300 μm.

An external file that holds a picture, illustration, etc.
Object name is 41586_2022_5688_Fig17_ESM.jpg

(a) Overview of the systematic LOF simulation and quantification method. CellOracle LOF simulation was performed for 232 TFs in the Notochord lineage to calculate the perturbation score (PS). The sum of the negative PS was calculated for each TF in the selected area between digitized pseudotime 0 to 3, before lineage specification. (b) Degree centrality score in the Notochord cluster GRN (left), gene expression specificity score in the Axial mesoderm sub-branch (middle), and mean expression value in the Axial mesoderm sub-branch (right) were calculated for the top 30 TFs selected in the systematic simulation to further prioritize candidate genes for experimental validation. We selected genes in the top 50% of these scores. Please refer to the Methods for the detailed selection procedure. We selected three candidates for experimental validation: lhx1a , sebox , and irx3a . ( c , e , g ) Cell cluster composition in axial mesoderm cells, comparing LOF ( lhx1a , sebox , and irx3a) samples with control samples. Cell cluster composition comparison was performed with a Chi-square test, Two-tailed Bonferroni correction. lhx1a experiment: Early axial mesoderm p = 0.000229717, Early Notochord p = 1.08×10 −21 , Notochord p = 4.38×10 −6 , Prechordal Plate p = 1.42×10 −10 . Sebox experiment: Early axial mesoderm p = 3.01×10 −6 , Early Notochord p = 2.87×10 −6 , Notochord p = 4.38×10 −6 , Prechordal Plate p = 4.17×10 −9 . The left panels show cluster composition in the merged data, and the right panels show individual scRNA-seq batch. lhx1a LOF produced the most significant changes in cell composition. ( d , f , h ) Comparison of notochord marker gene expression between LOF and control samples. scRNA-seq gene expression in the Notochord lineage clusters is shown as a violin plot. Late-stage notochord markers, twist2 and nog1 , or broad/early notochord markers, noto and tbxta , are visualized. Statistical tests: Wilcoxon rank-sum test, two-tailed with Bonferroni p-value correction. lhx1a experiment: twist2 p = 7.118×10 −64 , nog1 p = 7.757×10 −67 , noto p = 7.718×10 −11 . sebox experiment: twist2 p = 8.022×10 −10 , nog1 p = 3.184×10 −3 , tbxta p = 1.551×10 −3 . irx3a experiment: twist2 p = 0.000012. (c) n = 720 cells and 1,686 cells for lhx1a crispant and tyr crispant, respectively. (e) n = 1,216 cells and 1,703 cells for sebox crispant and tyr crispant, respectively. (g) n = 1,176 cells and 1,651 cells for irx3a crispant and tyr crispant, respectively.

In addition to known TFs, we identified several TFs with no previously reported role in axial mesoderm differentiation (Fig. ​ (Fig.5a, 5a , black). We further prioritized candidate genes for experimental validation by GRN degree centrality, gene enrichment score in axial mesoderm and average gene expression value, selecting lhx1a , sebox and irx3a (Extended Data Fig. 12b ). CellOracle predicts impaired notochord differentiation for all three genes after their LOF (Fig. ​ (Fig.5b 5b and Supplementary Fig. 19 ). However, no LOF studies describing axial mesoderm phenotypes that relate to these genes have, to our knowledge, been reported in zebrafish. Mouse Lhx1 (Lim1) KO embryos lack anterior head structures and kidneys 42 . In zebrafish, sebox ( mezzo ) has been implicated in mesoderm and endoderm specification 43 , whereas irx3a ( ziro3 ) morphants exhibit changes in the composition of pancreatic cell types 44 .

We generated lhx1a , sebox and irx3a crispants (Supplementary Fig. 20b–d ). We performed initial single-cell analyses at 10 hpf, integrating crispant scRNA-seq datasets with the control gRNA reference atlas described above. We observed significant changes in cell-type composition and notochord marker expression in lhx1a crispants (Extended Data Fig. 12c,d and Supplementary Table 4 ). Notably, we found a more considerable reduction in the expression of late notochord genes relative to broad notochord markers, suggesting that loss of lhx1a function inhibits the differentiation and maturation of notochord cells. We observed a slight yet significant reduction in the expression of the notochord markers twist2 , nog1 and tbxta in sebox crispants (Extended Data Fig. 12e,f and Supplementary Table 4 ), confirming CellOracle’s predictions that lhx1a and sebox are regulators of axial mesoderm development. Irx3a crispants showed no significant phenotype in cell-type composition but exhibited a slight reduction in twist2 expression in the notochord (Extended Data Fig. 12g,h ).

We extended lhx1a LOF characterization by performing four independent biological replicates for lhx1a crispants ( n  = 45,582 cells) and tyr crispants ( n  = 76,163 cells, 5 biological and 7 technical replicates). CellOracle predicted inhibition of early axial mesoderm differentiation after lhx1a disruption, depleting both notochord and prechordal plate lineages (Fig. ​ (Fig.5b). 5b ). Indeed, the lhx1a crispants exhibited inhibition of axial mesoderm differentiation (Fig. 5c–e ): a significant expansion of the early notochord cluster (+70.2%; P  = 1.34 × 10 −35 ), with a concomitant reduction of later notochord (−15.3%; P  = 3.83 × 10 −3 ) and prechordal plate clusters (−24.7%; P  = 1.28 × 10 −7 ). These phenotypes were reproducible across independent biological replicates (Extended Data Fig. 13e ), validating the predicted inhibition of early axial mesoderm differentiation (Fig. 5a,b ).

An external file that holds a picture, illustration, etc.
Object name is 41586_2022_5688_Fig18_ESM.jpg

( a , b ) t- SNE plot of lhx1a crispant (n = 45,582 cells, 4 biological replicates) and tyr control crispant samples (n = 76,163 cells, 5 biological, 7 technical replicates). (a) Cluster annotation labels transferred from WT reference data projected onto the t- SNE plot. (b) Sample label projected onto the t- SNE plot. (c) Cell cluster composition comparing lhx1a crispant and tyr control crispant samples as a proportion of cells from the whole embryo. (d) Cell density in the axial mesoderm is visualized as a kernel cell density contour plot. The cell number is downsampled to match the cell number before kernel cell density calculation (n = 260, 290, 336, and 367 for lhx1a crispant 1~4, n = 248, 234, 344, 316, 213, 286, and 350 for tyr crispant 1~7). The same contour threshold values are used for the visualization. (e) Cell cluster composition in the axial mesoderm clusters comparing lhx1a crispant and tyr control crispant samples. The left panels show cluster composition in the merged data, while the right panels show the individual scRNA-seq batch. (f) The top 30 NMF module weights for the Early notochord module (left) and the Late notochord module (right) are shown as a bar plot.

To further analyse the lhx1a LOF axial mesoderm phenotype, we investigated global changes in gene expression across all cell types using non-negative matrix factorization (NMF), a method to quantify gene module activation 45 (Supplementary Table 5 and Methods ). We observed that a module corresponding to the early notochord was significantly activated in lhx1a crispants ( P  = 2.62 × 10 −32 ; Fig. 5f,g ). The top gene in this module is admp (Extended Data Fig. 13f , left), which is significantly upregulated in lhx1a crispant cells ( P  = 6.69 × 10 −46 ; Fig. ​ Fig.5h) 5h ) and encodes a known negative regulator of notochord and prechordal plate development 46 . By contrast, the late notochord module received a significantly lower score in the lhx1a crispant cells ( P  = 1.04 × 10 −5 ; Fig. ​ Fig.5g, 5g , bottom). This module comprises late notochord marker genes, such as twist2 and nog1 (Extended Data Fig. 13f , right), which showed significantly lower expression in lhx1a crispant cells ( P  = 4.52 × 10 −105 and P  = 4.95 × 10 −105 , respectively; Fig. ​ Fig.5h). 5h ). Further, lhx1a crispant cells exhibited a higher somite module score ( P  = 5.19 × 10 −25 and Supplementary Table 5 ), suggesting that notochord cells may be redirected towards a somitic identity after lhx1a LOF. Overall, the NMF analysis supports the hypothesis that loss of lhx1a function induces global changes in gene expression that are related to inhibited notochord differentiation.

Finally, we confirmed the lhx1a LOF phenotype using orthogonal approaches. Hybridization chain reaction (HCR) RNA fluorescence in situ hybridization for nog1 (late notochord) and for gsc and twist2 (prechordal plate and notochord, respectively) showed that these genes were significantly downregulated in lhx1a crispants (Fig. 5i–k ). These results were further confirmed by quantitative reverse transcription PCR (qRT-PCR) and whole-mount in situ hybridization against nog1 (Supplementary Fig. 22 ). Together, this experimental validation confirms the significant and consistent disruption of axial mesoderm development after loss of lhx1a function. In summary, these results demonstrate the ability of CellOracle to accurately predict known TF perturbation phenotypes, provide insight into previously characterized mutants and reveal regulators of established developmental processes in well-studied model organisms.

The emerging discipline of perturbational single-cell omics enables regulators of cell identity and behaviour to be modelled and predicted 5 . For example, scGen combines variational autoencoders with latent space vector arithmetic to predict cell infection response. However, this approach requires experimentally perturbed training data, which limits its scalability 47 . More importantly, it remains challenging to interpret the gene program behind the simulated outcome using these previous computational perturbation approaches because they rely on complex black-box models; thus, the simulations lack any means to interpret how gene regulation relates to cellular phenotype. On the other hand, previous GRN analyses relied largely on static graph theory and could not consider cell identity as a dynamic property. Here we present a strategy that overcomes these limitations by integrating computational perturbation with GRN modelling. CellOracle uses GRN models to yield mechanistic insights into the regulation of cell identity; simulation and vector visualization based on the custom network model enables the interpretable, scalable and broadly applicable analysis of dynamic TF function.

We validated CellOracle using various in vivo differentiation models, verifying its efficacy and its robustness to complex and noisy biological data. CellOracle simulates shifts in cell identity by considering systematic gene-to-gene relationships for each cell state using multimodal data, generating a complex context-dependent vector representation that is not possible using differential gene expression or chromatin accessibility alone. For example, the role of Gata1 in granulocyte differentiation would probably not be predicted given its low expression in this cell type. However, CellOracle could corroborate this relatively mild Gata1 phenotype. Furthermore, CellOracle correctly reproduced the reported early-stage-specific cell-fate-regulatory role of Tal1 in erythropoiesis, which is impossible to uncover on the basis of the constitutive expression of Tal1 throughout all erythroid stages. This capacity of CellOracle means that it could identify previously unreported phenotypes. For example, the LOF simulation of a well-characterized regulator of zebrafish axial mesoderm development, noto , predicted a previously unreported expansion of the prechordal plate, which we experimentally validated. This case suggests that noto has a role in suppressing alternate fates, which could only be predicted by the integrative simulation using the GRN and cell differentiation trajectory together. Finally, although we focus on TF KO and LOF in this study, we have also recently demonstrated that CellOracle can be used to simulate TF overexpression 48 .

We note some limitations of the method. First, CellOracle visualizes the simulation vector within the existing trajectory space; thus, cell states that do not exist in the input scRNA-seq data cannot be analysed. Nevertheless, existing single-cell data collected after severe developmental disruption do not report the emergence of new transcriptional states in the context of loss of gene function, which suggests extensive canalization even during abnormal development 32 , supporting the use of CellOracle to accurately simulate TF perturbation effects. Second, we emphasize that TF simulation is limited by input data availability and data quality. For example, a perturbation cannot be simulated if a TF-binding motif is unknown or TF expression is too sparse, as we note in the case of foxa3 in zebrafish 41 .

Our application of CellOracle to systematically simulate TF perturbation has revealed regulators of a well-characterized developmental paradigm: the formation of axial mesoderm in zebrafish. Although zebrafish axial mesoderm has been well-characterized through mutagenesis screens, a role for Lhx1a in these developmental stages is likely to have gone unreported owing to the absence of gross morphological phenotypical changes at 10 hpf after disruption of lhx1a (ref. 49 ). However, our ability to predict and validate such a phenotype showcases the power of single-cell computational and experimental approaches, enabling finer-resolution dissection of gene regulation even in well-characterized systems. Moreover, CellOracle provides information at intermediate steps in a given developmental pathway, obviating the need for gross morphological end-points. Indeed, each simulation can be thought of as many successive predictions along a lineage, although we stress that experimental validation is essential to validate CellOracle’s predictions where possible. However, applying these approaches to emerging systems or where experimental intervention is not feasible promises to accelerate our understanding of how cell identity is regulated. For example, in the context of human development, we have recently applied CellOracle to predict candidate regulators of medium spiny neuron maturation in human fetal striatum 50 , demonstrating the power of in silico perturbation where experimental approaches cannot be deployed.

CellOracle algorithm overview

The CellOracle workflow consists of several steps: (1) base GRN construction using scATAC-seq data or promoter databases; (2) scRNA-seq data preprocessing; (3) context-dependent GRN inference using scRNA-seq data; (4) network analysis; (5) simulation of cell identity following TF perturbation; and (6) calculation of the pseudotime gradient vector field and the inner-product score to generate perturbation scores. We implemented and tested CellOracle in Python (versions 3.6 and 3.8) and designed it for use in the Jupyter notebook environment. CellOracle code is open source and available on GitHub ( https://github.com/morris-lab/CellOracle ), along with detailed descriptions of functions and tutorials.

Base GRN construction using scATAC-seq data

In the first step, CellOracle constructs a base GRN that contains unweighted, directional edges between a TF and its target gene. CellOracle uses the regulatory region’s genomic DNA sequence and TF-binding motifs for this task. CellOracle identifies regulatory candidate genes by scanning for TF-binding motifs within the regulatory DNA sequences (promoter and enhancers) of open chromatin sites. This process is beneficial as it narrows the scope of possible regulatory candidate genes in advance of model fitting and helps to define the directionality of regulatory edges in the GRN. However, the base network generated in this step may still contain pseudo- or inactive connections; TF regulatory mechanisms are not only determined by the accessibility of binding motifs but may also be influenced by many context-dependent factors. Thus, scRNA-seq data are used to refine this base network during the model fitting process in the next step of base GRN assembly.

Base GRN assembly can be divided into two steps: (i) identification of promoter and enhancer regions using scATAC-seq data; and (ii) motif scanning of promoter and enhancer DNA sequences.

Identification of promoter and enhancer regions using scATAC-seq data

CellOracle uses genomic DNA sequence information to define candidate regulatory interactions. To achieve this, the genomic regions of promoters and enhancers first need to be designated, which we infer from ATAC-seq data. We designed CellOracle for use with scATAC-seq data to identify accessible promoters and enhancers (Extended Data Fig. ​ Fig.1a, 1a , left panel). Thus, scATAC-seq data for a specific tissue or cell type yield a base GRN representing a sample-specific TF-binding network. In the absence of a sample-specific scATAC-seq dataset, we recommend using scATAC-seq data from closely related tissue or cell types to support the identification of promoter and enhancer regions. Using broader scATAC-seq datasets produces a base GRN corresponding to a general TF-binding network rather than a sample-specific base GRN. Nevertheless, this base GRN network will still be tailored to a specific sample using scRNA-seq data during the model fitting process. The final product will consist of context-dependent (cell-type or state-specific) GRN configurations.

To identify promoter and enhancer DNA regions within the scATAC-seq data, CellOracle first identifies proximal regulatory DNA elements by locating TSSs within the accessible ATAC-seq peaks. This annotation is performed using HOMER ( http://homer.ucsd.edu/homer/ ). Next, the distal regulatory DNA elements are obtained using Cicero, a computational tool that identifies cis -regulatory DNA interactions on the basis of co-accessibility, as derived from ATAC-seq peak information 12 . Using the default parameters of Cicero, we identify pairs of peaks within 500 kb of each other and calculate a co-accessibility score. Using these scores as input, CellOracle then identifies distal cis -regulatory elements defined as pairs of peaks with a high co-accessibility score (≥0.8), with the peaks overlapping a TSS. The output is a bed file in which all cis -regulatory peaks are paired with the target gene name. This bed file is used in the next step. CellOracle can also use other input data types to define cis -regulatory elements. For example, a database of promoter and enhancer DNA sequences or bulk ATAC-seq data can serve as an alternative if available as a .bed file.

For the analysis of mouse haematopoiesis that we present here, we assembled the base GRN using a published mouse scATAC-seq atlas consisting of around 100,000 cells across 13 tissues, representing around 400,000 differentially accessible elements and 85 different chromatin patterns 13 . This base GRN is built into the CellOracle library to support GRN inference without sample-specific scATAC-seq datasets. In addition, we have generated general promoter base GRNs for several key organisms commonly used to study development, including 10 species and 23 reference genomes (Supplementary Table 1 ).

Motif scan of promoter and enhancer DNA sequences

This step scans the DNA sequences of promoter and enhancer elements to identify TF-binding motifs. CellOracle internally uses gimmemotifs ( https://gimmemotifs.readthedocs.io/en/master/ ), a Python package for TF motif analysis. For each DNA sequence in the bed file obtained in step (i) above, motif scanning is performed to search for TF-binding motifs in the input motif database.

For mouse and human data, we use gimmemotifs motif v.5 data. CellOracle also provides a motif dataset for ten species generated from the CisBP v.2 database ( http://cisbp.ccbr.utoronto.ca ).

CellOracle exports a binary data table representing a potential connection between a TF and its target gene across all TFs and target genes. CellOracle also reports the TF-binding DNA region. CellOracle provides pre-built base GRNs for ten species (Supplementary Table 1 ), which can be used if scATAC-seq data are unavailable.

scRNA-seq data preprocessing

CellOracle requires standard scRNA-seq preprocessing in advance of GRN construction and simulation. The scRNA-seq data need to be prepared in the AnnData format ( https://anndata.readthedocs.io/en/latest/ ). For data preprocessing, we recommend using Scanpy ( https://scanpy.readthedocs.io/en/stable/ ) or Seurat ( https://satijalab.org/seurat/ ). Seurat data must be converted into the AnnData format using the CellOracle function, seuratToAnndata, preserving its contents. In the default CellOracle scRNA-seq preprocessing step, zero-count genes are first filtered out by UMI count using scanpy.pp.filter_genes(min_counts=1). After normalization by total UMI count per cell using sc.pp.normalize_per_cell(key_n_counts=‘n_counts_all’), highly variable genes are detected by scanpy.pp.filter_genes_dispersion(n_top_genes=2000~3000). The detected variable gene set is used for downstream analysis. Gene expression values are log-transformed, scaled and subjected to dimensional reduction and clustering. The non-log-transformed gene expression matrix (GEM) is also retained, as it is required for downstream GRN calculation and simulation.

Context-dependent GRN inference using scRNA-seq data

In this step of CellOracle GRN inference, a machine-learning model is built to predict target gene expression from the expression levels of the regulatory genes identified in the previous base GRN refinement step. By fitting models to sample gene expression data, CellOracle extracts quantitative gene–gene connection information. For signal propagation, the CellOracle GRN model must meet two requirements: (1) the GRN model needs to represent transcriptional connections as a directed network edge; and (2) the GRN edges need to be a linear regression model. Because of this second constraint, we cannot use pre-existing GRN inference algorithms, such as GENIE3 and GRNboost (refs. 7 , 51 ). CellOracle leverages genomic sequences and information on TF-binding motifs to infer the base GRN structure and directionality, and it does not need to infer the causality or directionality of the GRN from gene expression data. This allows CellOracle to adopt a relatively simple machine-learning model for GRN inference—a regularized linear machine-learning model. CellOracle builds a model that predicts the expression of a target gene on the basis of the expression of regulatory candidate genes:

where x j is single target gene expression and x i is the gene expression value of the regulatory candidate gene that regulates gene x j . b i,j is the coefficient value of the linear model (but b i,j  = 0 if i  =  j ), and c is the intercept for this model. Here, we use the list of potential regulatory genes for each target gene generated in the previous base GRN construction step (ii).

The regression calculation is performed for each cell cluster in parallel after the GEM of scRNA-seq data is divided into several clusters. The cluster-wise regression model can capture non-linear or mixed regulatory relationships. In addition, L2 weight regularization is applied by the Ridge model. Regularization not only helps distinguish active regulatory connections from random, inactive, or false connections in the base GRN but also reduces overfitting in smaller samples.

The Bayesian Ridge or Bagging Ridge model provides the coefficient value as a distribution, and we can analyse the reproducibility of the inferred gene–gene connection (Extended Data Fig. ​ Fig.1a, 1a , right). In both models, the output is a posterior distribution of coefficient value b :

where μ b is the centre of the distribution of b , and σ b is the standard deviation of b . The user can choose the model method depending on the availability of computational resources and the aim of the analysis; CellOracle’s Bayesian Ridge requires fewer computational resources, whereas the Bagging Ridge tends to produce better inference results than Bayesian Ridge. Using the posterior distribution, we can calculate P values of coefficient b ; one-sample t -tests are applied to b to estimate the probability (the centre of b  = 0). The P value helps to identify robust connections while minimizing connections derived from random noise. In addition, we apply regularization to coefficient b for two purposes: (i) to prevent coefficient b from becoming extremely large owing to overfitting; and (ii) to identify informative variables through regularization. In CellOracle, the Bayesian Ridge model uses regularizing prior distribution of b as follows: b ∼ N o r m a l ( 0 , σ b )

σ b is selected to represent non-informative prior distributions. This model uses data in the fitting process to estimate the optimal regularization strength. In the Bagging Ridge model, custom regularization strength can be manually set.

For the computational implementation of the above machine-learning models, we use a Python library, scikit-learn ( https://scikit-learn.org/stable/ ). For Bagging Ridge regression, we use the Ridge class in the sklearn.linear_model and BaggingRegressor in the sklearn.ensemble module. The number of iterative calculations in the bagging model can be adjusted depending on the computational resources and available time. For Bayesian Ridge regression, we use the BayesianRidge class in sklearn.linear_module with the default parameters.

Simulation of cell identity following perturbation of regulatory genes

The central purpose of CellOracle is to understand how a GRN governs cell identity. Toward this goal, we designed CellOracle to make use of inferred GRN configurations to simulate how cell identity changes following perturbation of regulatory genes. The simulated gene expression values are converted into 2D vectors representing the direction of cell-state transition, adapting the visualization method previously used by RNA velocity 52 . This process consists of four steps: (i) data preprocessing; (ii) signal propagation within the GRN; (iii) estimation of transition probabilities; and (iv) analysis of simulated transition in cell identity.

For simulation of cell identity, we developed our code by modifying Velocyto.py, a Python package for RNA-velocity analysis ( https://velocyto.org ). Consequently, CellOracle preprocesses the scRNA-seq data per Velocyto requirements by first filtering the genes and imputing dropout. Dropout can affect Velocyto’s transition probability calculations; thus, k -nearest neighbour (KNN) imputation must be performed before the simulation step.

This step aims to estimate the effect of TF perturbation on cell identity. CellOracle simulates how a ‘shift’ in input TF expression leads to a ‘shift’ in its target gene expression and uses a partial derivative ∂ x j ∂ x i . As we use a linear model, the derivative ∂ x j ∂ x i is a constant value and already calculated as b i , j in the previous step if the gene j is directly regulated by gene i :

And we calculate the shift of target gene Δ x j in response to the shift of regulatory gene Δ x i :

As we want to consider the gene-regulatory ‘network’, we also consider indirect connections. The network edge represents a differentiable linear function shown above, and the network edge connections between indirectly connected nodes is a composite function of the linear models, which is differentiable accordingly. Using this feature, we can apply the chain rule to calculate the partial derivative of the target genes, even between indirectly connected nodes.

For example, when we consider the network edge from gene 0 to 1 to 2, the small shift of gene 2 in response to gene 0 can be calculated using the intermediate connection with gene 1 (Supplementary Fig. 1 ).

In summary, the small shift of the target gene can be formulated by the multiplication of only two components, GRN model coefficient b i , j and input TF shift Δ x i . In this respect, we focus on the gradient of gene expression equations rather than the absolute expression values so that we do not model the error or the intercept of the model, which potentially includes unobservable factors within the scRNA-seq data.

The calculation above is implemented as vector and matrix multiplication. First, the linear regression model can be shown as follows.

where the X ∈ R 1 × N is a gene expression vector containing N genes, C ∈ R 1 × N is the intercept vector, B ∈ R N × N is the network adjacency matrix, and each element b i,j is the coefficient value of the linear model from regulatory gene i to target gene j .

First, we set the perturbation input vector Δ X input ∈ R 1 × N , a sparse vector consisting of zero except for the perturbation target gene i . For the TF perturbation target gene, we set the shift of the TF to be simulated. The CellOracle function will produce an error if the user enters a gene shift corresponding to an out-of-distribution value.

Next, we calculate the shift of the first target gene:

However, we fix the perturbation target gene i value, and the Δ x i retains the same value as the input state. Thus, the following calculation will correspond to both the first and the second downstream gene shift calculations.

Likewise, the recurrent calculation is performed to propagate the shift from gene to gene in the network. Repeating this calculation for n iterations, we can estimate the effects on the first to the n th indirect target gene (Extended Data Fig. 1b–d ):

CellOracle performs three iterative cycles in the default setting, sufficient to predict the directionality of changes in cell identity (Supplementary Figs. 4 and 5 ). We avoid a higher number of iterative calculations as it might lead to unexpected behaviour. Of note, CellOracle performs the calculations cluster-wise after splitting the whole GEM into gene expression submatrices on the basis of the assumption that each cluster has a unique GRN configuration. Also, gene expression values are checked between each iterative calculation to confirm whether the simulated shift corresponds to a biologically plausible range. If the expression value for a gene is negative, this value is adjusted to zero. The code in this step is implemented from scratch, specifically for CellOracle perturbations using NumPy, a python package for numerical computing ( https://numpy.org ).

From the previous steps, CellOracle produces a simulated gene expression shift vector Δ X simulated ∈ R 1 × N representing the simulated initial gene expression shift after TF perturbation. Next, CellOracle aims to project the directionality of the future transition in cell identity onto the dimensional reduction embedding (Fig. ​ (Fig.1a, 1a , right and Extended Data Fig. ​ Fig.1e). 1e ). For this task, CellOracle uses a similar approach to Velocyto ( https://github.com/velocyto-team/velocyto.py ). Velocyto visualizes future cell identity on the basis of the RNA-splicing information and calculated vectors from RNA synthesis and degradation differential equations. CellOracle uses the simulated gene expression vector Δ X simulated instead of RNA-velocity vectors.

First, CellOracle estimates the cell transition probability matrix P ∈ R M × M ( M is number of cells): p i , j , the element in the matrix P , is defined as the probability that cell i will adopt a similar cell identity to cell j after perturbation. To calculate p i , j , CellOracle calculates the Pearson’s correlation coefficient between d i and r i , j :

where d i is the simulated gene expression shift vector Δ X simulated ∈ R 1 × N for cell i , and r i j ∈ R 1 × N is a subtraction of the gene expression vector X ∈ R 1 × N between cell i and cell j in the original GEM. The value is normalized by the Softmax function (default temperature parameter T is 0.05). The calculation of p i.j uses neighbouring cells of cell i. The KNN method selects local neighbours in the dimensional reduction embedding space ( k  = 200 as default).

The transition probability matrix P is converted into a transition vector V i , simulated ∈ R 1 × 2 , representing the relative cell-identity shift of cell i in the 2D dimensional reduction space, as follows: CellOracle calculates the local weighted average of vector V i , j ∈ R 1 × 2 , V i , j denotes the 2D vector obtained by subtracting the 2D coordinates in the dimensional reduction embedding between cell i and cell j ( cell j ∈ G ).

The single-cell resolution vector V i , simulated is too fine to interpret the results in a large dataset consisting of many cells. We calculate the summarized vector field using the same vector averaging strategy as Velocyto. The simulated cell-state transition vector for each cell is grouped by grid point to get the vector field, V vector field = R 2 × L × L , ( L is grid number, default L is 40). v grid ∈ R 2 , an element in the V vector field , is calculated by the Gaussian kernel smoothing.

where the g ∈ R 2 denotes grid point coordinates, H is the neighbour cells of g and K σ is the Gaussian kernel weight:

Calculation of pseudotime gradient vector field and inner-product score to generate a perturbation score

To aid the interpretation of CellOracle simulation results, we quantify the similarity between the differentiation vector fields and KO simulation vector fields by calculating their inner-product value, which we term the perturbation score (PS) (Extended Data Fig. ​ Fig.4). 4 ). Calculation of the PS includes the following steps:

Differentiation pseudotime is calculated using DPT, a diffusion-map-based pseudotime calculation algorithm, using the scanpy.tl.dpt function (Extended Data Fig. ​ Fig.4a, 4a , left). CellOracle also works with other pseudotime data, such as Monocle pseudotime and URD pseudotime data. For the Farrell et al. 32 zebrafish scRNA-seq data analysis, we used pseudotime data calculated by the URD algorithm, as described previously 32 .

The pseudotime data are transferred to the n by n 2D grid points ( n  = 40 as default) (Extended Data Fig. ​ Fig.4a, 4a , centre). For this calculation, we implemented two functions in CellOracle: KNN regression and polynomial regression for the data transfer. We choose polynomial regression when the developmental branch is a relatively simple bifurcation, as is the case for the Paul et al. 16 haematopoiesis data. We used KNN regression for a more complex branching structure, such as the Farrell et al. 32 zebrafish development data. Then, CellOracle calculates the gradient of pseudotime data on the 2D grid points using the numpy.gradient function, producing the 2D vector map representing the direction of differentiation (Extended Data Fig. ​ Fig.4a, 4a , right).

Then, CellOracle calculates the inner-product score (perturbation score (PS)) between the pseudotime gradient vector field and the perturbation simulation vector field (Extended Data Fig. ​ Fig.4b). 4b ). The inner product between the two vectors represents their agreement (Extended Data Fig. ​ Fig.4c), 4c ), enabling a quantitative comparison of the directionality of the perturbation vector and differentiation vector with this metric.

  • (iv) PS calculation with randomized GRN model to calculate PS cut-off value

CellOracle also produces randomized GRN models. The randomized GRNs can be used to generate dummy negative control data in CellOracle simulations. We calculated cut-off values for the negative PS analysis in the systematic KO simulation. First, the negative PS is calculated for all TFs using either a normal or a randomized vector. The score distribution generated from the randomized vector was used as a null distribution. We determined the cut-off value corresponding to a false-positive rate of 0.01 by selecting the 99th percentile value of PSs generated with randomized results (Extended Data Fig. ​ Fig.3g 3g ).

Network analysis

In addition to CellOracle’s unique gene perturbation simulation, CellOracle’s GRN model can be analysed with general network structure analysis methods or graph theory approaches. Before this network structure analysis, we filter out weak or insignificant connections. GRN edges are initially filtered on the basis of P values and absolute values of edge strength. The user can define a custom value for the thresholding according to the data type, data quality and aim of the analysis. After filtering, CellOracle calculates several network scores: degree centrality, betweenness centrality and eigenvector centrality. It also assesses network module information and analyses network cartography. For these processes, CellOracle uses igraph ( https://igraph.org ).

Validation and benchmarking of CellOracle GRN inference

To test whether CellOracle can correctly identify cell-type- or cell-state-specific GRN configurations, we benchmarked our new method against diverse GRN inference algorithms: WGCNA, DCOL, GENIE3 and SCENIC. WGCNA is a correlation-based GRN inference algorithm, which is typically used to generate a non-directional network 53 ; DCOL is a ranking-based non-linear network modelling method 54 ; and GENIE3 uses an ensemble of tree-based regression models, and aims to detect directional network edges. GENIE3 emerged as one of the best-performing algorithms in a previous benchmarking study 55 . The SCENIC algorithm integrates a tree-based GRN inference algorithm with information on TF binding 7 .

Preparation of input data for GRN inference

We used the Tabula Muris scRNA-seq dataset for GRN construction input data 56 . Cells were subsampled for each tissue on the basis of the original tissue-type annotation: spleen, lung, muscle, liver and kidney. Data for each tissue were processed using the standard Seurat workflow, including data normalization, log transformation, finding variable features, scaling, principal component analysis (PCA) and Louvain clustering. The data were downsampled to 2,000 cells and 10,000 genes using highly variable genes detected by the corresponding Seurat function. Cell and gene downsampling were necessary to run the GRN inference algorithms within a practical time frame: we found that some GRN inference algorithms, especially GENIE3, take a long time with a large scRNA-seq dataset, and GENIE3 could not complete the GRN inference calculation even after several days if the whole dataset was used.

GRN inference method

After preprocessing, the exact same data were subjected to each GRN inference algorithm to compare results fairly. We followed the package tutorial and used the default hyperparameters unless specified otherwise. Details are as follows. WGCNA: we used WGCNA v.1.68 with R 3.6.3. WGCNA requires the user to select a ‘power parameter’ for GRN construction. We first calculate soft-thresholding power using the ‘pickSoftThreshold’ function with networkType=“signed”. Other hyperparameters were set to default values. Using the soft-thresholding power value, the ‘adjacency’ function was used to calculate the GRN adjacency matrix. The adjacency matrix was converted into a linklist object by the ‘getLinkLis’ function and used as the inferred value of the WGCNA algorithm. DCOL: we used nlnet v.1.4 with R 3.6.3. The ‘nlnet’ function was used with default parameters to make the DCOL network. The edge list was extracted using the ‘as_edgelist’ function. DCOL infers an undirected graph without edge weights. We assigned the value 1.0 for the inferred network edge and 0.0 for other edges. The assigned value was used as the output of the DCOL algorithm. GENIE3: we used GENIE3 v.1.8.0 with R 3.6.3. The GRN weight matrix was calculated with the processed scRNA-seq data using the ‘GENIE3’ function and converted into a GRN edge and weight list by the ‘getLinkList’ function. GENIE3 provides a directed network with network weight. The weight value was directly used as the inferred value of the GENIE3 algorithm. SCENIC: we used SCENIC v.1.2.2 with R 3.6.3. The SCENIC GRN calculation involves multiple processes. The calculation was performed according to SCENIC’s tutorial ( https://rdrr.io/github/aertslab/SCENIC/f/vignettes/SCENIC_Running.Rmd ). First, we created the initialize settings configuration object with ‘initializeScenic’. Then we calculated the co-expression network using the ‘runGenie3’ function, following the GRN calculation with several SCENIC functions; runSCENIC_1_coexNetwork2modules, runSCENIC_2_createRegulons and runSCENIC_2_createRegulons. We used the ‘10kb’ dataset for the promoter information range. The calculated GRN information was loaded with the ‘loadInt’ function, and the ‘CoexWeight’ value was used as the inferred value of the SCENIC algorithm.

Ground-truth data preparation for GRN benchmarking

Cell-type-specific ground-truth GRNs were generated in the same manner as in a previous benchmarking study 55 . Here, we selected tissues commonly available in the Tabula Muris scRNA-seq dataset, mouse sci-ATAC-seq atlas data and ground-truth datasets: heart, kidney, liver, lung and spleen. The ground-truth data were constructed as follows. (i) Download all mouse TF ChIP–seq data as bed files from the ChIP-Atlas database ( https://chip-atlas.org ). (ii) Remove datasets generated under non-physiological conditions. For example, we removed ChIP–seq data from gene KOs or adeno-associated virus treatment. (iii) Remove data that include fewer than 50 peaks. (iv) Select peaks detected in multiple studies. (v) Group data by TF and remove TFs if the number of detected target genes is less than ten peaks. (vi) Convert data into a binary network: each network edge is labelled either 0 or 1, representing its ChIP–seq binding between genes. These steps yielded tissue- or cell-type-specific ground-truth data for 80 TFs, corresponding to 1,298 experimental datasets.

GRN benchmarking results

GRN inference performance was evaluated by the AUROC and the early precision ratio (EPR), following the evaluation method used in a previous benchmarking study 55 . CellOracle and SCENIC outperformed WGCNA, DCOL and GENIE3 based on AUROC (Extended Data Fig. ​ Fig.2a). 2a ). This is because CellOracle and SCENIC filter out non-transcriptional connections (that is, non-TF–target gene connections) and other methodologies detect many false-positive edges between non-TFs. CellOracle with a scATAC-seq atlas base GRN performed better than CellOracle with a promoter base GRN and SCENIC. This difference was mainly derived from sensitivity (or true-positive rate). With scATAC-seq data, CellOracle captures a higher number of regulatory candidate genes. Considering EPR, representing inference accuracy for top k network edges ( k = number of network edges with the label ‘1’ in the ground-truth data), CellOracle performed well compared to other approaches (Extended Data Fig. ​ Fig.2b): 2b ): GENIE3 and WGCNA assigned a high network edge weight to many non-transcriptional connections, resulting in many false-positive edges for the highly ranked inferred genes.

The CellOracle GRN construction method was analysed further to assess the contribution of the base GRN. We performed the same GRN benchmarking with a scrambled motif base GRN or no base GRN. For the scrambled motif base GRN, we used scrambled TF-binding-motif data for the base GRN construction. For the no base GRN analysis, selection of regulatory candidate genes was skipped, and all genes were used as regulatory candidate genes. As expected, the AUROC scores decreased when we used the scrambled motif base GRN (ranked 12/13 in AUROC, 11/13 in EPR; Extended Data Fig. 2a,b ), decreasing even further in the no base GRN model (13/13; Extended Data Fig. 2a,b ). The scrambled motif base GRN did not detect many regulatory candidate TFs, producing lower sensitivity. However, the scrambled motif base GRN can still work positively by removing connections from non-TF genes to TFs, functioning to filter out false-positive edges, and resulting in a better score relative to the no base GRN model. In summary, the base GRN is primarily important to achieve acceptable specificity, and the scATAC-seq base GRN increases sensitivity.

Next, we used CellOracle after downsampling cells to test how cell number affects GRN inference results. Cells were downsampled to 400, 200, 100, 50, 25 and 10 cells and used for GRN analysis with the scATAC-seq base GRN. GRNs generated with 400, 200, 100 and 50 cells received comparable or slightly reduced AUROC scores. The AUROC score decreased drastically for GRNs generated with 25 and 10 cells (Extended Data Fig. ​ Fig.2c). 2c ). EPR was relatively robust even with small cell numbers (Extended Data Fig. ​ Fig.2d 2d ).

We performed additional benchmarking to investigate data compatibility between the base GRN and scRNA-seq data sources. A tissue-specific base GRN was generated separately using bulk ATAC-seq data 57 . We focused on the same five tissue types as above. Unprocessed bulk ATAC-seq data were downloaded from the NCBI database using the SRA tool kit (spleen: SRR8119827; liver: SRR8119839; heart: SRR8119835; lung: SRR8119864; and kidney: SRR8119833). After FASTQC quality check ( https://www.bioinformatics.babraham.ac.uk/projects/fastqc/ ), fastq files were mapped to the mm9 reference genome and converted into bam files. Peak calling using HOMER was used to generate bed files from the bam files. Peak bed files were then annotated with HOMER. Peaks within 10 kb around the TSS were used. Peaks were sorted by the ‘findPeaks Score’ generated by the HOMER peak-calling step, and we used the top 15,000 peaks for base GRN construction. These peaks were scanned with the gimmemotifs v.5 vertebrate motif dataset, which is the same motif set we use for scATAC-seq base GRN construction.

We compared benchmarking scores between GRN inference results generated from different base GRNs. Overall, GRN construction performed best when the same tissue type for ATAC-seq base GRN construction and scRNA-seq was used (10/13 in AUROC, 11/13 in EPR; Extended Data Fig. 2e,f ). The score was lower with different tissue types combined between the base GRN and scRNA-seq data. In summary, benchmarking confirmed that our GRN construction method performs well for the task of transcriptional GRN inference.

CellOracle evaluation

Evaluation of simulation value distribution range.

We investigated a range of simulated values to confirm that the signal propagation step does not generate an out-of-distribution prediction. Specifically, we assessed the distribution of the sum of the simulated shift and original gene expression, which correspond to the simulated expression level (termed ‘simulation gene expression level’ here for explanatory purposes: X simulation gene expression level  =  X original  + Δ X simulated, ). We evaluate all genes, comparing the simulation gene expression level with the original gene expression distribution. To detect out-of-distribution data, we calculated the maximum exceedance percentage, representing the percentage difference of the maximum value of the simulated gene expression level compared to the maximum value of the wild-type gene expression value. The higher maximum exceedance indicates a bigger difference between simulated and wild-type values, identifying out-of-distribution values. For the Spi1 KO simulation with the Paul et al. haematopoiesis dataset 16 , we present the top four genes showing the maximum exceedance values (Supplementary Fig. 2 ). The simulation expression levels of even these genes appear very similar to the original wild-type distributions of gene expression. For example, in the Ly86 simulated value distribution, 99.963% of all cells are within the wild-type gene expression range. Only 0.037% of cells exhibit a Ly86 gene simulation value outside the wild-type distribution, but the maximum difference is only 3.2%. We designed CellOracle to simulate a minimal relative shift vector rather than an out-of-distribution prediction, confirmed by this analysis. The functions we have used for these analyses are implemented in CellOracle. Users can check simulation value distributions, and CellOracle will produce a warning if out-of-distribution simulations occur.

To further explore the minimum number of cells with minor out-of-distribution values, we generated a simulation vector in which the out-of-distribution values are clipped into the wild-type distribution range. The simulated cell-identity shift vector of clipped values is indistinguishable compared to the original results (Supplementary Fig. 2b–e ), confirming that the CellOracle simulation is not relying on these out-of-distribution values. The out-of-distribution value can be clipped if we add ‘clip_delta_X=True’ in the CellOracle signal propagation function. Thus, users can ensure the simulation is not relying on out-of-distribution values.

CellOracle simulation results generated with randomized GRN or no signal propagation

We performed KO simulation with randomized GRN models to clarify the necessity of the GRN signal propagation simulation. In addition, we calculated cell-identity vectors without the signal propagation step; the cell-identity shift vector was calculated solely on the basis of input TF expression loss, thus representing the information from the expression pattern of only a single TF. The vector map in Supplementary Fig. 3 shows Gata1 KO simulation results and Spi1 KO simulation results with an intact GRN coefficient matrix, randomized GRN matrix or no GRN signal propagation. The randomized GRN analysis results and no GRN signal propagation results show only slight cell-identity shift vectors (Supplementary Fig. 3b,c,e,f ). Although very subtle vectors can be observed, most expected simulation results are not obtained. Thus, we confirmed that the GRN signal propagation strategy has an essential role in the CellOracle KO simulation.

Evaluation of signal propagation number

We next tested the number of iterations at the signal propagation step. We performed KO simulations using two independent mouse haematopoiesis datasets: Paul et al. 16 and Dahlin et al. 58 . For several TFs, we tested different numbers of signal propagation rounds in the KO simulations across independent clusters. First, focusing on the Paul dataset, simulation vector fields for Spi1 and Gata1 , with 0, 1 and 3 rounds of signal propagation, were investigated (Supplementary Fig. 4 ). The simulation under hyperparameter n  = 0 shows the vector calculated without any signal propagation within the GRN; that is, the vector is calculated from only the difference of the input TF gene expression shift. This n  = 0 simulation shows almost no phenotype, showing the necessity of the GRN signal propagation process. Next, a comparison of vector fields from n  = 1 and n  = 3 simulations shows similar results. This is not surprising given the following. (1) Most coefficient values in the GRN are small, ranging between −1 and 1 (Supplementary Fig. 4d ). (2) Accordingly, the signal will be attenuated over the propagation process in most cases. (3) This also means that the first signal propagation step will produce the most significant shifts relative to the later steps. However, when scrutinizing the vectors, we observe a more evident shift in cell identity around the late GMP cluster and the early granulocytes in the n  = 3 Gata1 KO vectors compared to n  = 1 vectors. The results suggest that the second and third rounds of signal propagation increase the sensitivity to detect small shifts by adding the second and third rounds of downstream gene effects, respectively.

To quantify these observations and determine whether there is an ideal number of signal propagation rounds, we investigated the L1-norm of Δ X , representing the sum of the magnitudes of each simulated gene expression shift. The L1-norm of Δ X is almost saturated at the n  = 3 in most cases (Supplementary Fig. 4c ). We also performed these analyses with the Dahlin haematopoiesis dataset 58 (Supplementary Fig. 5 ). Overall, the results are consistent with our analysis of the Paul data. Again, we observe that the L1-norm of Δ X is saturated at relatively small n values in most cases. However, Cebpa is an outlier in this analysis, in which the delta X length gradually and continuously increases as n increases. We next examined the vector field of Cebpa with various n (Supplementary Fig. 6 ). Despite such divergence of the L1-norm of Δ X , the vector field of Cebpa showed consistent results, suggesting that the calculation strategy for cell-identity shift is robust using the local neighbour vectors (Extended Data Fig. ​ Fig.1e 1e ).

Altogether, at n  = 3, the simulated shift vectors almost converge, producing consistent results. In rare cases, the L1-norm of Δ X might show divergence with n . However, the n  = 3 simulation results are consistent with higher n values, which might generate unexpected behaviour owing to signal divergence. On the basis of these analyses, we recommend that users perform three iterations for the signal propagation step.

Selection of dimensionality reduction method

Celloracle simulation with umap and t -sne using paul et al. haematopoiesis data.

We used UMAP and t -distributed stochastic neighbour embedding ( t -SNE) for the perturbation simulation analysis to show how the choice of dimensionality reduction affects CellOracle results. We used Scanpy to construct UMAP or t -SNE plots using the Paul et al. haematopoiesis dataset 16 . In the UMAP (Supplementary Fig. 7a ), we observe a similar trajectory that agrees with the force-directed graph (Fig. ​ (Fig.1b). 1b ). However, monocyte and granulocyte branches on the UMAP are relatively less resolved. This notwithstanding, the simulation results using the UMAP (Supplementary Fig. 8 , top) lead to the same conclusion as Fig. ​ Fig.1. 1 . For example, in the Gata1 KO simulation, we correctly predict inhibited differentiation along the MEP lineage whereas GMP differentiation is promoted. Furthermore, we predict inhibited GMP to granulocyte differentiation, consistent with our force-atlas-based presentation in Fig. ​ Fig.1h. 1h . In comparison, the overall structure of the t -SNE graph is consistent with the force-directed and UMAP graphs, although it lacks resolution (Supplementary Fig. 7b ). However, the t -SNE results still agree with Fig. ​ Fig.1, 1 , just at a lower resolution (Supplementary Fig. 8 , bottom). In conclusion, we stress that the choice of the dimensional reduction algorithm is crucial to sensitively analyse the cell differentiation trajectory.

Guidance for selecting the dimensionality reduction method

For the force-directed graph calculation, we recommend using Scanpy’s sc.pl.draw_graph function 59 or SPRING 60 . Both internally use force atlas 2 (ref. 61 ). Compared to UMAP, force-directed graphs can capture more fine-branching structures but can be unstable if the data have many branches that can overlap. To avoid branch overlap, PAGA cell trajectory information can be used to initiate the force-directed graph calculation: https://scanpy.readthedocs.io/en/stable/tutorials.html# https://github.com/theislab/paga .

We recommend using force-directed graphs as a first choice because they generally produce a high-resolution lineage structure. However, we recommend UMAP as a reliable alternative if overlapping branches are observed. In our CellOracle tutorial, we show the detailed guide and code for the dimensionality reduction implementation, including data preprocessing: https://morris-lab.github.io/CellOracle.documentation .

CellOracle KO simulation with unrelated cell-type base GRNs

To assess how base GRN performance relates to scATAC data source, we performed TF KO simulations in haematopoiesis using the ‘general’ mouse scATAC-seq atlas 13 base GRN versus a heart-specific base GRN to represent an unrelated cell type (Supplementary Fig. 9 ). The simulation vectors using the mismatched heart base GRN are weaker, although still in general agreement. We reason that even if the base GRN retains some edges that are not active in the scRNA-seq data, CellOracle can still work robustly. However, simulation with the heart base GRN fails to detect the early granulocyte phenotype in the Gata1 KO and almost all shifts in the Cepba KO, suggesting reduced sensitivity with the mismatched base GRN.

We also assess the mean degree centrality (the number of genes to which a TF is connected) in the inferred GRNs for each of four TFs (Supplementary Fig. 10 ). With the inappropriate heart base GRN, CellOracle fails to build network edges for some genes, resulting in a low degree centrality score and reduced simulation sensitivity. We recommend starting CellOracle analysis with the general GRN and comparing its performance against tailored base GRNs.

Markov simulation based on CellOracle simulation vector

To estimate cell distribution in response to gene perturbation, we need to consider both the differentiation hierarchy and the perturbation vector together. We performed a Markov random walk simulation as described previously 52 ( https://github.com/velocyto-team/velocyto.py ) with some modifications. First, our Markov simulation used the CellOracle cell-identity transition vector in addition to the differentiation vector; the transition probability matrix for these vectors was applied alternatively to consider both effects. Second, cells in early differentiation stages were selected and used for the initial state of our Markov simulation, whereas the previous study used the whole population as the initial state 52 . The Markov simulation analysis with data from another study 59 is shown in Supplementary Fig. 17 to show typical simulation results and their interpretation.

CellOracle analysis with previously published scRNA-seq and scATAC-seq data

Paul et al. mouse haematopoiesis scrna-seq data.

The GEM was downloaded with Scanpy’s data loading function, scanpy.datasets.paul15(). After removing genes with zero counts, the GEM was normalized by total UMI counts ((scanpy.pp.filter_genes(min_counts=1), scanpy.pp.normalize_per_cell(key_n_counts=‘n_counts_all’)). Highly variable genes, including 90 TFs, detected by scanpy.pp.filter_genes_dispersion(flavor=‘cell_ranger’, n_top_genes=2000, log=False), were used for the following downstream analysis: the GEM was log-transformed, scaled and subjected to PCA (scanpy.pp.log1p(), scanpy.pp.scale(), scanpy.tl.pca(svd_solver=‘arpack’)). We calculated the force-directed graph dimensional reduction data based on the PAGA graph 62 for initialization (scanpy.tl.paga(), scanpy.tl.draw_graph(init_pos=‘paga’)). Cells were clustered using the Louvain clustering method (scanpy.tl.louvain (resolution=1.0)). Clusters were annotated manually using marker gene expression and the previous annotations from Paul et al. 16 We removed dendritic cell (DC) and lymphoid cell clusters to focus on myeloid cell differentiation. GRN calculation and simulation were performed as described above, using the default parameters. For the base GRN, we used the base GRN generated from the mouse sci-ATAC-seq atlas dataset 13 .

Cell density was visualized using a kernel density estimation (KDE) plot. First, we performed random downsampling to 768 cells to adjust the cell number between WT and KO samples. KDE was calculated with the scipy.stat.gaussian_kde function. The calculated KDE was visualized with the matplotlib.pyplot.contour function. We used the same contour threshold levels between all samples.

Although we did not focus on the network structure in the main text, we examined CellOracle GRN models using graph theory approaches before the simulation analysis. Graph theory analysis revealed that these inferred GRN configurations resemble a scale-free network the degree distribution of which follows a power law, a characteristic configuration of biological networks 63 (Extended Data Fig. ​ Fig.3b). 3b ). Further, we assess GRNs using degree centrality—a basic measure of how many genes a TF connects to 63 . Using the MEP cluster as an example, 27 out of 30 genes with a high degree centrality score in the MEP_0 GRN are confirmed known regulators of MEP lineage differentiation or stem and progenitor cell function (Extended Data Fig. ​ Fig.3c 3c and Supplementary Table 2 ). Analysis of additional clusters yielded similar agreement with previous literature, confirming that CellOracle GRN inference captures biologically plausible cell-state-specific GRN structures, consistent with previous biological knowledge. All network analysis and simulation results can be explored at https://www.celloracle.org .

Pijuan-Sala et al. mouse early gastrulation and organogenesis scRNA-seq data

We applied CellOracle to a scRNA-seq atlas of mouse gastrulation and organogenesis by Pijuan-Sala et al. 30 . This single-cell profiling of WT cells highlighted a continuous differentiation trajectory across the early development of various cell types (Extended Data Fig. ​ Fig.9a). 9a ). In addition, the developmental effects of Tal1 KO, a TF known to regulate early haematoendothelial development 64 , 65 , were investigated in this study. We validated the CellOracle simulation using these Tal1 KO ground-truth scRNA-seq data. The data were generated from seven chimeric E8.5 embryos of WT and Tal1 KO cells (25,307 cells and 26,311 cells, respectively). We used the R library, MouseGastrulationData ( https://github.com/MarioniLab/MouseGastrulationData ), to download the mouse early gastrulation scRNA-seq dataset. This library provides the GEM and metadata. We used the Tal1 chimera GEM and cell-type annotation, “cell type.mapped”, provided by this library. Data were normalized with SCTransform 66 . The GEM was converted to the AnnData format and processed in the same way as the Paul et al. dataset. For the dimensionality reduction, we used UMAP using the PAGA graph for the initialization (maxiter=500, min_dist=0.6). We removed the extraembryonic ectoderm (ExE), primordial germ cell (PGC) and stripped nuclei clusters which lie outside the main differentiation branch. After removing these clusters, we used the WT cell data for the simulations (24,964 cells). GRN calculations and simulations were performed as described above using the default parameters. We used the base GRN generated from the mouse sci-ATAC-seq atlas dataset. We constructed cluster-wise GRN models for 25 cell states. Then, we simulated Tal1 KO effects using the WT scRNA-seq dataset. For the late-stage-specific Tal1 conditional KO simulation, we set Tal1 expression to be zero in the blood progenitor and erythroid clusters to analyse the role of Tal1 in late erythroid differentiation stages (Extended Data Fig. 9i,j ).

Farrell et al. zebrafish early development scRNA-seq data

GEM, metadata and URD trajectory data were downloaded from the Broad Institute Single Cell Portal ( https://tinyurl.com/7dup3b5k ). We used the cell clustering data and developmental lineage data from Farrell et al. 32 The GEM was already normalized and log 2 -transformed, which we converted to non-log-transformed data before CellOracle analysis. The GEM had human gene symbols, which we converted back to zebrafish gene symbols using gene name data in ZFIN ( https://zfin.org ). We used URD dimensional reduction embedding data. To use the URD differentiation trajectory in the CellOracle simulations, we ran several preprocessing and calculations. We first identified cells with URD coordinate data ( n  = 26,434 cells). The “EPL/periderm and primordial germ cell” cluster, which represents 1.7% of the total population, was also excluded from our analysis because it is located outside the main differentiation trajectory branch. The whole URD structure ( n  = 25,711 cells) was split into four sub-branches to simplify the calculations (Extended Data Fig. 10b ). Then, we converted the original URD coordinates, a 3D matrix, into a 2D matrix using PCA (sklearn.decomposition.PCA) because CellOracle requires 2D dimensional reduction embedding data. The GEM was converted into the AnnData format. At the variable gene detection step, we selected the top 3,000 genes. GRN calculation and simulations were performed as described above using the default parameters. We did not calculate pseudotime because the pseudotime data calculated with URD were available. The pre-calculated pseudotime data were used to calculate the 2D development vector field. For base GRN construction, we used UCSC TSS and promoter data and the zebrafish reference genome ( https://useast.ensembl.org/Danio_rerio/Info/Index ), danRer11 (the bed file is included in the CellOracle package). The promoter DNA sequence was scanned with CisBP version2 motif dataset to generate the base GRN ( http://cisbp.ccbr.utoronto.ca ).

For screening novel regulators of axial mesoderm cell identity, we prioritized candidate genes as follows. First, we performed CellOracle KO simulations for 232 active TFs, which had at least one gene edge in the constructed GRN model in the axial mesoderm branch (Extended Data Fig. 12a , step 1). We focused on the early differentiation stage by selecting cells between digitized pseudotime 0 and 3 (Extended Data Fig. 12a , step 2). For this analysis, we focused on negative perturbation scores to identify candidate TFs. A large negative perturbation score indicates a predicted inhibition or block in differentiation following TF KO; thus, we reasoned that these TFs might have a positive role in differentiation (Extended Data Fig. 12a , step 3). To prioritize TFs according to the predicted differentiation inhibition effects, we ranked TFs according to the sum of their negative perturbation scores, resulting in the 30 genes listed in Fig. ​ Fig.5a. 5a . Next, we surveyed the GRN degree centrality scores of 30 candidate genes in the notochord cluster GRN because we reasoned that those genes with higher GRN degree centrality result in a more reliable simulation. Then, we calculated the gene specificity score comparing the axial mesoderm sub-branch and the other sub-branches using the Scanpy function, sc.tl.rank_genes_groups(). Although gene specificity does not necessarily relate to gene function, we assumed that specific gene expression would simplify the interpretation of experimental results and reduce the likelihood of unexpected phenotypes from clusters other than axial mesoderm. Finally, we analysed mean expression, assuming perturbation experiments with highly expressed genes would be more robust, especially in the CRISPR–Cas9-based F 0 embryo analysis. After removing previously reported genes, we selected candidate genes that exist in the 50th percentile of these scores (Extended Data Fig. 12b , highlighted in a grey rectangle), resulting in lhx1a , sebox , irx3a , creb3l1 and zic2a . We finally selected three candidates, lhx1a , sebox and irx3a , and removed creb3l1 and zic2a from the first LOF experiment list, according to the following rationale: creb3l1 gene sequences are similar to creb3l2 ; thus, it was challenging to design specific sgRNAs to target this gene; creb3l2 was previously reported to regulate axial mesoderm development. Although zic2a narrowly passed the gene specificity threshold described above, we found that zic2a expression was high in the other mesendoderm sub-branch and the ectoderm sub-branches; thus, we excluded this gene from our downstream analyses.

Dahlin et al. mouse haematopoiesis scRNA-seq data

Mouse haematopoiesis scRNA-seq data from Dahlin et al. 58 were downloaded from the PAGA GitHub repository ( https://github.com/theislab/paga ). The GEM was normalized by total UMI counts after removing genes with zero counts ((scanpy.pp.filter_genes(min_counts=1), scanpy.pp.normalize_per_cell(key_n_counts=‘n_counts_all’)). Highly variable genes were detected and used for the following downstream analysis: (scanpy.pp.filter_genes_dispersion(flavor=‘cell_ranger’, n_top_genes=3000, log=False)). The GEM was log-transformed, scaled and subjected to PCA and Louvain clustering (scanpy.pp.log1p(), scanpy.pp.scale(), scanpy.tl.pca(svd_solver=‘arpack’), scanpy.tl.louvain(resolution=1.5)). The original force-directed graph reported in Dahlin et al. 58 was used for the CellOracle simulation. GRN calculation and simulation were performed using the default parameters. For the base GRN, we used the mouse sci-ATAC-seq atlas dataset 13 .

Setty et al. human haematopoiesis scRNA-seq data

Human haematopoiesis scRNA-seq were downloaded from the Human Cell Atlas: https://data.humancellatlas.org/explore/projects/091cf39b-01bc-42e5-9437-f419a66c8a45 (Setty et al.) 67 . The GEM was normalized by total UMI counts after removing genes with zero counts ((scanpy.pp.filter_genes(min_counts=1), scanpy.pp.normalize_per_cell(key_n_counts=‘n_counts_all’)). Highly variable genes were detected and used for the following downstream analysis: (scanpy.pp.filter_genes_dispersion(flavor=‘cell_ranger’, n_top_genes=3000, log=False)). The GEM was log-transformed, scaled and subjected to PCA and Louvain clustering (scanpy.pp.log1p(), scanpy.pp.scale(), scanpy.tl.pca(svd_solver=‘arpack’), scanpy.tl.louvain(resolution=1.5)). The force-directed graph was calculated with SPRING ( https://kleintools.hms.harvard.edu/tools/spring.html ). We removed DC and lymphoid cell clusters in line with the Paul et al. 16 data analysis. GRN calculation and simulation were performed using the default parameters. For the base GRN, we used the base GRN generated using the Buenrostro et al. human haematopoiesis scATAC-seq data described below 68 .

Buenrostro et al. human haematopoiesis scATAC-seq data

Human haematopoiesis scATAC-seq data from Buenrostro et al. 68 were used to construct a human haematopoiesis base GRN. The scATAC-seq peak data and count matrix was obtained from the Gene Expression Omnibus (GEO), with accession code GSE96769 , and processed with Cicero (v.1.3.4) to obtain co-accessibility scores as follows: After removing peaks with zero counts, cells were filtered by the peak count (min count = 200, max count = 30,000). The data were processed using Cicero functions (detect_genes(), estimate_size_factors(), preprocess_cds(method = "LSI"), reduce_dimension(reduction_method = ‘UMAP’, preprocess_method = "LSI")). Then Cicero co-accessibility scores were calculated using run_cicero() with human chromosome length information imported by data("human.hg19.genome"). Output peak and co-accessibility scores were used for CellOracle base GRN construction. CellOracle annotated the TSS site in the peaks, and the TSS peaks and cis -regulatory peaks with co-accessibility scores ≥ 0.8 were used for motif scanning. We used the gimmemotifs vertebrate v5 motif dataset, which is CellOracle’s default for mouse and human motif scanning.

TF motif enrichment analysis was performed using ChromVar 68 . The ChromVar score matrix, which includes 2,034 cells and 1,764 motif data, was processed with scanpy to generate a force-directed graph and Louvain clustering (scanpy.tl.pca(), scanpy.tl.louvain(resolution=0.5), scanpy.tl.draw_graph()). The cluster was annotated using cell source FACS gate sample labels. The score fold change was calculated and visualized as a volcano plot (Supplementary Fig. 16 ). The statistical test was performed using the two-tailed Wilcoxon rank-sum test with Bonferroni correction.

Comparison between CellOracle haematopoiesis KO simulation results and previous reports

CellOracle KO simulation results for 12 key TFs that regulate myeloid differentiation are shown in Figs. ​ Figs.1 1 and ​ and2, 2 , Extended Data Figs. ​ Figs.5 5 and ​ and6 6 and Supplementary Figs. 13 and 14 . The simulation results were compared with previous reports (summarized in Supplementary Table 2 ). In these figures, the summary of the simulation results is shown in the right column with the mark (*), which indicates that the simulation results agree with the previously reported role or phenotype of the TF. We note that the input haematopoiesis data focus on the myeloid lineage; thus, the simulation results show relative cell-identity shifts within the myeloid lineage only. For example, Spi1 has an important role not only in the myeloid lineage but also in other cell types, such as HSCs and lymphoid lineages 69 . However, we cannot simulate the role in these cell types if they are not present in the input data.

Klf1 promotes differentiation towards the ME lineage, promoting erythroid cell differentiation in particular 15 . CellOracle simulation results agree with this role (Extended Data Fig. ​ Fig.5a 5a and Supplementary Figs. 13e and 14e ).

Gata1 promotes ME lineage differentiation and also promotes granulocyte differentiation 15 , 70 . Both the Paul et al. 16 and Dahlin et al. 58 data simulation results reproduce these Gata1 roles. (Fig. ​ (Fig.1f 1f and Supplementary Fig. 13b ). In the Setty et al. dataset 67 , the ME lineage phenotype is reproduced, but the granulocyte phenotype is not observed (Supplementary Fig. 14b ). We speculate that this is because the Setty dataset includes few mature granulocytes.

Gata2 is a key factor in maintaining stemness in MPPs 15 . Simulation results in all data agree with this role for Gata2 (Extended Data Fig. ​ Fig.6a 6a and Supplementary Figs. 13i and 14g ).

Spi1 promotes GM lineage differentiation. The inhibition of Spi1 shifts cell identity from the GM to the ME lineage 15 , 71 . Simulation results in all datasets agree with this role of Spi1 (Fig. ​ (Fig.1e 1e and Supplementary Figs. 13a and 14a ).

Cebp a promotes GM lineage differentiation while inhibiting ME lineage differentiation 16 , 72 , and promoting granulocyte differentiation in particular 15 . Simulation results using the Paul et al. 16 and Dahlin et al. 58 datasets agree with this role for Cebpa (Fig. ​ (Fig.2b 2b and Supplementary Fig. 13c ). Although the ME lineage phenotype is not detected using the Setty et al. dataset 67 , the GM lineage phenotype is successfully reproduced (Supplementary Fig. 14c ).

Cebpe promotes granulocyte lineage differentiation 15 , 16 . Simulation results in all datasets agree with this role of Cebpe (Fig. ​ (Fig.2c 2c and Supplementary Figs. 13d and 14d ).

Gfi1 promotes granulocyte lineage differentiation 15 , 72 – 74 . Simulation results using the Paul et al. 16 and Dahlin et al. 58 datasets agree with this role of Gfi1 (Extended Data Fig. ​ Fig.5c 5c and Supplementary Fig. 13g ).

Gfi1b promotes ME lineage differentiation 15 . Simulation results in all data suggest that Gfi1b promotes erythroid differentiation (Extended Data Fig. ​ Fig.5b 5b and Supplementary Fig. 13f ). The Mk phenotype is unclear in the simulation, probably owing to the small numbers of Mk cells.

Irf8 promotes GM lineage differentiation. In particular, Irf8 promotes monocyte differentiation as a lineage switch between monocyte and granulocyte bifurcation 29 . Simulation results in all data agree with the role of Irf8 (Extended Data Fig. ​ Fig.5d 5d and Supplementary Figs. 13h and 14f ).

Lmo2 is a central factor in maintaining stemness in the MPP compartment 15 . Simulation results using the Dahlin et al. data 58 agree with this role. (Supplementary Fig. 13l ). However, simulation results using Paul et al. data 16 showed a different phenotype in erythrocyte cells, suggesting that Lmo2 is also crucial for promoting erythroid differentiation (Extended Data Fig. ​ Fig.6d). 6d ). A function of Lmo2 in promoting erythroid differentiation was also reported 75 .

Runx1 is an central factor in maintaining stemness in the MPP compartment 15 Simulation results in all datasets agree with this role of Runx1 (Extended Data Fig. ​ Fig.6b 6b and Supplementary Fig. 13j ).

  • Fli1 ( FLI1 )

Fli1 has context-dependent roles. Fli1 is a key factor for Mk differentiation 15 , and for maintaining stemness in the stem and progenitor comparment 76 . The simulations consistently reproduce these phenotypes (Extended Data Fig. ​ Fig.6c 6c and Supplementary Figs. 13k and 14h ). In addition, a previous study reported that loss of Fli1 causes dysregulation in later differentiation stages 77 , consistent in simulations using the Paul et al. dataset 16 (Extended Data Fig. ​ Fig.6c 6c ).

Zebrafish lines

The zebrafish experiments were approved by the Institutional Animal Care and Use Committees at Washington University in St Louis. All animal experiments followed all relevant guidelines and regulation. The following zebrafish lines were used in this study: AB* and floating head n1/n1 ( flh/noto ) mutants 37 . Sample sizes and developmental stages are stated below. Randomization was not performed as experimental groups were determined by genotype. Blinding was performed for the generation and analysis of the single-cell data.

CRISPR–Cas9-based mutagenesis of F 0 embryos

To generate somatic gene deletions in early zebrafish embryos, we used CRISPR–Cas9 with two or three sgRNAs as described previously 78 . In brief, sgRNAs were designed using CHOPCHOP ( http://chopchop.cbu.uib.no/ ) to target 5′ exons and the functional domain of each selected TF and synthesized (IDT) (Supplementary Fig. 20b ). sgRNA sequences are listed in Supplementary Table 6. Duplex sgRNA was prepared by mixing equimolar amounts of Alt-R crRNA and Alt-tracrRNA (IDT) in IDT Duplex Buffer, heating to 95 °C and slowly cooling to room temperature (RT) for 20 min. For the final mix of ribonucleoprotein complex (RNPs), around 4 μM duplex sgRNA was assembled with around 5 μM CRISPR–Cas nuclease (Alt-R S.p. HiFi Cas9 Nuclease V3) in 3 M KCl 0.025% and phenol red solution. The activity of HiFi Cas9 and selected sgRNAs was confirmed with regular PCR, Sanger sequencing and capillary electrophoresis, as described previously 40 . In brief, DNA from eight embryos for each combination of Cas9 and sgRNAs was extracted at 10 hpf. PCR amplification was performed with primers complementary to sequences 250 bp upstream and downstream of the PAM sequences (Supplementary Table 6 ). In addition, tracking of indels by decomposition (TIDE) 79 analysis was used to predict the percentage of indels at the target locus (Supplementary Fig. 20c ). flh n1/n1 mutant embryos were generated by crossing heterozygotes and selecting mutants on the basis of their morphology at 10 hpf.

Embryo collection and processing

Zebrafish embryos were produced by natural matings and injected at the one-cell stage with around 2–4 nl of RNP solution into the blastodisc. Embryos were incubated at 28 °C after removing those damaged during the injection process. After 9 hpf, embryos were enzymatically dechorionated and deyolked as previously described 32 . In brief, embryos were dechorionated by incubation in 1 mg ml −1 pronase, washed with ‘blue water’ and then transferred into plastic Petri dishes coated with 2% agarose with methylene blue water. Deyolking was performed manually by ‘squeezing’ the yolk out of the blastoderm cap with a closed pair of forceps inserted between the embryonic blastoderm and the yolk. The layer of cells detached from the yolk was transferred to a 1.5-ml Eppendorf tube with 50 μl of DMEM/F12 medium. For each experiment, 40–50 individual CRISPR–Cas9-targeted embryos (crispants) were prepared for dissociation into single-cell suspensions. Cell dissociation was performed according to the previous report (Farrell et al.) 32 . DMEM/F12 medium was added to the Eppendorf tube to bring the total volume to 200 μl. Cells were mechanically dissociated by flicking the tube 15 times and pipetting 3 times. The cell mixture was spun at 300 g for 2 min and twice washed with PBS + 0.1% BSA. The same procedure was followed to collect and dissociate cells from WT and flh n1/n1 mutant embryos.

RNA extraction and qRT-PCR

Total RNA was extracted from approximately 50 embryos for each experimental condition, homogenized in Trizol (Life Technologies) and further purified following Qiagen RNEasy Mini Kit instructions 80 . One microgram of total RNA was used to synthesize cDNA with the iScript kit (BioRad) following the manufacturer’s protocol. SYBR green (BioRad) qRT-PCR reactions were run in a CFX Connect Real-Time PCR detection system (BioRad) with three technical replicates. The primers used are listed in Supplementary Table 6 .

Whole-mount in situ hybridization

An antisense RNA probe for nog1 was generated from plasmid pBSKII 81 , previously linearized with Not1, and used as a template for in vitro transcription using NEB T7 RNA polymerase and RNTPs labelled with digoxygenin (DIG) (Roche). WISH was performed according to a previous report 82 . In brief, embryos were fixed overnight in 4% paraformaldehyde (PFA) in phosphate-buffered saline (PBS), rinsed in PBS + 0.1% Tween 20 (PBT) and dehydrated in methanol. Embryos were then rehydrated in PBT and incubated for at least 2 h in hybridization solution (HM) with 50% formamide (in 0.75 M sodium chloride, 75 mM sodium citrate, 0.1% Tween 20, 50 μg ml −1 heparin (Sigma) and 200 μg ml −1 tRNA) at 70 °C, then hybridized overnight at 70 °C with antisense probes diluted approximately 1 ng μl −1 in hybridization solution. Embryos were washed through a series of 10 min, 70 °C washes in HM diluted with 2× SSC buffer (0.3 M sodium chloride and 30 mM sodium citrate) once in each of the following: 75% HM, 50% HM, 25% HM and 100% 2× SSC. The same gradual change from SSC to PBT was performed for the subsequent washes. Embryos were blocked at RT for several hours in PBT with 2% goat serum and 2 mg ml −1 bovine serum albumin (BSA), then incubated overnight at 4 °C with anti-DIG antibody (Roche 11093274910) at 1:5,000 on a horizontal shaker (40 rpm). Embryos were rinsed six times for 15 min per wash in PBT, and then in staining buffer (PBT+100 mM Tris pH 9.5, 50 mM MgCl 2 and 100 mM NaCl) before staining with BM Purple solution (Roche).

HCR was performed according to the protocols provided by Molecular Instruments ( https://www.molecularinstruments.com ). Embryos were fixed at 10 hpf with 4% PFA, dehydrated with methanol and rehydrated as described for WISH above. Embryos were pre-hybridized in hybridization buffer (Molecular Instruments) for 1 h at 37 °C and subsequently incubated in 200 μl of hybridization solution containing 1 pg of probes overnight at 37 °C. Embryos were then washed four times in wash buffer (Molecular Instruments) followed by two washes in 5× SSCT, containing 5× SSC buffer (Thermo Fisher Scientific) and 0.1% Tween 20 (Sigma). For the pre-amplification step, embryos were incubated in amplification buffer (Molecular Instruments) for more than 1 h. At the same time, hairpin mixtures were prepared by heating 12 pmol of hairpin 1 (H1) and 2 (H2) for each sample to 95 °C for 90 s, followed by cooling in the dark for 30 min at RT. H1 and H2 were mixed and then added to 200 μl amplification buffer. Embryos were incubated in the hairpin mixture at RT overnight in the dark. On the third day, embryos were washed more than 4 times in 5× SSCT and either stored at 4 °C or mounted for microscopy.

Embryos subjected to HCR were mounted in 3% low-melt agarose in glass-bottomed 35-mm Petri dishes. Alternatively, embryos were manually deyolked and flattened on a glass slide with one to two drops of 3% methylcellulose. Images of the anterior and posterior body regions were taken by acquiring around 200-μm z -stacks with a 1-μm step, using a 10× objective lens on a modified Olympus IX81 inverted spinning disc confocal microscope equipped with Voltran and Cobolt steady-state lasers and a Hamamatsu ImagEM EM CCD digital camera.

Image quantification with IMARIS software

Individual confocal 3D datasets were analysed with IMARIS 9.9 software (Bitplane). On the basis of the DAPI signal, radii were determined by taking half of the longest diameter of each nucleus, which was measured as a single spot using the ‘spots’ view in IMARIS. These parameters were applied in all images used for quantification. Nuclei positive for specific probes within a selected area were identified using the ‘spots’ view as spots with a signal in the specific channel that overlapped with DAPI spots. Analysis was performed on eight embryos: four anterior and four posterior per experimental group.

10X Chromium procedure

For single-cell library preparation on the 10X Genomics platform, we used: the Chromium Single Cell 3′ Library & Gel Bead Kit v2 (PN-120237), Chromium Single Cell 3′ Chip kit v2 (PN-120236) and Chromium i7 Multiplex Kit (PN-120262), according to the manufacturer’s instructions in the Chromium Single Cell 3′ Reagents Kits V2 User Guide. Before cell capture, methanol-fixed cells were placed on ice, then spun at 3,000 rpm for 5 min at 4 °C, followed by resuspension and rehydration in PBS, as described previously 83 . A total of 17,000 cells were loaded per lane of the chip, aiming to capture 10,000 single-cell transcriptomes. The resulting cDNA libraries were quantified on an Agilent TapeStation and sequenced on an Illumina NextSeq 550.

10X Chromium scRNA-seq data processing

10x alignment and digital gem generation.

The Cell Ranger v5.0.1 pipeline ( https://support.10xgenomics.com/single-cell-gene-expression/software/downloads/latest ) was used to process data generated using the 10X Chromium platform. Cell Ranger processes, filters and aligns reads generated with the Chromium single-cell RNA sequencing platform. Following this step, the default Cell Ranger pipeline was implemented, and the filtered output data were used for downstream analyses.

Zebrafish scRNA-seq data processing

We used the R package Seurat (v.4.0.1) to process scRNA-seq data. Cells were filtered by RNA count and percentage of mitochondrial genes to remove low-quality cells. Data were normalized using the Seurat NormalizeData() function. Variable genes were identified using the FindVariableFeatures() function with nfeature = 2,000. Data were integrated by applying Seurat functions, SelectIntegrationFeatures(), FindIntegrationAnchors() and IntegrateData() using default parameters. After data scaling, PCA and clustering were performed. The data after cell calling may include cells with very low mRNA counts generated from non-cell GEMs and ambient RNA. To remove such non-cell GEM data, we assessed the RNA count distribution to remove clusters with an abnormal RNA count distribution. Scaling, PCA, clustering and t - SNE were performed again after removing the cells above. t - SNE was calculated using the first 30 principal components. We applied the same pipeline to the WT reference, flh mutant and crispant scRNA-seq data.

After data integration and standard scRNA-seq preprocessing, the whole WT reference scRNA-seq data were annotated as follows. The segmentation labels generated in the Farrell et al. 32 zebrafish scRNA-seq data were transferred to the new scRNA-seq data using the Seurat function, FindTransferAnchors and TransferData, with default parameters. We manually adjusted the cell annotation to account for differences in the timing of cell collection. We generated cell-type annotations for the clustering data generated in the previous step by referring to the Farrell et al. dataset annotation labels. The WT reference cell-type annotations were transferred to the other scRNA-seq data using the same Seurat label transfer functions.

To compare cell identity on the same 2D embedding space, we used UMAP and the UMAP transfer function. We first calculated UMAP with axial mesoderm clusters in WT reference datasets. Using this pre-trained UMAP model, we projected KO and control axial mesoderm data onto the same UMAP 2D embedding space constructed with WT reference data.

Cell density was visualized using a KDE plot. First, we performed random downsampling to adjust the cell number between the LOF control samples. (i) Whole-cell populations were randomly subsampled into a subset to have an equal cell number to the smaller dataset. (ii) Then, axial mesoderm cells were selected and subjected to KDE calculation. KDE was calculated with the scipy.stat.gaussian_kde function. The calculated KDE was visualized with the matplotlib.pyplot.contour function. We used the same contour threshold levels between the LOF and control samples.

In addition to the UMAP transfer analysis above, the WT data, lhx1a crispant and tyr crispant data were analysed with UMAP without data transfer (Supplementary Fig. 21 ). The 10 hpf axial mesoderm cell data were integrated using Seurat functions (SelectIntegrationFeatures(), FindIntegrationAnchors(), and IntegrateData() with default parameters), and then UMAP graph and Louvain cluster were calculated (RunPCA(), FindNeighbors(reduction = "pca", dims = 1:30), RunUMAP(reduction = "pca", dims = 1:30, min.dist = 1), FindClusters(resolution = 1.5)).

We performed NMF with our lhx1a crispants scRNA-seq dataset according to a previous report 32 . The normalized UMI counts were standardized, log-transformed and subjected to NMF calculation with sklearn.decomposition.NMF(n_components=40). Each module was manually annotated by its cluster enrichment and gene ontology calculated with the top 30 genes with high module weight. Gene annotation, weight and ontology are provided in Supplementary Table 3 . Gene ontology was calculated with the g:Profiler API ( https://biit.cs.ut.ee/gprofiler/page/apis ). The background was set to all genes used in the NMF calculation. Clusters governed by a single gene were excluded from our analysis.

Statistical testing

Details of all statistical tests performed are provided in Supplementary Table 4 . Scipy stat module (scipy version 1.7.0) was used for statistical analysis. In summary, we selected the statistical method as follows: (i) chi-square test was used to analyse the relationships of categorical variables; (ii) Wilcoxon rank-sum test (Mann–Whitney U test) was used when the data distribution type was not apparent; (iii) in cases in which the data distribution followed a Gaussian distribution, a t -test was used. Where multiple comparisons were made, the Bonferroni correction was applied. An alternative hypothesis (one-tailed or two-tailed) was selected depending on the aim of the analysis.

Reporting summary

Further information on research design is available in the  Nature Portfolio Reporting Summary linked to this article.

Online content

Any methods, additional references, Nature Portfolio reporting summaries, source data, extended data, supplementary information, acknowledgements, peer review information; details of author contributions and competing interests; and statements of data and code availability are available at 10.1038/s41586-022-05688-9.

Supplementary information

This file contains Supplementary Figures 1–26, Supplementary Tables 1–6 and Supplementary References

Details of CellOracle pre-built base GRN data. This table details the different species for which we have constructed base GRNs.

TF annotation and previous study references. This table details the previously reported knockout phenotypes in mouse haematopoiesis and zebrafish development.

Negative PS sum scores for zebrafish KO simulation. This table details the ranked list of predicted regulators of somitogenesis, with published phenotypes.

Details of statistical tests. This table details the statistical tests used throughout this study.

Lhx1a crispant NMF data. This table details the NMF analysis of Lhx1a crispants. For statistical tests, Wilcoxon rank-sum test, Two-tailed was used. The p-value was corrected with the Bonferroni correction method.

sgRNA and primer sequences. This table details the sgRNA and primer sequences used in the zebrafish experiments.

Visualization of the differentiation vector field and TF KO vector fields . The same 2D vector field used in Fig. 1c–f is visualized as a movie. The left panel shows the differentiation vector field of Paul et al. haematopoiesis data. The centre and right panels show CellOracle KO simulation vectors for Spi1 and Gata1 transcription factors, respectively. The particle visual effects were generated with Unity (version 2020.3.0f1) and Visual Effect Graph Asset: https://docs.unity3d.com/Packages/[email protected]/manual/VisualEffectGraphAsset.html

Acknowledgements

We thank members of the S.A.M. laboratory for discussions; J. Magee and T. Druley for critical feedback; A. Krezel for help with gRNA design; M. Ryan and T. Tsai for advice on HCR; and D. Klatt Shaw for advice on CRISPR–Cas9 gene disruption. This work was funded by the National Institute of General Medical Sciences grant R01 GM126112 and Silicon Valley Community Foundation, Chan Zuckerberg Initiative grants HCA2-A-1708-02799 and DAF2021-238797 (both to S.A.M.), and by grant R35 GM118179 to L.S.-K. S.A.M. is supported by an Allen Distinguished Investigator Award (through the Paul G. Allen Frontiers Group), a Vallee Scholar Award, a Sloan Research Fellowship and a New York Stem Cell Foundation Robertson Investigator Award; K.K. is supported by a Japan Society for the Promotion of Science Postdoctoral Fellowship; B.S. is supported by a postdoctoral fellowship from the Washington University in St Louis Center of Regenerative Medicine; and C.M.H. is supported by a National Science Foundation Graduate Research Fellowship (DGE-2139839 and DGE-1745038).

Extended data figures and tables

Author contributions.

K.K. and S.A.M. conceived the research. K.K. led the computational and experimental work, assisted by C.M.H. and K.J. and supervised by S.A.M. B.S. performed the zebrafish experimental validations, supervised by L.S.-K. All authors participated in the interpretation of data and writing.

Peer review

Peer review information.

Nature thanks Carl de Boer and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Data availability

Code availability, competing interests.

The authors declare no competing interests.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

is available for this paper at 10.1038/s41586-022-05688-9.

The online version contains supplementary material available at 10.1038/s41586-022-05688-9.

Control and real-time experiments for a multi-agent aerial transportation system

  • Technical Paper
  • Published: 21 September 2024
  • Volume 46 , article number  616 , ( 2024 )

Cite this article

in silico perturbation experiments

  • Diego A. Mercado Ravell   ORCID: orcid.org/0000-0002-7416-3190 1 , 2 ,
  • Fatima Oliva-Palomo 2 ,
  • Guillaume Sanahuja 3 &
  • Pedro Castillo 3  

This work presents a multi-agent aerial transportation system (MAATS) with various aerial robots carrying a cable suspended load. First, the dynamical model of the coupled system is described using the Newton–Euler formalism, considering a mass connected to the aerial drones by rigid cables. The objective is to control the load’s position to perform trajectory tracking using n drones. Hence, a hierarchical controller is designed using Lyapunov-like energy functions, where the resultant desired cable tension is employed as a virtual control input for the load control which in turn is distributed among the agents to obtain a suitable desired pose of each agent that ensures the trajectory tracking of the load. Then, position and attitude control of each drone is carried out using the desired pose. The stability analysis of the closed-loop system is provided, demonstrating the stability of the coupled system. Finally, the performance of the MAATS control strategy is verified in software-in-the-loop simulations, as well as in real-time experiments, considering 2 and 3 drones, and showing good behavior in spite of significant external perturbations acting on the cables, the load or the agents.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save.

  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime

Price includes VAT (Russian Federation)

Instant access to the full article PDF.

Rent this article via DeepDyve

Institutional subscriptions

in silico perturbation experiments

Ollero A, Tognon M, Suarez A, Lee D, Franchi A (2022) Past, present, and future of aerial robotic manipulators. IEEE Trans Robot 38:626–645. https://doi.org/10.1109/TRO.2021.3084395

Article   Google Scholar  

Saunders J, Saeedi S, Li W (2023) Autonomous aerial robotics for package delivery: a technical review. J Field Robot. https://doi.org/10.1002/rob.22231

Guerrero-Sánchez ME, Mercado-Ravell DA, Lozano R, García CD (2017) Swing-attenuation for a quadrotor transporting a cable-suspended payload. ISA Trans 68:433–449. https://doi.org/10.1016/j.isatra.2017.01.027

Michael N, Fink J, Kumar V (2011) Cooperative manipulation and transportation with aerial robots. Autonom Robots 30(1):73–86. https://doi.org/10.1007/s10514-010-9205-0

Jackson BE, Howell TA, Shah K, Schwager M, Manchester Z (2020) Scalable cooperative transport of cable-suspended loads with uavs using distributed trajectory optimization. IEEE Robot Autom Lett 5(2):3368–3374. https://doi.org/10.1109/LRA.2020.2975956

Zeng J, Kotaru P, Mueller MW, Sreenath K (2020) Differential flatness based path planning with direct collocation on hybrid modes for a quadrotor with a cable-suspended payload. IEEE Robot Autom Lett 5(2):3074–3081. https://doi.org/10.1109/LRA.2020.2972845

Sreenath K, Kumar V (2013) Dynamics, control and planning for cooperative manipulation of payloads suspended by cables from multiple quadrotor robots. In: Proceedings of robotics: science and systems, Berlin, Germany. https://doi.org/10.15607/RSS.2013.IX.011

Mechali O, Xu L (2023) Distributed fixed-time sliding mode control of time-delayed quadrotors aircraft for cooperative aerial payload transportation: theory and practice. Adv Space Res 71(9):3897–3916. https://doi.org/10.1016/j.asr.2022.12.037

Arab F, Shirazi FA, Yazdi MRH (2021) Planning and distributed control for cooperative transportation of a non-uniform slung-load by multiple quadrotors. Aerosp Sci Technol 117:106917. https://doi.org/10.1016/j.ast.2021.106917

Wu G, Sreenath K (2014) Geometric control of multiple quadrotors transporting a rigid-body load. In: 53rd IEEE conference on decision and control, pp 6141–6148. https://doi.org/10.1109/CDC.2014.7040351

Lee T (2014) Geometric control of multiple quadrotor uavs transporting a cable-suspended rigid body. In: 53rd IEEE conference on decision and control, pp 6155–6160. https://doi.org/10.1109/CDC.2014.7040353

Lee T (2018) Geometric control of quadrotor uavs transporting a cable-suspended rigid body. IEEE Trans Control Syst Technol 26(1):255–264. https://doi.org/10.1109/TCST.2017.2656060

Goodarzi FA, Lee T (2015) Dynamics and control of quadrotor uavs transporting a rigid body connected via flexible cables. In: 2015 American control conference (ACC), pp 4677–4682. https://doi.org/10.1109/ACC.2015.7172066

Geng J, Langelaan JW (2020) Cooperative transport of a slung load using load-leading control. J Guidance Control Dyn 43(7):1313–1331. https://doi.org/10.2514/1.G004680

Geng J, Singla P, Langelaan JW (2022) Load-distribution-based trajectory planning and control for a multilift system. J Aerosp Inf Syst 19(5):366–381. https://doi.org/10.2514/1.I011067

Li G, Ge R, Loianno G (2021) Cooperative transportation of cable suspended payloads with mavs using monocular vision and inertial sensing. IEEE Robot Autom Lett 6(3):5316–5323. https://doi.org/10.1109/LRA.2021.3065286

Oliva-Palomo F, Mercado-Ravell D, Castillo P (2024) Aerial transportation control of suspended payloads with multiple agents. J Frankl Inst 361(7):106787. https://doi.org/10.1016/j.jfranklin.2024.106787

https://gitlab.utc.fr/uav-hds/flair . Accessed 13/12/2023

Kuipers JB (1999) Quaternions and rotation sequences: a primer with applications to orbits, aerospace, and virtual reality. Princeton University Press, Princeton

Book   Google Scholar  

Oliva F, Sanchez A, Castillo P, Alazki H (2018) Nonlinear ellipsoid based attitude control for aggressive trajectories in a quadrotor: Closed-loop multi-flips implementation. Control Eng Pract 77:150–161. https://doi.org/10.1016/j.conengprac.2018.05.009

Oliva-Palomo F, Sanchez A, Alazki H, Castillo P, Muñoz-Vázquez A (2021) Robust global observer position-yaw control based on ellipsoid method for quadrotors. Mech Syst Signal Process 158:107721. https://doi.org/10.1016/j.ymssp.2021.107721

Khalil HK (2002) Nonlinear systems. Pearson Education. Prentice Hall. https://books.google.com.mx/books?id=t_d1QgAACAAJ

Download references

This work was partially supported by the Mexican National Council of Humanities, Science and Technology (CONAHCYT).

Author information

Authors and affiliations.

Investigadores CONAHCYT, Mexico City, Mexico

Diego A. Mercado Ravell

Computer Science Department, Center for Research in Mathematics, CIMAT AC, Campus Zacatecas, Avenida Lasec, Manzana 3 Lote 7, 98160, Zacatecas, Mexico

Diego A. Mercado Ravell & Fatima Oliva-Palomo

Université de technologie de Compiègne, CNRS, Heudiasyc UMR 7253, Compiègne, France

Guillaume Sanahuja & Pedro Castillo

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Fatima Oliva-Palomo .

Ethics declarations

Conflict of interest.

The authors have no relevant financial or non-financial interests to disclose.

Additional information

Technical Editor: Rogério Sales Gonçalves.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Mercado Ravell, D.A., Oliva-Palomo, F., Sanahuja, G. et al. Control and real-time experiments for a multi-agent aerial transportation system. J Braz. Soc. Mech. Sci. Eng. 46 , 616 (2024). https://doi.org/10.1007/s40430-024-05166-5

Download citation

Received : 31 January 2024

Accepted : 22 August 2024

Published : 21 September 2024

DOI : https://doi.org/10.1007/s40430-024-05166-5

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Aerial transportation
  • Multi-agent systems
  • Unmanned aerial vehicles
  • Nonlinear control
  • Underactuated systems
  • Find a journal
  • Publish with us
  • Track your research

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Research Briefing
  • Published: 17 August 2023

Machine learning predicts cellular response to genetic perturbation

Nature Biotechnology volume  42 ,  pages 858–859 ( 2024 ) Cite this article

4624 Accesses

12 Altmetric

Metrics details

  • Gene regulatory networks
  • Genomic engineering

GEARS, a machine learning model informed by biological knowledge of gene–gene relationships, effectively predicts transcriptional responses to multi-gene perturbations. GEARS can predict the effects of perturbing previously unperturbed genes and detects non-additive interactions, such as synergy, when predicting combinatorial perturbation outcomes. Thus, GEARS expands insights gained from perturbational screens.

This is a preview of subscription content, access via your institution

Access options

Access Nature and 54 other Nature Portfolio journals

Get Nature+, our best-value online-access subscription

24,99 € / 30 days

cancel any time

Subscribe to this journal

Receive 12 print issues and online access

195,33 € per year

only 16,28 € per issue

Buy this article

  • Purchase on SpringerLink
  • Instant access to full article PDF

Prices may be subject to local taxes which are calculated during checkout

in silico perturbation experiments

Dixit, A. et al. Perturb-Seq: dissecting molecular circuits with scalable single-cell RNA profiling of pooled genetic screens. Cell 167 , 1853–1866 (2016). This paper describes the assay used to measure single-cell transcriptional responses to perturbation.

Article   CAS   PubMed   PubMed Central   Google Scholar  

Norman, T. M. et al. Exploring genetic interaction manifolds constructed from rich single-cell phenotypes. Science 365 , 786–793 (2019). These authors studied genetic interactions using a multi-gene perturbation screen.

Kamimoto, K. et al. Dissecting cell identity via network inference and in silico gene perturbation. Nature 614 , 742–751 (2023). This article presents an alternative in silico gene perturbation model.

Replogle, J. M. et al. Mapping information-rich genotype-phenotype landscapes with genome-scale perturb-seq. Cell 185 , 2559–2575 (2022). This article presents a genome-wide perturbation screen.

Nelson, M. R. et al. The support of human genetic evidence for approved drug indications. Nat. Genet. 47 , 856–860 (2015). This article presents the importance of genetic information for drug efficacy.

Article   CAS   PubMed   Google Scholar  

Download references

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This is a summary of: Roohani, Y., Huang, K. & Leskovec, J. Predicting transcriptional outcomes of novel multigene perturbations with GEARS. Nat. Biotechnol . https://doi.org/10.1038/s41587-023-01905-6 (2023)

Rights and permissions

Reprints and permissions

About this article

Cite this article.

Machine learning predicts cellular response to genetic perturbation. Nat Biotechnol 42 , 858–859 (2024). https://doi.org/10.1038/s41587-023-01907-4

Download citation

Published : 17 August 2023

Issue Date : June 2024

DOI : https://doi.org/10.1038/s41587-023-01907-4

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

in silico perturbation experiments

IMAGES

  1. In-silico perturbations: leveraging single cell seq to perform

    in silico perturbation experiments

  2. In silico perturbation experiments. Left: Experimental perturbation

    in silico perturbation experiments

  3. 2 × 2 design of the in silico dual perturbation experiments

    in silico perturbation experiments

  4. In silico perturbation experiments. Left: Experimental perturbation

    in silico perturbation experiments

  5. in silico perturbation

    in silico perturbation experiments

  6. In silico perturbation of Sema3A to compare with corresponding in vivo

    in silico perturbation experiments

VIDEO

  1. EXP3204 Perceptual Constancy Part 2

  2. Houdini particle perturbation dynamic solution file

  3. The Double Slit Experiment: A Deep Dive into Quantum Physics

  4. Lecture 27: Singular Perturbation for ODE

  5. The Mind-Boggling Double Slit Experiment That Proves We Live in a Simulation

  6. The in silico experiments on Antigen...

COMMENTS

  1. Dissecting cell identity via network inference and in silico gene

    In silico perturbation involves four steps. (1) Cell-type- or cell-state-specific GRN configurations are constructed using cluster-wise regularized linear regression models with multi-omics data ...

  2. Interpreting cis-regulatory interactions from large-scale deep neural

    Inspired by wet laboratory experiments, such as CRISPRi 24,25,72, that perturb genomic loci to uncover how CREs influence gene expression, we devised a suite of in silico perturbation experiments ...

  3. Generative modeling of single-cell time series with PRESCIENT ...

    Introducing and evaluating in silico perturbations. Perturbation experiments were performed similarly to the clonal fate bias experiments, except using perturbed cells as input to the first-order ...

  4. Predicting cellular responses to complex perturbations in high

    In silico prediction of cell behavior in response to a perturbation is critical for optimal experiment design and the identification of effective drugs and treatments. With CPA, we have introduced a versatile and interpretable approach to modeling cell behaviors at single‐cell resolution.

  5. Perturbation Biology: Inferring Signaling Networks in Cellular Systems

    Systematic perturbation experiments. (A) Perturbation experiments with systematic combinations of eight small molecule inhibitors, applied in pairs and as single agents in low (light green) and high (dark green) doses. ... Novel predictions from in silico perturbations. (A) The histogram of phenotypic response profiles to the four most ...

  6. Machine learning for perturbational single-cell omics

    This article reviews the recent advances and challenges of using machine learning, especially deep learning, to model and predict the effects of perturbations on single-cell data. It covers objectives, resources, datasets, and methods for perturbation response modeling, as well as future directions and applications.

  7. GRouNdGAN: GRN-guided simulation of single-cell RNA-seq data using

    We introduce GRouNdGAN, a gene regulatory network (GRN)-guided reference-based causal implicit generative model for simulating single-cell RNA-seq data, in silico perturbation experiments, and benchmarking GRN inference methods. Through the imposition of a user-defined GRN in its architecture, GRouN …

  8. Structural Equation Modeling of In silico Perturbations

    The T-score was employed to project molecular activities of a gene of interest from a model system experiment to human specimens where a perturbation was not directly applicable (Creighton et al., 2008; Creighton et al., 2009; Luo et al., 2009; Qin et al., 2014). In a model system, the subjects are randomly assigned into two groups, where one ...

  9. GRouNdGAN: GRN-guided simulation of single-cell RNA-seq data ...

    In silico perturbation experiments using GRoundGAN. To perform perturbation experiments using GRoundGAN, we put the trained model in a deterministic mode of operation. This is necessary to ensure ...

  10. An Efficient Method for Dynamic Analysis of Gene Regulatory Networks

    An Efficient Method for Dynamic Analysis of Gene Regulatory Networks and in silico Gene Perturbation Experiments. Conference paper; pp 62-76; ... and performed in silico knock out experiments: showing both a reduction in computation time and correct steady state identification. Download to read the full chapter text. Chapter PDF.

  11. CODEX: COunterfactual Deep learning for the in silico EXploration of

    Large-scale perturbation experiments in human cancer cell lines offer a powerful approach to connect genetic or chemical interventions with downstream effects. Such high-throughput screens ... Thus, in silico approaches become necessary to prioritize interventions for further experimental investigations. To address this task, we will build on ...

  12. In-silico perturbations: leveraging single cell seq to perform

    To perform in-silico perturbations they constructed GRN models for each of the 24 myeloid clusters representing megakaryocyte and erythroid progenitors (MEPs) and granulocyte-monocyte progenitors (GMPs) which are differentiating toward erythrocytes, megakaryocytes, monocytes, and granulocytes. ... As we have seen above, perturbation experiments ...

  13. In silico experiments of bone remodeling explore metabolic ...

    To quantitatively show the validity of the in silico model, in silico perturbation of a specific signaling molecule was conducted to compare with corresponding in vivo experiments. After quantitative validation, the in silico model was applied to predict the therapeutic effects of various drugs against osteoporosis.

  14. Using in silico models to simulate dual perturbation experiments

    A growing number of realistic in silico models of metabolic functions are being formulated and can serve as 'dry lab' platforms to prototype and simulate experiments before they are performed. For example, dual perturbation experiments that vary both genetic and environmental parameters can readily be simulated in silico. Genetic and environmental perturbations were applied to a cell-scale ...

  15. Using in silico models to simulate dual perturbation experiments

    Conclusion: Dual perturbation experiments in silico are much more informative for the characterization of functional states than single perturbations. Predictions from an experimentally validated cellular model of metabolism indicate that the measurement of cofactor precursor transport rates can inform the internal state of the cell when the ...

  16. PDF CODEX: COunterfactual Deep learning for the in-silico ...

    CODEX: COunterfactual Deep learning for the in-silico EXploration of cancer cell line perturbations Stefan Schrod∗ 1, Tim Beißbarth , Helena U. Zacharias2, Anne-Christin Hauschild3, and Michael Altenbuchinger1 1Department of Medical Bioinformatics, University Medical Center Göttingen, Germany. 2Peter L. Reichertz Institute for Medical Informatics of TU Braunschweig and Hannover

  17. Knowledge of the perturbation design is essential for accurate gene

    The perturbation information is stored in a binary N-by-M matrix, where N refers to genes and M experiments, assigning − 1 (for knockdown) to all perturbations and 0 to all other cells.

  18. Predicting cellular responses to complex perturbations in high

    We demonstrate this by imputing in silico 5,329 missing combinations (97.6% of all possibilities) in a single‐cell Perturb‐seq experiment with diverse genetic interactions. We envision CPA will facilitate efficient experimental design and hypothesis generation by enabling in silico response prediction at the single‐cell level and thus ...

  19. Dissecting cell identity via network inference and in silico gene

    In silico perturbation involves four steps. (1) Cell-type- or cell-state-specific GRN configurations are constructed using cluster-wise regularized linear regression models with multi-omics data. (2) Using these GRN models, shifts in target gene expression in response to TF perturbation are calculated.

  20. Cell-type-specific prediction of 3D chromatin organization ...

    The accuracy of C.Origami enables in silico genetic perturbation experiments that assess the impact of cis-elements on chromatin interactions, and, moreover, allows systematic identification of ...

  21. In-Silico Analysis and Genomic Tracking of CaDRRG Gene ...

    In-silico analysis of promoter sequences using bioinformatics tools likes PlantCare (Lescot et al. 2002), PlantProm (Shahmuradov et al. 2003) is an extensive approach to find of known Cis-acting elements as structurally and enhancing conserved DNA motifs in the modular structure of eukaryotic promoters (Kaur et al. 2017).

  22. Control and real-time experiments for a multi-agent aerial

    In order to verify the robustness of the MAATS under the proposed control scheme, strong perturbations were applied along the experiment, with the help of a stick, as appreciated in Fig. 4. Notice from Fig. 6 , that the important errors on the graph are produced by the external perturbations, and observe also the good reaction of the control ...

  23. Learning representations of chromatin contacts using a ...

    They furthermore enable in-silico perturbation experiments that measure the influence of cis-regulatory elements on conformation. Despite the availability of chromatin conformation capture ...

  24. Machine learning predicts cellular response to genetic perturbation

    Kamimoto, K. et al. Dissecting cell identity via network inference and in silico gene perturbation. Nature 614, 742-751 (2023). This article presents an alternative in silico gene perturbation ...