knitr::opts_chunk$set( echo = TRUE, message = FALSE, warning = FALSE, eval = FALSE )
This guide provides a quick introduction to using mLLMCelltype for cell type annotation in single-cell RNA sequencing data. We'll cover the basic workflow, input data requirements, and a simple example to get you started.
The mLLMCelltype workflow consists of these main steps:
First, load the mLLMCelltype package:
library(mLLMCelltype)
Before using mLLMCelltype, you need to set up API keys for the LLM providers you plan to use:
# Set API keys as environment variables Sys.setenv(ANTHROPIC_API_KEY = "your-anthropic-api-key") # For Claude models Sys.setenv(OPENAI_API_KEY = "your-openai-api-key") # For GPT models Sys.setenv(GEMINI_API_KEY = "your-gemini-api-key") # For Gemini models Sys.setenv(OPENROUTER_API_KEY = "your-openrouter-api-key") # For OpenRouter models
You can obtain API keys from: - Anthropic: https://console.anthropic.com/ - OpenAI: https://platform.openai.com/ - Google (Gemini): https://ai.google.dev/ - OpenRouter: https://openrouter.ai/keys
Alternatively, you can provide API keys directly in function calls:
results <- annotate_cell_types( input = markers, tissue_name = "human PBMC", model = "claude-sonnet-4-6", api_key = "your-anthropic-api-key", # Direct API key top_gene_count = 10 )
mLLMCelltype accepts marker gene data in several formats:
A data frame with the following columns:
- cluster: Cluster ID (preserved as-is from your data)
- gene: Gene name/symbol
- avg_log2FC or similar metric: Log fold change
- p_val_adj or similar metric: Adjusted p-value
Example:
# Example marker data frame markers_df <- data.frame( cluster = c(0, 0, 0, 1, 1, 1), gene = c("CD3D", "CD3E", "CD2", "CD14", "LYZ", "CST3"), avg_log2FC = c(2.5, 2.3, 2.1, 3.1, 2.8, 2.5), p_val_adj = c(0.001, 0.001, 0.002, 0.0001, 0.0002, 0.0005) )
You can directly use the output from Seurat's FindAllMarkers() function:
# Assuming you have a Seurat object named 'seurat_obj' library(Seurat) all_markers <- FindAllMarkers(seurat_obj, only.pos = TRUE, min.pct = 0.25, logfc.threshold = 0.25)
A path to a CSV file containing marker gene data:
# Path to your CSV file markers_file <- "path/to/markers.csv"
A list where each element contains marker genes for a cluster:
# Example marker list markers_list <- list( "0" = c("CD3D", "CD3E", "CD2", "IL7R", "LTB"), "1" = c("CD14", "LYZ", "CST3", "MS4A7", "FCGR3A") )
The annotate_cell_types function has the following parameters:
| Parameter | Description | Default Value |
|-----------|-------------|---------------|
| input | Marker gene data (data frame, list, or file path) | (required) |
| tissue_name | Tissue name (e.g., "human PBMC", "mouse brain") | NULL |
| model | LLM model to use | "gpt-5.5" |
| api_key | API key (if not set in environment) | NA |
| top_gene_count | Number of top genes per cluster to use | 10 |
| debug | Whether to print debugging information | FALSE |
Note: If api_key is set to NA, the function will return the generated prompt without making an API call, which is useful for reviewing the prompt before sending it to the API.
Here's a simple example using a single LLM model for annotation:
# Example marker data markers <- data.frame( cluster = c(0, 0, 0, 0, 0, 1, 1, 1, 1, 1), gene = c("CD3D", "CD3E", "CD2", "IL7R", "LTB", "CD14", "LYZ", "CST3", "MS4A7", "FCGR3A"), avg_log2FC = c(2.5, 2.3, 2.1, 1.8, 1.7, 3.1, 2.8, 2.5, 2.2, 2.0), p_val_adj = c(0.001, 0.001, 0.002, 0.003, 0.005, 0.0001, 0.0002, 0.0005, 0.001, 0.002) ) # Run annotation with a single model results <- annotate_cell_types( input = markers, tissue_name = "human PBMC", model = "claude-sonnet-4-6", api_key = Sys.getenv("ANTHROPIC_API_KEY"), top_gene_count = 10, debug = FALSE # Set to TRUE for more detailed output ) # Print results print(results)
When using a single model like Claude, the output will be a character vector with one annotation per cluster:
> print(results) [1] "0: T cells" "1: Monocytes"
For more reliable annotations, you can use multiple models and create a consensus:
# Define models to use models <- c( "claude-sonnet-4-6", # Anthropic "gpt-5.5", # OpenAI "gemini-3.1-pro-preview" # Google ) # API keys for different providers api_keys <- list( anthropic = Sys.getenv("ANTHROPIC_API_KEY"), openai = Sys.getenv("OPENAI_API_KEY"), gemini = Sys.getenv("GEMINI_API_KEY") ) # Run annotation with multiple models results <- list() for (model in models) { provider <- get_provider(model) api_key <- api_keys[[provider]] results[[model]] <- annotate_cell_types( input = markers, tissue_name = "human PBMC", model = model, api_key = api_key, top_gene_count = 10 ) } # Create consensus consensus_results <- interactive_consensus_annotation( input = markers, tissue_name = "human PBMC", models = models, # Use all the models defined above api_keys = api_keys, controversy_threshold = 0.7, entropy_threshold = 1.0, consensus_check_model = "claude-sonnet-4-6" )
The function automatically prints a summary upon completion:
> Consensus Summary: ----------------- Total clusters: 2 Controversial clusters: 0 Consensus achieved for all clusters Cluster 0: Final annotation: T cells Consensus proportion: 1.0 Entropy: 0.0 Model predictions: - claude-sonnet-4-6: T cells - gpt-5.5: T cells - gemini-3.1-pro-preview: T cells Cluster 1: Final annotation: Monocytes Consensus proportion: 1.0 Entropy: 0.0 Model predictions: - claude-sonnet-4-6: Monocytes - gpt-5.5: Monocytes - gemini-3.1-pro-preview: Monocytes
To add the annotations to your Seurat object:
# Assuming you have a Seurat object named 'seurat_obj' and consensus results library(Seurat) # Add consensus annotations to Seurat object seurat_obj$cell_type_consensus <- plyr::mapvalues( x = as.character(Idents(seurat_obj)), from = names(consensus_results$final_annotations), to = consensus_results$final_annotations ) # Extract consensus metrics from the consensus results # Note: These metrics are available in the consensus_results$initial_results$consensus_results consensus_metrics <- lapply(names(consensus_results$initial_results$consensus_results), function(cluster_id) { metrics <- consensus_results$initial_results$consensus_results[[cluster_id]] return(list( cluster = cluster_id, consensus_proportion = metrics$consensus_proportion, entropy = metrics$entropy )) }) # Convert to data frame for easier handling metrics_df <- do.call(rbind, lapply(consensus_metrics, data.frame)) # Add consensus proportion to Seurat object seurat_obj$consensus_proportion <- plyr::mapvalues( x = as.character(Idents(seurat_obj)), from = metrics_df$cluster, to = metrics_df$consensus_proportion ) # Add entropy to Seurat object seurat_obj$entropy <- plyr::mapvalues( x = as.character(Idents(seurat_obj)), from = metrics_df$cluster, to = metrics_df$entropy )
Here's a simple visualization of the results using Seurat:
# Plot UMAP with cell type annotations DimPlot(seurat_obj, group.by = "cell_type_consensus", label = TRUE, repel = TRUE) + ggtitle("Cell Type Annotations") + theme(plot.title = element_text(hjust = 0.5))
The output of annotate_cell_types() is a vector of cell type annotations, where each element corresponds to a cluster.
The output of interactive_consensus_annotation() is a list containing:
final_annotations: Final consensus cell type annotationsinitial_results: Initial predictions from each modelcontroversial_clusters: List of clusters that required discussiondiscussion_logs: Detailed logs of the discussion processsession_id: Unique identifier for the annotation sessionWhen using consensus annotation, two key metrics help evaluate the reliability of annotations:
Clusters with low consensus proportion or high entropy may require manual review.
If you don't have access to paid API keys, you can use OpenRouter's free models:
# Set OpenRouter API key Sys.setenv(OPENROUTER_API_KEY = "your-openrouter-api-key") # Use a free model free_results <- annotate_cell_types( input = markers, tissue_name = "human PBMC", model = "meta-llama/llama-4-maverick:free", # Note the :free suffix api_key = Sys.getenv("OPENROUTER_API_KEY"), top_gene_count = 10 ) # Print results print(free_results)
Available free models (Updated Oct 2025):
meta-llama/llama-4-maverick:free - Meta Llama 4 Maverick (256K context, best performance)deepseek/deepseek-v4-pro:free - DeepSeek V4 Prometa-llama/llama-3.3-70b-instruct:free - Meta Llama 3.3 70B (reliable)venice/uncensored:free - Venice Uncensored (new model)z-ai/glm-4.5-air:free - GLM 4.5 Air (lightweight)Important: OpenRouter reduced free tier limits in 2025: - Free accounts: 50 requests/day (down from 200), 20 requests/minute - Accounts with $10+ credits: 1000 requests/day - Some models removed: NVIDIA Nemotron and others have exited the free tier - For production use: Consider using paid models for better reliability
r
Error: No auth credentials found
Solution: Ensure you've set the correct API key environment variable or provided it directly in the function call.
r
Error: Rate limit exceeded
Solution: Wait a few minutes before trying again, or reduce the number of API calls by processing fewer clusters at once.
r
Error: Unsupported model: [model_name]
Solution: Check that you're using a supported model name and that it's spelled correctly.
r
Error: Could not connect to API
Solution: Check your internet connection and try again. If the problem persists, the API service might be down.
Now that you understand the basics of mLLMCelltype, you can explore:
If you encounter any issues, check the FAQ or open an issue on our GitHub repository.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.