fennomix_mhc.mhc_binding_retriever module¶
Classes:
|
A retriever class to compute peptide-MHC binding metrics including distance, rank, and FDR. |
Functions:
|
Compute FDRs for the best-binding allele per peptide. |
|
Estimate FDRs for peptide-MHC binding distances using either TDA or FMM. |
|
Compute FDRs for the best-binding allele per peptide. |
|
Calculate FDRs using the target-decoy approach. |
|
Convert FDR estimates into q-values via monotonic minimization. |
- class fennomix_mhc.mhc_binding_retriever.MHCBindingRetriever(hla_encoder, pept_encoder, hla_df, hla_embeds, protein_data, min_peptide_len=8, max_peptide_len=14, device='cuda')[source][source]¶
Bases:
objectA retriever class to compute peptide-MHC binding metrics including distance, rank, and FDR.
This class wraps trained encoders for peptides and HLAs, enabling fast retrieval of binding predictions through embedding space distance. It supports both single-peptide queries and genome-wide screening against self-proteins.
- hla_encoder¶
Trained neural network model for encoding HLA sequences.
- pept_encoder¶
Trained neural network model for encoding peptide sequences.
- device¶
Computation device (e.g., ‘cuda’ or ‘cpu’).
- Type:
torch.device
- dataset¶
Dataset handler containing protein digestion and HLA info.
- Type:
- hla_embeds¶
Precomputed HLA embeddings. Shape: (n_alleles, d_model)
- Type:
np.ndarray
- n_decoy_samples¶
Number of random decoy peptides to generate for FDR estimation.
- Type:
int
- outlier_threshold¶
Fraction of strongest decoy binders to exclude.
- Type:
float
- use_fmm_fdr¶
Whether to use finite mixture model for FDR calculation.
- Type:
bool
- decoy_rnd_seed¶
Seed for reproducible decoy generation.
- Type:
int
- d_model¶
Embedding dimension size.
- Type:
int
- verbose¶
Enable progress bars and logging.
- Type:
bool
Methods:
__init__(hla_encoder, pept_encoder, hla_df, ...)Initialize the MHCBindingRetriever.
get_binding_distances(prot_embeds, peptide_list)Embed peptides and compute their distances to given MHC allele embeddings.
get_binding_metrics_for_embeds(prot_embeds, ...)Compute binding metrics for a list of peptides given their sequences or embeddings.
get_binding_metrics_for_peptides(alleles, ...)Score a list of peptides against specified HLA alleles.
Screen internal proteome for potential self-reactive binders.
get_embedding_distances(prot_embeds, pept_embeds)Compute pairwise Euclidean distances between protein and peptide embeddings.
- __init__(hla_encoder, pept_encoder, hla_df, hla_embeds, protein_data, min_peptide_len=8, max_peptide_len=14, device='cuda')[source][source]¶
Initialize the MHCBindingRetriever.
- Parameters:
hla_encoder – Model to encode HLA alleles into fixed-length vectors.
pept_encoder – Model to encode peptides into fixed-length vectors.
hla_df (
DataFrame) – DataFrame containing HLA allele metadata (e.g., names, sequences).hla_embeds (
ndarray) – Precomputed embeddings for all HLA alleles. Shape: (n_alleles, d_model)protein_data – Protein sequences used for generating decoy/non-self peptides.
min_peptide_len (
int) – Minimum length for digested peptides (default: 8).max_peptide_len (
int) – Maximum length for digested peptides (default: 14).device (
str) – Torch device identifier (‘cuda’, ‘cpu’, etc.). Auto-detected if needed.
- Raises:
ValueError – If hla_embeds has incorrect dimensions or incompatible encoder types.
- get_binding_distances(prot_embeds, peptide_list, cdist_batch_size=1000000, embed_batch_size=1024)[source][source]¶
Embed peptides and compute their distances to given MHC allele embeddings.
- Parameters:
prot_embeds (
ndarray) – Precomputed MHC embeddings. Shape: (n_alleles, d_model)peptide_list – List or array of peptide sequences (strings).
cdist_batch_size (
int) – Batch size for distance computation.embed_batch_size (
int) – Batch size for peptide embedding.
- Returns:
Distance from each peptide to each allele. Shape: (n_peptides, n_alleles)
- Return type:
dist_matrix
- get_binding_metrics_for_embeds(prot_embeds, peptide_list, keep_not_best_alleles=False)[source][source]¶
Compute binding metrics for a list of peptides given their sequences or embeddings.
- Parameters:
prot_embeds (
ndarray) – Allele embeddings. Shape: (n_alleles, d_model)peptide_list – Either a list of peptide sequences or a numpy array of embeddings.
keep_not_best_alleles (
bool) – If True, include full distance matrix in output.
- Returns:
- DataFrame with columns:
sequence (if input was sequences)
best_allele_id: Index of best-matching allele
best_allele_dist: Minimum distance
best_allele_rank: Percentile rank among decoys (0–100)
- Return type:
df
- get_binding_metrics_for_peptides(alleles, peptide_list, keep_not_best_alleles=False)[source][source]¶
Score a list of peptides against specified HLA alleles.
- Parameters:
alleles – Names of HLA alleles to evaluate.
peptide_list – List of peptide sequences.
keep_not_best_alleles (
bool) – Whether to retain scores for all alleles.
- Returns:
Binding metrics with added best_allele column mapping ID to name.
- Return type:
df
- get_binding_metrics_for_self_proteins(alleles, dist_threshold=0, fdr=0.02, cdist_batch_size=1000000, embed_batch_size=1024, get_sequence=True)[source][source]¶
Screen internal proteome for potential self-reactive binders.
- Parameters:
alleles – List of HLA allele names to consider.
dist_threshold (
float) – Maximum allowed embedding distance.fdr (
float) – Maximum allowed false discovery rate.cdist_batch_size (
int) – Batch size for distance computation.embed_batch_size (
int) – Batch size for embedding peptides.get_sequence (
bool) – If True, return actual sequences; else return indices.
- Returns:
DataFrame of qualifying peptides with binding metrics and optionally sequences.
- Return type:
df
- get_embedding_distances(prot_embeds, pept_embeds, batch_size=1000000)[source][source]¶
Compute pairwise Euclidean distances between protein and peptide embeddings.
Uses torch.cdist for efficient batched computation on GPU.
- Parameters:
prot_embeds (
ndarray) – Embeddings for MHC alleles. Shape: (n_alleles, d_model)pept_embeds (
ndarray) – Embeddings for peptides. Shape: (n_peptides, d_model)batch_size – Number of peptides processed per batch to avoid memory overflow.
- Returns:
Pairwise distance matrix. Shape: (n_peptides, n_alleles)
- Return type:
dist_matrix
Example
>>> prot_emb = np.random.rand(6, 480).astype(np.float32) >>> pept_emb = np.random.rand(100, 480).astype(np.float32) >>> dists = retriever.get_embedding_distances(prot_emb, pept_emb)
- fennomix_mhc.mhc_binding_retriever.get_binding_fdr_for_best_allele(distances, rnd_dist, outlier_threshold=0.01, fmm_fdr=False)[source][source]¶
Compute FDRs for the best-binding allele per peptide.
For each peptide, finds the allele with minimum distance and calculates its FDR independently across alleles using decoy distributions.
- Parameters:
distances (
ndarray) – Distance matrix between peptides and alleles. Shape: (n_peptides, n_alleles)rnd_dist (
ndarray) – Sorted decoy distance matrix (precomputed), one column per allele. Should be pre-sorted along axis 0. Shape: (n_decoys, n_alleles)outlier_threshold (
float) – Fraction of top decoys to treat as true binders (ignored in FDR).fmm_fdr (
bool) – Whether to use FMM-based FDR instead of basic TDA.
- Returns:
- FDR value for the best allele of each peptide.
Shape: (n_peptides,)
- Return type:
best_allele_fdrs
Example
>>> dists = np.random.rand(100, 6).astype(np.float32) >>> decoy_dists = np.sort(np.random.rand(1000, 6), axis=0) >>> fdrs = get_binding_fdr_for_best_allele(dists, decoy_dists)
- fennomix_mhc.mhc_binding_retriever.get_binding_fdrs(distances_1D, decoys_1D, max_fitting_samples=200000, random_state=1337, outlier_threshold=0.01, fmm_fdr=False)[source][source]¶
Estimate FDRs for peptide-MHC binding distances using either TDA or FMM.
- Supports two modes:
Target-Decoy Analysis (TDA): Simple empirical FDR.
Finite Mixture Model (FMM): Probabilistic modeling of binders vs. non-binders.
- Parameters:
distances_1D (
ndarray) – Observed distances for real peptides. Shape: (n_peptides,)decoys_1D (
ndarray) – Distances for randomly generated decoy peptides. Shape: (n_decoys,)max_fitting_samples (
int) – Maximum number of samples to use in FMM fitting if dataset is large.random_state (
int) – Random seed for reproducibility during subsampling.outlier_threshold (
float) – Fraction of smallest decoy distances to ignore as strong binders.fmm_fdr (
bool) – If True, use FMM-based FDR estimation; otherwise use standard TDA.
- Returns:
- Estimated FDR for each peptide in distances_1D.
Shape: (n_peptides,)
- Return type:
fdrs
- Raises:
ValueError – If decoys_1D is empty or invalid.
- fennomix_mhc.mhc_binding_retriever.get_binding_ranks(distances, sorted_rnd_dist)[source]¶
Compute FDRs for the best-binding allele per peptide.
For each peptide, finds the allele with minimum distance and calculates its FDR independently across alleles using decoy distributions.
- Parameters:
distances (
ndarray) – Distance matrix between peptides and alleles. Shape: (n_peptides, n_alleles)rnd_dist – Sorted decoy distance matrix (precomputed), one column per allele. Should be pre-sorted along axis 0. Shape: (n_decoys, n_alleles)
outlier_threshold – Fraction of top decoys to treat as true binders (ignored in FDR).
fmm_fdr – Whether to use FMM-based FDR instead of basic TDA.
- Returns:
- FDR value for the best allele of each peptide.
Shape: (n_peptides,)
- Return type:
best_allele_fdrs
Example
>>> dists = np.random.rand(100, 6).astype(np.float32) >>> decoy_dists = np.sort(np.random.rand(1000, 6), axis=0) >>> fdrs = get_binding_fdr_for_best_allele(dists, decoy_dists)
- fennomix_mhc.mhc_binding_retriever.get_fdrs(dists, rnd_dists, alpha, remove_rnd_top_rank=0.01)[source]¶
Calculate FDRs using the target-decoy approach.
This function computes False Discovery Rates (FDRs) for target distances by comparing them against decoy distances. It uses a simple counting method based on rank comparison.
- Parameters:
dists (
ndarray) – 1D array of target distances between peptide and MHC embeddings. Shape: (n_targets,)rnd_dists (
ndarray) – 1D array of decoy distances used for FDR estimation. Shape: (n_decoys,)alpha (
float) – Ratio of number of targets to decoys, i.e., len(dists) / len(rnd_dists).remove_rnd_top_rank (
float) – Fraction of lowest-ranked decoy values to exclude as “binders” when estimating FDR (default: 0.01).
- Returns:
- Array of FDR values corresponding to each entry in dists, unsorted.
Shape: (n_targets,)
- Return type:
fdrs
Example
>>> targets = np.array([0.3, 0.5, 0.7]) >>> decoys = np.random.normal(1.0, 0.2, size=1000) >>> fdrs = get_fdrs(targets, decoys, alpha=1.0)
- fennomix_mhc.mhc_binding_retriever.get_q_values(fdrs, distances)[source]¶
Convert FDR estimates into q-values via monotonic minimization.
Q-values are computed by enforcing that they are non-decreasing with increasing distance, ensuring proper multiple testing correction.
- Parameters:
fdrs (
ndarray) – Input array of FDR values (not necessarily monotonic). Shape: (n_samples,)distances (
ndarray) – Distance values used to sort peptides; larger distances mean weaker binding. Used to reverse-sort for q-value computation. Shape: (n_samples,)
- Returns:
- Monotonic q-values, same shape as input.
Values are adjusted so that q[i] <= q[j] for all j < i in ranking order. Shape: (n_samples,)
- Return type:
qvals
Note
The algorithm traverses from highest to lowest distance, maintaining minimum seen FDR.