fennomix_mhc.pipeline_api module¶
Classes:
|
Container for pretrained models used in the pipeline. |
Functions:
|
Cluster one set of peptides and predict binders from another set using derived clusters. |
|
Cluster peptides into groups and assign representative HLA alleles to each cluster. |
|
Embed peptides from FASTA or tabular file and save results. |
|
Embed HLA protein sequences and save embeddings to disk. |
|
Load peptide embeddings from a pickle file. |
|
Predict peptide binders for given MHC alleles. |
|
Predict MHC binders for a given set of epitope peptides. |
- class fennomix_mhc.pipeline_api.PretrainedModels(device='cuda', use_pseudo=False)[source][source]¶
Bases:
objectContainer for pretrained models used in the pipeline.
This class lazily downloads required model weights and provides convenience methods for embedding proteins/peptides and predicting peptide-MHC interactions.
- device¶
Device used for inference (‘cuda’, ‘cpu’, or ‘mps’).
- Type:
str
- _use_pseudo¶
Whether to use pseudo-sequence models.
- Type:
bool
- hla_encoder¶
Trained HLA encoder.
- Type:
- pept_encoder¶
Trained peptide encoder.
- Type:
- esm2_model¶
ESM-2 model for HLA embedding (if not pseudo).
- Type:
esm.ProteinBertModel
- esm2_alphabet¶
ESM-2 alphabet.
- Type:
esm.Alphabet
- batch_converter¶
ESM batch converter.
- background_protein_df¶
Background protein sequences.
- Type:
pd.DataFrame
- hla_df¶
HLA allele information.
- Type:
pd.DataFrame
- hla_embeddings¶
Precomputed HLA embeddings.
- Type:
np.ndarray
Methods:
__init__([device, use_pseudo])Initialize pretrained models and load weights.
deconvolute_peptides(peptide_list, ...[, ...])Cluster peptides based on embeddings using k-means.
embed_peptides_from_fasta(fasta[, ...])Digest proteins in a FASTA file and embed resulting peptides.
embed_peptides_tsv(peptide_tsv[, ...])Embed peptides listed in a TSV/CSV file.
embed_proteins(fasta)Embed HLA protein sequences from a FASTA file.
predict_epitopes_for_mhc(peptide_list, ...)Predict the most likely allele binder for each peptide.
predict_mhc_binders_for_epitopes(...[, ...])Find the best binding epitope for each HLA allele.
- __init__(device='cuda', use_pseudo=False)[source][source]¶
Initialize pretrained models and load weights.
- Parameters:
device (
str) – Device for inference (‘cuda’, ‘cpu’, or ‘mps’). Defaults to ‘cuda’.use_pseudo (
bool) – Whether to use pseudo-sequence models. Defaults to False.
- Raises:
RuntimeError – If model download fails.
- deconvolute_peptides(peptide_list, pept_embeddings, n_centroids=8, outlier_distance=0.2)[source][source]¶
Cluster peptides based on embeddings using k-means.
- Parameters:
peptide_list (
list) – List of peptide sequences.pept_embeddings (
ndarray) – Embedding matrix for peptide_list.n_centroids (
int) – Number of clusters. Defaults to 8.outlier_distance (
float) – Distance threshold for centroid refinement. Defaults to 0.2.
- Returns:
DataFrame assigning peptides to clusters and centroid embeddings.
- embed_peptides_from_fasta(fasta, min_peptide_length=8, max_peptide_length=12)[source][source]¶
Digest proteins in a FASTA file and embed resulting peptides.
- Parameters:
fasta (
str) – Path to FASTA file containing proteins.min_peptide_length (
int) – Minimum peptide length. Defaults to 8.max_peptide_length (
int) – Maximum peptide length. Defaults to 12.
- Returns:
Tuple of (list of peptide sequences, embeddings array).
- Raises:
ValueError – If no valid peptides are found.
- embed_peptides_tsv(peptide_tsv, min_peptide_length=8, max_peptide_length=12)[source][source]¶
Embed peptides listed in a TSV/CSV file.
- Parameters:
peptide_tsv (
str) – Path to delimited file with ‘sequence’ column.min_peptide_length (
int) – Minimum allowed peptide length. Defaults to 8.max_peptide_length (
int) – Maximum allowed peptide length. Defaults to 12.
- Returns:
Tuple of (list of peptide sequences, embeddings array).
- Raises:
ValueError – If no valid peptides are found.
- embed_proteins(fasta)[source][source]¶
Embed HLA protein sequences from a FASTA file.
- Parameters:
fasta (
str) – Path to FASTA file containing HLA sequences.- Returns:
Tuple of (protein DataFrame, embeddings array).
- Raises:
FileNotFoundError – If FASTA file does not exist.
- predict_epitopes_for_mhc(peptide_list, peptide_embeddings, alleles, hla_df=None, hla_embeddings=None, min_peptide_length=8, max_peptide_length=12, outlier_distance=0.4)[source][source]¶
Predict the most likely allele binder for each peptide.
- Parameters:
peptide_list (
list) – Peptide sequences to evaluate.peptide_embeddings (
ndarray) – Embeddings for peptide_list.alleles (
list) – List of allele names to consider.hla_df (
DataFrame) – DataFrame with HLA sequence info. If None, pretrained DB is used.hla_embeddings (
ndarray) – Embeddings for HLAs in hla_df.min_peptide_length (
int) – Minimum peptide length. Defaults to 8.max_peptide_length (
int) – Maximum peptide length. Defaults to 12.outlier_distance (
float) – Distance threshold to filter outliers. Defaults to 0.4.
- Returns:
Table of peptides with their best matching allele and distance.
- Raises:
ValueError – If no valid peptide sequences are found.
- predict_mhc_binders_for_epitopes(peptide_list, peptide_embeddings, hla_df=None, hla_embeddings=None, min_peptide_length=8, max_peptide_length=12, outlier_distance=0.4)[source][source]¶
Find the best binding epitope for each HLA allele.
- Parameters:
peptide_list (
list) – List of peptide sequences.peptide_embeddings (
ndarray) – Embeddings corresponding to peptide_list.hla_df (
DataFrame) – DataFrame containing HLA info. If None, builtin embeddings are used.hla_embeddings (
ndarray) – Embeddings for HLAs in hla_df.min_peptide_length (
int) – Minimum peptide length. Defaults to 8.max_peptide_length (
int) – Maximum peptide length. Defaults to 12.outlier_distance (
float) – Distance threshold to filter outliers. Defaults to 0.4.
- Returns:
DataFrame mapping each allele to its closest peptide.
- Raises:
ValueError – If no valid peptide sequences are found.
- fennomix_mhc.pipeline_api.deconvolute_and_predict_peptides(peptide_file_path_to_deconv, peptide_file_path_to_predict, n_centroids, out_folder, out_fasta_format, min_peptide_length=8, max_peptide_length=12, outlier_distance=0.2, hla_file_path=None, device='cuda', use_pseudo=False)[source][source]¶
Cluster one set of peptides and predict binders from another set using derived clusters.
First, peptides from peptide_file_path_to_deconv are clustered to infer “pseudo-alleles”. Then, peptides from peptide_file_path_to_predict are matched against these pseudo-alleles to identify potential binders.
- Parameters:
peptide_file_path_to_deconv (
str|Path) – File path for peptides used in clustering (deconvolution).peptide_file_path_to_predict (
str|Path) – File path for peptides to be tested for binding.n_centroids (
int) – Number of clusters to form during deconvolution.out_folder (
str|Path) – Output directory for results and logs.out_fasta_format (
bool) – If True, saves results in FASTA format; otherwise, TSV.min_peptide_length (
int) – Minimum peptide length. Defaults to 8.max_peptide_length (
int) – Maximum peptide length. Defaults to 12.outlier_distance (
float) – Distance threshold for clustering and prediction. Defaults to 0.2.hla_file_path (
str|Path) – Optional path to custom HLA embeddings or FASTA. Uses default if None.device (
str) – Device for computation (“cuda”, “cpu”, “mps”). Defaults to “cuda”.use_pseudo (
bool) – Whether to use pseudo-sequence embedding model. Defaults to False.
- Returns:
None
- Raises:
FileNotFoundError – If any input file is not found.
ValueError – If file formats are unsupported.
RuntimeError – If embedding, clustering, or prediction fails.
- fennomix_mhc.pipeline_api.deconvolute_peptides(peptide_file_path, n_centroids, out_folder, min_peptide_length=8, max_peptide_length=12, outlier_distance=100, hla_file_path=None, device='cuda', use_pseudo=False)[source][source]¶
Cluster peptides into groups and assign representative HLA alleles to each cluster.
This method performs unsupervised clustering of peptide embeddings and maps each cluster centroid to the closest known HLA allele, effectively “deconvoluting” potential allele specificities from a peptide set.
- Parameters:
peptide_file_path (
str) – Path to peptide sequences or embeddings (.pkl, .fasta, .tsv, .csv).n_centroids (
int) – Number of clusters to form.out_folder (
str) – Directory to save clustering results and logs.min_peptide_length (
int) – Minimum peptide length. Defaults to 8.max_peptide_length (
int) – Maximum peptide length. Defaults to 12.outlier_distance (
float) – Threshold for refining clusters (100 disables filtering). Defaults to 100.hla_file_path (
str) – Optional path to custom HLA embeddings or FASTA. Uses default if None.device (
str) – Computation device (“cuda”, “cpu”, “mps”). Defaults to “cuda”.use_pseudo (
bool) – Use pseudo-sequence model for embeddings. Defaults to False.
- Returns:
None
- Raises:
FileNotFoundError – If input files are missing.
ValueError – If file format is unsupported.
RuntimeError – If embedding or clustering fails.
- fennomix_mhc.pipeline_api.embed_peptides_from_file(peptide_file_path, out_folder, min_peptide_length=8, max_peptide_length=12, device='cuda', use_pseudo=False)[source][source]¶
Embed peptides from FASTA or tabular file and save results.
- Parameters:
peptide_file_path (
str) – Input file with peptide sequences (.fasta, .tsv, .csv).out_folder (
str) – Directory to save ‘peptide_embeddings.pkl’.min_peptide_length (
int) – Minimum peptide length. Defaults to 8.max_peptide_length (
int) – Maximum peptide length. Defaults to 12.device (
str) – Device for embedding. Defaults to ‘cuda’.use_pseudo (
bool) – Whether to use pseudo model. Defaults to False.
- Raises:
ValueError – If file format is unsupported.
- fennomix_mhc.pipeline_api.embed_proteins(fasta, out_folder, device='cuda')[source][source]¶
Embed HLA protein sequences and save embeddings to disk.
- Parameters:
fasta (
str) – Path to FASTA file with HLA sequences.out_folder (
str) – Directory to save ‘hla_embeddings.pkl’.device (
str) – Device for embedding (‘cuda’, ‘cpu’, ‘mps’). Defaults to ‘cuda’.
- fennomix_mhc.pipeline_api.load_peptide_embedding_pkl(fname)[source][source]¶
Load peptide embeddings from a pickle file.
- Parameters:
fname – Path to .pkl file containing ‘peptide_list’ and ‘pept_embeds’.
- Returns:
peptide_list (list of str): Peptide sequences.
pept_embeds (np.ndarray): Corresponding embedding vectors.
- Return type:
A tuple containing
- Raises:
IOError – If file cannot be read.
KeyError – If expected keys are missing in the pickle.
- fennomix_mhc.pipeline_api.predict_epitopes_for_mhc(peptide_file_path, alleles, out_folder, out_fasta_format=False, min_peptide_length=8, max_peptide_length=12, outlier_distance=0.4, hla_file_path=None, device='cuda', use_pseudo=False)[source][source]¶
Predict peptide binders for given MHC alleles.
- Parameters:
peptide_file_path (
str) – Path to peptide embeddings or sequence file.alleles (
list) – Alleles to consider.out_folder (
str) – Directory to write results.out_fasta_format (
bool) – Whether to output FASTA instead of TSV. Defaults to False.min_peptide_length (
int) – Minimum peptide length. Defaults to 8.max_peptide_length (
int) – Maximum peptide length. Defaults to 12.outlier_distance (
float) – Distance threshold. Defaults to 0.4.hla_file_path (
str) – Optional path to custom HLA embeddings or FASTA.device (
str) – Device for model. Defaults to ‘cuda’.use_pseudo (
bool) – Whether to use pseudo model. Defaults to False.
- fennomix_mhc.pipeline_api.predict_mhc_binders_for_epitopes(peptide_file_path, out_folder, min_peptide_length=8, max_peptide_length=12, outlier_distance=0.4, hla_file_path=None, device='cuda', use_pseudo=False)[source][source]¶
Predict MHC binders for a given set of epitope peptides.
This function loads peptide and MHC (HLA) embeddings, then predicts which MHC alleles are likely to bind the input peptides based on embedding similarity. Results are saved in a TSV file.
- Parameters:
peptide_file_path (
str) – Path to a file containing peptide sequences or precomputed embeddings. Supported formats: .pkl (embeddings), .fasta, .tsv, .csv.out_folder (
str) – Output directory where results and logs will be saved.min_peptide_length (
int) – Minimum length of peptides to consider. Defaults to 8.max_peptide_length (
int) – Maximum length of peptides to consider. Defaults to 12.outlier_distance (
float) – Distance threshold for filtering binding predictions. Lower values indicate stricter similarity. Defaults to 0.4.hla_file_path (
str) – Optional path to custom HLA embeddings (.pkl) or sequences (.fasta). If None, uses built-in HLA embeddings.device (
str) – Device to run computations on. Options: “cuda”, “cpu”, “mps”. Defaults to “cuda”.use_pseudo (
bool) – Whether to use the pseudo-sequence embedding model. Defaults to False.
- Returns:
None
- Raises:
FileNotFoundError – If the peptide or HLA file does not exist.
ValueError – If an unsupported file format is provided.
RuntimeError – If model loading or prediction fails.