fennomix_mhc.mhc_utils module¶

Utility helpers for handling peptide and protein sequences.

Classes:

NonSpecificDigest(protein_data[, ...])

Generate peptides from a protein sequence without specific cleavage.

Functions:

load_peptide_df_from_mixmhcpred(mixmhcpred_dir)

Load and filter peptide predictions from MixMHCpred output files.

class fennomix_mhc.mhc_utils.NonSpecificDigest(protein_data, min_peptide_len=8, max_peptide_len=14)[source][source]¶

Bases: object

Generate peptides from a protein sequence without specific cleavage.

Methods:

`__init__`(protein_data[, min_peptide_len, ...])	Initialize the digestion object with protein data.
`get_peptide_seqs_from_idxes`(idxes)	Retrieve peptide sequences by their digestion index.
`get_random_pept_df`([n])	Sample a random set of peptides from the precomputed digest.

__init__(protein_data, min_peptide_len=8, max_peptide_len=14)[source][source]¶

Initialize the digestion object with protein data.

Concatenates all protein sequences with ‘$’ as delimiters and precomputes start and stop indices for all possible peptides within the specified length range.

Parameters:

protein_data (DataFrame | str | list[str]) – Input protein data, which can be one of the following: - A pandas DataFrame with a ‘sequence’ column. - A path to a FASTA file. - A list of paths to FASTA files.
min_peptide_len (int) – Minimum length of peptides to generate. Must be >= 1. Default is 8.
max_peptide_len (int) – Maximum length of peptides to generate. Must be >= min_peptide_len. Default is 14.

Raises:

ValueError – If min_peptide_len > max_peptide_len, or if no sequences are found.
TypeError – If protein_data is not a DataFrame, string, or list of strings.

get_peptide_seqs_from_idxes(idxes)[source][source]¶

Retrieve peptide sequences by their digestion index.

Parameters:: idxes (Sequence[int] | ndarray) – A sequence (e.g., list, tuple) or NumPy array of integer indices corresponding to positions in the precomputed digest.
Return type:: list[str]
Returns:: A list of peptide sequences corresponding to the given indices.
Raises:: IndexError – If any index in idxes is out of bounds.

get_random_pept_df(n=5000)[source][source]¶

Sample a random set of peptides from the precomputed digest.

Parameters:

n (int) – Number of peptides to sample. If n exceeds the number of available peptides, sampling is done with replacement. Default is 5000.

Returns:

‘sequence’: Randomly sampled peptide sequences.
’allele’: A constant value ‘random’ for all rows.

Return type:

A DataFrame with two columns

fennomix_mhc.mhc_utils.load_peptide_df_from_mixmhcpred(mixmhcpred_dir, rank=2)[source][source]¶

Load and filter peptide predictions from MixMHCpred output files.

This function reads all TSV result files from a directory generated by MixMHCpred, filters peptides based on the ‘%Rank_bestAllele’ column, and returns a unified DataFrame containing the peptide sequences and their corresponding alleles.

Parameters:

mixmhcpred_dir (str) – Path to the directory containing MixMHCpred output .tsv files.
rank (int) – Maximum allowed value for %Rank_bestAllele. Only peptides with rank less than or equal to this value are included. Default is 2.

Returns:

‘sequence’: The peptide amino acid sequence.
’allele’: The corresponding MHC allele name derived from the filename.

Return type:

A DataFrame with two columns

Raises:

FileNotFoundError – If the specified directory does not exist or contains no files.
pd.errors.EmptyDataError – If no valid data is found in the TSV files.