fennomix_mhc.mhc_utils module

Utility helpers for handling peptide and protein sequences.

Classes:

NonSpecificDigest(protein_data[, ...])

Generate peptides from a protein sequence without specific cleavage.

Functions:

load_peptide_df_from_mixmhcpred(mixmhcpred_dir)

Load and filter peptide predictions from MixMHCpred output files.

class fennomix_mhc.mhc_utils.NonSpecificDigest(protein_data, min_peptide_len=8, max_peptide_len=14)[source][source]

Bases: object

Generate peptides from a protein sequence without specific cleavage.

Methods:

__init__(protein_data[, min_peptide_len, ...])

Initialize the digestion object with protein data.

get_peptide_seqs_from_idxes(idxes)

Retrieve peptide sequences by their digestion index.

get_random_pept_df([n])

Sample a random set of peptides from the precomputed digest.

__init__(protein_data, min_peptide_len=8, max_peptide_len=14)[source][source]

Initialize the digestion object with protein data.

Concatenates all protein sequences with ‘$’ as delimiters and precomputes start and stop indices for all possible peptides within the specified length range.

Parameters:
  • protein_data (DataFrame | str | list[str]) – Input protein data, which can be one of the following: - A pandas DataFrame with a ‘sequence’ column. - A path to a FASTA file. - A list of paths to FASTA files.

  • min_peptide_len (int) – Minimum length of peptides to generate. Must be >= 1. Default is 8.

  • max_peptide_len (int) – Maximum length of peptides to generate. Must be >= min_peptide_len. Default is 14.

Raises:
  • ValueError – If min_peptide_len > max_peptide_len, or if no sequences are found.

  • TypeError – If protein_data is not a DataFrame, string, or list of strings.

get_peptide_seqs_from_idxes(idxes)[source][source]

Retrieve peptide sequences by their digestion index.

Parameters:

idxes (Sequence[int] | ndarray) – A sequence (e.g., list, tuple) or NumPy array of integer indices corresponding to positions in the precomputed digest.

Return type:

list[str]

Returns:

A list of peptide sequences corresponding to the given indices.

Raises:

IndexError – If any index in idxes is out of bounds.

get_random_pept_df(n=5000)[source][source]

Sample a random set of peptides from the precomputed digest.

Parameters:

n (int) – Number of peptides to sample. If n exceeds the number of available peptides, sampling is done with replacement. Default is 5000.

Returns:

  • ‘sequence’: Randomly sampled peptide sequences.

  • ’allele’: A constant value ‘random’ for all rows.

Return type:

A DataFrame with two columns

fennomix_mhc.mhc_utils.load_peptide_df_from_mixmhcpred(mixmhcpred_dir, rank=2)[source][source]

Load and filter peptide predictions from MixMHCpred output files.

This function reads all TSV result files from a directory generated by MixMHCpred, filters peptides based on the ‘%Rank_bestAllele’ column, and returns a unified DataFrame containing the peptide sequences and their corresponding alleles.

Parameters:
  • mixmhcpred_dir (str) – Path to the directory containing MixMHCpred output .tsv files.

  • rank (int) – Maximum allowed value for %Rank_bestAllele. Only peptides with rank less than or equal to this value are included. Default is 2.

Returns:

  • ‘sequence’: The peptide amino acid sequence.

  • ’allele’: The corresponding MHC allele name derived from the filename.

Return type:

A DataFrame with two columns

Raises:
  • FileNotFoundError – If the specified directory does not exist or contains no files.

  • pd.errors.EmptyDataError – If no valid data is found in the TSV files.