fennomix_mhc.mhc_utils module¶
Utility helpers for handling peptide and protein sequences.
Classes:
|
Generate peptides from a protein sequence without specific cleavage. |
Functions:
|
Load and filter peptide predictions from MixMHCpred output files. |
- class fennomix_mhc.mhc_utils.NonSpecificDigest(protein_data, min_peptide_len=8, max_peptide_len=14)[source][source]¶
Bases:
objectGenerate peptides from a protein sequence without specific cleavage.
Methods:
__init__(protein_data[, min_peptide_len, ...])Initialize the digestion object with protein data.
get_peptide_seqs_from_idxes(idxes)Retrieve peptide sequences by their digestion index.
get_random_pept_df([n])Sample a random set of peptides from the precomputed digest.
- __init__(protein_data, min_peptide_len=8, max_peptide_len=14)[source][source]¶
Initialize the digestion object with protein data.
Concatenates all protein sequences with ‘$’ as delimiters and precomputes start and stop indices for all possible peptides within the specified length range.
- Parameters:
protein_data (
DataFrame|str|list[str]) – Input protein data, which can be one of the following: - A pandas DataFrame with a ‘sequence’ column. - A path to a FASTA file. - A list of paths to FASTA files.min_peptide_len (
int) – Minimum length of peptides to generate. Must be >= 1. Default is 8.max_peptide_len (
int) – Maximum length of peptides to generate. Must be >= min_peptide_len. Default is 14.
- Raises:
ValueError – If min_peptide_len > max_peptide_len, or if no sequences are found.
TypeError – If protein_data is not a DataFrame, string, or list of strings.
- get_peptide_seqs_from_idxes(idxes)[source][source]¶
Retrieve peptide sequences by their digestion index.
- Parameters:
idxes (
Sequence[int] |ndarray) – A sequence (e.g., list, tuple) or NumPy array of integer indices corresponding to positions in the precomputed digest.- Return type:
list[str]- Returns:
A list of peptide sequences corresponding to the given indices.
- Raises:
IndexError – If any index in idxes is out of bounds.
- get_random_pept_df(n=5000)[source][source]¶
Sample a random set of peptides from the precomputed digest.
- Parameters:
n (
int) – Number of peptides to sample. If n exceeds the number of available peptides, sampling is done with replacement. Default is 5000.- Returns:
‘sequence’: Randomly sampled peptide sequences.
’allele’: A constant value ‘random’ for all rows.
- Return type:
A DataFrame with two columns
- fennomix_mhc.mhc_utils.load_peptide_df_from_mixmhcpred(mixmhcpred_dir, rank=2)[source][source]¶
Load and filter peptide predictions from MixMHCpred output files.
This function reads all TSV result files from a directory generated by MixMHCpred, filters peptides based on the ‘%Rank_bestAllele’ column, and returns a unified DataFrame containing the peptide sequences and their corresponding alleles.
- Parameters:
mixmhcpred_dir (
str) – Path to the directory containing MixMHCpred output .tsv files.rank (
int) – Maximum allowed value for %Rank_bestAllele. Only peptides with rank less than or equal to this value are included. Default is 2.
- Returns:
‘sequence’: The peptide amino acid sequence.
’allele’: The corresponding MHC allele name derived from the filename.
- Return type:
A DataFrame with two columns
- Raises:
FileNotFoundError – If the specified directory does not exist or contains no files.
pd.errors.EmptyDataError – If no valid data is found in the TSV files.