fennomix_mhc.mhc_binding_model module¶
Classes:
|
Dataset providing paired HLA embeddings and peptides for training. |
|
Transformer-based encoder for HLA embeddings. |
|
Transformer-based encoder for peptide sequences. |
Contrastive Siamese loss for HLA-peptide similarity learning. |
Functions:
|
Converts a list of variable-length HLA ESM embeddings into a padded tensor. |
|
Generates fixed-size embeddings for a list of HLA ESM features. |
|
Encodes a list of peptide sequences into embeddings. |
|
Converts a list of peptide sequences into ASCII-encoded index tensors. |
|
Creates a learning rate scheduler with linear warmup and cosine decay. |
|
Creates a DataLoader for HlaDataSet with custom collation. |
|
Collate function for creating batches from HlaDataSet. |
|
Evaluates model performance on test alleles using rank-based recall. |
|
Train the peptide/HLA encoders. |
- class fennomix_mhc.mhc_binding_model.HlaDataSet(hla_df, hla_esm_list, pept_df, protein_data, min_peptide_len=8, max_peptide_len=14)[source][source]¶
Bases:
DatasetDataset providing paired HLA embeddings and peptides for training.
Methods:
__init__(hla_df, hla_esm_list, pept_df, ...)Initialize the dataset.
get_allele_embed(index)Get HLA embedding for a specific peptide.
Sample a negative peptide sequence.
- __init__(hla_df, hla_esm_list, pept_df, protein_data, min_peptide_len=8, max_peptide_len=14)[source][source]¶
Initialize the dataset.
- Parameters:
hla_df (
DataFrame) – DataFrame with HLA information; must have ‘allele’ column.hla_esm_list (
list[ndarray]) – List of HLA ESM embeddings corresponding to hla_df rows.pept_df (
DataFrame|None) – Peptide DataFrame with columns ‘sequence’ and ‘allele’.protein_data (
DataFrame|list|str) – Protein FASTA path(s) or DataFrame to generate negatives.min_peptide_len (
int) – Minimum length for random digestion.max_peptide_len (
int) – Maximum length for random digestion.
- class fennomix_mhc.mhc_binding_model.ModelHlaEncoder(d_model=480, layer_num=1, dropout=0.2)[source][source]¶
Bases:
ModuleTransformer-based encoder for HLA embeddings.
Methods:
__init__([d_model, layer_num, dropout])Initialize the HLA encoder.
forward(x)Encodes variable-length HLA embeddings into fixed-size vectors.
- class fennomix_mhc.mhc_binding_model.ModelSeqEncoder(d_model=480, layer_num=4, dropout=0.2)[source][source]¶
Bases:
ModuleTransformer-based encoder for peptide sequences.
Methods:
__init__([d_model, layer_num, dropout])Initialize the sequence encoder.
forward(aa_idxes)Encode peptide sequences to embeddings.
- class fennomix_mhc.mhc_binding_model.SiameseCELoss[source][source]¶
Bases:
objectContrastive Siamese loss for HLA-peptide similarity learning.
Encourages the model to bring positive pairs closer and push negative pairs apart. Uses margin-based contrastive loss.
Methods:
get_loss(hla_x, x[, y])Computes contrastive loss for one pair.
Attributes:
- get_loss(hla_x, x, y=1.0)[source][source]¶
Computes contrastive loss for one pair.
- Parameters:
hla_x (
Tensor) – HLA embedding tensor.x (
Tensor) – Peptide embedding tensor.y (
float) – Label (1.0 for positive pair, 0.0 for negative).
- Return type:
Tensor- Returns:
Scalar loss tensor.
-
margin:
float= 1¶
- fennomix_mhc.mhc_binding_model.batchify_hla_esm_list(batch_esm_list)[source][source]¶
Converts a list of variable-length HLA ESM embeddings into a padded tensor.
- Parameters:
batch_esm_list (
list[ndarray]) – List of arrays, each of shape (1, seq_len, d_model).- Return type:
Tensor- Returns:
Padded tensor of shape (batch_size, max_seq_len, d_model).
- fennomix_mhc.mhc_binding_model.embed_hla_esm_list(hla_encoder, hla_esm_list, batch_size=200, device=None, verbose=False)[source][source]¶
Generates fixed-size embeddings for a list of HLA ESM features.
- Parameters:
hla_encoder (
ModelHlaEncoder) – Trained HLA encoder model.hla_esm_list (
list[ndarray]) – List of raw ESM embeddings for HLA alleles.batch_size (
int) – Inference batch size.device (
str|device|None) – Device to use. Auto-detected if None.verbose (
bool) – Show progress bar.
- Return type:
ndarray- Returns:
Array of shape (num_hla, d_model) containing encoded HLA embeddings.
- fennomix_mhc.mhc_binding_model.embed_peptides(pept_encoder, seqs, d_model=480, batch_size=512, device=None, verbose=False)[source][source]¶
Encodes a list of peptide sequences into embeddings.
- Parameters:
pept_encoder (
ModelSeqEncoder) – Trained peptide encoder model.seqs (
list[str]) – List of peptide strings.d_model (
int) – Expected embedding dimension.batch_size (
int) – Inference batch size.device (
str|device|None) – Device to use (auto-detected if None).verbose (
bool) – Show progress bar.
- Return type:
ndarray- Returns:
Array of shape (num_peptides, d_model) with peptide embeddings.
- fennomix_mhc.mhc_binding_model.get_ascii_indices(seq_array)[source][source]¶
Converts a list of peptide sequences into ASCII-encoded index tensors.
Each character in the peptide string is represented by its ASCII code, reshaped into a 2D tensor.
- Parameters:
seq_array (
list[str]) – List of peptide sequence strings (e.g., [‘GLCTLVAML’, …]).- Return type:
LongTensor- Returns:
A tensor of shape (batch_size, sequence_length), dtype=torch.long.
- fennomix_mhc.mhc_binding_model.get_cosine_schedule_with_warmup(optimizer, num_warmup_steps, num_training_steps, num_cycles=0.5, last_epoch=-1)[source][source]¶
Creates a learning rate scheduler with linear warmup and cosine decay.
The learning rate rises linearly during warmup steps, then follows a cosine decay curve. Useful for stabilizing early training.
- Parameters:
optimizer (
Optimizer) – Optimizer to wrap with the scheduler.num_warmup_steps (
int) – Number of steps for linear warmup.num_training_steps (
int) – Total number of training steps.num_cycles (
float) – Number of cosine cycles (default 0.5 for half-cycle).last_epoch (
int) – Index of last epoch (-1 for new training).
- Returns:
A PyTorch learning rate scheduler.
- Return type:
LambdaLR
- fennomix_mhc.mhc_binding_model.get_hla_dataloader(dataset, batch_size, shuffle)[source][source]¶
Creates a DataLoader for HlaDataSet with custom collation.
- Parameters:
dataset (
HlaDataSet) – The dataset to load.batch_size (
int) – Number of samples per batch.shuffle (
bool) – Whether to shuffle data each epoch.
- Return type:
DataLoader- Returns:
A DataLoader with pept_hla_collate as collate_fn.
- fennomix_mhc.mhc_binding_model.pept_hla_collate(batch)[source][source]¶
Collate function for creating batches from HlaDataSet.
Handles variable-length HLA embeddings and ASCII-encodes peptides.
- Parameters:
batch (
list[tuple[ndarray,str,str]]) – List of tuples (hla_embed, pos_peptide, neg_peptide).- Returns:
hla_tensor: Padded HLA embeddings.
pos_pept_tensor: ASCII-encoded positive peptides.
neg_pept_tensor: ASCII-encoded negative peptides.
- Return type:
A tuple of
- fennomix_mhc.mhc_binding_model.test(test_df, test_allele_list, hla_encoder, pept_encoder, hla_df, hla_esm_list, fasta_list)[source][source]¶
Evaluates model performance on test alleles using rank-based recall.
- Parameters:
test_df (
DataFrame) – DataFrame with test peptide-allele pairs.test_allele_list – List of HLA alleles to evaluate.
hla_encoder (
ModelHlaEncoder) – Trained HLA encoder.pept_encoder (
ModelSeqEncoder) – Trained peptide encoder.hla_df (
DataFrame) – HLA metadata DataFrame.hla_esm_list (
list[ndarray]) – List of raw HLA ESM embeddings.fasta_list (
list[str]) – List of protein FASTA file paths.
- Return type:
tuple[float,float,float]- Returns:
Tuple of mean recall rates at rank < 0.1, < 0.5, and < 2.0.
- fennomix_mhc.mhc_binding_model.train(hla_encoder, pept_encoder, dataset, batch_size=256, lr=0.0001, epoch=100, warmup_epoch=20, verbose=True, device='cuda', test_bundle=None, neptune_run=None)[source][source]¶
Train the peptide/HLA encoders.
- Parameters:
hla_encoder (
ModelHlaEncoder) – Encoder for HLA embeddings.pept_encoder (
ModelSeqEncoder) – Encoder for peptide sequences.dataset (
HlaDataSet) – Training dataset.batch_size (
int) – Number of samples per batch.lr (
float) – Learning rate for the optimizer.epoch (
int) – Total number of epochs.warmup_epoch (
int) – Number of warmup epochs for the scheduler.verbose (
bool) – Whether to print training progress.device (
str) – Device identifier fortorch.device.test_bundle (
tuple|None) – Optional tuple of test data passed totest().neptune_run – Optional Neptune experiment for logging.
- Return type:
None