fennomix_mhc.mhc_binding_model module

Classes:

HlaDataSet(hla_df, hla_esm_list, pept_df, ...)

Dataset providing paired HLA embeddings and peptides for training.

ModelHlaEncoder([d_model, layer_num, dropout])

Transformer-based encoder for HLA embeddings.

ModelSeqEncoder([d_model, layer_num, dropout])

Transformer-based encoder for peptide sequences.

SiameseCELoss()

Contrastive Siamese loss for HLA-peptide similarity learning.

Functions:

batchify_hla_esm_list(batch_esm_list)

Converts a list of variable-length HLA ESM embeddings into a padded tensor.

embed_hla_esm_list(hla_encoder, hla_esm_list)

Generates fixed-size embeddings for a list of HLA ESM features.

embed_peptides(pept_encoder, seqs[, ...])

Encodes a list of peptide sequences into embeddings.

get_ascii_indices(seq_array)

Converts a list of peptide sequences into ASCII-encoded index tensors.

get_cosine_schedule_with_warmup(optimizer, ...)

Creates a learning rate scheduler with linear warmup and cosine decay.

get_hla_dataloader(dataset, batch_size, shuffle)

Creates a DataLoader for HlaDataSet with custom collation.

pept_hla_collate(batch)

Collate function for creating batches from HlaDataSet.

test(test_df, test_allele_list, hla_encoder, ...)

Evaluates model performance on test alleles using rank-based recall.

train(hla_encoder, pept_encoder, dataset[, ...])

Train the peptide/HLA encoders.

class fennomix_mhc.mhc_binding_model.HlaDataSet(hla_df, hla_esm_list, pept_df, protein_data, min_peptide_len=8, max_peptide_len=14)[source][source]

Bases: Dataset

Dataset providing paired HLA embeddings and peptides for training.

Methods:

__init__(hla_df, hla_esm_list, pept_df, ...)

Initialize the dataset.

get_allele_embed(index)

Get HLA embedding for a specific peptide.

get_neg_pept()

Sample a negative peptide sequence.

__init__(hla_df, hla_esm_list, pept_df, protein_data, min_peptide_len=8, max_peptide_len=14)[source][source]

Initialize the dataset.

Parameters:
  • hla_df (DataFrame) – DataFrame with HLA information; must have ‘allele’ column.

  • hla_esm_list (list[ndarray]) – List of HLA ESM embeddings corresponding to hla_df rows.

  • pept_df (DataFrame | None) – Peptide DataFrame with columns ‘sequence’ and ‘allele’.

  • protein_data (DataFrame | list | str) – Protein FASTA path(s) or DataFrame to generate negatives.

  • min_peptide_len (int) – Minimum length for random digestion.

  • max_peptide_len (int) – Maximum length for random digestion.

get_allele_embed(index)[source][source]

Get HLA embedding for a specific peptide.

Parameters:

index (int) – Index of the peptide.

Return type:

ndarray

Returns:

Corresponding HLA embedding.

get_neg_pept()[source][source]

Sample a negative peptide sequence.

Return type:

str

Returns:

Random peptide string from the dataset or digested proteins.

class fennomix_mhc.mhc_binding_model.ModelHlaEncoder(d_model=480, layer_num=1, dropout=0.2)[source][source]

Bases: Module

Transformer-based encoder for HLA embeddings.

Methods:

__init__([d_model, layer_num, dropout])

Initialize the HLA encoder.

forward(x)

Encodes variable-length HLA embeddings into fixed-size vectors.

__init__(d_model=480, layer_num=1, dropout=0.2)[source][source]

Initialize the HLA encoder.

Parameters:
  • d_model (int) – Embedding dimension.

  • layer_num (int) – Number of Transformer layers.

  • dropout (float) – Dropout rate for Transformer layers.

forward(x)[source][source]

Encodes variable-length HLA embeddings into fixed-size vectors.

Parameters:

x (Tensor) – Input tensor of shape (batch_size, seq_len, d_model), typically from ESM models.

Return type:

Tensor

Returns:

Normalized embedding tensor of shape (batch_size, d_model).

class fennomix_mhc.mhc_binding_model.ModelSeqEncoder(d_model=480, layer_num=4, dropout=0.2)[source][source]

Bases: Module

Transformer-based encoder for peptide sequences.

Methods:

__init__([d_model, layer_num, dropout])

Initialize the sequence encoder.

forward(aa_idxes)

Encode peptide sequences to embeddings.

__init__(d_model=480, layer_num=4, dropout=0.2)[source][source]

Initialize the sequence encoder.

Parameters:
  • d_model (int) – Embedding dimension.

  • layer_num (int) – Number of Transformer layers.

  • dropout (float) – Dropout rate for Transformer layers.

forward(aa_idxes)[source][source]

Encode peptide sequences to embeddings.

Parameters:

aa_idxes (Tensor) – Tensor of shape (batch_size, seq_len) with ASCII indices.

Return type:

Tensor

Returns:

Normalized embedding tensor of shape (batch_size, d_model).

class fennomix_mhc.mhc_binding_model.SiameseCELoss[source][source]

Bases: object

Contrastive Siamese loss for HLA-peptide similarity learning.

Encourages the model to bring positive pairs closer and push negative pairs apart. Uses margin-based contrastive loss.

Methods:

get_loss(hla_x, x[, y])

Computes contrastive loss for one pair.

Attributes:

get_loss(hla_x, x, y=1.0)[source][source]

Computes contrastive loss for one pair.

Parameters:
  • hla_x (Tensor) – HLA embedding tensor.

  • x (Tensor) – Peptide embedding tensor.

  • y (float) – Label (1.0 for positive pair, 0.0 for negative).

Return type:

Tensor

Returns:

Scalar loss tensor.

margin: float = 1
fennomix_mhc.mhc_binding_model.batchify_hla_esm_list(batch_esm_list)[source][source]

Converts a list of variable-length HLA ESM embeddings into a padded tensor.

Parameters:

batch_esm_list (list[ndarray]) – List of arrays, each of shape (1, seq_len, d_model).

Return type:

Tensor

Returns:

Padded tensor of shape (batch_size, max_seq_len, d_model).

fennomix_mhc.mhc_binding_model.embed_hla_esm_list(hla_encoder, hla_esm_list, batch_size=200, device=None, verbose=False)[source][source]

Generates fixed-size embeddings for a list of HLA ESM features.

Parameters:
  • hla_encoder (ModelHlaEncoder) – Trained HLA encoder model.

  • hla_esm_list (list[ndarray]) – List of raw ESM embeddings for HLA alleles.

  • batch_size (int) – Inference batch size.

  • device (str | device | None) – Device to use. Auto-detected if None.

  • verbose (bool) – Show progress bar.

Return type:

ndarray

Returns:

Array of shape (num_hla, d_model) containing encoded HLA embeddings.

fennomix_mhc.mhc_binding_model.embed_peptides(pept_encoder, seqs, d_model=480, batch_size=512, device=None, verbose=False)[source][source]

Encodes a list of peptide sequences into embeddings.

Parameters:
  • pept_encoder (ModelSeqEncoder) – Trained peptide encoder model.

  • seqs (list[str]) – List of peptide strings.

  • d_model (int) – Expected embedding dimension.

  • batch_size (int) – Inference batch size.

  • device (str | device | None) – Device to use (auto-detected if None).

  • verbose (bool) – Show progress bar.

Return type:

ndarray

Returns:

Array of shape (num_peptides, d_model) with peptide embeddings.

fennomix_mhc.mhc_binding_model.get_ascii_indices(seq_array)[source][source]

Converts a list of peptide sequences into ASCII-encoded index tensors.

Each character in the peptide string is represented by its ASCII code, reshaped into a 2D tensor.

Parameters:

seq_array (list[str]) – List of peptide sequence strings (e.g., [‘GLCTLVAML’, …]).

Return type:

LongTensor

Returns:

A tensor of shape (batch_size, sequence_length), dtype=torch.long.

fennomix_mhc.mhc_binding_model.get_cosine_schedule_with_warmup(optimizer, num_warmup_steps, num_training_steps, num_cycles=0.5, last_epoch=-1)[source][source]

Creates a learning rate scheduler with linear warmup and cosine decay.

The learning rate rises linearly during warmup steps, then follows a cosine decay curve. Useful for stabilizing early training.

Parameters:
  • optimizer (Optimizer) – Optimizer to wrap with the scheduler.

  • num_warmup_steps (int) – Number of steps for linear warmup.

  • num_training_steps (int) – Total number of training steps.

  • num_cycles (float) – Number of cosine cycles (default 0.5 for half-cycle).

  • last_epoch (int) – Index of last epoch (-1 for new training).

Returns:

A PyTorch learning rate scheduler.

Return type:

LambdaLR

fennomix_mhc.mhc_binding_model.get_hla_dataloader(dataset, batch_size, shuffle)[source][source]

Creates a DataLoader for HlaDataSet with custom collation.

Parameters:
  • dataset (HlaDataSet) – The dataset to load.

  • batch_size (int) – Number of samples per batch.

  • shuffle (bool) – Whether to shuffle data each epoch.

Return type:

DataLoader

Returns:

A DataLoader with pept_hla_collate as collate_fn.

fennomix_mhc.mhc_binding_model.pept_hla_collate(batch)[source][source]

Collate function for creating batches from HlaDataSet.

Handles variable-length HLA embeddings and ASCII-encodes peptides.

Parameters:

batch (list[tuple[ndarray, str, str]]) – List of tuples (hla_embed, pos_peptide, neg_peptide).

Returns:

  • hla_tensor: Padded HLA embeddings.

  • pos_pept_tensor: ASCII-encoded positive peptides.

  • neg_pept_tensor: ASCII-encoded negative peptides.

Return type:

A tuple of

fennomix_mhc.mhc_binding_model.test(test_df, test_allele_list, hla_encoder, pept_encoder, hla_df, hla_esm_list, fasta_list)[source][source]

Evaluates model performance on test alleles using rank-based recall.

Parameters:
  • test_df (DataFrame) – DataFrame with test peptide-allele pairs.

  • test_allele_list – List of HLA alleles to evaluate.

  • hla_encoder (ModelHlaEncoder) – Trained HLA encoder.

  • pept_encoder (ModelSeqEncoder) – Trained peptide encoder.

  • hla_df (DataFrame) – HLA metadata DataFrame.

  • hla_esm_list (list[ndarray]) – List of raw HLA ESM embeddings.

  • fasta_list (list[str]) – List of protein FASTA file paths.

Return type:

tuple[float, float, float]

Returns:

Tuple of mean recall rates at rank < 0.1, < 0.5, and < 2.0.

fennomix_mhc.mhc_binding_model.train(hla_encoder, pept_encoder, dataset, batch_size=256, lr=0.0001, epoch=100, warmup_epoch=20, verbose=True, device='cuda', test_bundle=None, neptune_run=None)[source][source]

Train the peptide/HLA encoders.

Parameters:
  • hla_encoder (ModelHlaEncoder) – Encoder for HLA embeddings.

  • pept_encoder (ModelSeqEncoder) – Encoder for peptide sequences.

  • dataset (HlaDataSet) – Training dataset.

  • batch_size (int) – Number of samples per batch.

  • lr (float) – Learning rate for the optimizer.

  • epoch (int) – Total number of epochs.

  • warmup_epoch (int) – Number of warmup epochs for the scheduler.

  • verbose (bool) – Whether to print training progress.

  • device (str) – Device identifier for torch.device.

  • test_bundle (tuple | None) – Optional tuple of test data passed to test().

  • neptune_run – Optional Neptune experiment for logging.

Return type:

None