fennomix_mhc.mhc_binding_model module¶

Classes:

`HlaDataSet`(hla_df, hla_esm_list, pept_df, ...)	Dataset providing paired HLA embeddings and peptides for training.
`ModelHlaEncoder`([d_model, layer_num, dropout])	Transformer-based encoder for HLA embeddings.
`ModelSeqEncoder`([d_model, layer_num, dropout])	Transformer-based encoder for peptide sequences.
`SiameseCELoss`()	Contrastive Siamese loss for HLA-peptide similarity learning.

Functions:

`batchify_hla_esm_list`(batch_esm_list)	Converts a list of variable-length HLA ESM embeddings into a padded tensor.
`embed_hla_esm_list`(hla_encoder, hla_esm_list)	Generates fixed-size embeddings for a list of HLA ESM features.
`embed_peptides`(pept_encoder, seqs[, ...])	Encodes a list of peptide sequences into embeddings.
`get_ascii_indices`(seq_array)	Converts a list of peptide sequences into ASCII-encoded index tensors.
`get_cosine_schedule_with_warmup`(optimizer, ...)	Creates a learning rate scheduler with linear warmup and cosine decay.
`get_hla_dataloader`(dataset, batch_size, shuffle)	Creates a DataLoader for HlaDataSet with custom collation.
`pept_hla_collate`(batch)	Collate function for creating batches from HlaDataSet.
`test`(test_df, test_allele_list, hla_encoder, ...)	Evaluates model performance on test alleles using rank-based recall.
`train`(hla_encoder, pept_encoder, dataset[, ...])	Train the peptide/HLA encoders.

class fennomix_mhc.mhc_binding_model.HlaDataSet(hla_df, hla_esm_list, pept_df, protein_data, min_peptide_len=8, max_peptide_len=14)[source][source]¶

Bases: Dataset

Dataset providing paired HLA embeddings and peptides for training.

Methods:

`__init__`(hla_df, hla_esm_list, pept_df, ...)	Initialize the dataset.
`get_allele_embed`(index)	Get HLA embedding for a specific peptide.
`get_neg_pept`()	Sample a negative peptide sequence.

__init__(hla_df, hla_esm_list, pept_df, protein_data, min_peptide_len=8, max_peptide_len=14)[source][source]¶

Initialize the dataset.

Parameters:

hla_df (DataFrame) – DataFrame with HLA information; must have ‘allele’ column.
hla_esm_list (list[ndarray]) – List of HLA ESM embeddings corresponding to hla_df rows.
pept_df (DataFrame | None) – Peptide DataFrame with columns ‘sequence’ and ‘allele’.
protein_data (DataFrame | list | str) – Protein FASTA path(s) or DataFrame to generate negatives.
min_peptide_len (int) – Minimum length for random digestion.
max_peptide_len (int) – Maximum length for random digestion.

get_allele_embed(index)[source][source]¶

Get HLA embedding for a specific peptide.

Parameters:: index (int) – Index of the peptide.
Return type:: ndarray
Returns:: Corresponding HLA embedding.

get_neg_pept()[source][source]¶

Sample a negative peptide sequence.

Return type:: str
Returns:: Random peptide string from the dataset or digested proteins.

class fennomix_mhc.mhc_binding_model.ModelHlaEncoder(d_model=480, layer_num=1, dropout=0.2)[source][source]¶

Bases: Module

Transformer-based encoder for HLA embeddings.

Methods:

`__init__`([d_model, layer_num, dropout])	Initialize the HLA encoder.
`forward`(x)	Encodes variable-length HLA embeddings into fixed-size vectors.

__init__(d_model=480, layer_num=1, dropout=0.2)[source][source]¶

Initialize the HLA encoder.

Parameters:

d_model (int) – Embedding dimension.
layer_num (int) – Number of Transformer layers.
dropout (float) – Dropout rate for Transformer layers.

forward(x)[source][source]¶

Encodes variable-length HLA embeddings into fixed-size vectors.

Parameters:: x (Tensor) – Input tensor of shape (batch_size, seq_len, d_model), typically from ESM models.
Return type:: Tensor
Returns:: Normalized embedding tensor of shape (batch_size, d_model).

class fennomix_mhc.mhc_binding_model.ModelSeqEncoder(d_model=480, layer_num=4, dropout=0.2)[source][source]¶

Bases: Module

Transformer-based encoder for peptide sequences.

Methods:

`__init__`([d_model, layer_num, dropout])	Initialize the sequence encoder.
`forward`(aa_idxes)	Encode peptide sequences to embeddings.

__init__(d_model=480, layer_num=4, dropout=0.2)[source][source]¶

Initialize the sequence encoder.

Parameters:

d_model (int) – Embedding dimension.
layer_num (int) – Number of Transformer layers.
dropout (float) – Dropout rate for Transformer layers.

forward(aa_idxes)[source][source]¶

Encode peptide sequences to embeddings.

Parameters:: aa_idxes (Tensor) – Tensor of shape (batch_size, seq_len) with ASCII indices.
Return type:: Tensor
Returns:: Normalized embedding tensor of shape (batch_size, d_model).

class fennomix_mhc.mhc_binding_model.SiameseCELoss[source][source]¶

Bases: object

Contrastive Siamese loss for HLA-peptide similarity learning.

Encourages the model to bring positive pairs closer and push negative pairs apart. Uses margin-based contrastive loss.

Methods:

get_loss(hla_x, x[, y])

Computes contrastive loss for one pair.

Attributes:

margin

get_loss(hla_x, x, y=1.0)[source][source]¶

Computes contrastive loss for one pair.

Parameters:

hla_x (Tensor) – HLA embedding tensor.
x (Tensor) – Peptide embedding tensor.
y (float) – Label (1.0 for positive pair, 0.0 for negative).

Return type:

Tensor

Returns:

Scalar loss tensor.

margin: float = 1¶

fennomix_mhc.mhc_binding_model.batchify_hla_esm_list(batch_esm_list)[source][source]¶

Converts a list of variable-length HLA ESM embeddings into a padded tensor.

Parameters:: batch_esm_list (list[ndarray]) – List of arrays, each of shape (1, seq_len, d_model).
Return type:: Tensor
Returns:: Padded tensor of shape (batch_size, max_seq_len, d_model).

fennomix_mhc.mhc_binding_model.embed_hla_esm_list(hla_encoder, hla_esm_list, batch_size=200, device=None, verbose=False)[source][source]¶

Generates fixed-size embeddings for a list of HLA ESM features.

Parameters:

hla_encoder (ModelHlaEncoder) – Trained HLA encoder model.
hla_esm_list (list[ndarray]) – List of raw ESM embeddings for HLA alleles.
batch_size (int) – Inference batch size.
device (str | device | None) – Device to use. Auto-detected if None.
verbose (bool) – Show progress bar.

Return type:

ndarray

Returns:

Array of shape (num_hla, d_model) containing encoded HLA embeddings.

fennomix_mhc.mhc_binding_model.embed_peptides(pept_encoder, seqs, d_model=480, batch_size=512, device=None, verbose=False)[source][source]¶

Encodes a list of peptide sequences into embeddings.

Parameters:

pept_encoder (ModelSeqEncoder) – Trained peptide encoder model.
seqs (list[str]) – List of peptide strings.
d_model (int) – Expected embedding dimension.
batch_size (int) – Inference batch size.
device (str | device | None) – Device to use (auto-detected if None).
verbose (bool) – Show progress bar.

Return type:

ndarray

Returns:

Array of shape (num_peptides, d_model) with peptide embeddings.

fennomix_mhc.mhc_binding_model.get_ascii_indices(seq_array)[source][source]¶

Converts a list of peptide sequences into ASCII-encoded index tensors.

Each character in the peptide string is represented by its ASCII code, reshaped into a 2D tensor.

Parameters:: seq_array (list[str]) – List of peptide sequence strings (e.g., [‘GLCTLVAML’, …]).
Return type:: LongTensor
Returns:: A tensor of shape (batch_size, sequence_length), dtype=torch.long.

fennomix_mhc.mhc_binding_model.get_cosine_schedule_with_warmup(optimizer, num_warmup_steps, num_training_steps, num_cycles=0.5, last_epoch=-1)[source][source]¶

Creates a learning rate scheduler with linear warmup and cosine decay.

The learning rate rises linearly during warmup steps, then follows a cosine decay curve. Useful for stabilizing early training.

Parameters:

optimizer (Optimizer) – Optimizer to wrap with the scheduler.
num_warmup_steps (int) – Number of steps for linear warmup.
num_training_steps (int) – Total number of training steps.
num_cycles (float) – Number of cosine cycles (default 0.5 for half-cycle).
last_epoch (int) – Index of last epoch (-1 for new training).

Returns:

A PyTorch learning rate scheduler.

Return type:

LambdaLR

fennomix_mhc.mhc_binding_model.get_hla_dataloader(dataset, batch_size, shuffle)[source][source]¶

Creates a DataLoader for HlaDataSet with custom collation.

Parameters:

dataset (HlaDataSet) – The dataset to load.
batch_size (int) – Number of samples per batch.
shuffle (bool) – Whether to shuffle data each epoch.

Return type:

DataLoader

Returns:

A DataLoader with pept_hla_collate as collate_fn.

fennomix_mhc.mhc_binding_model.pept_hla_collate(batch)[source][source]¶

Collate function for creating batches from HlaDataSet.

Handles variable-length HLA embeddings and ASCII-encodes peptides.

Parameters:

batch (list[tuple[ndarray, str, str]]) – List of tuples (hla_embed, pos_peptide, neg_peptide).

Returns:

hla_tensor: Padded HLA embeddings.
pos_pept_tensor: ASCII-encoded positive peptides.
neg_pept_tensor: ASCII-encoded negative peptides.

Return type:

A tuple of

fennomix_mhc.mhc_binding_model.test(test_df, test_allele_list, hla_encoder, pept_encoder, hla_df, hla_esm_list, fasta_list)[source][source]¶

Evaluates model performance on test alleles using rank-based recall.

Parameters:

test_df (DataFrame) – DataFrame with test peptide-allele pairs.
test_allele_list – List of HLA alleles to evaluate.
hla_encoder (ModelHlaEncoder) – Trained HLA encoder.
pept_encoder (ModelSeqEncoder) – Trained peptide encoder.
hla_df (DataFrame) – HLA metadata DataFrame.
hla_esm_list (list[ndarray]) – List of raw HLA ESM embeddings.
fasta_list (list[str]) – List of protein FASTA file paths.

Return type:

tuple[float, float, float]

Returns:

Tuple of mean recall rates at rank < 0.1, < 0.5, and < 2.0.

fennomix_mhc.mhc_binding_model.train(hla_encoder, pept_encoder, dataset, batch_size=256, lr=0.0001, epoch=100, warmup_epoch=20, verbose=True, device='cuda', test_bundle=None, neptune_run=None)[source][source]¶

Train the peptide/HLA encoders.

Parameters:

hla_encoder (ModelHlaEncoder) – Encoder for HLA embeddings.
pept_encoder (ModelSeqEncoder) – Encoder for peptide sequences.
dataset (HlaDataSet) – Training dataset.
batch_size (int) – Number of samples per batch.
lr (float) – Learning rate for the optimizer.
epoch (int) – Total number of epochs.
warmup_epoch (int) – Number of warmup epochs for the scheduler.
verbose (bool) – Whether to print training progress.
device (str) – Device identifier for torch.device.
test_bundle (tuple | None) – Optional tuple of test data passed to test().
neptune_run – Optional Neptune experiment for logging.

Return type:

None