| --- |
| tags: |
| - protein language model |
| pipeline_tag: text-classification |
| --- |
| |
| # PDeepPP model |
|
|
| `PDeepPP` is a hybrid protein language model designed to predict post-translational modification (PTM) sites and extract biologically relevant features from protein sequences. By leveraging pretrained embeddings from `ESM` and incorporating both transformer and convolutional neural network (CNN) architectures, `PDeepPP` provides a robust framework for analyzing protein sequences in various contexts. |
|
|
| ## Model description |
|
|
| `PDeepPP` is a flexible model architecture that integrates the power of transformer-based self-attention mechanisms with convolutional operations for capturing local and global sequence features. The model consists of: |
|
|
| 1. A **Self-Attention Global Features module** for capturing long-range dependencies. |
| 2. A **TransConv1d module**, combining transformers and convolutional layers. |
| 3. A **PosCNN module**, which applies position-aware convolutional operations for feature extraction. |
|
|
| The model is trained with a loss function that combines classification loss and additional regularization terms to enhance generalization and interpretability. It is compatible with Hugging Face's `transformers` library, allowing seamless integration with other tools and workflows. |
|
|
| ## Intended uses |
|
|
| `PDeepPP` was developed and validated using PTM and BPS datasets, but its applications are not limited to these specific tasks. Leveraging its flexible architecture and robust feature extraction capabilities, `PDeepPP` can be applied to a wide range of protein sequence-related analysis tasks. Specifically, the model has been validated on the following datasets: |
|
|
| 1. **PTM datasets**: Used for predicting post-translational modification (PTM) sites (e.g., phosphorylation), focusing on serine (S), threonine (T), and tyrosine (Y) residues. |
| 2. **BPS datasets**: Used for analyzing biologically active regions of protein sequences (Biologically Active Protein Sequences, BPS) to support downstream analyses. |
|
|
| Although the model was trained and validated on PTM and BPS datasets, `PDeepPP`’s architecture enables users to generalize and extend its capabilities to other protein sequence analysis tasks, such as embedding generation, sequence classification, or task-specific analyses. |
|
|
| --- |
|
|
| ### Key features |
|
|
| - **Dataset support**: `PDeepPP` is trained on PTM and BPS datasets, demonstrating its effectiveness in identifying specific sequence features (e.g., post-translational modification sites) and extracting biologically relevant regions. |
| - **Task flexibility**: The model is not limited to PTM and BPS tasks. Users can adapt `PDeepPP` to other protein sequence-based tasks by customizing input data and task objectives. |
| - **PTM mode**: Focuses on sequences centered around specific residues (S, T, Y) to analyze post-translational modification activity. |
| - **BPS mode**: Analyzes overlapping or non-overlapping subsequences of a protein to extract biologically meaningful features. |
|
|
| ## How to use |
|
|
| To use `PDeepPP`, you need to install the required dependencies, including `torch` and `transformers`: |
|
|
| ```bash |
| pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118 |
| pip install transformers |
| ``` |
| Before proceeding, you need to ensure that the `DataProcessor` and `Pretraining` files are in the same directory as the `example` file. |
| Here is an example of how to use PDeepPP to process protein sequences and obtain predictions: |
|
|
| ```python |
| import torch |
| import esm |
| from DataProcessor_pdeeppp import PDeepPPProcessor |
| from Pretraining_pdeeppp import PretrainingPDeepPP |
| from transformers import AutoModel |
| |
| # Global parameter settings |
| device = torch.device("cpu") |
| pad_char = "X" # Padding character |
| target_length = 33 # Target length for sequence padding |
| mode = "BPS" # Mode setting (only configured in example.py) |
| esm_ratio = 0.95 # Ratio for ESM embeddings |
| |
| # Load the PDeepPP model |
| model_name = "fondress/PDeepPP_umami" |
| model = AutoModel.from_pretrained(model_name, trust_remote_code=True) # Directly load the model |
| |
| # Initialize the PDeepPPProcessor |
| processor = PDeepPPProcessor(pad_char=pad_char, target_length=target_length) |
| |
| # Example protein sequences (test sequences) |
| protein_sequences = ["VELYP", "YPLDL", "ESHINQKWVCK"] |
| |
| # Preprocess the sequences |
| inputs = processor(sequences=protein_sequences, mode=mode, return_tensors="pt") # Dynamic mode parameter |
| processed_sequences = inputs["raw_sequences"] |
| |
| # Load the ESM model |
| esm_model, esm_alphabet = esm.pretrained.esm2_t33_650M_UR50D() |
| esm_model = esm_model.to(device) |
| esm_model.eval() |
| |
| # Initialize the PretrainingPDeepPP module |
| pretrainer = PretrainingPDeepPP( |
| embedding_dim=1280, |
| target_length=target_length, |
| esm_ratio=esm_ratio, |
| device=device |
| ) |
| |
| # Extract the vocabulary and ensure the padding character 'X' is included |
| vocab = set("".join(protein_sequences)) |
| vocab.add(pad_char) # Add the padding character |
| |
| # Generate pretrained features using the PretrainingPDeepPP module |
| pretrained_features = pretrainer.create_embeddings( |
| processed_sequences, vocab, esm_model, esm_alphabet |
| ) |
| |
| # Ensure pretrained features are on the same device |
| inputs["input_embeds"] = pretrained_features.to(device) |
| |
| # Perform prediction |
| model.eval() |
| outputs = model(input_embeds=inputs["input_embeds"]) # Use pretrained features as model input |
| logits = outputs["logits"] |
| |
| # Compute probability distributions and generate predictions |
| softmax = torch.nn.Softmax(dim=-1) # Apply softmax on the last dimension |
| probabilities = softmax(logits) |
| predicted_labels = (probabilities >= 0.5).long() |
| |
| # Print the prediction results for each sequence |
| print("\nPrediction Results:") |
| for i, seq in enumerate(processed_sequences): |
| print(f"Sequence: {seq}") |
| print(f"Probability: {probabilities[i].item():.4f}") |
| print(f"Predicted Label: {predicted_labels[i].item()}") |
| print("-" * 50) |
| ``` |
|
|
| ## Training and customization |
|
|
| `PDeepPP` supports fine-tuning on custom datasets. The model uses a configuration class (`PDeepPPConfig`) to specify hyperparameters such as: |
|
|
| - **Number of transformer layers** |
| - **Hidden layer size** |
| - **Dropout rate** |
| - **PTM type** and other task-specific parameters |
|
|
| Refer to `PDeepPPConfig` for details. |
|
|
| ## Citation |
| If you use `PDeepPP` in your research, please cite the associated paper or repository: |
|
|
| ``` |
| @article{your_reference, |
| title={`PDeepPP`: A Hybrid Model for Protein Sequence Analysis}, |
| author={Author Name}, |
| journal={Journal Name}, |
| year={2025} |
| } |
| ``` |