llm2vec4cxr / README.md
lukeingawesome's picture
Update README: self-contained usage with trust_remote_code
8c10378 verified
metadata
license: mit
base_model: microsoft/LLM2CLIP-Llama-3.2-1B-Instruct-CC-Finetuned
tags:
  - text-embeddings
  - sentence-transformers
  - llm2vec
  - medical
  - chest-xray
  - radiology
  - clinical-nlp
language:
  - en
pipeline_tag: feature-extraction
library_name: transformers

LLM2Vec4CXR - Fine-tuned Model for Chest X-ray Report Analysis

LLM2Vec4CXR is a text encoder optimized for chest X-ray report analysis and medical text understanding.
It is introduced in our paper Exploring the Capabilities of LLM Encoders for Image–Text Retrieval in Chest X-rays.

Model Description

LLM2Vec4CXR is a bidirectional text encoder fine-tuned with a latent_attention pooling strategy.
This design enhances semantic representation of chest X-ray reports, making the model robust across different reporting styles and effective even with domain-specific abbreviations.
It improves performance on clinical text similarity, retrieval, and interpretation tasks.

Key Features

  • Base Architecture: LLM2CLIP-Llama-3.2-1B-Instruct
  • Pooling Mode: Latent Attention (trained weights automatically loaded)
  • Bidirectional Processing: Enabled for better context understanding
  • Medical Domain: Specialized for chest X-ray report analysis
  • Max Length: 512 tokens
  • Precision: bfloat16
  • Automatic Loading: Latent attention weights are automatically loaded from safetensors
  • Simple API: Built-in methods for similarity computation and instruction-based encoding

Training Details

Training Data

  • Fully fine-tuned on chest X-ray reports and medical text data
  • Training focused on understanding pleural effusion status and other chest X-ray findings

Training Configuration

  • Pooling Mode: latent_attention (modified from base model)
  • Enable Bidirectional: True
  • Max Length: 512
  • Torch Dtype: bfloat16
  • Full Fine-tuning: All model weights were updated during training

Usage

Installation

# Only transformers is needed!
pip install transformers torch

Basic Usage

import torch
from transformers import AutoModel

# Load the model - that's it!
model = AutoModel.from_pretrained(
    "lukeingawesome/llm2vec4cxr",
    trust_remote_code=True,
    torch_dtype=torch.bfloat16
).to("cuda" if torch.cuda.is_available() else "cpu").eval()

# Simple text encoding
report = "Small left pleural effusion with basal atelectasis."
embedding = model.encode_text([report])
print(embedding.shape)  # torch.Size([1, 2048])

# Multiple texts at once
reports = [
    "No acute cardiopulmonary abnormality.",
    "Small bilateral pleural effusions.",
    "Large left pleural effusion with compressive atelectasis."
]
embeddings = model.encode_text(reports)
print(embeddings.shape)  # torch.Size([3, 2048])

Instruction-Based Encoding and Similarity

import torch
from transformers import AutoModel

# Load model
model = AutoModel.from_pretrained(
    "lukeingawesome/llm2vec4cxr",
    trust_remote_code=True,
    torch_dtype=torch.bfloat16
).to("cuda" if torch.cuda.is_available() else "cpu").eval()

# Instruction-based task with separator
instruction = "Determine the status of the pleural effusion."
report = "There is a small increase in the left-sided effusion."
query = instruction + "!@#$%^&*()" + report

# Compare against multiple candidates
candidates = [
    "No pleural effusion",
    "Pleural effusion present",
    "Worsening pleural effusion",
    "Improving pleural effusion"
]

# One-line similarity computation
scores = model.compute_similarities(query, candidates)
print(scores)
# tensor([0.7171, 0.8270, 0.9155, 0.8113], device='cuda:0')

best_match = candidates[torch.argmax(scores)]
print(f"Best match: {best_match}")
# Best match: Worsening pleural effusion

Medical Report Retrieval Example

import torch
from transformers import AutoModel

# Load model
model = AutoModel.from_pretrained(
    "lukeingawesome/llm2vec4cxr",
    trust_remote_code=True,
    torch_dtype=torch.bfloat16
).to("cuda" if torch.cuda.is_available() else "cpu").eval()

# Instruction for retrieval
instruction = "Retrieve semantically similar reports"
query_report = "Small left pleural effusion with basal atelectasis."
query = instruction + "!@#$%^&*()" + query_report

# Candidate reports
candidates = [
    "No acute cardiopulmonary abnormality.",
    "Small left pleural effusion is present.",
    "Large right pleural effusion causing compressive atelectasis.",
    "Heart size is normal with no evidence of pleural effusion.",
]

# Compute similarities
scores = model.compute_similarities(query, candidates)

# Get most similar
best_idx = torch.argmax(scores)
print(f"Most similar: {candidates[best_idx]}")
print(f"Score: {scores[best_idx]:.4f}")

API Reference

The model provides three main methods:

encode_text(texts, max_length=512)

Simple text encoding for one or more texts.

Parameters:

  • texts: List of strings or single string
  • max_length: Maximum sequence length (default: 512)

Returns: Tensor of shape (batch_size, 2048)

📄 Related Papers:

Parameters:

  • texts: List of strings with optional separator
  • separator: String separator (default: '!@#$%^&*()')
  • max_length: Maximum sequence length (default: 512)

Returns: Tensor of shape (batch_size, 2048)

The model has been evaluated on chest X-ray report analysis tasks, particularly for:

  • Text retrieval/encoder
  • Medical text similarity comparison
  • Clinical finding extraction

Parameters:

  • query_text: Single query string
  • candidate_texts: List of candidate strings
  • separator: String separator (default: '!@#$%^&*()')
  • max_length: Maximum sequence length (default: 512)

Returns: Tensor of shape (num_candidates,) with cosine similarity scores

Training Details

Training Data

  • Fully fine-tuned on chest X-ray reports and medical text data
  • Training focused on understanding pleural effusion status and other chest X-ray findings

Training Configuration

  • Pooling Mode: latent_attention (512 latents, 8 attention heads)
  • Enable Bidirectional: True
  • Max Length: 512 tokens
  • Torch Dtype: bfloat16
  • Full Fine-tuning: All model weights were updated during training

Technical Specifications

  • Model Type: Bidirectional Language Model (LLM2Vec)
  • Architecture: LlamaBiModel (modified Llama 3.2) + Latent Attention Pooling
  • Parameters: ~1B parameters
  • Hidden Size: 2048
  • Input Length: Up to 512 tokens
  • Output Dimension: 2048
  • Precision: bfloat16
  • Dependencies: Only transformers and torch

Intended Use

Primary Use Cases

  • Medical Text Embeddings: Generate embeddings for chest X-ray reports
  • Clinical Text Similarity: Compare medical texts for semantic similarity
  • Medical Information Retrieval: Find relevant medical reports or findings
  • Clinical NLP Research: Foundation model for medical text analysis

Limitations

  • Specialized for chest X-ray reports - may not generalize to other medical domains
  • Requires careful preprocessing for optimal performance
  • Should be used as part of a larger clinical decision support system, not for standalone diagnosis

Evaluation

The model has been evaluated on chest X-ray report analysis tasks, particularly for:

  • Text retrieval and encoding
  • Medical text similarity comparison
  • Clinical finding extraction

Sample Performance

The model demonstrates consistent improvements over the base LLM2CLIP architecture on medical text understanding benchmarks.
LLM2Vec4CXR shows stronger performance in:

  • Handling medical abbreviations and radiological terminology
  • Capturing fine-grained semantic differences in chest X-ray reports
  • Understanding clinical context and temporal changes

Related Resources

📄 Paper: Exploring the Capabilities of LLM Encoders for Image–Text Retrieval in Chest X-rays

🔗 Related Projects:

  • LLM2CLIP4CXR: A CLIP-based model that leverages the LLM2Vec encoder to align visual and textual representations of chest X-rays

Citation

If you use this model in your research, please cite:

@article{ko2025exploring,
  title={Exploring the Capabilities of LLM Encoders for Image--Text Retrieval in Chest X-rays},
  author={Ko, Hanbin and Cho, Gihun and Baek, Inhyeok and Kim, Donguk and Koo, Joonbeom and Kim, Changi and Lee, Dongheon and Park, Chang Min},
  journal={arXiv preprint arXiv:2509.15234},
  year={2025}
}

Acknowledgments

This model is built upon:

  • LLM2Vec - Framework for converting decoder-only LLMs into text encoders
  • LLM2CLIP - Microsoft's implementation for connecting LLMs with CLIP models

License

This model is licensed under the MIT License.