Safetensors
English
llama

ELLSA: End-to-end Listen, Look, Speak and Act

The first end-to-end model that unifies vision, speech, text and action in a streaming full-duplex framework, enabling joint multimodal perception and concurrent generation.

πŸ§ͺ Highlights

  • Full-Duplex Multimodal Interaction: unifies listening, looking, speaking, and acting in a single end-to-end architecture, enabling simultaneous multimodal perception and generation.
  • SA-MoE Architecture for Efficient Multimodal Fusion: utilizes modality-specific experts with shared attention to reduce interference and leverage the capabilities of pretrained models.
  • Unique Human-like Capabilities: supports speaking-while-acting, context-grounded VQA, instruction rejection, and action barge-in, enabling more natural interactive intelligence.

πŸ”§ REPO TODO List

  • Support for evaluation on speech interaction.
  • Support for evaluation on LIBERO.
  • Support for evaluation on CALVIN.
  • Release the training data.
  • Support for training.

πŸ“š Experiments

Basic Capabilities

On speech-interaction and robotmanipulation benchmarks, ELLSA matches modality-specific baselines.

Speech Interaction
Model Llama Q. S2T Llama Q. S2S Web Q. S2T Web Q. S2S TriviaQA S2T TriviaQA S2S AlpacaEval S2T AlpacaEval S2S
Moshi 60.8 54.5 23.4 22.1 25.6 16.7 1.84 1.76
Freeze-Omni 74.2 56.2 40.8 27.9 45.1 28.5 3.90 2.46
ELLSA 74.7 70.0 39.5 36.5 45.2 41.7 3.09 2.80
Speech-conditioned Robot Manipulation
Model SPATIAL OBJECT GOAL LONG Average
DP* 78.3% 92.5% 68.3% 50.5% 72.4%
Octo 78.9% 85.7% 84.6% 51.1% 75.1%
OpenVLA 84.9% 88.4% 79.2% 53.7% 76.5%
SpatialVLA 88.2% 89.9% 78.6% 55.5% 78.1%
CoT-VLA 87.5% 91.6% 87.6% 69.0% 81.1%
Ο€β‚€-FAST 96.4% 96.8% 88.6% 60.2% 85.5%
ELLSA 90.8% 95.8% 86.4% 84.4% 89.4%

Advanced Capabilities

ELLSA can accomplish tasks previously unattainable, such as dialogue and action turn-taking prediction, rejection of defective instructions, speaking while acting and responding to action barge-ins. These results highlight the feasibility and significance of full-duplex multimodal interaction as a foundation for more natural and general multimodal interactive intelligence.

WAVE Architecture
An example of ELLSA’s advanced capabilities: starting from a spoken instruction, the model executes the action, engages in context-grounded VQA, and supports action barge-in. This instance demonstrates not only ELLSA’s core skills but also its unique advanced capabilities: its MIMO capacity to process multimodal inputs and outputs simultaneously, and its duplex capability to manage complex conversational dynamics such as turn-taking and interruptions.

πŸ› οΈ Setup

Here we provide a conda environment setup for the project.

conda create -n ellsa python=3.10
conda activate ellsa
pip install -r requirements.txt

If you run into issues installing flash-attention or kaldifeat, you can instead use the prebuilt wheels available here: flash-attn prebuilt wheels and kaldifeat prebuilt wheels.

πŸ”₯ Training

Coming soon...

πŸš€ Inference

Required Checkpoints and Data

Before running inference, make sure to download all required checkpoints and Data.

Model Download
Emu3-vision πŸ€— HuggingFace
UniVLA-LIBERO πŸ€— huggingface
Llama-3.1-8B-Instruct πŸ€— huggingface
CosyVoice2-0.5B πŸ€— huggingface
ELLSA πŸ€— huggingface
Data Download
Test Data πŸ€— HuggingFace

Speech Interaction

cd reference/RoboVLMs
bash scripts/run_eval_speech_only.sh ${CKPT_PATH} 

Robot manipulation on LIBERO Benchmark

Build LIBERO environment and dataset based on the instruction.

cd reference/RoboVLMs
bash scripts/run_eval_libero_contemporary.sh ${CKPT_PATH} 

πŸ“ Code Structure

ELLSA/
β”œβ”€β”€ configs/           # Model configuration files
β”œβ”€β”€ models/            # Tokenizer and diffusion test
β”œβ”€β”€ train/             # Training dataset and pipeline
β”œβ”€β”€ reference/         # Reference code
β”‚   β”œβ”€β”€ cosyvoice/     # Speech synthesizer
β”‚   β”œβ”€β”€ Emu3/          # Base code
β”‚   β”œβ”€β”€ RoboVLMs/      # Evaluation code
β”‚   └── spear_encoder/ # Speech encoder
β”œβ”€β”€ scripts/           # Shell scripts for training
β”œβ”€β”€ tools/             # Data preprocessing tools
└── README.md          # Project description and user guide
    

❀️ Acknowledgement

Our work is built upon the following projects, Thanks for their great open-source work!

🌟 Citation

If you find this project useful, please consider citing our work:

@inproceedings{wang2026end,
  title={End-to-end Listen, Look, Speak and Act},
  author={Wang, Siyin and Yu, Wenyi and Chen, Xianzhao and Tian, Xiaohai and Zhang, Jun and Lu, Lu and Zhang, Chao},
  journal={Proc. ICLR},
  year={2026},
  address={Rio de Janeiro}
}
Downloads last month
57
Safetensors
Model size
18B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Paper for tsinghua-ee/ELLSA