MultiHashFormer: Hash-based Generative Language Models
Abstract
MultiHashFormer enables hash-based autoregression in language models by representing tokens as hash signatures processed through a Hash Encoder and Hash Decoder within a Transformer framework.
Language models (LMs) represent tokens using embedding matrices that scale linearly with the vocabulary size. To constrain the parameter footprint, prior work proposes hashing many tokens into a single vector within encoder-only models. While this offers parameter efficiency, many-to-one collisions prevent its use in causal LMs. In this paper, we propose MultiHashFormer, a new framework that allows hash-based autoregression. Each token is represented as a unique hash signature, a short sequence of discrete hash IDs, generated by multiple independent hash functions. A Hash Encoder compresses this signature into a single latent vector for processing by a Transformer decoder. Then, a Hash Decoder generates the hash signature of the next token, which is then mapped back to text. We evaluate our approach at the 100M, 1B and 3B parameter scales, demonstrating that MultiHashFormer consistently outperforms standard Transformer LMs across multiple benchmarks. Furthermore, we show that our model handles multilingual vocabulary expansion with a constant parameter footprint without any modifications.
Community
Token hashing has been confined to encoder-only architectures due to the conventional many-to-one collision problem, which breaks generative decoding. This paper addresses that limitation by representing tokens as unique, multi-ID combinatorially engineered hash signatures, which are then processed through a cascaded predictor decoder. Notably, this architecture consistently outperforms standard language models at a wide range of scales across core language benchmarks. More importantly, it supports a substantially large vocabulary with a fixed memory footprint, but achieves performance comparable to the standard vocabulary expansion approach. This work offers a highly viable solution for vocabulary modularity.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Tensorizing Engram: Sharing Latents Across N-Gram Embeddings is Beneficial in LLMs (2026)
- NGM: A Plug-and-Play Training-Free Memory Module for LLMs (2026)
- Kronecker Embeddings: Byte-Level Structured Token Representations for Parameter-Efficient Language Models (2026)
- TIDE: Every Layer Knows the Token Beneath the Context (2026)
- Augmenting Molecular Language Models with Local $n$-gram Memory (2026)
- Lngram: N-gram Conditional Memory in Latent Space (2026)
- Parallel Causal Associative Fields: Gated Sparse Memory for Long-Context Language Modeling (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
How to cite this paper:
@misc {xue2026multihashformerhashbasedgenerativelanguage,
title={MultiHashFormer: Hash-based Generative Language Models},
author={Huiyin Xue and Atsuki Yamaguchi and Nikolaos Aletras},
year={2026},
eprint={2606.28057},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2606.28057},
}
Models citing this paper 28
klein9692/mhf_1b_32768_4_64
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper