arxiv:2606.28057

MultiHashFormer: Hash-based Generative Language Models

Published on Jun 26

· Submitted by

Atsuki Yamaguchi on Jun 29

Upvote

Authors:

Huiyin Xue ,

Abstract

MultiHashFormer enables hash-based autoregression in language models by representing tokens as hash signatures processed through a Hash Encoder and Hash Decoder within a Transformer framework.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Language models (LMs) represent tokens using embedding matrices that scale linearly with the vocabulary size. To constrain the parameter footprint, prior work proposes hashing many tokens into a single vector within encoder-only models. While this offers parameter efficiency, many-to-one collisions prevent its use in causal LMs. In this paper, we propose MultiHashFormer, a new framework that allows hash-based autoregression. Each token is represented as a unique hash signature, a short sequence of discrete hash IDs, generated by multiple independent hash functions. A Hash Encoder compresses this signature into a single latent vector for processing by a Transformer decoder. Then, a Hash Decoder generates the hash signature of the next token, which is then mapped back to text. We evaluate our approach at the 100M, 1B and 3B parameter scales, demonstrating that MultiHashFormer consistently outperforms standard Transformer LMs across multiple benchmarks. Furthermore, we show that our model handles multilingual vocabulary expansion with a constant parameter footprint without any modifications.

View arXiv page View PDF GitHub 5 Add to collection

Community

atsuki-yamaguchi

Paper submitter 1 day ago

Token hashing has been confined to encoder-only architectures due to the conventional many-to-one collision problem, which breaks generative decoding. This paper addresses that limitation by representing tokens as unique, multi-ID combinatorially engineered hash signatures, which are then processed through a cascaded predictor decoder. Notably, this architecture consistently outperforms standard language models at a wide range of scales across core language benchmarks. More importantly, it supports a substantially large vocabulary with a fixed memory footprint, but achieves performance comparable to the standard vocabulary expansion approach. This work offers a highly viable solution for vocabulary modularity.

librarian-bot

about 12 hours ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

klein9692

Paper author about 7 hours ago

How to cite this paper:

@misc {xue2026multihashformerhashbasedgenerativelanguage,
title={MultiHashFormer: Hash-based Generative Language Models},
author={Huiyin Xue and Atsuki Yamaguchi and Nikolaos Aletras},
year={2026},
eprint={2606.28057},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2606.28057},
}