Get trending papers in your email inbox once a day!
Get trending papers in your email inbox!
SubscribeCollective Communication for 100k+ GPUs
The increasing scale of large language models (LLMs) necessitates highly efficient collective communication frameworks, particularly as training workloads extend to hundreds of thousands of GPUs. Traditional communication methods face significant throughput and latency limitations at this scale, hindering both the development and deployment of state-of-the-art models. This paper presents the NCCLX collective communication framework, developed at Meta, engineered to optimize performance across the full LLM lifecycle, from the synchronous demands of large-scale training to the low-latency requirements of inference. The framework is designed to support complex workloads on clusters exceeding 100,000 GPUs, ensuring reliable, high-throughput, and low-latency data exchange. Empirical evaluation on the Llama4 model demonstrates substantial improvements in communication efficiency. This research contributes a robust solution for enabling the next generation of LLMs to operate at unprecedented scales.
COCO-DR: Combating Distribution Shifts in Zero-Shot Dense Retrieval with Contrastive and Distributionally Robust Learning
We present a new zero-shot dense retrieval (ZeroDR) method, COCO-DR, to improve the generalization ability of dense retrieval by combating the distribution shifts between source training tasks and target scenarios. To mitigate the impact of document differences, COCO-DR continues pretraining the language model on the target corpora to adapt the model to target distributions via COtinuous COtrastive learning. To prepare for unseen target queries, COCO-DR leverages implicit Distributionally Robust Optimization (iDRO) to reweight samples from different source query clusters for improving model robustness over rare queries during fine-tuning. COCO-DR achieves superior average performance on BEIR, the zero-shot retrieval benchmark. At BERT Base scale, COCO-DR Base outperforms other ZeroDR models with 60x larger size. At BERT Large scale, COCO-DR Large outperforms the giant GPT-3 embedding model which has 500x more parameters. Our analysis show the correlation between COCO-DR's effectiveness in combating distribution shifts and improving zero-shot accuracy. Our code and model can be found at https://github.com/OpenMatch/COCO-DR.
LEONARDO: A Pan-European Pre-Exascale Supercomputer for HPC and AI Applications
A new pre-exascale computer cluster has been designed to foster scientific progress and competitive innovation across European research systems, it is called LEONARDO. This paper describes the general architecture of the system and focuses on the technologies adopted for its GPU-accelerated partition. High density processing elements, fast data movement capabilities and mature software stack collections allow the machine to run intensive workloads in a flexible and scalable way. Scientific applications from traditional High Performance Computing (HPC) as well as emerging Artificial Intelligence (AI) domains can benefit from this large apparatus in terms of time and energy to solution.
Probabilistic Partitive Partitioning (PPP)
Clustering is a NP-hard problem. Thus, no optimal algorithm exists, heuristics are applied to cluster the data. Heuristics can be very resource-intensive, if not applied properly. For substantially large data sets computational efficiencies can be achieved by reducing the input space if a minimal loss of information can be achieved. Clustering algorithms, in general, face two common problems: 1) these converge to different settings with different initial conditions and; 2) the number of clusters has to be arbitrarily decided beforehand. This problem has become critical in the realm of big data. Recently, clustering algorithms have emerged which can speedup computations using parallel processing over the grid but face the aforementioned problems. Goals: Our goals are to find methods to cluster data which: 1) guarantee convergence to the same settings irrespective of the initial conditions; 2) eliminate the need to establish the number of clusters beforehand, and 3) can be applied to cluster large datasets. Methods: We introduce a method that combines probabilistic and combinatorial clustering methods to produce repeatable and compact clusters that are not sensitive to initial conditions. This method harnesses the power of k-means (a combinatorial clustering method) to cluster/partition very large dimensional datasets and uses the Gaussian Mixture Model (a probabilistic clustering method) to validate the k-means partitions. Results: We show that this method produces very compact clusters that are not sensitive to initial conditions. This method can be used to identify the most 'separable' set in a dataset which increases the 'clusterability' of a dataset. This method also eliminates the need to specify the number of clusters in advance.
High-dimensional Clustering onto Hamiltonian Cycle
Clustering aims to group unlabelled samples based on their similarities. It has become a significant tool for the analysis of high-dimensional data. However, most of the clustering methods merely generate pseudo labels and thus are unable to simultaneously present the similarities between different clusters and outliers. This paper proposes a new framework called High-dimensional Clustering onto Hamiltonian Cycle (HCHC) to solve the above problems. First, HCHC combines global structure with local structure in one objective function for deep clustering, improving the labels as relative probabilities, to mine the similarities between different clusters while keeping the local structure in each cluster. Then, the anchors of different clusters are sorted on the optimal Hamiltonian cycle generated by the cluster similarities and mapped on the circumference of a circle. Finally, a sample with a higher probability of a cluster will be mapped closer to the corresponding anchor. In this way, our framework allows us to appreciate three aspects visually and simultaneously - clusters (formed by samples with high probabilities), cluster similarities (represented as circular distances), and outliers (recognized as dots far away from all clusters). The experiments illustrate the superiority of HCHC.
Cluster Explanation via Polyhedral Descriptions
Clustering is an unsupervised learning problem that aims to partition unlabelled data points into groups with similar features. Traditional clustering algorithms provide limited insight into the groups they find as their main focus is accuracy and not the interpretability of the group assignments. This has spurred a recent line of work on explainable machine learning for clustering. In this paper we focus on the cluster description problem where, given a dataset and its partition into clusters, the task is to explain the clusters. We introduce a new approach to explain clusters by constructing polyhedra around each cluster while minimizing either the complexity of the resulting polyhedra or the number of features used in the description. We formulate the cluster description problem as an integer program and present a column generation approach to search over an exponential number of candidate half-spaces that can be used to build the polyhedra. To deal with large datasets, we introduce a novel grouping scheme that first forms smaller groups of data points and then builds the polyhedra around the grouped data, a strategy which out-performs simply sub-sampling data. Compared to state of the art cluster description algorithms, our approach is able to achieve competitive interpretability with improved description accuracy.
Deep view of the intracluster light in the Coma cluster of galaxies
Detection and study of the intracluster light in rich clusters of galaxies has been a problem of long standing challenge and interest. Using the lowest surface brightness images of the Coma cluster of galaxies in the g and r bands, from the Halos and Environment of Nearby Galaxies (HERON) Coma Cluster Project, we obtained the most extensive image of intracluster light (ICL) in a single cluster to date, spreading over 1.5 Mpc from the cluster core. The unprecedented wealth of spectroscopic data made publicly available by the Dark Energy Spectroscopic Instrument (DESI) Early Data Release, complemented with a compilation from the NASA/IPAC Extragalactic Database and the literature, enabled the identification of 2,157 galaxy members within Coma, from which 42 distinct groups were identified. The synergy between these high-quality data allowed us to: 1) calculate ICL fractions of 19.9pm0.5\% and 19.6pm0.6\% in the g and r bands, respectively, consistent with a dynamically active cluster, 2) unveil Coma's faintest tidal features, and 3) provide a comprehensive picture of the dynamics and interactions within this complex system. Our findings indicate that the ICL connects several of these groups in a filamentous network, from which we infer the ongoing dynamical processes. In particular, we identified a faint stellar bridge linking the core of Coma with the galaxy NGC 4839, providing compelling evidence that this galaxy has already traversed the central region of the cluster.
Clustering Cluster Algebras with Clusters
Classification of cluster variables in cluster algebras (in particular, Grassmannian cluster algebras) is an important problem, which has direct application to computations of scattering amplitudes in physics. In this paper, we apply the tableaux method to classify cluster variables in Grassmannian cluster algebras C[Gr(k,n)] up to (k,n)=(3,12), (4,10), or (4,12) up to a certain number of columns of tableaux, using HPC clusters. These datasets are made available on GitHub. Supervised and unsupervised machine learning methods are used to analyse this data and identify structures associated to tableaux corresponding to cluster variables. Conjectures are raised associated to the enumeration of tableaux at each rank and the tableaux structure which creates a cluster variable, with the aid of machine learning.
Modeling formation and transport of clusters at high temperature and pressure gradients by implying partial chemical equilibrium
A theoretical approach to describing transport of an entire ensemble of clusters with different sizes as a single species in gas has been developed. The major assumption is an existence of local partial chemical equilibrium between the clusters. It is shown that thermal diffusion emerges in the collective description as a significant factor even if it is negligible when transport of the original molecular species is considered. Analytical expressions for the effective diffusion and thermal diffusion coefficients at temperature, pressure, and chemical composition gradients have been derived. The theory has been applied to a technology of H2S conversion in a centrifugal plasma-chemical reactor and has made it possible to account for sulfur clusters in numerical process modeling.
Formation of supermassive stars and dense star clusters in metal-poor clouds exposed to strong FUV radiation
The direct collapse scenario, which predicts the formation of supermassive stars (SMSs) as precursors to supermassive black holes (SMBHs), has been explored primarily under the assumption of metal-free conditions. However, environments exposed to strong far-ultraviolet (FUV) radiation, which is another requirement for the direct collapse, are often chemically enriched to varying degrees. In this study, we perform radiation hydrodynamic simulations of star-cluster formation in clouds with finite metallicities, Z=10^{-6} to 10^{-2} Z_{odot}, incorporating detailed thermal and chemical processes and radiative feedback from forming stars. Extending the simulations to approximately two million years, we demonstrate that SMSs with masses exceeding 10^4~M_odot can form even in metal-enriched clouds with Z lesssim 10^{-3} Z_{odot}. The accretion process in these cases, driven by "super-competitive accretion," preferentially channels gas into central massive stars in spite of small (sub-pc) scale fragmentation. At Z simeq 10^{-2} Z_{odot}, however, enhanced cooling leads to intense fragmentation on larger scales, resulting in the formation of dense star clusters dominated by very massive stars with 10^3 M_{odot} rather than SMSs. These clusters resemble young massive or globular clusters observed in the distant and local universe, exhibiting compact morphologies and high stellar surface densities. Our findings suggest that SMS formation is viable below a metallicity threshold of approximately 10^{-3} Z_{odot}, significantly increasing the number density of massive seed black holes to levels sufficient to account for the ubiquitous SMBHs observed in the local universe. Moreover, above this metallicity, this scenario naturally explains the transition from SMS formation to dense stellar cluster formation.
The effect of dynamical states on galaxy clusters populations. I. Classification of dynamical states
While the influence of galaxy clusters on galaxy evolution is relatively well-understood, the impact of the dynamical states of these clusters is less clear. This paper series explores how the dynamical state of galaxy clusters affects their galaxy populations' physical and morphological properties. The primary aim of this first paper is to evaluate the dynamical state of 87 massive (M_{500} geq 1.5 times 10^{14} M_{odot}) galaxy clusters at low redshifts (0.10 leq z leq 0.35). This will allow us to have a well-characterized sample for analyzing physical and morphological properties in our next work. We employ six dynamical state proxies utilizing optical and X-ray imaging data. Principal Component Analysis (PCA) is applied to integrate these proxies effectively, allowing for robust classification of galaxy clusters into relaxed, intermediate, and disturbed states based on their dynamical characteristics. The methodology successfully segregates the galaxy clusters into the three dynamical states. Examination of the galaxy distributions in optical wavelengths and gas distributions in X-ray further confirms the consistency of these classifications. The clusters' dynamical states are statistically distinguishable, providing a clear categorization for further analysis.
Organizing Unstructured Image Collections using Natural Language
Organizing unstructured image collections into semantic clusters is a long-standing challenge. Traditional deep clustering techniques address this by producing a single data partition, whereas multiple clustering methods uncover diverse alternative partitions-but only when users predefine the clustering criteria. Yet expecting users to specify such criteria a priori for large, unfamiliar datasets is unrealistic. In this work, we introduce the task of Open-ended Semantic Multiple Clustering (OpenSMC), which aims to automatically discover clustering criteria from large, unstructured image collections, revealing interpretable substructures without human input. Our framework, X-Cluster: eXploratory Clustering, treats text as a reasoning proxy: it concurrently scans the entire image collection, proposes candidate criteria in natural language, and groups images into meaningful clusters per criterion. To evaluate progress, we release COCO-4c and Food-4c benchmarks, each annotated with four grouping criteria. Experiments show that X-Cluster effectively reveals meaningful partitions and enables downstream applications such as bias discovery and social media image popularity analysis. We will open-source code and data to encourage reproducibility and further research.
Classifying Clustering Schemes
Many clustering schemes are defined by optimizing an objective function defined on the partitions of the underlying set of a finite metric space. In this paper, we construct a framework for studying what happens when we instead impose various structural conditions on the clustering schemes, under the general heading of functoriality. Functoriality refers to the idea that one should be able to compare the results of clustering algorithms as one varies the data set, for example by adding points or by applying functions to it. We show that within this framework, one can prove a theorems analogous to one of J. Kleinberg, in which for example one obtains an existence and uniqueness theorem instead of a non-existence result. We obtain a full classification of all clustering schemes satisfying a condition we refer to as excisiveness. The classification can be changed by varying the notion of maps of finite metric spaces. The conditions occur naturally when one considers clustering as the statistical version of the geometric notion of connected components. By varying the degree of functoriality that one requires from the schemes it is possible to construct richer families of clustering schemes that exhibit sensitivity to density.
Citizen Science Identification of Isolated Blue Stellar Systems in the Virgo cluster
We present a catalog of 34 new candidate (13 high confidence) isolated, young stellar systems within the Virgo galaxy cluster identified through a citizen science search of public optical and ultraviolet imaging. "Blue blobs" are a class of blue, faint, isolated, extremely low stellar mass, and metal-rich star-forming clouds embedded in the hot intracluster medium of the Virgo cluster. Only six blue blobs were known previously and here we confirm an additional six of our candidates through velocity and metallicity measurements from follow-up optical spectroscopy on the Hobby-Eberly Telescope (HET). Our 13 high confidence candidates (including the six confirmed) have properties consistent with prior known blue blobs and are inconsistent with being low-mass galaxies. Most candidates are concentrated in relatively dense regions, roughly following filamentary structures within the cluster, but avoiding its center. Three of our candidates are likely the stellar counterparts of known 'optically dark' clouds of neutral hydrogen in the cluster, while a further four are widely separated extensions to previously known blue blobs. The properties of our new candidates are consistent with previous conclusions that blue blobs likely originated from ram pressure stripping events, however, their locations in velocity--projected cluster-centric radius phase-space imply that their parent galaxies are not on their first infall into the cluster. Through our ongoing follow-up program with HET we aim to confirm additional candidates, however, detailed understanding of the stellar populations and star formation histories of blue blobs will require JWST observations.
Untangling Gaussian Mixtures
Tangles were originally introduced as a concept to formalize regions of high connectivity in graphs. In recent years, they have also been discovered as a link between structural graph theory and data science: when interpreting similarity in data sets as connectivity between points, finding clusters in the data essentially amounts to finding tangles in the underlying graphs. This paper further explores the potential of tangles in data sets as a means for a formal study of clusters. Real-world data often follow a normal distribution. Accounting for this, we develop a quantitative theory of tangles in data sets drawn from Gaussian mixtures. To this end, we equip the data with a graph structure that models similarity between the points and allows us to apply tangle theory to the data. We provide explicit conditions under which tangles associated with the marginal Gaussian distributions exist asymptotically almost surely. This can be considered as a sufficient formal criterion for the separabability of clusters in the data.
LSTM-based Selective Dense Text Retrieval Guided by Sparse Lexical Retrieval
This paper studies fast fusion of dense retrieval and sparse lexical retrieval, and proposes a cluster-based selective dense retrieval method called CluSD guided by sparse lexical retrieval. CluSD takes a lightweight cluster-based approach and exploits the overlap of sparse retrieval results and embedding clusters in a two-stage selection process with an LSTM model to quickly identify relevant clusters while incurring limited extra memory space overhead. CluSD triggers partial dense retrieval and performs cluster-based block disk I/O if needed. This paper evaluates CluSD and compares it with several baselines for searching in-memory and on-disk MS MARCO and BEIR datasets.
CLUSTSEG: Clustering for Universal Segmentation
We present CLUSTSEG, a general, transformer-based framework that tackles different image segmentation tasks (i.e., superpixel, semantic, instance, and panoptic) through a unified neural clustering scheme. Regarding queries as cluster centers, CLUSTSEG is innovative in two aspects:1) cluster centers are initialized in heterogeneous ways so as to pointedly address task-specific demands (e.g., instance- or category-level distinctiveness), yet without modifying the architecture; and 2) pixel-cluster assignment, formalized in a cross-attention fashion, is alternated with cluster center update, yet without learning additional parameters. These innovations closely link CLUSTSEG to EM clustering and make it a transparent and powerful framework that yields superior results across the above segmentation tasks.
Good things come in small packages: Should we adopt Lite-GPUs in AI infrastructure?
To match the blooming demand of generative AI workloads, GPU designers have so far been trying to pack more and more compute and memory into single complex and expensive packages. However, there is growing uncertainty about the scalability of individual GPUs and thus AI clusters, as state-of-the-art GPUs are already displaying packaging, yield, and cooling limitations. We propose to rethink the design and scaling of AI clusters through efficiently-connected large clusters of Lite-GPUs, GPUs with single, small dies and a fraction of the capabilities of larger GPUs. We think recent advances in co-packaged optics can be key in overcoming the communication challenges of distributing AI workloads onto more Lite-GPUs. In this paper, we present the key benefits of Lite-GPUs on manufacturing cost, blast radius, yield, and power efficiency; and discuss systems opportunities and challenges around resource, workload, memory, and network management.
Dynamical evolution of massless particles in star clusters with NBODY6++GPU-MASSLESS: I. Free-floating MLPs
Context. Low-mass bodies, such as comets, asteroids, planetesimals, and free-floating planets, are continuously injected into the intra-cluster environment after expulsion from their host planetary systems. These can be modeled as massless particles (MLPs, hereafter). The dynamics of large populations of MLPs, however, has yet received little attention in literature. Aims. We investigate the dynamical evolution of MLP populations in star clusters, and characterize their kinematics and ejection rates. Methods. We present NBODY6++GPU-MASSLESS, a modified version of the N-body simulation code NBODY6++GPU, that allows fast integration of star clusters that contain large numbers of massless particles (MLPs). NBODY6++GPU-MASSLESS contains routines specifically directed at the dynamical evolution of low-mass bodies, such as planets. Results. Unlike stars, MLPs do not participate in the mass segregation process. Instead, MLPs mostly follow the gravitational potential of the star cluster, which gradually decreases over time due to stellar ejections and stellar evolution. The dynamical evolution of MLPs is primarily affected by the evolution of the core of the star cluster. This is most apparent in the outer regions for clusters with higher initial densities. High escape rates of MLPs are observed before the core-collapse, after which escape rates remain stable. Denser star clusters undergo a more intense core collapse, but this does not impact the dynamical evolution of MLPs. The speeds of escaping stars are similar to those of escaping MLPs, when disregarding the high-velocity ejections of neutron stars during the first 50 Myr.
Quantum Monte Carlo simulations in the restricted Hilbert space of Rydberg atom arrays
Rydberg atom arrays have emerged as a powerful platform to simulate a number of exotic quantum ground states and phase transitions. To verify these capabilities numerically, we develop a versatile quantum Monte Carlo sampling technique which operates in the reduced Hilbert space generated by enforcing the constraint of a Rydberg blockade. We use the framework of stochastic series expansion and show that in the restricted space, the configuration space of operator strings can be understood as a hard rod gas in d+1 dimensions. We use this mapping to develop cluster algorithms which can be visualized as various non-local movements of rods. We study the efficiency of each of our updates individually and collectively. To elucidate the utility of the algorithm, we show that it can efficiently generate the phase diagram of a Rydberg atom array, to temperatures much smaller than all energy scales involved, on a Kagom\'e link lattice. This is of broad interest as the presence of a Z_2 spin liquid has been hypothesized recently.
Cluster-Specific Predictions with Multi-Task Gaussian Processes
A model involving Gaussian processes (GPs) is introduced to simultaneously handle multi-task learning, clustering, and prediction for multiple functional data. This procedure acts as a model-based clustering method for functional data as well as a learning step for subsequent predictions for new tasks. The model is instantiated as a mixture of multi-task GPs with common mean processes. A variational EM algorithm is derived for dealing with the optimisation of the hyper-parameters along with the hyper-posteriors' estimation of latent variables and processes. We establish explicit formulas for integrating the mean processes and the latent clustering variables within a predictive distribution, accounting for uncertainty on both aspects. This distribution is defined as a mixture of cluster-specific GP predictions, which enhances the performances when dealing with group-structured data. The model handles irregular grid of observations and offers different hypotheses on the covariance structure for sharing additional information across tasks. The performances on both clustering and prediction tasks are assessed through various simulated scenarios and real datasets. The overall algorithm, called MagmaClust, is publicly available as an R package.
Polar nano-clusters in nominally paraelectric ceramics demonstrating high microwave tunability for wireless communication
Dielectric materials, with high tunability at microwave frequencies, are key components in the design of microwave communication systems. Dense Ba0.6Sr0.4TiO3 (BST) ceramics, with different grain sizes, were prepared in order to optimise the dielectric tunability via polar nano cluster effects. Dielectric permittivity and loss measurements were carried at both high and low frequencies and were supported by results from X-ray powder diffraction, scanning and transmission electron microscopies, Raman spectroscopy and piezoresponse force microscopy. The concentration of polar nano clusters, whose sizes are found to be in the range 20 to 50 nm, and the dielectric tunability increase with increasing grain size. A novel method for measurement of the microwave tunability in bulk dielectrics is presented. The highest tunability of 32% is achieved in ceramics with an average grain size of 10 um. The tunability of BST ceramics with applied DC field is demonstrated in a prototype small resonant antenna.
Cluster-lensed supernova yields from the Vera C. Rubin Observatory and Nancy Grace Roman Space Telescope
Through gravitational lensing, galaxy clusters can magnify supernovae (SNe) and create multiple images of the same SN. This enables measurements of cosmological parameters, which will be increasingly important in light of upcoming telescopic surveys. We study the prospects of detecting strongly lensed SNe in cluster fields with the Nancy Grace Roman Space Telescope (Roman)'s High Latitude Time Domain Survey (HLTDS) and the Vera C. Rubin Observatory's Legacy Survey of Space and Time (LSST). We employed two approaches: one focusing on known multiply imaged galaxies behind clusters, along with the SN rates specific to those galaxies, and another based on the expected number of lensed SNe exploding in a given volume behind a galaxy cluster. We collected all the clusters in the literature that feature a well-constrained lens model and multiply imaged galaxies behind clusters with high-quality data for the lensed galaxies. This allowed us to determine the SN rate for each galaxy. We provide predictions for 46 clusters visible to the Vera C. Rubin Observatory, as well as for 9 observable by Roman's HLTDS, depending on whether the clusters fall within the survey's observing field. We predict that the number of multiply imaged SNe discovered by LSST in its first three years is 3.95 pm 0.89 from the first approach or 4.94 pm 1.02 from the second. For the HLTDS, the expected number of multiply imaged SNe ranges from 0.38 pm 0.15 to 5.2 pm 2.2, depending on the specific cluster observed, however, the fields to be targeted remain a matter of discussion. We conclude that LSST offers great prospects for detecting multiply imaged SNe. Our predictions are effectively lower limits, as we only considered the most massive and well-studied clusters. We provide a recommendation for HLTDS observing field selection, namely: either MACS J0553.4-3342 or Abell 1758a should be observed by the survey.
Accelerated Hierarchical Density Clustering
We present an accelerated algorithm for hierarchical density based clustering. Our new algorithm improves upon HDBSCAN*, which itself provided a significant qualitative improvement over the popular DBSCAN algorithm. The accelerated HDBSCAN* algorithm provides comparable performance to DBSCAN, while supporting variable density clusters, and eliminating the need for the difficult to tune distance scale parameter. This makes accelerated HDBSCAN* the default choice for density based clustering. Library available at: https://github.com/scikit-learn-contrib/hdbscan
CLAMS: A Cluster Ambiguity Measure for Estimating Perceptual Variability in Visual Clustering
Visual clustering is a common perceptual task in scatterplots that supports diverse analytics tasks (e.g., cluster identification). However, even with the same scatterplot, the ways of perceiving clusters (i.e., conducting visual clustering) can differ due to the differences among individuals and ambiguous cluster boundaries. Although such perceptual variability casts doubt on the reliability of data analysis based on visual clustering, we lack a systematic way to efficiently assess this variability. In this research, we study perceptual variability in conducting visual clustering, which we call Cluster Ambiguity. To this end, we introduce CLAMS, a data-driven visual quality measure for automatically predicting cluster ambiguity in monochrome scatterplots. We first conduct a qualitative study to identify key factors that affect the visual separation of clusters (e.g., proximity or size difference between clusters). Based on study findings, we deploy a regression module that estimates the human-judged separability of two clusters. Then, CLAMS predicts cluster ambiguity by analyzing the aggregated results of all pairwise separability between clusters that are generated by the module. CLAMS outperforms widely-used clustering techniques in predicting ground truth cluster ambiguity. Meanwhile, CLAMS exhibits performance on par with human annotators. We conclude our work by presenting two applications for optimizing and benchmarking data mining techniques using CLAMS. The interactive demo of CLAMS is available at clusterambiguity.dev.
Training Foundation Models on a Full-Stack AMD Platform: Compute, Networking, and System Design
We report on the first large-scale mixture-of-experts (MoE) pretraining study on pure AMD hardware, utilizing both MI300X GPUs with Pollara interconnect. We distill practical guidance for both systems and model design. On the systems side, we deliver a comprehensive cluster and networking characterization: microbenchmarks for all core collectives (all-reduce, reduce-scatter, all-gather, broadcast) across message sizes and GPU counts on Pollara. To our knowledge, this is the first at this scale. We further provide MI300X microbenchmarks on kernel sizing and memory bandwidth to inform model design. On the modeling side, we introduce and apply MI300X-aware transformer sizing rules for attention and MLP blocks and justify MoE widths that jointly optimize training throughput and inference latency. We describe our training stack in depth, including often-ignored utilities such as fault-tolerance and checkpoint-reshaping, as well as detailed information on our training recipe. We also provide a preview of our model architecture and base model - ZAYA1 (760M active, 8.3B total parameters MoE) - which will be further improved upon in forthcoming papers. ZAYA1-base achieves performance comparable to leading base models such as Qwen3-4B and Gemma3-12B at its scale and larger, and outperforms models including Llama-3-8B and OLMoE across reasoning, mathematics, and coding benchmarks. Together, these results demonstrate that the AMD hardware, network, and software stack are mature and optimized enough for competitive large-scale pretraining.
