Codeseys's picture
Wave 17: close all 5 audit FLAGs + SDPO context alignment + serverless re-exports
a84c060
"""composer_replication.diloco.serverless — run Decoupled DiLoCo across
serverless training systems (Modal, HuggingFace Jobs, SageMaker, k8s, …).
Per ADR-005, the design rests on two abstractions:
1. `ServerlessExecutor` Protocol — a uniform interface for spinning up
N replicas on different cloud backends. Each backend (Modal, HF Jobs,
SageMaker, etc.) gets a concrete adapter that implements the Protocol.
2. `ObjectStoreAllReduce` — fsspec-backed pseudo-gradient exchange that
replaces the in-process `torchft.Manager.allreduce` call. The
communication pattern is `S3 PutObject + N GetObjects` once per
~500-1000 inner steps, which matches DiLoCo's actual sync cadence
(paper arXiv:2311.08105 §3.2). Bandwidth: ~2 GB / 30 minutes per
replica for 1B-param bf16, well within S3 free-tier.
The framework's existing `composer_replication.diloco.make_diloco_outer_loop`
wraps `torchft.local_sgd.DiLoCo`. To run that across N serverless replicas:
>>> from composer_replication.diloco.serverless import (
... LocalProcessExecutor,
... ObjectStoreAllReduce,
... )
>>> rendezvous = ObjectStoreAllReduce("s3://my-bucket/diloco-runs/run42/")
>>> executor = LocalProcessExecutor()
>>> handles = executor.launch_replicas(
... n_replicas=4,
... entrypoint="composer_replication.diloco.serverless.replica_entrypoint",
... entrypoint_args={"rendezvous": rendezvous.uri, "rank_env": "REPLICA_RANK"},
... )
>>> result = executor.collect(handles, timeout=3600)
Module layout:
- `executor.py` — `ServerlessExecutor` Protocol + base classes + `LocalProcessExecutor`
- `allreduce.py` — `ObjectStoreAllReduce` + `MockManager` (drops into torchft path)
- `modal.py` — `ModalExecutor` (skeleton — implements when modal-client is available)
- `hf_jobs.py` — `HFJobsExecutor` (skeleton — uses huggingface_hub.run_job)
- `replica_entrypoint.py` — script each replica runs (loaded from object store)
Optional dependency: `pip install -e .[serverless]` pulls fsspec + s3fs +
gcsfs. Modal/HF Jobs adapters require `modal` and `huggingface_hub` respectively;
both are checked at adapter init time, not at module import.
"""
from __future__ import annotations
from composer_replication.diloco.serverless.allreduce import (
MockManager,
ObjectStoreAllReduce,
)
from composer_replication.diloco.serverless.executor import (
LocalProcessExecutor,
ReplicaHandle,
ServerlessExecutor,
)
from composer_replication.diloco.serverless.hf_jobs import HFJobsExecutor
from composer_replication.diloco.serverless.modal import ModalExecutor
__all__ = [
"HFJobsExecutor",
"LocalProcessExecutor",
"MockManager",
"ModalExecutor",
"ObjectStoreAllReduce",
"ReplicaHandle",
"ServerlessExecutor",
]