Wave 17: close all 5 audit FLAGs + SDPO context alignment + serverless re-exports

a84c060 12 days ago

2.84 kB

	"""composer_replication.diloco.serverless — run Decoupled DiLoCo across
	serverless training systems (Modal, HuggingFace Jobs, SageMaker, k8s, …).

	Per ADR-005, the design rests on two abstractions:

	1. `ServerlessExecutor` Protocol — a uniform interface for spinning up
	N replicas on different cloud backends. Each backend (Modal, HF Jobs,
	SageMaker, etc.) gets a concrete adapter that implements the Protocol.

	2. `ObjectStoreAllReduce` — fsspec-backed pseudo-gradient exchange that
	replaces the in-process `torchft.Manager.allreduce` call. The
	communication pattern is `S3 PutObject + N GetObjects` once per
	~500-1000 inner steps, which matches DiLoCo's actual sync cadence
	(paper arXiv:2311.08105 §3.2). Bandwidth: ~2 GB / 30 minutes per
	replica for 1B-param bf16, well within S3 free-tier.

	The framework's existing `composer_replication.diloco.make_diloco_outer_loop`
	wraps `torchft.local_sgd.DiLoCo`. To run that across N serverless replicas:

	>>> from composer_replication.diloco.serverless import (
	... LocalProcessExecutor,
	... ObjectStoreAllReduce,
	... )
	>>> rendezvous = ObjectStoreAllReduce("s3://my-bucket/diloco-runs/run42/")
	>>> executor = LocalProcessExecutor()
	>>> handles = executor.launch_replicas(
	... n_replicas=4,
	... entrypoint="composer_replication.diloco.serverless.replica_entrypoint",
	... entrypoint_args={"rendezvous": rendezvous.uri, "rank_env": "REPLICA_RANK"},
	... )
	>>> result = executor.collect(handles, timeout=3600)

	Module layout:
	- `executor.py` — `ServerlessExecutor` Protocol + base classes + `LocalProcessExecutor`
	- `allreduce.py` — `ObjectStoreAllReduce` + `MockManager` (drops into torchft path)
	- `modal.py` — `ModalExecutor` (skeleton — implements when modal-client is available)
	- `hf_jobs.py` — `HFJobsExecutor` (skeleton — uses huggingface_hub.run_job)
	- `replica_entrypoint.py` — script each replica runs (loaded from object store)

	Optional dependency: `pip install -e .[serverless]` pulls fsspec + s3fs +
	gcsfs. Modal/HF Jobs adapters require `modal` and `huggingface_hub` respectively;
	both are checked at adapter init time, not at module import.
	"""
	from __future__ import annotations

	from composer_replication.diloco.serverless.allreduce import (
	MockManager,
	ObjectStoreAllReduce,
	)
	from composer_replication.diloco.serverless.executor import (
	LocalProcessExecutor,
	ReplicaHandle,
	ServerlessExecutor,
	)
	from composer_replication.diloco.serverless.hf_jobs import HFJobsExecutor
	from composer_replication.diloco.serverless.modal import ModalExecutor

	__all__ = [
	"HFJobsExecutor",
	"LocalProcessExecutor",
	"MockManager",
	"ModalExecutor",
	"ObjectStoreAllReduce",
	"ReplicaHandle",
	"ServerlessExecutor",
	]