File size: 2,836 Bytes
b266c31
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a84c060
 
b266c31
 
a84c060
b266c31
 
a84c060
b266c31
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
"""composer_replication.diloco.serverless — run Decoupled DiLoCo across
serverless training systems (Modal, HuggingFace Jobs, SageMaker, k8s, …).

Per ADR-005, the design rests on two abstractions:

1. `ServerlessExecutor` Protocol — a uniform interface for spinning up
   N replicas on different cloud backends. Each backend (Modal, HF Jobs,
   SageMaker, etc.) gets a concrete adapter that implements the Protocol.

2. `ObjectStoreAllReduce` — fsspec-backed pseudo-gradient exchange that
   replaces the in-process `torchft.Manager.allreduce` call. The
   communication pattern is `S3 PutObject + N GetObjects` once per
   ~500-1000 inner steps, which matches DiLoCo's actual sync cadence
   (paper arXiv:2311.08105 §3.2). Bandwidth: ~2 GB / 30 minutes per
   replica for 1B-param bf16, well within S3 free-tier.

The framework's existing `composer_replication.diloco.make_diloco_outer_loop`
wraps `torchft.local_sgd.DiLoCo`. To run that across N serverless replicas:

    >>> from composer_replication.diloco.serverless import (
    ...     LocalProcessExecutor,
    ...     ObjectStoreAllReduce,
    ... )
    >>> rendezvous = ObjectStoreAllReduce("s3://my-bucket/diloco-runs/run42/")
    >>> executor = LocalProcessExecutor()
    >>> handles = executor.launch_replicas(
    ...     n_replicas=4,
    ...     entrypoint="composer_replication.diloco.serverless.replica_entrypoint",
    ...     entrypoint_args={"rendezvous": rendezvous.uri, "rank_env": "REPLICA_RANK"},
    ... )
    >>> result = executor.collect(handles, timeout=3600)

Module layout:
- `executor.py`     — `ServerlessExecutor` Protocol + base classes + `LocalProcessExecutor`
- `allreduce.py`    — `ObjectStoreAllReduce` + `MockManager` (drops into torchft path)
- `modal.py`        — `ModalExecutor` (skeleton — implements when modal-client is available)
- `hf_jobs.py`      — `HFJobsExecutor` (skeleton — uses huggingface_hub.run_job)
- `replica_entrypoint.py` — script each replica runs (loaded from object store)

Optional dependency: `pip install -e .[serverless]` pulls fsspec + s3fs +
gcsfs. Modal/HF Jobs adapters require `modal` and `huggingface_hub` respectively;
both are checked at adapter init time, not at module import.
"""
from __future__ import annotations

from composer_replication.diloco.serverless.allreduce import (
    MockManager,
    ObjectStoreAllReduce,
)
from composer_replication.diloco.serverless.executor import (
    LocalProcessExecutor,
    ReplicaHandle,
    ServerlessExecutor,
)
from composer_replication.diloco.serverless.hf_jobs import HFJobsExecutor
from composer_replication.diloco.serverless.modal import ModalExecutor

__all__ = [
    "HFJobsExecutor",
    "LocalProcessExecutor",
    "MockManager",
    "ModalExecutor",
    "ObjectStoreAllReduce",
    "ReplicaHandle",
    "ServerlessExecutor",
]