A single agent feels great when you are just experimenting. The moment you start splitting real work across machines, though, the problem stops being about the model and starts being about coordination.
One machine collects material, one reviews it, one handles the final expression, and another one owns orchestration, nudging, and closeout. Once those responsibilities live on separate VPS nodes, the real questions become much more practical: how the machines talk to each other, how tasks move forward, how files are handed off, how blocking gets surfaced, and how the controller knows when a step is actually done.
That was the real point of this build. I was not trying to make a few agents talk to each other for fun. I was trying to turn that setup into a multi-machine workflow. The final shape is not fancy: Tailscale connects the machines into one private network, and a constrained task bus carries tasks, attachments, and status.
The point is not “many bots talking at once”
When people hear “multi-agent,” they often picture a room full of bots replying to each other.
That looks lively, but it breaks down quickly in real engineering work. Responsibilities get blurry, intermediate artifacts go missing, status becomes hard to trust, and in the end it looks like everything is moving while nobody is actually delivering anything.
So I collapsed the system into four fixed roles:
main: orchestration, dispatch, nudging, validation, and closeoutnano: first-pass research and raw materialsg: review and structural convergencejp: final expression and acceptance
Once the roles are clear, the system gets easier to reason about. This is not group chat. It is relay work.
Why I chose Tailscale as the base layer
The reason is simple: I did not want to expose a pile of coordination ports to the public internet just to let a few machines cooperate.
What Tailscale gives me is not “something cool.” It solves the messier and more important problem first: it pulls scattered machines into one controlled private network.
That means each node can bind its task-bus listener to a tailnet address instead of opening it publicly:
main -> XXXX:18100nano -> XXXX:18101sg -> XXXX:18102jp -> XXXX:18103
That separation matters. Human-facing services can keep using public entry points or Tunnel. Machine-to-machine coordination can stay on the internal network.
Why the task bus is intentionally constrained
I never wanted this to become a remote control platform.
Right now the bus only does four things:
POST /taskPOST /uploadGET /statusGET /health
In plain terms, it accepts tasks, accepts attachments, reports status, and does not perform remote execution.
That limit is deliberate. In early multi-machine systems, the biggest risk is usually not “not enough automation.” It is opening the execution surface too early. If the boundaries are still fuzzy and nodes can already call into each other freely, the system gets messy fast, and the security story gets worse even faster.
So the first-stage goal is intentionally narrow: make messages, attachments, and status reliable first; add more capability later.
This is already past the idea stage
At this point, the system has real directories, real consumers, and real outputs.
Each machine has the same basic structure:
inbox/artifacts/archive/logs/
And the key consumers are already in place:
de_consumer.pysg_consumer.pyjp_consumer.py
The relay path is also real and already working end to end:
main -> nano -> sg -> jp -> main
More importantly, this is not just something that “seems to work.” There are actual intermediate artifacts on disk:
candidate-list.mdde-raw-output.txtsg-review.mdjp-final.txt
That is why I care more and more about one thing: evidence.
In a multi-machine workflow, no files, no logs, and no verifiable acknowledgment means no real progress. That rule sounds strict, but once more nodes get involved, it becomes one of the easiest ways to avoid fake progress.
The most valuable part was not success, but the mistakes
If all I say is “we built multi-machine collaboration,” this post is not worth much. The useful part is what broke and what that taught me.
The first mistake was treating a cross-machine team like local subagents.
Those two setups may both sit under the broad label of “multi-agent,” but they are not the same thing. Local subagents behave like branches inside one runtime. main / nano / sg / jp are independent nodes distributed across different machines. Dispatch, observability, and failure handling all work differently.
The second mistake was confusing “delivered” with “done.”
A file landing in inbox, or even being marked processed, does not mean the step actually produced a usable result. Real progress only starts when artifacts are written, status comes back, and the chain closes cleanly.
The third mistake was that old bridge-style shortcuts always disguise themselves as the shortest path.
When time is tight, it is very tempting to fall back to the old habit: SSH into a node, drop JSON into its inbox, and move on. That does get the task there, but it does not give you a proper loop. Acknowledgment, status, escalation, and auditability all get weaker.
The fourth mistake was the one that stuck with me the most: using the right transport is not enough if you break the contract.
At one point I manually took over the nano step. I thought switching to Tailscale /upload + /task was enough. It was not. I also changed the attachment naming on my own and turned it into a new research pack format. But sg_consumer.py only knew how to consume the original fixed filenames, so even though the task arrived, the consumer still blocked.
That locked in a rule for me:
Manually taking over one stage does not give you permission to invent a new intermediate protocol. You still have to feed the existing consumer what it already expects.
Once I switched back to the standard format, the chain recovered and sg -> jp -> main completed normally.
How I think about this system now
If I had to compress the whole experience into one sentence, it would be this:
The interesting part is not that a few agents on different machines can finally talk to each other. The interesting part is that multi-machine collaboration is starting to look like an engineering workflow: verifiable, handoff-friendly, and closeable.
That is the real difference between a demo and a working system.
Demo systems optimize for “this looks smart.” Engineering systems care more about:
- whether there is evidence
- whether responsibility is explicit
- whether task state is visible
- whether intermediate artifacts exist
- whether blocked work can be recovered
- whether the whole thing can actually be closed cleanly
None of that sounds romantic. In practice, it matters far more than simply making the agents sound impressive.
What still needs work
This chain is working, but it is not stable yet.
At minimum, the next layer still needs:
- clearer timeout and escalation behavior
- stricter ACK and delivery rules
- finer ACL boundaries
- more stable attachment contracts
- better audit trails
Tailscale is the base layer. The task bus is the first skeleton. A mature multi-machine collaboration system still needs to grow on top of both.
But even at this stage, one thing already feels clear: instead of trying to make more agents talk at once, it is far more useful to get tasks, status, attachments, and closeout right first.
That is what makes multi-machine collaboration move from “interesting to watch” to “actually usable.”
#OpenClaw #Tailscale #MultiAgent #VPS #Automation #Operations