Architecture
The three components and the two guards. One server per host, one lab per host — everything else falls out of those two rules.
Multiple developers and CI jobs need a queue, a heartbeat, and someone to tear the lab down when the last test walks away. neops-remote-lab is that queue.
Three components, one process per host
neops-remote-lab ships three pieces that cooperate across the network:
| Component | Runs on | Role |
|---|---|---|
FastAPI server (neops_remote_lab.server) |
The lab host | Accepts sessions, queues them FIFO, invokes Netlab for the active session. |
LabManager singleton |
In the server process | Tracks the one running lab; enforces the one-lab-per-host rule; reference-counts reuse. |
RemoteLabClient + remote_lab_fixture |
Your CI / dev machine | Creates a session, heartbeats it, uploads a topology, yields devices to your test, releases on teardown. |
sequenceDiagram
participant Test as pytest (remote_lab_fixture)
participant Client as RemoteLabClient
participant Server as FastAPI server
participant Manager as LabManager
participant Netlab
Test->>Client: acquire topology
Client->>Server: POST /session
Server-->>Client: session_id (WAITING)
Client->>Server: GET /session/{id} (poll until ACTIVE)
Client->>Server: POST /lab (multipart: topology + extra_files)
Server->>Manager: try_acquire(topo, reuse=True)
Manager->>Netlab: netlab up topology.yml
Netlab-->>Manager: nodes
Manager-->>Server: devices
Server-->>Client: 200 devices
loop every <300s
Client->>Server: POST /session/heartbeat
end
Test->>Client: release()
Client->>Server: POST /lab/release
Server->>Manager: release() ref--
Client->>Server: DELETE /session/{id}
Request flow: session, then lab
Every /lab/* call is gated by two things: a valid X-Session-ID header and
that session being the head of the FIFO queue.
Non-active sessions receive 423 Locked. The server never shortcuts the queue;
the only way to skip ahead is to wait for the head to release or time out.
/session/heartbeat is gated more loosely: it only requires the session to
exist (404 if unknown) and accepts heartbeats from both WAITING and ACTIVE
sessions, which lets a queued client reset its WAITING timeout before it is
promoted. See
Session Queue for the promotion and timeout rules.
Runtime walk-through — what happens during a test
The diagram above summarizes the wire flow. The narrative below animates the same sequence against a small pytest run, so a reader following the Quickstart can map their actual log output back to the components.
- pytest loads the plugin.
neops_remote_lab.testing.pytest_order_pluginregisters theremote_lab_fixturefactory and installs the collection-time guard that rejects tests with more than one lab fixture. - The
remote_lab_clientsession-scoped fixture connects. It readsREMOTE_LAB_URL, creates a session on the server, and waits for the session to reach ACTIVE state — joining a FIFO queue if someone else holds the host. - The test asks for its lab fixture. The generated fixture uploads the topology via multipart
POST /lab, polling every five seconds if the server responded423 Locked(another test in the run holding the host with a different topology). - Netlab brings the topology up. The server returns a list of
DeviceInfoDtoobjects oncenetlab upcompleted. The test body runs against those. - Teardown runs. The fixture calls
release(), which on a non-reuse lab triggers teardown when the reference count hits zero. The session stays alive until the pytest process exits —atexitcleanup then closes it.
For the state machines behind those steps, read Session Queue, Lab Lifecycle, and Topology Format in that order.
The one-server-per-host guard
At startup the entrypoint acquires a non-blocking FileLock under the system
temp directory. If a second instance tries to start on the same host, the lock
fails immediately and the new process logs the owner’s PID, user, host, bind
address, and the command that started it — then exits with status 1.
Stale locks after a crash
If a previous server crashed without running its cleanup, the lock file may remain. The entrypoint detects this by reading the companion metadata file and probing whether the recorded PID is still alive; when the PID is gone it clears the stale metadata and proceeds. If both are stuck (live PID for a process that is actually hung), kill the PID manually. See Operator runbook.
The one-lab-per-host guard
Netlab itself can only manage one topology at a time per host. LabManager is
a classmethod-only singleton; its state lives on the class, and a
system-wide FileLock under the temp directory serialises access across any
additional Python processes (e.g. local tests running alongside the server).
Why both? The singleton prevents two async tasks in the server from racing. The
FileLockprevents a developer from accidentally running a localnetlabprocess in parallel with the server on the same host.
Long Netlab calls don’t block the queue
Netlab commands take minutes: they build containers, boot routers, and install
configurations. The server runs them off the event loop so the rest of the API
— heartbeats, status polls from other sessions, health checks — keeps
responding while a netlab up is in flight. The mechanics are documented for
contributors in
Internals: Async discipline.
How Netlab is invoked
There is exactly one path to the Netlab CLI: run_netlab in
neops_remote_lab.netlab.connector. It builds the argv as
["netlab", *args], runs the subprocess, and streams or captures stdout
depending on the NEOPS_NETLAB_STREAM_OUTPUT env var. Never shell out to
netlab from anywhere else in the codebase — the concentrator gives us
uniform logging, error handling, and the expected_failure flag for silent
cleanup attempts.
Ecosystem position
remote_lab_fixture is the stable public API. The Worker SDK imports it directly to give function-block tests a real topology (integration guide). For the broader neops vocabulary and which concepts apply to Remote Lab, see How Remote Lab fits with neops.
Where to go next
- Session queue — FIFO promotion, heartbeats, and the stale-session sweep that keeps a crashed client from blocking the queue.
- Lab lifecycle — SHA-based topology identity, reference
counting, the
try_acquirevsacquiredistinction, andatexitteardown. - Topology format — the YAML shape, vendor
defaults, and the
extra_filesupload contract.