Operator runbook
Install, run under systemd, recover from a stale lock, unstick a wedged lab. The day-1 and day-N operator handbook for a shared host.
uv tool install drops
the CLI in ~/.local/bin (or uv tool dir) inside an isolated
environment that uv manages.
pipx installs into a per-app virtualenv
under ~/.local/pipx/venvs/. After the first install, run
pipx ensurepath and re-login so ~/.local/bin is on PATH.
Verify the CLI is reachable and can print its help:
The help output lists the complete server CLI surface: --debug,
--host, --port, --log-level, --log-config, and --version.
If neops-remote-lab --help errors with command not found, your
tool’s bin directory is not on PATH — run uv tool update-shell,
pipx ensurepath, or add the symlink target manually depending on which
installer you used.
Before you start
Netlab CLI must already be on PATH — the launcher refuses to start without it. If the host is fresh, run Netlab host setup first. You’ll also want shell access with permission to read /tmp, kill processes, and restart the service.
Once the CLI is reachable, continue with Starting the server
for a one-shot foreground run, or Running as a system service
to put the server under systemd.
Running as a system service
The server is a long-running process that needs to come back after a
reboot. The recommended supervisor on Linux hosts is systemd. A minimal
unit file looks like this — save it at
/etc/systemd/system/neops-remote-lab.service:
[Unit]
Description=neops-remote-lab Manager
After=network-online.target docker.service
Wants=network-online.target
[Service]
Type=simple
User=<SERVICE_USER>
Group=<SERVICE_USER>
# <INSTALL_PATH> is the pipx venv or virtualenv where neops-remote-lab was installed.
# With the default pipx layout, that is typically /home/<SERVICE_USER>/.local/pipx/venvs/neops-remote-lab.
ExecStart=<INSTALL_PATH>/bin/neops-remote-lab --host 0.0.0.0 --port 8000 --log-level INFO
Restart=on-failure
RestartSec=5
# Logs land in the journal by default (stdout/stderr). Override with --log-config to redirect.
StandardOutput=journal
StandardError=journal
[Install]
WantedBy=multi-user.target
Install, enable, and start it:
Verify the service is up and watch its logs:
journalctl -u neops-remote-lab -f follows the server’s structured log
stream in the journal; send it to <LOG_PATH> via --log-config if you
need a file-based handler instead.
One systemd unit per host only
The server acquires a cross-process FileLock at startup and exits
with status 1 if another instance already holds it.
Do not define a second neops-remote-lab@.service template
instance on the same host — the second unit will crashloop on the
lock, fill the journal, and systemctl status will flap. The
one-server-per-host invariant is a hard constraint, not a tunable.
Starting the server
The entry point is:
See Configuration → Server CLI flags for every supported flag.
On startup the server, in order:
- Sets up logging.
- Acquires a global single-instance
FileLock(see below). - Writes an instance-metadata JSON file.
- Verifies the
netlabCLI is reachable. - Starts Uvicorn.
- Dispatches a one-shot best-effort cleanup of any stale Netlab
defaultinstance left over from a crashed prior run.
Single-instance filelock
Only one Remote Lab Manager may run per host. The entry point acquires a
FileLock at a fixed path under the system temp directory, or exits if
another instance already holds it.
The paths, on a typical Linux host:
| File | Purpose |
|---|---|
/tmp/neops_remote_lab_server.lock |
The filelock itself |
/tmp/neops_remote_lab_server.meta.json |
Human-readable metadata about the running instance |
On successful startup the server writes the metadata JSON. Fields:
| Field | Value |
|---|---|
pid |
Process id of the running server |
user |
Unix user the process is running as |
host |
platform.node() of the lab host |
started_at |
Unix timestamp |
port |
Value of --port |
host_bind |
Value of --host |
log_level |
Effective log level |
log_config |
Path to the logging config in use |
version |
Package version |
cwd |
Working directory at launch |
cmd |
Full argv used to launch the process |
Inspect it directly when you need to know who is running the server:
On normal exit the server deletes the metadata file and releases the lock via
a finally-guarded _cleanup_lock() callback.
Stale-lock recovery
The most common failure after a crash is a stale lockfile. The startup path handles this automatically:
- Attempt
lock.acquire(timeout=0)— fails if the lock is held. - Read
meta.json. - If the recorded
pidis not alive, remove the stalemeta.jsonand retry the lock. - If the recorded
pidis alive, log the full running-instance details (pid/user/host/version/bind/started) and exit with status 1.
Manual recovery is rarely needed, but the procedure is:
# Inspect the claimed owner
jq . /tmp/neops_remote_lab_server.meta.json
# Confirm it is really gone
ps -p "$(jq .pid /tmp/neops_remote_lab_server.meta.json)" || echo "not running"
# Remove both files and restart
rm -f /tmp/neops_remote_lab_server.lock /tmp/neops_remote_lab_server.meta.json
neops-remote-lab --host 0.0.0.0 --port 8000
Do not remove the lockfile while another instance is running
Two simultaneous instances cannot be guaranteed safe — the Netlab
default instance is a single cross-process resource. If two servers both
think they own it, lab state will be destroyed out from under active
sessions.
Routine operations
Checking server health
For richer stats during an incident:
Returns uptime, queue length, and session count. Intended for debugging only — see the note in the REST API reference.
Log monitoring
The server emits structured logs with the session id prefix:
Keys to watch:
| Log event | Meaning |
|---|---|
Session <id> promoted to ACTIVE |
A waiting session moved to the head of the queue |
Removing stale session <id> due to inactivity |
Heartbeat missed the 300 s window; lab will be cleaned up |
Lab currently busy at level WARN |
POST /lab returned 423 because a different topology is already running |
netlab ... failed with exit code ... at level ERROR |
Topology failed to come up; see the captured Netlab output |
Forced cleanup of a stuck lab
If a lab is stuck (netlab up failed partway through, or a client crashed
without releasing), take the lab down via the REST API using any ACTIVE
session:
#!/usr/bin/env bash
# Force-destroy a stuck lab via the REST API.
#
# Use when a lab is wedged: netlab up failed partway through, or a client
# crashed without releasing. Requires LAB_HOST to point at the Remote Lab
# Manager (e.g. lab.example.com:8000).
#
# Usage:
# LAB_HOST=lab.example.com:8000 ./examples/scripts/force_cleanup.sh
set -euo pipefail
: "${LAB_HOST:?LAB_HOST must be set, e.g. LAB_HOST=lab.example.com:8000}"
SESSION_ID=$(curl -s -X POST "http://$LAB_HOST/session" | jq -r .session_id)
# Wait for ACTIVE
while [[ "$(curl -s "http://$LAB_HOST/session/$SESSION_ID" | jq -r .status)" != "active" ]]; do
sleep 2
done
# Force destroy the lab
curl -s -X DELETE "http://$LAB_HOST/lab?force=true" \
-H "X-Session-ID: $SESSION_ID"
# End the cleanup session
curl -s -X DELETE "http://$LAB_HOST/session/$SESSION_ID"
echo "Lab force-destroyed; cleanup session ended."
As a last resort (server unreachable or wedged), clean up Netlab directly on the host:
Remember: only one operator should be doing this at a time. The Netlab
default instance is the single cross-process resource.
Troubleshooting
Looking for client-side debugging or HTTP error codes?
This troubleshooting table is for operators of the lab host. For client-side debugging, log patterns, and the HTTP-error-code reference, see Debugging.
| Symptom | Likely cause | Recovery |
|---|---|---|
Another Remote Lab Manager instance is already running. on startup |
Filelock held by another (possibly dead) process | Inspect /tmp/neops_remote_lab_server.meta.json; if PID is not alive, delete the lock + meta file and retry. See Stale-lock recovery. |
'netlab' CLI not found in PATH. at startup, exit 1 |
Netlab not installed or not on the launcher’s PATH |
Install Netlab; verify with netlab version; retry. |
Address already in use on the configured port |
Prior server did not exit cleanly, or another service occupies the port | lsof -i :8000; kill the process or start with --port. |
All POST /lab calls return 423 Locked from one caller |
The caller’s session is not ACTIVE, or a different topology owns the host | Check GET /session/{id}; if WAITING, wait; if ACTIVE, another topology is running — release or force-destroy. |
Clients time out in _wait_for_active_session |
Session is still in the queue after 600 s | The server is clearing waiting sessions every 600 s; check server logs for queue state and Lab currently busy messages. |
| Session silently disappears mid-test | Heartbeat missed the 300 s window | Ensure the fixture is session-scoped; check client logs for heartbeat failures; consider raising REMOTE_LAB_REQUEST_TIMEOUT. |
netlab down hangs when removing a topology manually |
Containers wedged in an error state | docker ps / containerlab destroy --all; restart Docker as a last resort. |
Zombie metadata after kill -9 |
The finally block did not run |
Remove /tmp/neops_remote_lab_server.meta.json manually after confirming no server is running. |
See also
- REST API — endpoint reference for operator scripting.
- Server config — flags and environment variables.
- Security model — the threat model the operational guidance above is built on top of.
- Architecture — where the single-instance + one-lab invariants come from.
- Session queue — FIFO semantics and 423 Locked flow.
- Headscale: quick setup — the recommended VPN enclosure (alternatives in the page’s Other approaches section).