Skip to content

Job Daemon

The job daemon is an asynchronously running task of the service. Its purpose is to periodically retrieve the state of HEAppE resources, such as job statuses, data transfer tunnels to job nodes, and cluster health. Using this information, the job daemon ensures that the inference server on the compute node remains operational by establishing tunnels, refreshing metrics, or replacing compute jobs when their walltime is close to expiring.

The daemon runs several independent reconcilers, each responsible for a specific aspect of the lifecycle. Every reconciler has its own configurable interval.

Job Sync Reconciler

The primary reconciler that keeps the service's view of HEAppE jobs up to date and starts new inference jobs for models that need them. It works in two phases.

Phase 1 — Synchronisation

All active HEAppE jobs (queued and running) are fetched and matched against known models by parsing each job's name. Running jobs get a tunnel record that will later be activated. The in-memory job store is updated to reflect the current HEAppE state — new jobs are added, stale ones removed.

Jobs that have a running compute node but no active tunnel are processed next. For each such job the compute node IP is resolved, the tunnel record is saved, and a data transfer tunnel is opened through HEAppE. Once the tunnel responds, it is marked as active and the resolved port is stored.

Phase 2 — Starting Jobs

Models that are not idle are checked to see whether enough inference jobs are already running. A job is considered "fresh" if it started within a configurable window of its total walltime. If fewer jobs are running than the model's desired scale, new ones are created and submitted to HEAppE using the appropriate command template for the model's inference engine (e.g. vLLM).

This two-phase design prevents race conditions — synchronisation always happens before any decision to start new jobs.

Idle Reconciler

Models that have not received any inference requests for longer than their configured idle timeout are automatically marked as idle. An idle model is skipped by the job sync reconciler, so no new HEAppE jobs are created for it.

When a new inference request arrives for an idle model, it is woken up — the idle flag is cleared and the next sync cycle will start a job for it.

Metrics Reconciler

For every running job with an active tunnel, Prometheus metrics are periodically fetched from the inference engine's metrics endpoint (e.g. /metrics on vLLM). The following metrics are collected:

  • Number of actively running inference requests
  • Number of queued (waiting) inference requests
  • KV cache usage percentage

If the metrics endpoint becomes unreachable, the job state is set to loading and all metrics are reset to zero — indicating the engine may have restarted.

The collected metrics are exposed through the API's model listing endpoint, providing live visibility into each model's current load.