The user facing API is implemented using Golang and the Gin HTTP framework. Its purpose is to provide interface for basic inference use-cases and model management. The full API docs are available from the live service's Swagger UI at the /swagger endpoint.
Authorization
The authorization to the API is handled by the Authorization HTTP header with a Bearer token.
Authorization: Bearer <token>
Access the API
You will need to authenticate using a Bearer token. There are 2 types of Bearer tokens that are currently accepted:
-
LEXIS Platform access token - a temporary access token obtained from the LEXIS Platform AAI system.
-
Static local key - a secret configured via the
API_LOCAL_KEYenvironment variable. Recommended for local deployment, development or usage with a group of trusted users.
Endpoints
The service groups endpoints by use-cases. All use-case groups are associated to a compute project with a prefix path /p/:project, where project is typically the accounting string associated with the HPC compute project.
OpenAI compatible API
The main user facing API for inference requests, prefixed by openai. Currently implemented endpoints are:
openai/v1/chat/completionsopenai/v1/completionsopenai/v1/modelsopenai/v1/embeddings
When using the OpenAI API with external tools, always set the path as /p/:project/openai/v1
Visit the official OpenAI platform documentation for more information.
Model Management
All model management endpoints are scoped under /p/:project/models where :project is the HPC project accounting string (e.g. eu-00-00).
GET /p/:project/models
Returns all models registered for the project. Each model includes its configuration, a computed state (derived from active inference jobs: not_loaded, loading, queued, ready, busy), and a jobs array with live metrics (GPU utilization, waiting and running requests) for any running inference job.
PUT /p/:project/models — Register a model
Registers a new model in the service's local database. The model is registered in the idle state — no inference job is created until explicitly woken up.
Request body:
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
hf_model_id |
string | yes | — | Hugging Face model ID (e.g. Qwen/Qwen3-8B-Instruct). Also used as the model identifier in OpenAI requests. For models prepared by in the HPC project storage through other means (such as staging or finetuning workflow), this must match their ID as well. |
gpu_count |
integer | yes | — | Number of GPUs requested for the inference job. Must be ≥ 1. |
engine |
string | no | "vllm" |
Inference engine to use |
walltime |
integer | no | 3600 |
Maximum job walltime in seconds (1 hour). The job is killed by the scheduler after this time elapses. |
walltime_ratio |
float | no | 0.9 |
Fraction of walltime before a replacement job is started. At 0.9 with walltime=3600, a new job is submitted after 3240s (54 min). |
idle_timeout |
integer | no | 600 |
Seconds of inactivity after the last wakeup before the model is automatically set idle (10 min). |
autoscale |
bool | no | false |
Enable auto-scaling strategy for busy inference jobs |
is_refreshing |
bool | no | true |
Allow toggle of job replacement strategy. Reserved for future use. |
POST /p/:project/models — Load a model
Initiates model loading by waking the model from idle.
This sets the model's Idle flag to false and records LastWakeup as the current time. On the next daemon cycle, the JobSyncReconciler creates an inference job if none exists. If a job is already running, no new job is created (as long as the count of "fresh" jobs meets the desired scale).
Returns 200 OK on success, 404 Not Found if the model is not registered.
DELETE /p/:project/models — Unregister a model
Removes the model from the service's database and in-memory cache.
This does not stop any running HEAppE inference jobs — they continue until they finish naturally. Only the service's tracking of the model is removed.
Returns 204 No Content on success or 404 Not Found if the model is not registered.
Project Management
GET /plists all projects the authenticated user has access to.GET /p/:projectreturns details for a specific project.POST /pactivates a project for inference by creating the required command templates in HEAppE.
General
healthreturns OK when the service is online, and provides API version info.swagger/index.htmlserves the Swagger UI documentation.