Skip to content

The user facing API is implemented using Golang and the Gin HTTP framework. Its purpose is to provide interface for basic inference use-cases and model management. The full API docs are available from the live service's Swagger UI at the /swagger endpoint.

Authorization

The authorization to the API is handled by the Authorization HTTP header with a Bearer token.

Authorization: Bearer <token>

Access the API

You will need to authenticate using a Bearer token. There are 2 types of Bearer tokens that are currently accepted:

  • LEXIS Platform access token - a temporary access token obtained from the LEXIS Platform AAI system.

  • Static local key - a secret configured via the API_LOCAL_KEY environment variable. Recommended for local deployment, development or usage with a group of trusted users.

Endpoints

The service groups endpoints by use-cases. All use-case groups are associated to a compute project with a prefix path /p/:project, where project is typically the accounting string associated with the HPC compute project.

OpenAI compatible API

The main user facing API for inference requests, prefixed by openai. Currently implemented endpoints are:

  • openai/v1/chat/completions
  • openai/v1/completions
  • openai/v1/models
  • openai/v1/embeddings

When using the OpenAI API with external tools, always set the path as /p/:project/openai/v1

Visit the official OpenAI platform documentation for more information.

Model Management

All model management endpoints are scoped under /p/:project/models where :project is the HPC project accounting string (e.g. eu-00-00).

GET /p/:project/models

Returns all models registered for the project. Each model includes its configuration, a computed state (derived from active inference jobs: not_loaded, loading, queued, ready, busy), and a jobs array with live metrics (GPU utilization, waiting and running requests) for any running inference job.

PUT /p/:project/models — Register a model

Registers a new model in the service's local database. The model is registered in the idle state — no inference job is created until explicitly woken up.

Request body:

Field Type Required Default Description
hf_model_id string yes Hugging Face model ID (e.g. Qwen/Qwen3-8B-Instruct). Also used as the model identifier in OpenAI requests. For models prepared by in the HPC project storage through other means (such as staging or finetuning workflow), this must match their ID as well.
gpu_count integer yes Number of GPUs requested for the inference job. Must be ≥ 1.
engine string no "vllm" Inference engine to use
walltime integer no 3600 Maximum job walltime in seconds (1 hour). The job is killed by the scheduler after this time elapses.
walltime_ratio float no 0.9 Fraction of walltime before a replacement job is started. At 0.9 with walltime=3600, a new job is submitted after 3240s (54 min).
idle_timeout integer no 600 Seconds of inactivity after the last wakeup before the model is automatically set idle (10 min).
autoscale bool no false Enable auto-scaling strategy for busy inference jobs
is_refreshing bool no true Allow toggle of job replacement strategy. Reserved for future use.

POST /p/:project/models — Load a model

Initiates model loading by waking the model from idle.

This sets the model's Idle flag to false and records LastWakeup as the current time. On the next daemon cycle, the JobSyncReconciler creates an inference job if none exists. If a job is already running, no new job is created (as long as the count of "fresh" jobs meets the desired scale).

Returns 200 OK on success, 404 Not Found if the model is not registered.

DELETE /p/:project/models — Unregister a model

Removes the model from the service's database and in-memory cache.

This does not stop any running HEAppE inference jobs — they continue until they finish naturally. Only the service's tracking of the model is removed.

Returns 204 No Content on success or 404 Not Found if the model is not registered.

Project Management

  • GET /p lists all projects the authenticated user has access to.
  • GET /p/:project returns details for a specific project.
  • POST /p activates a project for inference by creating the required command templates in HEAppE.

General

  • health returns OK when the service is online, and provides API version info.
  • swagger/index.html serves the Swagger UI documentation.