Architecture

The system architecture is centered around the HEAppE Middleware. Its main purpose is to manage HPC jobs and provide an HTTP data transfer tunnel to an inference engine running on a compute node. The logic of job orchestration and periodic monitoring is implemented by the Job Daemon, which runs as a background set of reconcilers within the API service.

The inference engine (e.g. vLLM) runs as an HPC job on the compute node and serves an OpenAI-compatible HTTP API. HEAppE's data transfer layer tunnels HTTP requests from the API service to the compute node, enabling inference requests, metrics scraping, and model management without direct network connectivity to the HPC cluster.

The service is implemented in Go using the Gin framework. It accepts OpenAI-compatible inference requests from authenticated users, selects the appropriate inference job for the requested model, and proxies the request through HEAppE to the running engine on the compute node.