GOOD 2026

Building LLM Inference Services for Researchers on HPC with Open OnDemand
2026-03-10 , Main Hall

We share how we built LLM inference services for researchers at Ohio Supercomputer Center (OSC) using Open OnDemand. Our users need LLMs in many different ways, so we support several service models. We started with an Ollama module on the cluster, added OpenWebUI, and later created dedicated OnDemand apps that start the LLM server and give users a simple interface. We also provide a Jupyter + Ollama app that exposes an OpenAI API interface in a Jupyter kernel. We maintain shared models so users can start quickly, while still allowing local models. We have also deployed the Ollama + OpenWebUI service on our Kubernetes cluster, but scaling it for a large number of users remains a challenge. We plan to add additional LLM engines and front-ends as part of our future improvements as well.


In this talk, we describe our efforts to provide LLM inference services for researchers at Ohio Supercomputer Center (OSC). Many of our users want to run LLMs, but they have a wide variety of workflows and tool preferences. Our goal has been to support these needs in a simple and reliable way, using Open OnDemand as the main entry point.

We started by installing an Ollama module on our HPC cluster that includes OpenWebUI and a few wrappers for easy use. Users can start an OnDemand virtual desktop session, download models, start an Ollama server, and run OpenWebUI to access the interface in a browser. At this point, it should support all the workflows users need, but it is still not accessible enough for most users.

To make this more accessible, we built a dedicated OnDemand app for Ollama. When launched, it starts an Ollama server for the user and opens a browser with the OpenWebUI interface. This made the system much more accessible, although it does not scale well because each user session runs on its own compute node.

We also created a Jupyter + Ollama app. This app does not use OpenWebUI. Instead, it launches a Jupyter session with an Ollama server running in the background. Through a Jupyter kernel, users can access the OpenAI API interface. This works well for users who want to test the code and run inference batch jobs.

To reduce friction further, we updated our module so that several popular models are stored in a central location. Users can start quickly without downloading anything, but they can still download additional models into their home or project directories.

We have also deployed generative AI services on our Kubernetes cluster. This includes both LLM inference and a stable diffusion service. These run well for small numbers of users, but the cost of running many user sessions is high, and we do not yet have a scalable deployment model. A major part of our future work is building a more scalable LLM service on Kubernetes. To date, we have deployed a core set of tools (Ollama, OpenWebUI, and OpenAI API) and plan to expand these options. Initial tests have shown very promising results.

Heechang Na, Ph.D., is the Scientific Applications Operations Manager at the Ohio Supercomputer Center (OSC). In his role, he is responsible for managing the comprehensive operations for providing optimal scientific software application environments to academic and industry clients. Leveraging his technical expertise and strategic insights, he drives the enhancement of OSC's scientific computing infrastructure.

As part of his responsibilities, he oversees the deployment of the scientific software stack on new systems and collaborates with other groups to make the software environment more accessible. This includes deploying more Open OnDemand apps, supporting classrooms and enabling new workflows in AI, bioinformatics and other domains.