Skip to content

6.0

vLLM

vLLM is an inference and serving engine for LLMs (Large Language Models) that allows you to run AI models on your own hardware. Compared to Ollama, vLLM provides better throughput under high-concurrency scenarios.

Check GPU Driver and Docker Setup

Ensure that GPU drivers are correctly installed on your GPU server and that they can be passed through to Docker.
These steps depend on your specific environment and are not covered in this guide.
Also verify that your setup meets all requirements.

Adjust NVIDIA Runtime if Necessary

In the sample vllm.yml provided for vLLM, the NVIDIA runtime is already configured.
Review and adjust this configuration if your system requires different runtime settings.

Additional Prerequisites

HuggingFace Access Token

By default, vLLM will try to download models from HuggingFace. Some models are only available after accepting license or usage terms. This requires you to generate an access token that can be used to download models. You should store this access token in a safe location since it cannot be displayed again.

Instructions

Download .yml files

Download the latest .yml from GitHub by running the following command:

mkdir /opt/seatable-compose && \
cd /opt/seatable-compose && \
wget -c https://github.com/seatable/seatable-release/releases/latest/download/seatable-compose.tar.gz -O - | tar -xz -C /opt/seatable-compose && \
cp -n .env-release .env

Create vllm.yml

Create /opt/seatable-compose/vllm.yml with the following contents:

services:
  vllm:
    image: vllm/vllm-openai:v0.11.0
    container_name: vllm
    restart: unless-stopped
    runtime: nvidia
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    ipc: host
    command: ["--model", "${VLLM_MODEL:?Variable is not set or empty}", "--enable-log-requests", "--max-model-len", "${VLLM_MAX_MODEL_LEN:-65536}"]
    environment:
      - HUGGING_FACE_HUB_TOKEN=${HUGGING_FACE_TOKEN:?Variable is not set or empty}
      - VLLM_API_KEY=${VLLM_API_KEY:?Variable is not set or empty}
      # If set to true, enables logging of all API server responses from vLLM. Don't use this in production.
      - VLLM_DEBUG_LOG_API_SERVER_RESPONSE=${VLLM_DEBUG_LOG_API_SERVER_RESPONSE:-false}
    networks:
      - frontend-net
    volumes:
      - ~/.cache/huggingface:/root/.cache/huggingface
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
      interval: 10s
      start_period: 300s
    logging:
      driver: json-file
      options:
        # Maximum size per file
        max-size: 10m
        # Maximum number of files
        max-file: 3
    labels:
      caddy: ${VLLM_HOSTNAME:?Variable is not set or empty}
      caddy.@denied.not_0: "remote_ip ${VLLM_ALLOWED_IPS:?Variable is not set or empty} private_ranges"
      caddy.abort: "@denied"
      caddy.reverse_proxy: "{{upstreams 8000}}"

Afterwards, you should add vllm.yml to the COMPOSE_FILE variable inside your .env file:

sed -i "s/COMPOSE_FILE='\(.*\)'/COMPOSE_FILE='\1,vllm.yml'/" /opt/seatable-compose/.env

Make sure that COMPOSE_FILE already includes caddy.yml since vLLM's port 8000 is not exposed by default.

vLLM Configuration

Add the following configuration settings to your .env file to configure vLLM:

VLLM_HOSTNAME='YOUR_VLLM_HOSTNAME'

# Allowed IP addresses (multiple addresses can be separated by spaces)
VLLM_ALLOWED_IPS='SEATABLE_AI_PUBLIC_IP'

# Set this to a long random string, e.g. generated by "pwgen -s 64 1"
# This API key must then be used for any API requests
VLLM_API_KEY=''

HUGGING_FACE_TOKEN='YOUR_HUGGING_FACE_TOKEN'

# Model identifier from HuggingFace
# (e.g. RedHatAI/gemma-3-12b-it-quantized.w4a16)
VLLM_MODEL=''

Start vLLM

You can start vLLM by running the following command inside /opt/seatable-compose:

docker compose up -d

Starting vLLM may take several minutes depending on the model size and computing resources. On first startup, vLLM will automatically download the configured model from HuggingFace. Wait for the Docker container to report a healthy status before proceeding.

Perfekt! Your local vLLM deployment is ready to use.

SeaTable AI Configuration

To use vLLM for AI-based automation inside SeaTable, add the following settings to the .env file on the host where SeaTable AI is deployed:

SEATABLE_AI_LLM_TYPE='hosted_vllm'

SEATABLE_AI_LLM_URL='https://<YOUR_VLLM_HOSTNAME>/v1'

# API key for requests to vLLM (use the same key as above)
SEATABLE_AI_LLM_KEY=''

# Model identifier from HuggingFace
# (e.g. RedHatAI/gemma-3-12b-it-quantized.w4a16)
SEATABLE_AI_LLM_MODEL=''

Remember to restart SeaTable AI after making any changes.

Test vLLM

Send a request

You can send your first request to vLLM with the following example from the command line of the SeaTable AI container. This makes sure that all your environment variables are correctly set and that the ip address of SeaTable AI is part of the list of allowed IP addresses configured with VLLM_ALLOWED_IPS.

docker exec -i seatable-ai bash -s <<EOF
curl -fsSL \$SEATABLE_AI_LLM_URL/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer \$SEATABLE_AI_LLM_KEY" \
  -d '{  
    "model": "$SEATABLE_AI_LLM_MODEL",
    "messages": [                                                     
      {"role": "system", "content": "You are a helpful assistant."},    
      {"role": "user", "content": "How many inhabitants does Germany have?"}
    ]
  }'
EOF

Example output

Here is an example response from vLLM:

{
  "id": "chatcmpl-02d3a0161ed647e988b5a93088435670",
  "object": "chat.completion",
  "created": 1761910895,
  "model": "RedHatAI/gemma-3-12b-it-quantized.w4a16",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "As of late 2023/early 2024, Germany has approximately
            **83.24 million** inhabitants.\n\nIt's always good to remember
            that population figures are constantly changing!\n\n\n\nDo you
            want to know anything else about Germany?",
        "refusal": null,
        "annotations": null,
        "audio": null,
        "function_call": null,
        "tool_calls": [],
        "reasoning_content": null
      },
      "logprobs": null,
      "finish_reason": "stop",
      "stop_reason": 106,
      "token_ids": null
    }
  ],
  "service_tier": null,
  "system_fingerprint": null,
  "usage": {
    "prompt_tokens": 23,
    "total_tokens": 79,
    "completion_tokens": 56,
    "prompt_tokens_details": null
  },
  "prompt_logprobs": null,
  "prompt_token_ids": null,
  "kv_transfer_params": null
}

🥳 Congratulations! Everything is set up and you can start to build AI-powered automations in SeaTable with your own self-hosted vLLM.