Installation Guide

This guide explains how to install and set up PITA for different hardware configurations and inference engines.

Prerequisites

Conda (Miniconda or Anaconda) for environment management
Git for cloning the repository

git clone https://github.com/cobi-inc-MC/pita
cd pita

CPU Installation

For development, testing, or systems without GPU acceleration.

Option 1: llama.cpp (Recommended for CPU)

Save the following as pita_llama_cpp.yml:

name: pita_llama_cpp
channels:
  - conda-forge
  - defaults
dependencies:
  - python=3.12
  - pip
  - llama-cpp-python
  - pytest

Then run:

conda env create -f pita_llama_cpp.yml
conda activate pita_llama_cpp

# Install pita in editable mode
pip install -e .

Option 2: Windows CPU (llama.cpp)

For Windows users without a dedicated GPU.

Save the following as llamacpp_windows_cpu.yml:

name: pita_llamacpp_windows_cpu
channels:
  - conda-forge
  - defaults
dependencies:
  - python=3.11
  - pip
  - llama-cpp-python
  - pytest
  - pytest-asyncio

Then run:

conda env create -f llamacpp_windows_cpu.yml
conda activate pita_llamacpp_windows_cpu

# Install pita (ensure you are in the root directory)
pip install -e ".[llama_cpp]"

Option 3: Manual Setup

conda create -n pita_cpu python=3.12 -y
conda activate pita_cpu

pip install llama-cpp-python
pip install -e .
pip install pytest  # For testing

Verify Installation

python -c "import pita; import llama_cpp; print('CPU installation successful!')"

NVIDIA CUDA Installation

For systems with NVIDIA GPUs. Choose your preferred inference engine:

Option A.1: llama.cpp with CUDA using scripts (Recommended)

For a streamlined installation on Linux, we provide an automated setup script that handles environment creation and CUDA compilation for you.

See the Automated Installation Guide for detailed instructions on using the setup_llamacpp_cuda.sh script.

Option A.2: llama.cpp with CUDA (Manual)

Best for: Smaller models, lower memory usage, flexible quantization options.

Save the following as llamacpp_cuda.yml:

name: pita_llamacpp_cuda
channels:
  - defaults
  - nvidia
  - conda-forge
dependencies:
  - python=3.12
  - pip
  - cuda-cudart=12.4.127
  - cuda-toolkit=12.4.1
  - cmake

Then run the setup:

# 1. Create environment
conda env create -f llamacpp_cuda.yml
conda activate pita_llamacpp_cuda

# 2. Set up CUDA build environment
export CUDACXX=$CONDA_PREFIX/bin/nvcc
export CPATH=$CONDA_PREFIX/targets/x86_64-linux/include:$CPATH
export LD_LIBRARY_PATH=$CONDA_PREFIX/lib:$LD_LIBRARY_PATH

# 3. Build llama-cpp-python with CUDA support
CMAKE_ARGS="-DGGML_CUDA=on -DCMAKE_CUDA_FLAGS=-allow-unsupported-compiler" \
  pip install llama-cpp-python --no-cache-dir

# 4. Install pita
pip install -e .

Verify CUDA Backend

python -c "
import os
lib_dir = os.path.dirname(__import__('llama_cpp').__file__) + '/lib'
libs = os.listdir(lib_dir)
assert 'libggml-cuda.so' in libs, 'CUDA backend not found!'
print('CUDA backend installed:', [l for l in libs if 'cuda' in l])
"

Option B: vLLM with CUDA

Best for: Large models, high throughput, production deployments.

Save the following as vllm_cuda.yml:

name: pita_vllm_cuda
channels:
  - defaults
  - nvidia
  - conda-forge
dependencies:
  - python=3.12
  - pip
  - cuda-toolkit=12.8
  - cxx-compiler
  - valkey-server
  - pip:
    - vllm==0.11.0
    - pandas==2.3.3
    - datasets==4.3.0
    - regex==2025.9.18

Then run:

# Create environment with vLLM and CUDA 12.8
conda env create -f vllm_cuda.yml
conda activate pita_vllm_cuda

# Install pita
pip install -e .

vLLM Requirements

NVIDIA GPU with compute capability 7.0+ (Volta, Turing, Ampere, Ada, Hopper)
CUDA 12.x driver installed on host system
Sufficient GPU memory for your target model

Verify vLLM Installation

python -c "import vllm; print(f'vLLM {vllm.__version__} installed successfully')"

Option C: TensorRT-LLM with CUDA

Best for: Maximum performance on NVIDIA hardware, production deployments.

Save the following as tensorrt_cuda.yml:

name: pita_tensorrt_cuda
channels:
  - defaults
  - nvidia
  - conda-forge
dependencies:
  - python=3.10
  - pip
  - cxx-compiler
  - onnx<1.16.0
  - mpi4py
  - openmpi
  - pytest
  - pip:
    - --extra-index-url https://pypi.nvidia.com/
    - tensorrt_llm
    - torch
    - transformers
    - numpy
    - scipy
    - pandas
    - regex
    - pydantic
    - fastapi
    - valkey
    - valkey-server
    - uvicorn

Then run:

# 1. Create environment
conda env create -f tensorrt_cuda.yml
conda activate pita_tensorrt_cuda

# 2. Install pita in editable mode
pip install -e .

Verify TensorRT-LLM Installation

python -c "import tensorrt_llm; print(f'TensorRT-LLM {tensorrt_llm.__version__} installed successfully')"

Choosing an Inference Engine

Feature	llama.cpp	vLLM	TensorRT-LLM
Best for	Experimentation, quantized models	Production, high throughput	Maximum performance, production
Memory usage	Lower (supports aggressive quantization)	Higher	High (optimized for performance)
Model formats	GGUF	HuggingFace, GPTQ, AWQ	TensorRT engines (built from HF/ONNX)
Batch processing	Limited	Excellent	Excellent
Setup complexity	Simple	Moderate	Moderate/High

Running Tests

After installation, verify everything works:

# Run the test suite
pytest tests/ -v

# Run specific backend tests
pytest tests/inference/ -v -k "llama"     # llama.cpp tests
pytest tests/inference/ -v -k "vllm"      # vLLM tests
pytest tests/inference/ -v -k "tensorrt"  # TensorRT-LLM tests

Troubleshooting

llama.cpp CUDA Build Errors

"unsupported GNU version": Add the compiler compatibility flag:

CMAKE_ARGS="-DGGML_CUDA=on -DCMAKE_CUDA_FLAGS=-allow-unsupported-compiler" pip install llama-cpp-python

"cuda_runtime.h: No such file": Set the CUDA headers path:

export CPATH=$CONDA_PREFIX/targets/x86_64-linux/include:$CPATH

vLLM Import Errors

GPU not detected: Ensure NVIDIA drivers are installed:

nvidia-smi  # Should show your GPU

TensorRT-LLM Issues

"AttributeError: module 'onnx.helper' has no attribute 'float32_to_bfloat16'": Ensure you are using onnx<1.16.0.

"ImportError: libmpi.so.40": Ensure openmpi is installed via conda (conda list openmpi).

General Issues

Environment conflicts: Create a fresh environment:

conda env remove -n <env_name> -y
conda env create -f <environment_file.yml>

Platform Support Matrix

Platform	llama.cpp	vLLM	TensorRT-LLM	Status
Linux + NVIDIA CUDA	✅	✅	✅	Fully supported
Linux + CPU	✅	❌	❌	llama.cpp only
macOS + Apple Silicon	🔄	🔄	❌	In development
Linux + AMD ROCm	🔄	🔄	❌	In development

✅ = Supported | 🔄 = In development | ❌ = Not supported