Skip to content

Serving Mode Examples

pita supports two primary modes of operation: Programmatic and API.

Programmatic Mode

Use pita directly in your Python code for maximum control and offline processing.

from pita.inference.LLM_backend import AutoregressiveSampler

# Initialize the sampler
sampler = AutoregressiveSampler(
    engine="vllm",
    model="Qwen/Qwen2.5-0.5B-Instruct",
    logits_processor=True
)

# Basic sampling
prompt = "Write a short story about a robot."
output = sampler.sample(prompt)
generated_text = sampler.tokenizer.decode(output.output_ids)
print(f"Generated text: {generated_text}")

# Power Sampling
sampler.enable_power_sampling(
    block_size=250,
    MCMC_steps=3,
    token_metric="power_distribution"
)

output = sampler.token_sample(prompt)
generated_text = sampler.tokenizer.decode(output.output_ids)
print(f"Generated text (Power Sampling): {generated_text}")

API Mode

Run pita as a server with an OpenAI-compatible API endpoint.

Starting the Server

Start the server using the pita serve command:

# Start with defaults
pita serve

# Customize model, engine, and port
pita serve --model Qwen/Qwen2.5-0.5B-Instruct --engine vllm --port 8001

# Short options are also available
pita serve -m Qwen/Qwen2.5-0.5B-Instruct -e vllm -p 8001

You can also use environment variables:

export PITA_ENGINE=vllm
export PITA_MODEL=Qwen/Qwen2.5-0.5B-Instruct
export PITA_PORT=8001
pita serve

Querying the API

Once the server is running, you can use any OpenAI-compatible client.

import openai

client = openai.OpenAI(
    base_url="http://localhost:8001/v1",
    api_key="none"
)

response = client.chat.completions.create(
    model="Qwen/Qwen2.5-0.5B-Instruct",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Tell me a joke."}
    ]
)

print(response.choices[0].message.content)

Advanced Sampling via System Prompt

You can trigger advanced sampling strategies by prefixing your system prompt with ITS and specific parameters:

# Example: Trigger Power Sampling with 1000 tokens, block size 250
system_prompt = "ITS PS_1000_250_3 You are a helpful assistant."