Usage Guide

This guide provides a step-by-step guide to using the pita library.

Step 1: Choose an Inference Backend

The core functionality of pita revolves around inference engines and sampling strategies. First choose an inference backend that you would like to use:

vLLM: A GPU/Accelerated focused platform that leverages pyTorch for inference. Can be used with multi-GPU setups. Limited under-the-hood customization.
llama.cpp: A general purpose backed for both CPUs and GPUs. Written in C++.
transformers: A general purpose backed for both CPUs and GPUs. Not optimized for inference, but exposes many under-the-hood customization options.
TensorRT: Nvidia specific platform that is optimized for inference. Can only be used with select GPUs.
DeepSpeed: TODO

Step 2: Choose Programmatic or API Serving Modes

pita can be used in two different modes:

Programmatic: Use pita as a library to run offline inference and sampling strategies.
API: Use pita as a server with limited customization options, but an openAI API compatible endpoint.

Step 3.A: Choose a Sampling Strategy

pita provides several sampling strategies to generate diverse and high-quality outputs. Choose the strategy that best suits your needs:

Power Sampling: Leverage Metropolis-Hastings MCMC Sampling to generate diverse and high-quality outputs.
Sequential Monte Carlo/Particle Filtering: Sequential Monte Carlo/Particle Filtering generates diverse and high-quality token sequences, parsing and extending sequences.
Best-of-N: Select the best N outputs from a set of candidate sequences.
Beam Search:
Combination of Strategies: Combine multiple strategies together to increase the reasoning capabilites of a model.

Step 3.B: Choose a Token Metric

pita provides several token metrics to evaluate the quality of generated outputs. Choose the metric that best suits your needs:

Log Probability: Decide based on the log probability of the generated tokens with regular
Power Sampling:
Entropy:

Running from Command Line

The pita command provides a CLI interface for serving the API:

# Start with default settings
pita serve

# Customize with options
pita serve --model Qwen/Qwen2.5-0.5B-Instruct --engine vllm --port 8001

# Use short flags
pita serve -m ./model.gguf -e llama_cpp -p 8080

# View all options
pita serve --help

Available options: - --model, -m: Model name or path (default: Qwen/Qwen2.5-0.5B-Instruct) - --engine, -e: Inference engine (vllm or llama_cpp, default: vllm) - --tokenizer, -t: Tokenizer path (optional, defaults to model path) - --port, -p: Server port (default: 8001) - --host, -h: Host address (default: 0.0.0.0)

All options can also be set via environment variables (PITA_MODEL, PITA_ENGINE, etc.).

Using in Python Code

You can import and use pita components directly in your Python scripts.

from pita.inference.LLM_backend import AutoregressiveSampler

# Initialize the sampler
sampler = AutoregressiveSampler(
    engine="vllm",
    model="facebook/opt-125m",
    logits_processor=True,
    max_probs=100
)

# Configure sampling parameters
sampler.sampling_params.max_tokens = 50
sampler.sampling_params.temperature = 1.0

# Generate text
context = "What is the capital of France?"
output = sampler.sample(context)
generated_text = sampler.tokenizer.decode(output.output_ids)
print(generated_text)

# Further usage with advanced sampling strategies (see examples)

Refer to the API Reference for detailed documentation on each module.