Usage Guide
This guide provides a step-by-step guide to using the pita library.
Step 1: Choose an Inference Backend
The core functionality of pita revolves around inference engines and sampling strategies. First choose an inference backend that you would like to use:
- vLLM: A GPU/Accelerated focused platform that leverages pyTorch for inference. Can be used with multi-GPU setups. Limited under-the-hood customization.
- llama.cpp: A general purpose backed for both CPUs and GPUs. Written in C++.
- transformers: A general purpose backed for both CPUs and GPUs. Not optimized for inference, but exposes many under-the-hood customization options.
- TensorRT: Nvidia specific platform that is optimized for inference. Can only be used with select GPUs.
- DeepSpeed: TODO
Step 2: Choose Programmatic or API Serving Modes
pita can be used in two different modes:
- Programmatic: Use
pitaas a library to run offline inference and sampling strategies. - API: Use
pitaas a server with limited customization options, but an openAI API compatible endpoint.
Step 3.A: Choose a Sampling Strategy
pita provides several sampling strategies to generate diverse and high-quality outputs. Choose the strategy that best suits your needs:
- Power Sampling: Leverage Metropolis-Hastings MCMC Sampling to generate diverse and high-quality outputs.
- Sequential Monte Carlo/Particle Filtering: Sequential Monte Carlo/Particle Filtering generates diverse and high-quality token sequences, parsing and extending sequences.
- Best-of-N: Select the best N outputs from a set of candidate sequences.
- Beam Search:
- Combination of Strategies: Combine multiple strategies together to increase the reasoning capabilites of a model.
Step 3.B: Choose a Token Metric
pita provides several token metrics to evaluate the quality of generated outputs. Choose the metric that best suits your needs:
- Log Probability: Decide based on the log probability of the generated tokens with regular
- Power Sampling:
- Entropy:
Running from Command Line
The pita command provides a CLI interface for serving the API:
# Start with default settings
pita serve
# Customize with options
pita serve --model Qwen/Qwen2.5-0.5B-Instruct --engine vllm --port 8001
# Use short flags
pita serve -m ./model.gguf -e llama_cpp -p 8080
# View all options
pita serve --help
Available options:
- --model, -m: Model name or path (default: Qwen/Qwen2.5-0.5B-Instruct)
- --engine, -e: Inference engine (vllm or llama_cpp, default: vllm)
- --tokenizer, -t: Tokenizer path (optional, defaults to model path)
- --port, -p: Server port (default: 8001)
- --host, -h: Host address (default: 0.0.0.0)
All options can also be set via environment variables (PITA_MODEL, PITA_ENGINE, etc.).
Using in Python Code
You can import and use pita components directly in your Python scripts.
from pita.inference.LLM_backend import AutoregressiveSampler
# Initialize the sampler
sampler = AutoregressiveSampler(
engine="vllm",
model="facebook/opt-125m",
logits_processor=True,
max_probs=100
)
# Configure sampling parameters
sampler.sampling_params.max_tokens = 50
sampler.sampling_params.temperature = 1.0
# Generate text
context = "What is the capital of France?"
output = sampler.sample(context)
generated_text = sampler.tokenizer.decode(output.output_ids)
print(generated_text)
# Further usage with advanced sampling strategies (see examples)
Refer to the API Reference for detailed documentation on each module.