PITA

PITA (Probabilistic Inference Time Algorithms) is a library designed to consolidate and simplify the usage of probabilistic inference time algorithms with LLMs. It is built on top of existing inference frameworks and provides a unified interface for different inference backends.

Introduction

pita splits probabilistic inference time scaling methods into two categories:

Chain Scaling: Methods that curate the multiple responses to a prompt. For example, Best-of-N creates N sequences and returns the best one based on a decision metric.
Token Scaling: Methods that curate the tokens generated for a prompt. For example, Power Sampling generates iteratively improves the prompt through Metropolis-Hastings Sampling combined with a decision metric.

Chain scaling methods can be combined with token scaling methods to create a hybrid scaling method. For example, Power Best-of-N creates N chains with each chains being generated with Power Sampling. (WIP) See this flow chart for specifics on how to create custom hybrid scaling methods.

Both chain and token scaling methods have shared decision metrics. Decision metrics can be based on token probabiliites, or external graders/process reward models.

This library can also be used to generate non-probabilistic, non-test-time scaled outputs while taking advantage of the unified interface for different inference backends. Different models run better on different engines and hardware. Develop on your CPU before deploying on your GPU. Swap between ROCm, CUDA, and CPU. pita provides a unified interface for the most popular inference backends while your source code remains the same.

Key Features

Sampling Methodologies

Power Sampling: Leverage Metropolis-Hastings MCMC Sampling to generate diverse and high-quality outputs.
Sequential Monte Carlo (SMC): Sequential Monte Carlo/Particle Filtering generates diverse and high-quality token sequences, parsing and extending sequences.
Best-of-N: Generate N sequences and select the best based on decision metrics
(WIP) Beam Search: Maintain multiple candidate sequences during generation
Hybrid Strategies: Combine chain and token-level scaling methods

Decision Metrics

Log Probabilities: Standard model confidence scoring based on token probabilities
Power Distribution: Temperature-scaled confidence metrics using logits and normalization constants
Entropy: Model uncertainty quantification at each token position
Likelihood Confidence: Combined metric multiplying probability by confidence (exp(-entropy))
(WIP) Entropy Minimization Inference: Advanced entropy-based sampling
(WIP) Process Reward Models (PRM): External graders for decision-making
(WIP) Verifiers: External verification models for quality assessment

Inference Backends

vLLM: High-throughput GPU inference (primary backend, fully supported)
llama.cpp: CPU/GPU inference with GGUF model support (fully supported)
TensorRT: NVIDIA-optimized inference (supported, requires Valkey)
(WIP) Transformers: HuggingFace integration for flexibility (basic support)
(WIP) DeepSpeed: Distributed inference support

Getting Started

Installation: Set up your environment and install the library.
Usage: Learn the basics of running inference and using the library.
API Reference: Dive into the technical details of modules and classes.

Contributing

We welcome contributions! Please see the repository for more details on how to contribute.

License

PITA is dual-licensed under: - GNU Affero General Public License v3.0 or later (AGPLv3+) - for open source use - Commercial License - for proprietary use

For Open Source Users

Use PITA freely under AGPLv3+. Key requirements: - Make source code available to all users (including network users) - License derivatives under AGPLv3+ - See LICENSE for full terms

For Commercial Users

Use PITA in proprietary software without AGPLv3 obligations. - Contact: sales@cobi-inc.com - Contact: sales@cobi-inc.com

Dependencies & Special Cases

All dependencies use permissive licenses (MIT, BSD, Apache 2.0). See NOTICE.

TensorRT backend (optional): Requires NVIDIA's proprietary TensorRT with separate licensing. See TENSORRT-LICENSE-NOTICE.md.

HuggingFace models: Individual models have their own licenses - users must verify before use.

Complete guide: See LICENSING-GUIDE.md

Citation

If you use PITA in your research, please cite it as follows:

@misc{pita2026,
  author = {COBI, Inc. Engineering Team},
  title = {PITA: Probabilistic Inference Time Algorithms},
  year = {2026},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/cobi-inc-MC/pita}}
}