# ReinforceNow

> End-to-end platform for continual learning with AI agents. Deploy, train on production traffic, and continuously improve your models with reinforcement learning.

ReinforceNow is a platform for training AI models using reinforcement learning (RL) and supervised finetuning (SFT). It supports open-source LLMs including Qwen, Llama, DeepSeek, and OpenAI's GPT-OSS reasoning models.

## Docs

### Getting Started

- [Installation](https://reinforcenow.ai/docs/getting-started/installation): Install the ReinforceNow CLI using uv and Python 3.11. Supports automated setup via Claude Code.
- [Quickstart](https://reinforcenow.ai/docs/getting-started/quickstart): Your first ReinforceNow project. Finetune Qwen3-8B using SFT or RL in minutes.
- [Create Your First Reward](https://reinforcenow.ai/docs/getting-started/first-reward): Write reward functions using math-verify to train models on OpenMathReasoning dataset.
- [Train Your First Agent](https://reinforcenow.ai/docs/getting-started/first-agent): Train an agent with tools using Wikipedia search for multi-turn RL.

### CLI Reference

- [CLI Commands](https://reinforcenow.ai/docs/cli-reference/cli): Full reference for rnow commands - login, init, run, stop, test, download, orgs, status, logout.
- [config.yml](https://reinforcenow.ai/docs/cli-reference/configuration): Project configuration reference for RL and SFT training. Includes data, model, algorithm, rollout, and trainer settings.
- [train.jsonl](https://reinforcenow.ai/docs/cli-reference/train-data): Training data format with messages, rewards, metadata, tools, variables, docker, docker_env fields.
- [rewards.py](https://reinforcenow.ai/docs/cli-reference/rewards): Define reward functions with @reward decorator. Supports precondition rewards and sandbox rewards.
- [tools.py](https://reinforcenow.ai/docs/cli-reference/tools): Define custom tools with @tool decorator for multi-turn agent training. Supports sandbox tools.
- [Supported Models](https://reinforcenow.ai/docs/cli-reference/models): Complete list of supported models with parameters and capabilities.

### Tutorials

- [Supervised Finetuning](https://reinforcenow.ai/docs/tutorials/supervised-finetuning): Train models on labeled conversation data without reward functions.
- [Reasoning Mode](https://reinforcenow.ai/docs/tutorials/reasoning-mode): Enable chain-of-thought reasoning with <think> tags for supported models.
- [MCP Tools](https://reinforcenow.ai/docs/tutorials/mcp-agent): Connect agents to external tools via Model Context Protocol (Tavily, Exa, etc.).
- [Agent Harness](https://reinforcenow.ai/docs/tutorials/agent-harness): Understand ReinforceNow's append-only agent architecture and termination policies.
- [Distillation](https://reinforcenow.ai/docs/tutorials/distillation): On-policy and off-policy knowledge distillation from teacher models.
- [Chart Reasoning](https://reinforcenow.ai/docs/tutorials/chart-reasoning): Train vision-language models on chart QA tasks.

## Installation

```bash
# Install uv (Python package manager)
curl -LsSf https://astral.sh/uv/install.sh | sh

# Create Python 3.11 environment
uv venv --python 3.11
source .venv/bin/activate

# Install rnow CLI
uv pip install rnow

# Login to ReinforceNow
rnow login
```

## CLI Commands

```bash
rnow login                    # Authenticate via OAuth device flow
rnow init --template <name>   # Initialize project (rl-single, rl-tools, sft, tutorial-reward, tutorial-tool)
rnow run                      # Submit training job
rnow stop <RUN_ID>            # Stop active training run
rnow test -n 3 --verbose      # Test rollouts locally before training
rnow download <RUN_ID> -o ./  # Download trained model checkpoint
rnow status                   # Check authentication and running jobs
rnow orgs [ORG_ID]            # List or select organization
rnow logout                   # Remove stored credentials
```

## config.yml (RL Example)

```yaml
dataset_type: rl

data:
  train_file: train.jsonl
  batch_size: 2
  group_size: 16              # Parallel rollouts per prompt

model:
  path: Qwen/Qwen3-8B
  qlora_rank: 32
  qlora_alpha: 64

algorithm:
  loss_fn: ppo                # ppo or importance_sampling
  adv_estimator: grpo         # grpo, gae, or reinforce
  kl_penalty_coef: 0.01

rollout:
  max_turns: 1                # Multi-turn: set > 1
  max_tokens: 2048
  max_context_window: 32768
  termination_policy: last_tool  # last_tool or max_turns
  reasoning_mode: medium      # disabled, low, medium, high (for reasoning models)
  mcp_url: "https://..."      # Optional: MCP server for external tools

trainer:
  num_epochs: 30
  learning_rate: 0.0001
  save_step: 20

# Optional: Run-dependent evals (create with rnow eval first)
evals:
  - eval_id: your_eval_id
    step: 100
```

## config.yml (SFT Example)

```yaml
dataset_type: sft

data:
  train_file: train.jsonl
  batch_size: 4
  val_split: 0.2

model:
  path: Qwen/Qwen3-8B
  qlora_rank: 32

trainer:
  num_epochs: 3
  learning_rate: 0.0001
```

## train.jsonl Format

### RL Training Entry
```json
{"messages": [{"role": "user", "content": "What is 2+2?"}], "rewards": ["accuracy"], "metadata": {"answer": "4"}}
```

### SFT Training Entry
```json
{"messages": [{"role": "user", "content": "Hello"}, {"role": "assistant", "content": "Hi there!"}]}
```

### With Tools
```json
{"messages": [{"role": "user", "content": "Search for AI news"}], "rewards": ["quality"], "tools": ["search"]}
```

### With Sandbox (Isolated Docker Execution)
```json
{"messages": [{"role": "user", "content": "Write Python code"}], "rewards": ["file_created"], "tools": ["run_python"], "docker": "python:3.11-slim"}
```

### With Environment Variables
```json
{"messages": [...], "docker": "myorg/image:latest", "docker_env": {"API_KEY": "xxx"}}
```

## rewards.py

```python
from rnow.core import reward, RewardArgs

@reward
def accuracy(args: RewardArgs, messages: list) -> float:
    """Check if response matches expected answer."""
    response = messages[-1]["content"]
    expected = args.metadata["answer"]
    return 1.0 if expected in response else 0.0

@reward(precondition=True)
def has_format(args: RewardArgs, messages: list) -> float:
    """Gate reward: if 0, total reward is 0."""
    return 1.0 if "Answer:" in messages[-1]["content"] else 0.0

@reward(sandbox=True)
def file_created(args: RewardArgs, messages: list) -> float:
    """Check sandbox state (requires docker field in train.jsonl)."""
    import os
    return 1.0 if os.path.exists("output.txt") else 0.0
```

### Math Verification with math-verify

```python
from math_verify import LatexExtractionConfig, parse, verify
from rnow.core import reward, RewardArgs

@reward
def accuracy(args: RewardArgs, messages: list) -> float:
    """Math verification with LaTeX/boxed support."""
    gold = parse(args.metadata["expected_answer"])
    pred = parse(
        messages[-1]["content"],
        extraction_config=[LatexExtractionConfig(boxed_match_priority=0)]
    )
    return 1.0 if pred and verify(gold, pred) else 0.0
```

## tools.py

```python
from rnow.core.tool import tool

@tool
def calculator(expression: str) -> float:
    """Evaluate a math expression."""
    return eval(expression)

@tool
def web_search(query: str) -> dict:
    """Search the web and return results."""
    import requests
    resp = requests.get("https://api.example.com/search", params={"q": query})
    return resp.json()

@tool(sandbox=True)
def run_python(code: str) -> str:
    """Execute Python in isolated container (requires docker field)."""
    exec(code)
    return "Success"
```

## Supported Models

### Qwen (Text)
- `Qwen/Qwen3-235B-A22B-Instruct-2507` - 235B MoE (22B active), instruction-tuned
- `Qwen/Qwen3-30B-A3B-Instruct-2507` - 30B MoE (3B active), instruction-tuned
- `Qwen/Qwen3-30B-A3B` - 30B MoE, hybrid (thinking optional)
- `Qwen/Qwen3-32B` - 32B dense, hybrid
- `Qwen/Qwen3-8B` - 8B, hybrid (recommended for most tasks)
- `Qwen/Qwen3-4B-Instruct-2507` - 4B, instruction-tuned

### Qwen (Vision)
- `Qwen/Qwen3-VL-235B-A22B-Instruct` - 235B MoE vision model
- `Qwen/Qwen3-VL-30B-A3B-Instruct` - 30B MoE vision model

### Meta Llama
- `meta-llama/Llama-3.3-70B-Instruct` - 70B, instruction-tuned
- `meta-llama/Llama-3.1-70B` - 70B
- `meta-llama/Llama-3.1-8B-Instruct` - 8B, instruction-tuned
- `meta-llama/Llama-3.1-8B` - 8B
- `meta-llama/Llama-3.2-3B` - 3B
- `meta-llama/Llama-3.2-1B` - 1B (fastest iteration)

### DeepSeek
- `deepseek-ai/DeepSeek-V3.1` - MoE, hybrid
- `deepseek-ai/DeepSeek-V3.1-Base` - MoE base model

### OpenAI Open Source (Reasoning)
- `openai/gpt-oss-120b` - 120B MoE, always chain-of-thought
- `openai/gpt-oss-20b` - 20B MoE, reasoning model

### Moonshot (Reasoning)
- `moonshotai/Kimi-K2-Thinking` - 1T+ MoE, long reasoning chains

## Key Concepts

### Training Types
- **RL (Reinforcement Learning)**: Model generates responses, rewards computed by user functions, weights updated via PPO/GRPO
- **SFT (Supervised Finetuning)**: Standard next-token prediction on conversation data

### Rollouts
- Episodes of model inference during RL training
- Configure `max_turns` for multi-turn conversations
- `termination_policy`: `last_tool` (ends when no tool call) or `max_turns` (fixed turns)

### Reasoning Mode
- Supported models can reason in `<think>...</think>` tags before answering
- Modes: `disabled`, `low`, `medium`, `high`
- Increase `max_context_window` (8192-16384) for reasoning models

### Sandbox Execution
- Tools/rewards with `sandbox=True` run in isolated Docker containers
- Requires `docker` field in train.jsonl entry
- Build images for linux/amd64: `docker build --platform linux/amd64 -t image .`

### MCP (Model Context Protocol)
- Connect external tool servers without writing code
- Single: `mcp_url: "https://mcp.tavily.com/..."`
- Multiple: `mcp_url: ["https://server1/...", "https://server2/..."]`
- Works alongside custom tools.py

### Multi-Model Training
```yaml
model:
  path:
    - Qwen/Qwen3-4B-Instruct-2507
    - Qwen/Qwen3-8B
    - Qwen/Qwen3-30B-A3B
```
CLI submits separate runs for each model.

## Links

- Website: https://reinforcenow.ai
- Dashboard: https://reinforcenow.ai/home
- Documentation: https://reinforcenow.ai/docs
- Pricing: https://reinforcenow.ai/pricing