# ReinforceNow > End-to-end platform for continual learning with AI agents. Deploy, train on production traffic, and continuously improve your models with reinforcement learning. ReinforceNow is a platform for training AI models using reinforcement learning (RL) and supervised finetuning (SFT). It supports open-source LLMs including Qwen, Llama, DeepSeek, and OpenAI's GPT-OSS reasoning models. ## Docs ### Getting Started - [Installation](https://reinforcenow.ai/docs/getting-started/installation): Install the ReinforceNow CLI using uv and Python 3.11. Supports automated setup via Claude Code. - [Quickstart](https://reinforcenow.ai/docs/getting-started/quickstart): Your first ReinforceNow project. Finetune Qwen3-8B using SFT or RL in minutes. - [Create Your First Reward](https://reinforcenow.ai/docs/getting-started/first-reward): Write reward functions using math-verify to train models on OpenMathReasoning dataset. - [Train Your First Agent](https://reinforcenow.ai/docs/getting-started/first-agent): Train an agent with tools using Wikipedia search for multi-turn RL. ### CLI Reference - [CLI Commands](https://reinforcenow.ai/docs/cli-reference/cli): Full reference for rnow commands - login, init, run, stop, test, download, orgs, status, logout. - [config.yml](https://reinforcenow.ai/docs/cli-reference/configuration): Project configuration reference for RL and SFT training. Includes data, model, algorithm, rollout, and trainer settings. - [train.jsonl](https://reinforcenow.ai/docs/cli-reference/train-data): Training data format with messages, rewards, metadata, tools, variables, docker, docker_env fields. - [rewards.py](https://reinforcenow.ai/docs/cli-reference/rewards): Define reward functions with @reward decorator. Supports precondition rewards and sandbox rewards. - [tools.py](https://reinforcenow.ai/docs/cli-reference/tools): Define custom tools with @tool decorator for multi-turn agent training. Supports sandbox tools. - [Supported Models](https://reinforcenow.ai/docs/cli-reference/models): Complete list of supported models with parameters and capabilities. ### Tutorials - [Supervised Finetuning](https://reinforcenow.ai/docs/tutorials/supervised-finetuning): Train models on labeled conversation data without reward functions. - [Reasoning Mode](https://reinforcenow.ai/docs/tutorials/reasoning-mode): Enable chain-of-thought reasoning with tags for supported models. - [MCP Tools](https://reinforcenow.ai/docs/tutorials/mcp-agent): Connect agents to external tools via Model Context Protocol (Tavily, Exa, etc.). - [Agent Harness](https://reinforcenow.ai/docs/tutorials/agent-harness): Understand ReinforceNow's append-only agent architecture and termination policies. - [Distillation](https://reinforcenow.ai/docs/tutorials/distillation): On-policy and off-policy knowledge distillation from teacher models. - [Chart Reasoning](https://reinforcenow.ai/docs/tutorials/chart-reasoning): Train vision-language models on chart QA tasks. ## Installation ```bash # Install uv (Python package manager) curl -LsSf https://astral.sh/uv/install.sh | sh # Create Python 3.11 environment uv venv --python 3.11 source .venv/bin/activate # Install rnow CLI uv pip install rnow # Login to ReinforceNow rnow login ``` ## CLI Commands ```bash rnow login # Authenticate via OAuth device flow rnow init --template # Initialize project (rl-single, rl-tools, sft, tutorial-reward, tutorial-tool) rnow run # Submit training job rnow stop # Stop active training run rnow test -n 3 --verbose # Test rollouts locally before training rnow download -o ./ # Download trained model checkpoint rnow status # Check authentication and running jobs rnow orgs [ORG_ID] # List or select organization rnow logout # Remove stored credentials ``` ## config.yml (RL Example) ```yaml dataset_type: rl data: train_file: train.jsonl batch_size: 2 group_size: 16 # Parallel rollouts per prompt model: path: Qwen/Qwen3-8B qlora_rank: 32 qlora_alpha: 64 algorithm: loss_fn: ppo # ppo or importance_sampling adv_estimator: grpo # grpo, gae, or reinforce kl_penalty_coef: 0.01 rollout: max_turns: 1 # Multi-turn: set > 1 max_tokens: 2048 max_context_window: 32768 termination_policy: last_tool # last_tool or max_turns reasoning_mode: medium # disabled, low, medium, high (for reasoning models) mcp_url: "https://..." # Optional: MCP server for external tools trainer: num_epochs: 30 learning_rate: 0.0001 save_step: 20 # Optional: Run-dependent evals (create with rnow eval first) evals: - eval_id: your_eval_id step: 100 ``` ## config.yml (SFT Example) ```yaml dataset_type: sft data: train_file: train.jsonl batch_size: 4 val_split: 0.2 model: path: Qwen/Qwen3-8B qlora_rank: 32 trainer: num_epochs: 3 learning_rate: 0.0001 ``` ## train.jsonl Format ### RL Training Entry ```json {"messages": [{"role": "user", "content": "What is 2+2?"}], "rewards": ["accuracy"], "metadata": {"answer": "4"}} ``` ### SFT Training Entry ```json {"messages": [{"role": "user", "content": "Hello"}, {"role": "assistant", "content": "Hi there!"}]} ``` ### With Tools ```json {"messages": [{"role": "user", "content": "Search for AI news"}], "rewards": ["quality"], "tools": ["search"]} ``` ### With Sandbox (Isolated Docker Execution) ```json {"messages": [{"role": "user", "content": "Write Python code"}], "rewards": ["file_created"], "tools": ["run_python"], "docker": "python:3.11-slim"} ``` ### With Environment Variables ```json {"messages": [...], "docker": "myorg/image:latest", "docker_env": {"API_KEY": "xxx"}} ``` ## rewards.py ```python from rnow.core import reward, RewardArgs @reward def accuracy(args: RewardArgs, messages: list) -> float: """Check if response matches expected answer.""" response = messages[-1]["content"] expected = args.metadata["answer"] return 1.0 if expected in response else 0.0 @reward(precondition=True) def has_format(args: RewardArgs, messages: list) -> float: """Gate reward: if 0, total reward is 0.""" return 1.0 if "Answer:" in messages[-1]["content"] else 0.0 @reward(sandbox=True) def file_created(args: RewardArgs, messages: list) -> float: """Check sandbox state (requires docker field in train.jsonl).""" import os return 1.0 if os.path.exists("output.txt") else 0.0 ``` ### Math Verification with math-verify ```python from math_verify import LatexExtractionConfig, parse, verify from rnow.core import reward, RewardArgs @reward def accuracy(args: RewardArgs, messages: list) -> float: """Math verification with LaTeX/boxed support.""" gold = parse(args.metadata["expected_answer"]) pred = parse( messages[-1]["content"], extraction_config=[LatexExtractionConfig(boxed_match_priority=0)] ) return 1.0 if pred and verify(gold, pred) else 0.0 ``` ## tools.py ```python from rnow.core.tool import tool @tool def calculator(expression: str) -> float: """Evaluate a math expression.""" return eval(expression) @tool def web_search(query: str) -> dict: """Search the web and return results.""" import requests resp = requests.get("https://api.example.com/search", params={"q": query}) return resp.json() @tool(sandbox=True) def run_python(code: str) -> str: """Execute Python in isolated container (requires docker field).""" exec(code) return "Success" ``` ## Supported Models ### Qwen (Text) - `Qwen/Qwen3-235B-A22B-Instruct-2507` - 235B MoE (22B active), instruction-tuned - `Qwen/Qwen3-30B-A3B-Instruct-2507` - 30B MoE (3B active), instruction-tuned - `Qwen/Qwen3-30B-A3B` - 30B MoE, hybrid (thinking optional) - `Qwen/Qwen3-32B` - 32B dense, hybrid - `Qwen/Qwen3-8B` - 8B, hybrid (recommended for most tasks) - `Qwen/Qwen3-4B-Instruct-2507` - 4B, instruction-tuned ### Qwen (Vision) - `Qwen/Qwen3-VL-235B-A22B-Instruct` - 235B MoE vision model - `Qwen/Qwen3-VL-30B-A3B-Instruct` - 30B MoE vision model ### Meta Llama - `meta-llama/Llama-3.3-70B-Instruct` - 70B, instruction-tuned - `meta-llama/Llama-3.1-70B` - 70B - `meta-llama/Llama-3.1-8B-Instruct` - 8B, instruction-tuned - `meta-llama/Llama-3.1-8B` - 8B - `meta-llama/Llama-3.2-3B` - 3B - `meta-llama/Llama-3.2-1B` - 1B (fastest iteration) ### DeepSeek - `deepseek-ai/DeepSeek-V3.1` - MoE, hybrid - `deepseek-ai/DeepSeek-V3.1-Base` - MoE base model ### OpenAI Open Source (Reasoning) - `openai/gpt-oss-120b` - 120B MoE, always chain-of-thought - `openai/gpt-oss-20b` - 20B MoE, reasoning model ### Moonshot (Reasoning) - `moonshotai/Kimi-K2-Thinking` - 1T+ MoE, long reasoning chains ## Key Concepts ### Training Types - **RL (Reinforcement Learning)**: Model generates responses, rewards computed by user functions, weights updated via PPO/GRPO - **SFT (Supervised Finetuning)**: Standard next-token prediction on conversation data ### Rollouts - Episodes of model inference during RL training - Configure `max_turns` for multi-turn conversations - `termination_policy`: `last_tool` (ends when no tool call) or `max_turns` (fixed turns) ### Reasoning Mode - Supported models can reason in `...` tags before answering - Modes: `disabled`, `low`, `medium`, `high` - Increase `max_context_window` (8192-16384) for reasoning models ### Sandbox Execution - Tools/rewards with `sandbox=True` run in isolated Docker containers - Requires `docker` field in train.jsonl entry - Build images for linux/amd64: `docker build --platform linux/amd64 -t image .` ### MCP (Model Context Protocol) - Connect external tool servers without writing code - Single: `mcp_url: "https://mcp.tavily.com/..."` - Multiple: `mcp_url: ["https://server1/...", "https://server2/..."]` - Works alongside custom tools.py ### Multi-Model Training ```yaml model: path: - Qwen/Qwen3-4B-Instruct-2507 - Qwen/Qwen3-8B - Qwen/Qwen3-30B-A3B ``` CLI submits separate runs for each model. ## Links - Website: https://reinforcenow.ai - Dashboard: https://reinforcenow.ai/home - Documentation: https://reinforcenow.ai/docs - Pricing: https://reinforcenow.ai/pricing