Train CodeFu-7B with veRL and Ray on Amazon SageMaker Tra...

The rapid advancement of artificial intelligence (AI) has created unprecedented demand for specialized models capable of complex reasoning tasks, particularly in competitive programming where models must generate functional code through algorithmic reasoning rather than pattern memorization. Reinforcement learning (RL) enables models to learn through trial and error by receiving rewards based on actual code execution, making it particularly well-suited for developing genuine problem-solving capabilities in algorithmic domains.

However, implementing distributed RL training for code generation presents significant infrastructure challenges such as orchestrating multiple heterogeneous components, coordinating parallel code compilation across nodes, and maintaining fault tolerance for long-running processes. Ray is one of the frameworks for distributed workloads that address these challenges, due to its unified system that handles the entire AI pipeline, GPU-first architecture, and seamless integration with tools like Hugging Face Transformers and PyTorch.

Workloads can be run with Ray framework on SageMaker training jobs by using the Ray on Amazon SageMaker Training jobs solution, which combines Ray’s distributed computing framework with SageMaker’s fully managed infrastructure. This solution automatically handles Ray cluster initialization, multi-node coordination, and distributed resource management, enabling developers to focus on model development while benefiting from SageMaker’s enterprise-grade features.

In this post, we demonstrate how to train CodeFu-7B, a specialized 7-billion parameter model for competitive programming, using Group Relative Policy Optimization (GRPO) with veRL, a flexible and efficient training library for large language models (LLMs) that enables straightforward extension of diverse RL algorithms and seamless integration with existing LLM infrastructure, within a distributed Ray cluster managed by SageMaker training jobs.

We walk through the complete implementation, covering data preparation, distributed training setup, and comprehensive observability, showcasing how this unified approach delivers both computational scale and developer experience for sophisticated RL training workloads. CodeFu-7B-v0.1 is a 7B parameter language model specifically trained for solving Competitive Programming (CP) problems. Built upon the DeepSeek-R1-Distill-Qwen-7B base model, CodeFu demonstrates how reinforcement learning can develop capabilities in algorithmic reasoning and efficient C++ code generation beyond traditional supervised fine-tuning approaches.

The model is trained using problem statements from the DeepMind CodeContest dataset without access to ground-truth solutions during training, forcing it to learn through trial and error based on code execution feedback. This approach enables the development of genuine problem-solving capabilities rather than pattern memorization CodeFu is publicly available on HuggingFace and released under the MIT license, making it accessible for researchers and practitioners interested in code generation and algorithmic reasoning.

The model’s training methodology demonstrates the potential for applying reinforcement learning techniques to complex reasoning tasks beyond competitive programming. Ray on Amazon SageMaker Training jobs is a solution that enables distributed data processing and model training using Ray within SageMaker’s managed training environment. The solution provides key capabilities including universal launcher architecture for automatic Ray cluster setup, multi-node cluster management with intelligent coordination, heterogeneous cluster support for mixed instance types, and integrated observability through Ray Dashboard, Prometheus, Grafana, and Amazon CloudWatch integration.

The solution seamlessly integrates with the SageMaker Python SDK using the modern ModelTrainer API. This publicly available solution on GitHub enables developers to use Ray’s distributed computing capabilities while benefiting from SageMaker’s managed infrastructure, making it ideal for complex workloads like reinforcement learning training that require sophisticated distributed coordination and resource management.

This streamlined architecture delivers a fully managed reinforcement learning training experience, enabling developers to focus on model development while SageMaker and Ray handle the complex distributed infrastructure orchestration—within a pay-as-you-go pricing model that bills only for actual compute time. Note: These permissions grant broad access and are not recommended for use in production environments.

See the SageMaker Developer Guide for guidance on defining more fine-grained permissions The data preparation pipeline transforms the raw DeepMind CodeContest dataset into a format suitable for reinforcement learning training. We apply systematic filters to identify suitable problems, removing those with Codeforces ratings below 800 and implementing quality validation checks for missing test cases, malformed descriptions, and invalid constraints.

We categorize problems into three difficulty tiers: Easy (800-1000 points), Hard (1100-2200 points), and Expert (2300-3500 points). This post uses only the Easy dataset for training. Each problem is formatted with two components: a user prompt containing the problem statement, and a reward_model specification with test cases, time limits, and memory constraints. Crucially, the ground_truth field contains no solution code — only test cases, forcing the model to learn through reward signals rather than memorizing solutions.

For this post, we provide a pre-processed subset of the Easy difficulty dataset in the code sample to streamline the training example, accessible from the GitHub repository. The training process uses Ray to orchestrate the distributed execution and synchronization of vLLM rollout, reward evaluation (code compilation and execution), FSDP model parallelism, and Ulysses sequence parallelism. We set the degree of sequence parallelism to 4 for long-form reasoning and code generations.

The veRL framework implements a sophisticated multi-component architecture through its main_ppo.py orchestrator, which coordinates three primary distributed worker types: ActorRolloutRefWorker for policy inference and rollouts, CriticWorker for value function estimation, and RewardModelWorker for scoring generated solutions. The GRPO algorithm enhances traditional proximal policy optimization (PPO) by computing advantages using group-relative baselines, which helps stabilize training by reducing variance in policy gradient estimates.

We extended the TinyZero code repository by using Ray to manage and distribute reward function calculation. This enables parallel C++ code compilation and evaluation across the same cluster to address the compute-intensive and latency-bound nature of code execution. The entire pipeline is executed as a SageMaker training job running on ml.p4de.24xlarge instances. The training pipeline consists of the following steps as shown in the following architecture: The training process orchestration involves several key components implemented across multiple modules. The core veRL training loop is implemented in main_ppo.py, which initializes Ray workers and manages the distributed training process: This implementation enables efficient distributed training by separating concerns: the main_ppo.py orchestrator manages Ray worker coordination, while the reward system provides scalable code evaluation through parallel compilation and execution across the SageMaker cluster.

Below is the pseudocode for the reward calculation used in this post to train a competitive programming coding model. The reward function is the most important part of reinforcement learning as it defines what the model is encouraged to achieve and what it should avoid. This implementation uses a hierarchical penalty system that first checks for fundamental code execution issues, assigning severe penalties for non-executable code (-1) and moderate penalties for compilation failures (-0.5).

Extracted code solutions are executed with strict time limit enforcement – code exceeding the problem’s specified time limit is given zero reward, facilitating realistic competitive programming conditions. For a successfully executed C++ solution, its reward is calculated as a linear function based on the fraction of private test cases passed, encouraging the model to solve as many private test cases as possible while avoiding overfitting to publicly visible tests.

This design prioritizes code correctness and execution validity, with the private test performance serving as the sole signal for learning optimal coding solutions. Refer to scripts/verl/utils/reward_score/code_contests.py for the complete Python code. Executing generated code in production environments requires appropriate sandboxing. In this controlled demonstration setting, we execute the code as a quick example to evaluate its correctness to assign rewards.

To train CodeFu-7B using veRL and Ray on SageMaker training jobs, we use the ModelTrainer class from the SageMaker Python SDK. Start by setting up the distributed training workload with the following steps: The training uses a custom Docker container that includes veRL, Ray, and the necessary dependencies for distributed RL training. Refer to the GitHub repository for the complete container definition and build instructions.

The ModelTrainer class provides flexible execution options through its SourceCode configuration, allowing users to customize their training workflows with different frameworks and launchers. Specify either an entry_script for direct Python script execution or use the command parameter for custom execution commands, enabling integration with specialized frameworks such as Ray, Hugging Face Accelerate, or custom distributed training solutions.

The launcher.py script serves as the universal entry point that detects the SageMaker environment (single-node or multi-node, homogeneous or heterogeneous cluster), initializes the Ray cluster with proper head/worker node coordination, and executes your custom training script. Key launcher.py functionalities are: For the complete implementation of the Ray cluster setup with SageMaker training jobs, refer to launcher.py.

The job can be monitored directly from the notebook output or through the SageMaker console, which shows the job status and corresponding CloudWatch logs. The launcher.py script orchestrates the Ray cluster initialization through the following automated steps, which can be monitored in real-time through CloudWatch logs: After the job completes, the trained model weights and checkpoints will be available in the specified S3 output path, ready for deployment or further evaluation.

The CodeFu training pipeline integrates seamlessly with Managed MLflow on Amazon SageMaker AI as well as third party solutions, for comprehensive experiment tracking and visualization of reinforcement learning metrics. The following image shows the metrics that are particularly useful to monitor during CodeFu training. The metrics plot shows a promising GRPO/PPO learning progression for the competitive programming model.

The reward signals demonstrate clear improvement, with critic/reward/mean rising from -0.8 to 0.6 and critic/reward/min recovering from initial failures -1.0 to moderate performance -0.5, while critic/reward/max maintains perfect scores 1.0 throughout training, indicating the model can achieve optimal solutions. The Actor metrics reveal healthy training dynamics: actor/ppo_kl remains low ~0.0002 after an initial spike, confirming stable policy updates, while actor/pg_clipfrac stays in a reasonable range ~0.002-0.004, suggesting appropriately sized learning steps.

The increasing actor/kl_loss trend indicates growing divergence from the reference model as expected during RL fine-tuning. Most importantly, val/test_score/code_contests shows consistent improvement from -0.6 to ~0.5, and the train-validation comparison reveals good generalization with both curves tracking closely, indicating the model is learning to solve coding problems effectively without overfitting.

To access the Ray Dashboard and enable Grafana visualization during training, establish port forwarding using AWS Systems Manager (SSM). To learn more about the setup of AWS SSM, please refer to AWS Systems Manager Quick Setup. This observability setup is crucial for debugging distributed RL training issues, optimizing resource allocation, and making sure the training process progresses efficiently across the multi-node SageMaker cluster.

This post demonstrates how to train specialized reasoning models for competitive programming using the Ray on Amazon SageMaker Training jobs solution combined with veRL’s reinforcement learning framework. The Ray on SageMaker training jobs solution simplifies the complexity of orchestrating distributed RL workloads by automatically handling Ray cluster initialization, multi-node coordination, and resource management across heterogeneous compute environments.

This integration enables organizations to use Ray’s advanced distributed computing capabilities—including support for complex multi-component architectures, dynamic resource allocation, and fault-tolerant execution—while benefiting from SageMaker’s fully managed infrastructure, enterprise-grade security, and pay-as-you-go pricing model. The detailed metrics analysis demonstrated how to monitor training health through reward progression, policy stability indicators, and generalization performance, enabling practitioners to identify optimal training configurations and troubleshoot distributed training issues effectively.

To begin implementing distributed RL training with Ray on SageMaker, visit the Ray on Amazon SageMaker Training jobs GitHub repository for the foundational solution framework. The complete CodeFu-7B training implementation, including veRL integration and configuration examples, is available at this GitHub repository. Bruno Pistone is a Senior Worldwide Generative AI/ML Specialist Solutions Architect at AWS based in Milan, Italy.

He works with AWS product teams and large customers to help them fully understand their technical needs and design AI and machine learning solutions that take full advantage of the AWS cloud and Amazon ML stack. His expertise includes distributed training and inference workloads, model customization, generative AI, and end-to-end ML. He enjoys spending time with friends, exploring new places, and traveling to new destinations.

Giuseppe Angelo Porcelli is a Principal Machine Learning Specialist Solutions Architect for Amazon Web Services. With several years of software engineering and an ML background, he works with customers of any size to understand their business and technical needs and design AI and ML solutions that make the best use of the AWS Cloud and the Amazon Machine Learning stack. He has worked on projects in different domains, including MLOps, computer vision, and NLP, involving a broad set of AWS services.

In his free time, Giuseppe enjoys playing football. Yin Song is a Senior Applied Scientist at the AWS Prototyping team in Sydney, Australia, with over five years of experience helping customers build tailored prototypes that demonstrate complex AWS service use cases. His work focuses on research in AI model fine-tuning and serving, enabling impactful end-to-end AI solutions. A passionate advocate for open source, Yin leads generative AI initiatives that have produced widely-adopted models.

Chen Wu is a Principal Applied Scientist at the AWS Prototyping team, where he drives both applied research and high-impact customer engagements. He specializes in long-context language models, reasoning LLMs, agentic systems, and high-performance AI systems. Chen leads development of the Agent Training Kit, an open source framework for continual learning agents. He has delivered strategic engagements across genomic foundation models, LLM optimization, multi-scale image generation, and 3D/4D volumetric AI pipelines.

His open LLMs on Hugging Face have achieved over 1 million downloads, and his long-context research has appeared in NeurIPS 2024 and ACL 2025. He is an ACM Gordon Bell Prize Finalist.

Summary

This report covers the latest developments in artificial intelligence. The information presented highlights key changes and updates that are relevant to those following this topic.

Original Source: Amazon.com | Author: Bruno Pistone | Published: February 24, 2026, 3:46 pm

Category Name

Older iPhones and iPads Receive Critical Security Updates…

Samsung Galaxy Z Fold 7 Joins One UI 8.5 Beta Program

The best — and worst — iPhone alarm sounds to wake up to

Recent Posts

Older iPhones and iPads Receive Critical Security Updates…

Samsung Galaxy Z Fold 7 Joins One UI 8.5 Beta Program

The best — and worst — iPhone alarm sounds to wake up to

The 1TB PNY microSD Express Card loaded up Pokemon Pokopi…

Older iPhones and iPads Receive Critical Security Updates…

Samsung Galaxy Z Fold 7 Joins One UI 8.5 Beta Program

The best — and worst — iPhone alarm sounds to wake up to

Older iPhones and iPads Receive Critical Security Updates…

Samsung Galaxy Z Fold 7 Joins One UI 8.5 Beta Program

The best — and worst — iPhone alarm sounds to wake up to

Train CodeFu-7B with veRL and Ray on Amazon SageMaker Tra…

Summary

Share This Post

Leave a Reply Cancel reply

Recent Posts

Categories