IBM AI Racing League: Autonomous Driving Agent
A reinforcement learning project in the TORCS simulator, focused on training an autonomous driving agent from telemetry. The work centred on observation design, reward shaping, training stability, evaluation discipline and iterative debugging of learned driving behaviour.
- Competition
- IBM AI Racing League
- Simulator
- TORCS (The Open Racing Car Simulator)
- Approach
- Reinforcement learning · reward shaping
- Stack
- Python · IBM Granite · Git
- Status
- Active
Overview
The IBM AI Racing League is a competition in which agents drive simulated cars in TORCS (The Open Racing Car Simulator). The agent consumes telemetry such as speed, track position, orientation and distances to track edges, then produces continuous steering, throttle and brake commands in real time.
The main engineering challenge was not selecting a model family in isolation. It was defining a reward function that encouraged stable driving, designing evaluation runs that measured generalisation and keeping experiments reproducible while the policy was still learning useful behaviour.
Environment
TORCS exposes a structured Python interface to the simulator: the agent receives a fixed-shape telemetry vector each timestep and returns a continuous control vector. Episodes are full laps or pre-defined timeouts; track topology, friction and weather can be configured per experiment.
The simulator makes rapid iteration possible because failed episodes can be reset cheaply. It also exposes reward design problems quickly, because poorly shaped objectives lead to unstable or unintended driving behaviour.
Observation & action spaces
- Observation. Speed (longitudinal and lateral), track position offset from centreline, car orientation relative to track, multiple range-finder beams to track edges, RPM and gear.
- Action. Continuous steering ∈ [-1, 1], continuous throttle ∈ [0, 1], continuous brake ∈ [0, 1]. Discrete gear was tried early but moved into a deterministic helper to keep the policy focused on the parts that actually matter.
Approach
A reinforcement-learning policy trained from telemetry to control. The deliberate choice was to start with a small, well-understood algorithm and let the reward design and evaluation rigour carry the work, rather than reaching for the most exotic available method first.
- Continuous-action RL on the policy itself, with a separate, deterministic helper for gearing and emergency-recovery heuristics.
- Normalised observations so that any one telemetry channel does not dominate the learning signal by scale alone.
- Action smoothing on top of the raw policy output to avoid the classic RL-on-real-control failure mode of high-frequency steering oscillation.
Reward shaping
The reward function is the place to be careful. A terminal "reward = lap time" signal is theoretically clean but practically weak because the agent never finishes a lap to receive it. The shaping is therefore multi-component:
- Forward progress. Per-step reward proportional to longitudinal velocity along the track centreline. This is the dominant shaping signal and the one most prone to abuse.
- Track-position penalty. Quadratic penalty on lateral offset from centreline, which keeps the car on the track without forbidding cornering lines.
- Off-track termination. Hard episode termination with a large negative bonus on going off. This provides a cleaner signal than a continuous penalty when the dynamics get unstable.
- Smoothness regularisers. Small penalties on action-delta magnitudes to discourage twitchy steering.
The iteration cycle was driven by observing how the policy exploited the reward proxy, then refining the reward to better match the intended driving behaviour. Several training runs exposed proxy failures such as sideways progress or corner-cutting that improved the shaped reward without improving lap completion.
Training loop
Step 01
TORCS env
Reset · step · render off
Step 02
Telemetry
Observation vector
Step 03
Normalisation
Per-channel scaling
Step 04
Policy
Continuous control output
Step 05
Action smoothing
Stop oscillation
Step 06
Reward
Shaped multi-term
Step 07
Replay / update
Algorithm-specific
Step 08
Eval episodes
Held-out tracks · metrics
Training and evaluation use disjoint track configurations so that the reported numbers reflect generalisation across track shapes, not memorisation of a single circuit.
IBM Granite assistance
IBM Granite was used as an assistant during algorithm design and debugging. It helped explain failure modes, sketch reward variants and cross-check specific implementation details. The agent itself was not a Granite-driven policy; this was straightforward RL with a competent code assistant in the loop.
The useful framing is that Granite accelerated iteration around implementation details and reward design ideas, while the core control policy, experiments and evaluation remained part of the reinforcement learning workflow.
Evaluation
Lap completion rate, mean lap time across completed laps, off-track-event count, and average centreline offset. Lap time on its own is not a meaningful headline metric until completion rate is high enough for the average to be comparable across runs.
Held-out track configurations distinct from the training set are used to check generalisation. A policy that wins one track and crashes on another is only useful in a single-track competition.
Each evaluation runs multiple seeds. RL training variance is large enough that a single point estimate is misleading. Reported numbers include mean and standard deviation across seeds, not a cherry-picked best run.
What broke and why
- Reward exploitation. Early shaping rewarded raw forward velocity, and the agent discovered that crabbing sideways at high speed was rewarding even while progressing slowly along the track. Fixed by projecting velocity onto the centreline tangent.
- Twitchy steering. The first stable policy oscillated the steering at very high frequency, technically converging but visibly unrideable. Fixed with action-delta penalties and a small low-pass on the steering channel.
- Off-track snowballs. Without hard termination, going off track caused training to spend most of its time learning recovery manoeuvres rather than driving. Switching to hard termination on off-track events accelerated learning meaningfully.
- Hyperparameter brittleness. Small changes in learning rate or batch size produced very different training trajectories. Logged every run with a unique ID, config snapshot and metric history, so "which run did that?" became answerable rather than mysterious.
Engineering practice
- Every run starts from a Git commit and a config file; both are written into the run's artefact directory. No silently mismatched environments.
- Evaluation is decoupled from training. Evaluation episodes never feed gradients back into the policy.
- Failure modes are catalogued. The interesting deliverable of an RL project is not just the policy; it is the list of behaviours the policy developed and how the reward had to evolve to suppress them.
Future work
- Curriculum across track topologies, starting with easier circuits and moving to harder ones through a measured handoff rather than a single mixed-difficulty pool.
- Imitation pre-training from a hand-coded controller to bootstrap the policy past the first 90% of training time spent learning "don't go off the track".
- Multi-agent racing scenarios, because the dynamics with traffic are a different problem and a more interesting one.
- Sim-to-real considerations: what would have to be true for this approach to transfer to a real RC-car testbed?