PyCon DE & PyData 2026

Mastering the Hex: A Case Study in Reinforcement Learning for Strategy Games
, Ferrum [2nd Floor]

What does it take to build an AI that learns to play strategy games from scratch? Over the past year, I chose to explore this question out of personal fascination with game AI — as a seminar project for college, but really as a hobby. The result was a complete reinforcement learning environment for Antiyoy, a turn-based strategy game played on hexagonal grids.

The journey raised intriguing challenges: How do you represent hexagonal game boards for neural networks? What do you do when your AI has over 4,000 possible actions to choose from? How do you design rewards that teach strategy rather than just reward flailing in the right direction? This talk shares how these problems were approached using Python's modern ML ecosystem—Gymnasium, PyTorch, and PPO training—ultimately producing an agent that wins nine out of ten games against a random opponent. Whether that qualifies as "strategic play" is a question the agent and I still disagree on.

Whether you're curious about building custom RL environments, interested in game AI, or just wondering what reinforcement learning actually looks like when it half-works, you'll leave with practical insights and a healthy dose of realistic expectations.


Context and Motivation

This talk emerged from a year-long journey that began with a simple curiosity: could I teach a computer to play strategy games by itself? It started as a college seminar project, but the topic was chosen purely out of personal interest in reinforcement learning and game AI — this was a hobby from the start. Rather than working with pre-built environments like CartPole or Atari games, the goal was to understand the entire pipeline—from implementing game mechanics to training a neural network that actually learns to win.

The game chosen was Antiyoy, a minimalist turn-based strategy game where players control territories on hexagonal grids, build units and structures, manage resources, and compete for dominance. While the game is simple enough to understand, it presents genuine strategic depth—exactly the kind of challenge that makes reinforcement learning both difficult and rewarding.

The talk walks through the complete development process, focusing not on implementation minutiae but on the fundamental questions and design decisions that anyone building similar systems would encounter. You won't see walls of code or detailed mathematical derivations. Instead, you'll hear about the thinking process, the challenges faced, and the solutions that emerged—all with the goal of demystifying what it actually takes to build a learning agent for complex games.


What Will You Learn?

The talk is structured around three core challenges that define this kind of project, presented as questions that the work had to answer:

How do you turn a game into something a neural network can understand?

Strategy games aren't naturally suited for machine learning. Antiyoy is played on hexagonal grids, uses discrete turn-based actions, and involves complex state information—territory ownership, unit positions, economic resources, and more. The talk explores how to bridge this gap: representing hexagonal coordinates in ways that computers can efficiently process, encoding complete game state into multi-channel observations similar to those used in AlphaZero, and designing observation spaces that preserve spatial relationships for convolutional networks. You'll hear about the choice between different coordinate systems, the challenge of maintaining game history for temporal reasoning, and how to normalize diverse information types (positions, money, turn counts) into a coherent input for neural networks.

How do you handle massive action spaces without overwhelming your AI?

When your agent has more than 4,000 possible actions at any given moment—moving units to different positions, building various types of units and structures, or ending the turn—training becomes a serious challenge. Most of these actions are illegal at any given time, yet a naive approach would force the agent to learn this the hard way. The talk discusses how action masking solves this problem by dynamically filtering the action space to only legal moves, dramatically improving learning efficiency. You'll understand why this technique is crucial for games with complex rules and how it fundamentally changes the training dynamics compared to environments where every action is always available.

How do you design rewards that actually teach strategy?

Perhaps the most subtle challenge in reinforcement learning is reward design. Give an agent only a +1 for winning and -1 for losing, and it may take forever to figure out what behaviors lead to victory. But add too many intermediate rewards, and you risk the agent exploiting shortcuts rather than learning genuine strategy. The talk shares the experimentation process: starting with sparse rewards as a baseline, carefully introducing intermediate signals for meaningful actions like territory expansion and economic development, and ultimately landing on a reward structure that accelerated learning while still encouraging strategic play. You'll see how reward shaping influenced training speed and final performance, and learn to think about reward design as a crucial part of the development process rather than an afterthought.


What Are the Results and Takeaways?

After training over several thousand episodes—which took about eight hours on a consumer-grade GPU—the agent learned to win approximately nine out of ten games against a baseline random opponent. To be precise about what that means: the baseline picks uniformly from legal moves, so the bar is not high. The trained agent makes progress through a game, expands territory, and occasionally does things that look like they could be intentional. It also makes plenty of moves that defy easy explanation. "Strategy" might be a generous word; "learned to flail more purposefully" is closer to the truth.

The talk concludes by reflecting on what worked well and what proved unexpectedly difficult—and what is still unresolved. Action masking emerged as perhaps the single most impactful technique for managing complexity. The choice of observation space design—borrowing ideas from AlphaZero's approach to representing board games—turned out to be well-suited for the problem. Training infrastructure using MLflow provided invaluable insight into the learning process and made experimentation much more manageable. On the challenging side: reward design required multiple iterations and still produced an agent that plays competently but not strategically. The gap between "beats random" and "actually plays well" is humbling and, it turns out, enormous.


Expected audience expertise in your talk's domain:: Intermediate Expected audience expertise in Python:: Novice

Simon Hedrich is a computer scientist and AI enthusiast currently completing his Master’s degree in Computer Science. His academic and professional journey is marked by a deep interest in bridging the gap between theoretical research and practical AI engineering.

Through his work at inovex GmbH, Simon has demonstrated expertise in specialized areas of Artificial Intelligence, including computer vision and the use of synthetic data to enhance small object detection. His technical writing highlights his ability to leverage generative AI models, such as Stable Diffusion, to solve complex real-world challenges like training data scarcity.