Agentic Vehicle Routing - Part A

Eashwar Sathyamurthy
Jan 11
16 min read

Updated: Jan 11

Motivation

My PhD research involves developing algorithms for Vehicle Routing Problems. Since these problems are NP-hard, I dedicate most of my time to developing complex heuristics and metaheuristics, as exact mathematical solvers are often impractical for large-scale, real-world operations. Typically, a heuristic is tailored to a single variant of the routing problem. Consequently, changing the constraints even marginally can render the algorithm useless, requiring a completely new approach.

My ultimate ambition is to develop an autonomous agent with the ability to adapt to and solve any type of vehicle routing configuration. However, I am not an expert in Reinforcement Learning or agentic frameworks yet. This blog captures my genuine, real-time journey into these fields. It follows a strict bottom-up, learn-as-we-go structure because I am documenting these concepts in parallel with my own learning process.

High-Level Goal: The final goal is to develop a truly Agentic Vehicle Routing system, an autonomous entity capable of handling diverse routing challenges (from TSP to Arc Routing), generalizing across varying graph sizes, and strategically wielding external tools like industrial solvers or mapping APIs.

Goal for Part A: However, before an agent can use tools, it needs a brain. In this first installment, our specific objective is to build that neural foundation. In this blog, the aim to implement a Neural Combinatorial Optimization model that learns to solve the Traveling Salesman Problem (TSP) from scratch using Deep Reinforcement Learning. The focus is to demonstrate that a neural network can derive complex routing heuristics purely through self-play and trial and error, without ever seeing a labeled "correct" solution

Table of contents:

Data Preparation: Building the Ground Truth
The Training Ground: Building a Custom Environment
The Brain of the LLM: The Attention Model
1. Part 1: The Encoder (Understanding the Map)
2. Part 2: The Decoder (Making the Moves)
The Training Strategy (REINFORCE with Baseline)
Results and Analysis
Visualization for fun and sanity check
Limitations
Next Steps: Towards True Agentic Routing

Data Preparation: Building the Ground Truth

To train a reinforcement learning agent to solve the Traveling Salesman Problem (TSP), we first need a robust dataset. Since we cannot easily find thousands of real-world city maps with perfectly known optimal tours, we will generate a synthetic dataset. Critically, this dataset is not for training, as our agent learns purely through self-play. Instead, it serves as a validation benchmark, allowing us to quantify the 'optimality gap' between the agent's self-taught solutions and established heuristics.

Here is a visual overview of our data generation pipeline:

This process involves three main steps:

Step 1: Generating Random Instances

First, we define the scope of our training data. In the configuration section of our script, we set NUM_INSTANCES = 500 and fix the problem size to 50 nodes (cities). The generation function creates random X and Y coordinates on a 1000x1000 grid for each city. It then saves this data into a text file using a standard format known as TSPLIB.

Here is the core loop responsible for generating the random coordinates:

Step 2: Solving for Optimality with OR-Tools

Once the instances are generated, we need to find the near-optimal tour for each one. This "ground truth" is vital for training our RL agent and measuring its performance later. We use Google OR-Tools, an industrial-grade optimization library. To ensure we get high-quality solutions, we configure the solver to use Guided Local Search. This is a powerful metaheuristic that prevents the solver from getting stuck in suboptimal "valleys" by penalizing features of the current solution, forcing it to explore new areas of the search space.

We set a 2-second time limit per instance, which is sufficient for OR-Tools to find near-optimal paths for 50-city graphs.

Step 3: Execution and Visualization

The final step in our script orchestrates the entire process. It first generates all 500 instance files. Then, it loops through them, parses the data, feeds it into the OR-Tools solver, and saves the resulting tour into a corresponding solution text file. For sanity check, we also include a visualization step for the first instance to verify that our data and solutions look correct. Below figure provides a sample generated instance on the left and the TSP tour obtained from OR-tools.

TSP instance (left) and optimal tour (right) generated from OR-tools

The Training Ground: Building a Custom Environment

We have successfully generated our "ground truth" data using OR-Tools, giving us a benchmark for performance. Now, we must build the arena where our AI agent will learn to play the game. In Reinforcement Learning (RL), this arena is called the Environment. While standard libraries like OpenAI Gymnasium provide environments for games like Pac-Man or CartPole, there is no pre-built environment for the Traveling Salesman Problem that fits our exact needs. We must build one from scratch.

Think of this environment as a simulation. It is responsible for three critical tasks:

Observation: showing the agent the current state of the map (where the cities are and which ones are already visited).
Rules: enforcing the constraints of the TSP (you cannot visit the same city twice).
Feedback: telling the agent how well it is doing via a reward signal.

Defining the World (The Observation Space)

The first challenge is deciding what the agent "sees." In a video game, the agent sees pixels. In our routing problem, the agent sees coordinates. Our environment, defined in the TSPEnv class, provides the agent with a dictionary containing three distinct pieces of information:

Nodes: The fixed X, Y coordinates of all cities.
Visited Mask: A binary list (0 or 1) tracking which cities have already been visited.
Current Node: The index of the city the agent is currently standing at.

This distinction is important. The coordinates of the cities never change during an episode, but the "Visited Mask" and "Current Node" change every time the agent takes a step.

Here is the code snippet defining this structure:

The Rules of the Game (The Step Function)

The core logic of any RL environment lives in the step() function. This function takes an action (the agent choosing the next city index) and returns the new state and reward.

For the Traveling Salesman Problem, we have a strict rule: a city can only be visited once.

If our agent tries to visit a city it has already been to, we cannot just let it happen. We apply a severe penalty and immediately end the episode. This forces the agent to learn the rules of valid routing very quickly. If the move is valid, we simply calculate the Euclidean distance to the new city.

We use standard Euclidean distance formula:

Here is how we implement the logic for handling actions and calculating penalties:

The Reward Signal: Why Negative Distance?

You might notice in the code that the reward is the negative distance: reward = -dist.

This is a counter-intuitive but standard trick in Reinforcement Learning.

TSP Goal: Minimize the total distance traveled.
RL Agent Goal: Maximize the total reward collected.

To align these two goals, we flip the sign. By trying to maximize a negative number (e.g., trying to get a score of -10 instead of -100), the agent implicitly learns to find the shortest path. Finally, when the agent visits the last city, the environment automatically adds the distance required to return to the starting city. This ensures the tour is a complete "closed loop," which is a requirement of the Traveling Salesman Problem.

With this environment ready, we now have a simulation where our agent can practice millions of times per hour. In the next section, we will build the "brain" of the agent using an Attention Model.

The Brain of the LLM: The Attention Model

We have the data and the environment. Now we need the brain. To solve the Traveling Salesman Problem (TSP), our agent needs a neural network architecture capable of understanding a map of cities and making a sequence of decisions to construct a valid tour.

Standard feed forward networks are insufficient here. Why? Because the input (a list of cities) represents a spatial graph, not a fixed list of features. The order in which we list the cities in the input file should not change the optimal path. Furthermore, the output is a sequence of decisions that depends heavily on the context of where we are and where we have already been.

To handle these unique challenges, we use an Encoder Decoder architecture. This approach, made famous in machine translation for translating English to French, is surprisingly perfect for routing. In translation, you process a sentence to understand its meaning (Encode) and then generate the translation word by word (Decode). In routing, we process the graph to understand the layout (Encode) and then generate the tour city by city (Decode).

Before looking at the code, let's look at the high level flow. The model acts as a policy network. It takes the current state of the world and outputs a probability distribution telling us which city to visit next.

As shown above, the model takes three specific inputs:

City Coordinates: The static map.
Current City Index: Where the agent is standing right now.
Visited Mask: A checklist of where the agent has already been.

Attention Model: Encoder - Decoder: This diagram summarizes the entire flow we are going to dive into. You can see the Encoder on the left processing the map (graph) once, and the Decoder on the right using the Query and Mask to produce the final probabilities.

Part 1: The Encoder (Understanding the Map)

The role of the Encoder is to process the static graph once at the beginning of the episode. It takes the raw 2D coordinates of the cities and transforms them into rich, high dimensional vectors called embeddings. In our code, this is handled by the TSPEncoder class.

The process happens in two stages:

Linear Projection: The raw X and Y coordinates do not carry enough capacity on their own. We first project them into a 128 dimensional vector.
Transformer Layers: A Transformer uses a mechanism called Self Attention. This allows every city node to look at every other city node simultaneously. By stacking these layers, the model learns a representation for each city that contains information about its neighbors. A city is no longer just an X,Y point. It is now a vector that understands its spatial relationship to the rest of the graph.

Part 2: The Decoder (Making the Moves)

The AttentionModel class acts as the decoder in our architecture. Unlike the encoder, which processes the entire graph in parallel, the decoder operates sequentially. Its specific role is to execute a single step of the tour construction process. At each time step t, it must answer a specific question: Given the global layout of the map and my current position at city i, which unvisited city j should I visit next?

This decision making process, implemented in the forward function, consists of three distinct mathematical stages: defining the context, calculating attention scores (the "pointer"), and enforcing validity through masking.

Step A: Defining the Context (The Query)

To make an informed decision, the agent requires a comprehensive summary of its current state. We formulate this summary as a Query Vector, denoted as q. This query is a fusion of two distinct types of information: global context and local context.

1. Global Context: First, we need a representation of the map as a whole. We compute this by taking the average of all node embeddings produced by the encoder.

2. Local Context: Second, the agent needs to know its immediate surroundings. We extract the embedding corresponding to the specific node the agent is currently occupying.

3. Forming the Query (q): We fuse these two contexts using learnable linear projections. We project both the global context and the local context into the same vector space and sum them together to form the final query vector q.

Step B: The Pointer Mechanism

With the query vector q prepared, the network must now evaluate every city in the graph to determine its suitability as the next destination. This is handled by the Pointer Mechanism.

Unlike the standard Scaled Dot-Product Attention often found in Transformers, we use Additive Attention (also known as Bahdanau Attention). This mechanism is generally preferred for pointer networks because it allows the model to learn a non-linear relationship between the query and the potential targets.

For every city i in the graph, we calculate a raw compatibility score. This score represents how strongly the pointer "wants" to move to city i. The equation governing this is:

In the code, this operation is vectorized for efficiency. We calculate the projections for all nodes and the query, sum them, apply the tangent function, and finally compute the dot product with v.

Step C: Masking (Enforcing Validity)

The output of the pointer mechanism is a vector of raw scores u, one for every city. However, the Traveling Salesman Problem imposes a strict constraint: a tour must visit every city exactly once. The raw attention scores do not inherently respect this rule. The model might assign a high score to a city it has already visited. To prevent valid loops, we must forcibly intervene using Masking.

We utilize a binary visited mask where a value of 1 indicates a city has already been visited. We use this mask to overwrite the raw scores of all visited cities with negative infinity.

This simple mathematical trick ensures that the agent can never select an invalid action, allowing it to focus entirely on learning the optimal sequence among the valid remaining cities.

The entire code is given below:

The Training Strategy (REINFORCE with Baseline)

We now possess all the necessary components for our autonomous routing system. We have the Data Generator to create infinite maps, the Environment to simulate the rules of travel, and the Attention Model to act as the brain. Now we must teach that brain how to optimize tours.

This is where Reinforcement Learning (RL) comes in. Specifically, we use an algorithm called REINFORCE (also known as Monte Carlo Policy Gradient). While effective, standard REINFORCE suffers from high variance which makes training unstable. To stabilize this, we augment it with a Greedy Rollout Baseline.

The Mathematical Motivation

In a standard supervised learning task, we would have "correct" answers (labels) for every map. However, generating optimal labels for millions of TSP instances is computationally prohibitive. Instead, we let the agent discover solutions on its own.

Our goal is to minimize the expected tour length. The gradient of this objective function is defined as:

The Crucial Role of the Baseline

Notice the term called the Advantage. If our agent generates a tour that is shorter than the baseline, the advantage is negative (which is good in a minimization problem). We then push the gradients to encourage this behavior. If the tour is longer than the baseline, we push the gradients to discourage it.

Without a baseline, the agent only knows that "distance is 20." It does not know if 20 is a good score or a bad score for that specific map. The baseline provides that context, significantly reducing the variance of the training and preventing the agent from wandering aimlessly in the optimization landscape.

The Training Loop (Execution)

With the theory established, we implement the training loop. Unlike standard deep learning where we iterate over a fixed dataset, here we generate data on the fly. This prevents overfitting because the agent never sees the exact same graph twice. It must learn the general logic of TSP rather than memorizing maps.

Step 1: The Rollout (Exploration)

For every batch, we feed the random graphs into our Policy Network. Crucially, we run this network in sampling mode. When the Attention mechanism outputs probabilities for the next city (e.g., 60% City A, 40% City B), we roll the dice based on these probabilities.

This introduces randomness, allowing the agent to explore different paths. We record the log probabilities of every action taken, as these are required for the backpropagation step.

Step 2: The Baseline Calculation (Reference)

Simultaneously, we feed the same batch of graphs into the Baseline Network. This network is a frozen copy of the best model we have found so far. We run this network in greedy mode, meaning it always picks the city with the highest probability. This gives us a stable reference point: "What would our best model do on this map?"

Step 3: Optimization

We calculate the loss by multiplying the Advantage by the log probabilities. We then perform standard backpropagation. If the agent explored a path that was better than the baseline, the loss function updates the weights to make that path more likely in the future.

Validation and The "King of the Hill" Update

The training loop improves the Policy Network, but the Baseline Network remains frozen. If the Baseline is too weak, the agent learns easily but plateaus. If the Baseline is too strong, the agent never receives a positive signal. We need a way to evolve the Baseline.

We use a "King of the Hill" update strategy. At the end of every epoch, the Policy Network (the Challenger) faces off against the Baseline Network (the Champion) on a fixed validation set of 500 maps.

The Rules of the Duel:

We run the Policy Network greedily on the validation set and calculate the average tour length.
We run the Baseline Network greedily on the same set.
If the Policy Network's average tour is strictly shorter than the Baseline's, it wins.

If the Policy wins, we say it has fundamentally improved. We then copy the weights of the Policy Network into the Baseline Network. The Challenger becomes the new Champion, and the bar for future training is raised.

This ensures that the training curriculum automatically gets harder as the agent gets smarter, creating a continuous cycle of self-improvement.

Results and Analysis

The performance metrics illustrated above were generated using an NVIDIA GeForce RTX 3080 GPU laptop. It is important to note that reinforcement learning is notoriously sensitive to hyperparameters, and the smooth curves shown here are the result of significant behind the scenes trial and error. Initial runs often failed to converge or got stuck in poor local optima, requiring us to carefully tune the architecture and training schedule to achieve stable learning. The graphs demonstrate a clear progression where the agent evolves from random guessing to a structured routing policy, evidenced by the consistent decrease in both the baseline cost and the optimality gap over the course of 3,000 training steps.

A critical component of this success was the implementation of a learning rate scheduler, which you can see in the decaying curve of the learning rate graph. In optimization, the learning rate determines the size of the step the model takes during gradient descent, following the update rule

Early in training, we use a larger learning rate to allow the model to make significant changes and explore the loss landscape broadly. As training progresses, we decay this rate to prevent the model from overshooting the optimal solution. This decay forces the agent to take smaller, more precise steps, allowing it to settle into a stable minimum rather than oscillating around it. The impact of this is visible in the validation cost, which drops rapidly at first and then gradually plateaus as the learning rate approaches zero.

We also closely monitored the entropy of the agent's policy to gauge its confidence. Mathematically, the entropy H of a probability distribution pi over actions a is defined as

In the context of our routing agent, high entropy indicates that the probability distribution is nearly uniform, meaning the agent is uncertain and is exploring many different cities with roughly equal probability. Low entropy, conversely, means the distribution is peaked, indicating the agent is very confident in its choice of the next city. The graph shows a steady decline in entropy from approximately 140 down to 60. This confirms that our agent successfully transitioned from an exploration phase, where it tried random paths to learn the environment, to an exploitation phase, where it became confident in its learned strategy.

Ultimately, the effectiveness of the training is best summarized by the validation gap percentage. This metric measures how much longer our agent's tours are compared to the highly optimized solutions generated by Google OR-Tools.

We started with a significant gap of over 60 percent, implying our agent was initially very inefficient. Over the course of training, this gap narrowed significantly to the 35 to 40 % range with best gap of 32%. While a gap remains, the downward trend proves that the Attention Model is learning the underlying logic of the Traveling Salesman Problem and is progressively approximating the behavior of industrial grade solvers purely through self play and reinforcement.

Visualization for fun and sanity check

To verify our model's logic, we generated a side-by-side animation comparing the RL agent's real-time tour construction against the industrial-grade OR-Tools benchmark on a random 50-node map. This visualization serves as a crucial sanity check, confirming that the neural network has successfully learned to form valid, coherent loops, even if it occasionally deviates from the mathematically perfect path found by the solver.

https://video.wixstatic.com/video/dbfb5d_4f37c796c2da4850b076bf41b0bfc132/720p/mp4/file.mp4

Limitations

While this architecture is powerful, this specific implementation has a limitation regarding input size. Because neural networks rely on efficient batch processing on the GPU, they require input tensors to be rectangular (all batches must have the same dimensions). This means if we train the model on graphs with 50 nodes, the input tensor shape is fixed to [BatchSize, 50, 2].

Consequently, a model trained specifically on 50 nodes expects exactly 50 nodes during testing. It cannot immediately solve a 20 node or 100 node graph without modifying the code to handle padding (adding dummy nodes to fill the space) or masking differently.

Next Steps: Towards True Agentic Routing

In Part A, we successfully built a neural combinatorial optimization model. We demonstrated that a Deep Learning architecture, specifically an Attention Model trained via Reinforcement Learning, can learn the complex underlying logic of the Traveling Salesman Problem purely through trial and error. However, what we have built so far is arguably a highly specialized solver, not yet an autonomous agent.

An "agent" in AI implies a level of reasoning, adaptability, and the ability to interact with its environment dynamically beyond just processing a static input tensor. To evolve our current model into a truly agentic system, we must address two major frontiers in Part B of this series.

1. Dynamic Graph Generalization

Our current model is limited to the exact graph size it was trained on (e.g., 50 nodes). To build a universal router, we will implement dynamic padding and masking.

The Fix: We will define a maximum capacity (e.g., N=100) and pad smaller inputs with "ghost nodes."
The Mechanism: Sophisticated attention masks in the Transformer will ensure these ghost nodes contribute zero value to the attention scores. This allows a single model weights file to instantly adapt to a graph of 10 cities or 95 cities without retraining.

2. Problem Versatility: Beyond TSP (Arc Routing)

Real-world logistics is rarely just about visiting points (Node Routing/TSP); it often requires traversing specific streets, such as for snow plowing or garbage collection. This is known as Arc Routing (or the Chinese Postman Problem).

The Pivot: We will generalize our agent to handle diverse graph topologies. By altering the input representation from "Coordinate Lists" to "Adjacency Matrices" or "Edge Lists," we can retrain the same attention mechanism to solve Arc Routing problems, where the goal is to cover edges rather than visit nodes.

3. Neuro-Symbolic Tool Use

Finally, a true agent must know when to ask for help. We will transition from a pure end-to-end solver to a hierarchical controller that uses tools.

Hybrid Solving: The agent will learn to decompose large, complex maps into clusters. It will solve simple clusters itself but "call out" to industrial solvers (like OR-Tools) for highly dense, critical sub-sections.
Real-World APIs: Instead of minimizing Euclidean distance, the agent will query mapping APIs (like Google Maps) to optimize for real-time traffic data, bridging the gap between theoretical graphs and muddy reality.

References:

Kool, W., Van Hoof, H., & Welling, M. (2018). Attention, learn to solve routing problems!. arXiv preprint arXiv:1803.08475.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30.
Vinyals, O., Fortunato, M., & Jaitly, N. (2015). Pointer networks. Advances in neural information processing systems, 28.
Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3), 229-256.

Check out the GitHub here: Routing_agent

EASHWAR SATHYAMURTHY