The Traveling Salesman Problem – Genetic Algorithm in C++ and CUDA
Last Updated: 01/07/15
This project created an implementation for solving the Traveling Salesman Problem (TSP) in C++ and CUDA through the use of a Genetic Algorithm (GA). This documentation is not intended to be a standalone document for providing information about what GAs are nor is it a detailed publication of methods for solving the TSP. The purpose of this documentation is to provide adequate background about the specific solution implemented. The functionality of this implementation is demonstrated at the end of this document. Additionally, the source code is open-source (MIT license) and available here.
The TSP is an NP-hard problem that involves finding the shortest distance between a given set of cities. In this work, cities are arranged in a two-dimensional (2D) world. The starting and ending positions are left unfixed, resulting in an open-looped path between the cities. Each city must be visited exactly once. It is assumed that direct paths may be taken between cities. In this fashion, there exists num_cities! number of configurations for a given set of cities, where num_cities is the number of cities in the world. The goal is to minimize the total distance, i.e. the sum of the Euclidean distance between each pair of connecting cities. This would result in a total of num_cities – 1 distance calculations. Finding the ideal solution is impractical, as with a small world of only 25 cities, it would require about 1.6 x 1025 unique trials.
To find a solution to the TSP a Genetic Algorithm (GA) was used. A GA is a search heuristic that utilizes the process of natural selection to arrive at a desirable solution. GAs are designed to maximize a fitness function. In the TSP it is desired to minimize the distance; thus, the fitness function was set to be 1 / distance. Calculating the distance is costly as it involves computing a square root. Since distances can be compared accurately without having to take the square root (using the squared Euclidean distance), this operator was removed, resulting in the fitness function shown below, where cities is a list of objects containing the x and y coordinates for each city in the world.
The TSP consists of finding the minimum distance between fixed cities; thus, the distance is a function of the order the cities are traveled through. For this reason, a permutation encoding scheme was selected. In a permutation encoding scheme, the only adjustable feature is the order of the objects. A GA works on a provided population of individuals. For this context, each individual in the population will be an instance of the desired world, where a world consists of a list of randomly arranged cities. Note that the same cities must be used for each world instance, ensuring that the salesman travels through the same set of cities, each time.
A GA operates on generations, where each generation produces a new population from the previous population, and a dominant individual emerges. The stopping condition for this project was set to be a user-defined, fixed number of generations. The final result is obtained by finding the best route (largest fitness function) across all of the generations. Within a generation, a GA must make a new population, utilizing the previous population. Creating a new population involves performing a selection, crossover, and mutation step.
The process of selection, crossover, and mutation is iteratively performed until a new population is created that is the same size as the original population. It should be noted that there exists a random probability of both crossover occurring and a random probability of mutation occurring. If crossover does not occur, the first parent is utilized as the new child. With that new child, the algorithm continues, with mutation. If mutation does not occur on that child, it is simply inserted into the new population as is, resulting in a direct copy of one of the parents.
In selection, two parents are chosen from the population. There are a number of different selection schemes, all having the basic goal of selecting the individuals with the top fitness functions. This process simulates the idea of “survival of the fittest”. The selection scheme utilized in this project is known as roulette wheel selection. This selection scheme consists of three steps. In the first step, the sum of the fitness function across all individuals is calculated. The second step is used to calculate the individual probabilities, which is simply the individual’s fitness divided by the sum of the fitness of all individuals up through the current individual. The third step is used to select the two parents. A random probability is generated and the population is iterated. If the individual’s probability is greater than the random probability, it is selected as a parent. This step is repeated until both parents are selected.
After the parents are selected, they are mated to create children. This step is known as crossover, as traits from each parent are used to create one or more children. In this work, the single point crossover method was utilized, where only one child is created per set of parents. This involves randomly selecting a crossover point, copying all elements up through that point, from the first parent, into a new child, and then adding the rest of the elements from the second parent. This process is illustrated below. It is important to note that the set of cities cannot be changed; therefore, when cities are added from the second parent, only cities that are not currently in the child will be added. Cities are added sequentially from the second parent, beginning with the first city, to ensure that the order is maintained as best as possible.
The final step of a GA is to mutate the newly created child. This simulates the birth of a child with features that are unique to itself. The mutation scheme utilized in this work was the order changing permutation. In this method, two unique elements are randomly chosen and swapped. This process is illustrated below.
To eliminate data dependencies, as well as to improve the performance of the code, some fundamental changes were performed. It is important to note that all changes resulted in a functionally equivalent result of the originally described algorithm. The primary change was pulling out the first two steps of the selection process and adding them into the evaluation process. With the C++ code, this completely eliminates an entire loop. Additionally, an added requirement of producing only one child per set of parents was placed. This ensured that pop_size iterations would be required to create each population, where pop_size is the number of individuals in the population. All random numbers were generated ahead of time using a user-provided random seed. This allowed for a deterministic path to be taken, which is necessary for timing and repeatability purposes.
Structurally, a struct was created to define a World. Each world consisted of a pointer to an array of City structs, along with other various properties that defined the world. Each city struct contained the x and y coordinates of the city. Memory management was tricky and crucial, as increasing pop_size and / or num_cities results in a large increase in memory utilization.
A number of parallelizations exist in this solution; however, many of which were found to be unsuitable for CUDA. Initially, parallelizations were exploited at the lowest level (within each step of the GA). This led to poor results as the task size was much too small. Without using a sort of islanding scheme, where multiple versions of the GA run in parallel, it was found that the largest speedup could be obtained by exploiting parallelization within a generation. Instead of sequentially creating a new population, calculating the fitness function, and selecting the dominant leader, all of those operations may be done in parallel.
The biggest challenge for mapping the sequential code to CUDA was the memory management. All memory operations had to be manually rewritten to accommodate the use of the World struct. Additionally, the struct was created to be portable (it could be used outside of the function it was created), because of this, properly freeing it proved to be quite difficult. Through the testing of various memory schemes it was found that global memory was best suited for the task. The population was initially created on the CPU and then copied into global memory on the device. From that point onwards, all other operations took place in device memory. A large amount of memory was required, especially considering that all random numbers had to be pre-computed (computing them on the GPU would have resulted in different numbers than the CPU; thus, disallowing a direct-to-direct comparison). It was desired to trace the route the algorithm created, through generations; therefore, the dominant individual from each generation had to be copied back to the CPU. Given that this was only one element out of the entire population, the data transfer time became negligible. If only the dominant individual was desired, only one single transfer to the CPU would be required, once all generations had been computed, in addition to updating the random numbers once per generation.
The code was written to be six kernels and two device functions. Due to data dependencies, evaluating the fitness function required the use of three kernels. One kernel was used to compute the actual fitness function, where each thread calculated the sum of the squared Euclidean distances for its cities. The time complexity of this kernel is O(num_cities). The second kernel was used to calculate the partial fitness sum, which was used in determining the partial probabilities during the selection step. Each thread summed up to its thread ID; thus, the time complexity of that kernel was O(pop_size). The last kernel was used to calculate the individual fitness probability. Each thread calculated one probability, resulting in a time complexity of O(1). With this in mind, the time complexity for the first portion of the algorithm is max(num_cities, pop_size).
After the population has been evaluated, the maximum fitness must be computed, such that the dominant leader can be selected. A kernel was used to find the max value. This was done in a rather lazy approach, as only one thread was used to find the max; thus, the algorithm took O(pop_size – 1) time. A better approach would have been to use a reduction algorithm. This was the initial plan; however, due to the relatively small population sizes this was deemed less critical. Especially given that non-power of two’s would be computed; thus, two kernels would be required to properly execute the reduction.
In addition to the previously described kernels, selection utilized one more kernel. This kernel was used to select the parents. Each thread selected one parent; thus, double pop_size threads were required. Early stopping was used, such that as soon as a parent was found, that thread’s work would finish. Even still, the worst-case execution time is O(pop_size).
The final kernel was used to create the children and performed both crossover and mutation via device functions. The newly created child was directly placed in the new population array (the old and new populations swapped after the end of each generation). Crossover was a relatively complex procedure, as both the old and new populations had to be iterated. The new population would have a worst-case time complexity of O(num_cities2). Mutation was a simple pointer swap operation; thus, having a time complexity of O(1).
With all of that in mind, the time complexity for the GPU becomes O(num_generations * max(num_cities2, pop_size)). For the CPU the time complexity is O(num_generations * pop_size * max(num_cities2, pop_size)). This shows a substantial, theoretical, improvement in performance of the GPU over the CPU.
To verify the functionality of the GA, the routes were plotted as the GA selected them. Fig. 1 – Fig. 4 show an example for a population consisting of 100 individuals and 25 cities per world. In Fig. 1, the initial world is created. This figure represents the problem that the GA is tasked with solving. Looking at the points, a decent solution can be manually found; however, as this problem space grows, finding an optimal solution would become increasingly difficult. After randomly arranging the cities in each individual, the best results are shown in Fig. 2. Those results are before the GA has been executed. It is very apparent how poor the selected path is, due to the number of backtracking steps. After only 250 generations, the traveled distance is almost halved, as shown in Fig. 3. There is still some backtracking, but the route has drastically improved. After 500 generations, the GA reaches the solution shown in Fig. 4. At this point almost no backtracking occurs. The total distance is now 2.5x smaller than the original path. The route could still be improved; however, at this point the solution could be considered good enough.
The following video shows a demonstration of the GA building the routes:
To measure the performance of the algorithm, three measurements were taken – the time to compute each generation, the total execution time, and the initial overhead (time before starting a generation). The overhead is important to consider, especially for the GPU, as it includes a device reset (to clear out the memory, which is necessary due to the large memory requirements), building the population, and copying it to device memory. Obviously, this overhead for the GPU should always be much larger than the overhead for the CPU (or at best equivalent). The overhead was computed by calculating the total execution time and subtracting the generation time. This was averaged across five runs for every experimental case. All times stated to be “adjusted” refer to the total execution time, less the overhead. To obtain the total execution times and generation times, multiple generations were executed. The total execution time is simply the time it took to compute all generations and includes any overheads. This value is not averaged, as it is inherently an averaged generation time in addition to any overhead. The generation time was measured for each generation; thus, the reported execution time is an average of all of the generations. These results are tabulated in Table 1.
Fig. 5 – Fig. 7 show the GPU speedup for the various test cases. It is shown that as the population size increases the speedup increases. This is expected, as parallelizations were exploited at the population level. Additionally, as the number of cities increases, the speedup decreases. This is also expected, because with more cities, an increased number of computations must be computed; thus, decreasing the performance gains. The general shape of the total, adjusted, and generation curves is similar, thereby validating the measurement scheme. The generation curve showed the largest improvement, having a best speedup of over 18. Generation measurements resulted in the largest improvement as no overheads were included; thus, only the optimized code is being measured. That being said, the bulk of the work happens within a generation; therefore, even with the large memory overhead, the performance gains were still noteworthy (over 3.5 for the total time).
Observing Fig. 8, it is shown that as the population size increases, the speedup decreases. As the overhead is heavily dependent on the amount of data, this result makes sense, because as the population size increases more data must be computed. Interestingly, as the number of cities increased, the speedup also increased. This result is unexpected, as more cities require more data to be transferred to the device. Taking a closer look shows that the CPU overhead increases with the number of cities; however, the GPU overhead time remains relatively constant. This would imply that for the GPU there is a more dominating factor.
It was shown how the TSP can be solved with a GA. It was further shown how parallelizations within the GA can be exploited. Parallelizations were exploited at the population level. The GPU speedup positively scaled with population size and negatively scaled with the number of cities. A speedup of over 18 was obtained for a population size of 100,000 with 25 cities. This shows that a reasonable approach was taken for exploiting parallelizations. The largest bottleneck was memory, especially given that as the population size and number of cities increases, the total memory consumption drastically increases. Further performance gains could be obtained through a unique representation of the cities, exploiting fine-grain parallelizations, and eliminating divergence.