One problem with the naive implementation (as found here) is that it doesn't necessarily return the best result it found. In simulated annealing, every time a change is contemplated there's a chance it will be accepted, even if the change makes the overall score worse. This is how simulated annealing can find it's way out of local minima. But when the algorithm is near the best answer, almost every move it makes will make things worse if accepted. So it's almost certain that if you run simulated annealing for a fixed number of steps, you're going to end up with a solution that is worse than the best one the algorithm found.
The solution to this is to save the best solution found and return that when the algorithm ends. A problem arises if it is difficult or expensive to save a solution. Since the algorithm is likely to encounter many "best" solutions during the search, it will have to save new best solutions many times. If this is expensive, it may add significant time to the search. Fortunately, for my problem saving the best solution is fairly trivial, so this is a worthwhile optimization to implement.
It's fairly powerful, too. On my test maps, the best solution is about 20% better than the ending solution (!).
An extension of this idea is to send the algorithm back to the best solution every time the "temperature" of the simulated annealing changes. Changing the temperature basically restarts the algorithm with a different search parameters. So it makes some sense to restart the search at the best solution found so far, rather than wherever the algorithm happened to be at the moment.
However, in testing this doesn't improve the results. In most cases, it makes them worse! I suspect the reason is that this tends to increase the chance the algorithm will get caught in a local minima, by repeatedly sending it back to the best minima found so far.
Another optimization has to do with how the algorithm selects which label to change at each step. The initial implementation makes a full pass through all the labels at every iteration. For each label, the algorithm contemplates a possible change. The drawback of this approach is that it may spend a lot of time trying to improve labels that are already very good. For example, consider this map:
Most of these labels are in places where they won't overlap with other labels and only have the coast and perhaps a river to avoid. Simulated annealing finds pretty good solutions for these labels very quickly. At that point most proposed moves for these labels only make the map worse. It's more productive if the algorithm focuses instead on trying to find better spots for labels that have the worst scores. However, I don't want the algorithm to completely ignore the labels with good scores, because there's always a chance that it can find an even better score, or that a label with a good score needs to move to make room for a label with a bad score.
My solution was to weight the chance of selecting a label for a change by its score. So a label with a high score has a high chance to be selected for a change, while a label with a low score has a low chance of being selected. As labels are improved, their chances for further changes go down, so the algorithm keeps shifting its attention to the current worst labels.
For my test maps, this is also a very effective improvement with (again) about a 20% improvement over the previous best solutions. At this point the algorithm is doing a very good job placing labels on most maps:
Notice how close the city labels are to their cities, without overlapping any map features. (Right now simulated annealing is only placing the city labels. The region labels are a different problem that I'll address in a future posting.)
A different sort of optimization is to make the algorithm more efficient. One of the slowest parts of the algorithm is checking to see if a label intersects with a map feature. Finding the intersections between labels is not a big problem because there are only a few labels. But there are many map features (1712 in the map above) so checking a label against all the map features is a big effort. It would be more efficient if I could check against just the nearby features.
As it turns out, there's a good way to do that: store the map features in a quadtree. A quadtree is a kind of data structure that stores two-dimensional data in a way that makes it fast to retrieve points near a location. And even better, d3 provides a quadtree implementation that I can use. But there are a couple of problems to overcome.
First, the d3 implementation doesn't provide a function for finding all the points near a point. That would have been handy, but with a little thought it's possible to figure out how to implement a function to retrieve objects near a point using d3.quadtree.visit. Since the labels are rectangular, I can search for objects within that rectangle.
// Find all points in quadtree that are (possibly) within a rectangle function qtWithinRect(quadtree, rect) { var results = []; quadtree.visit(function(node, x0, y0, x1, y1) { // Does this node overlap with our rectangle?
let x_overlap = Math.max(0, Math.min(rect[1][0],x1) - Math.max(rect[0][0],x0));
let y_overlap = Math.max(0, Math.min(rect[1][1],y1) - Math.max(rect[0][1],y0));
let overlaps = (x_overlap > 0) && (y_overlap > 0);
// If this is a leaf, add it's nodes to results
if (!node.length) {
do {
results.push(node.data);
} while (node = node.next);
}
return !overlaps; });
return results; };
Note that this doesn't guarantee that all the points returned are within the rectangle, because it returns all the points in nodes that overlap with the rectangle. That overlap isn't necessarily complete, so there may be some points in the node that lie outside the overlap. I could check to make sure the points are within the rectangle before adding them to the results, but it turns out in my case (as I'll explain below) including some nearby points is actually fine.
The second problem is that this doesn't quite work for finding labels that overlap features. I'm indexing features in the quadtree by the center of the feature. This function will find features whose centers lie within the label, but it won't necessarily find features whose centers lie outside the label, but still overlap the label. (This is why getting nearby points isn't bad; I need to check them anyway.) Completely fixing this is complicated, but on a practical level it works fine to keep track of the size of the features as you add them to the tree and then pad the search by the dimensions of the biggest feature. This padding can pick up features that don't overlap the label, which makes the algorithm a little less efficient, but it's not a big impact.
In fact, it turns out this is quite an improvement: It saves about 90% of the time spent in calculating label scores (!). This makes it feasible to do a lot of iterations, but it turns out that there isn't much improvement beyond about 2500 iterations per label -- at least with the algorithm as it works right now.
This is looking better and better all the time!
ReplyDeleteIs there a time when such a thing is going to be downloadable for the hungry masses?