Friday, June 8, 2018

The Naming of Places (Part 2): Some Resources

I've been contemplating the problem of naming places for some time and have gathered some Internet resources that I will document here since they may prove useful for other developers.

Previously I wrote about exploring the names of U.S. mountains using data provided by the US Board on Geological Namesis a federal agency founded in 1890 to keep track of the official names of geographic features within the United States and makes that data publicly available.  So that's a good source for finding out about how things are currently named (at least in the US).  I used that data to develop, for example, a list of the most common synonyms for mountains:
1. Hills: 1617
2. Mountains: 1239
3. Range: 479
4. Buttes: 432
5. Peaks: 319
Another resource I talked about in that posting was the Natural Language Toolkit.  This is a Python library that provides a number of handy natural-language processing functions, including classification by part of speech.  This is useful when you're gathering names so that you know whether something is a noun or an adjective or something else.

Procedural language generation is something a lot of people dabble with, so you can find lots of examples and code with a little Googling.  It also shows up on /r/proceduralgeneration occasionally. There are also lots of name generators intended for use with fantasy role-playing games, although the quality is hit-or-miss and you may have to dig into the web page source code to figure out how the names are being generated.

A common approach for name generation is to use a Markov chain.  The idea is to look at a large set of existing names (say, English town names) and determine how often syllables (or letters) follow other syllables (or letters).  For example, in “Birmingham," the syllable “Bir" starts the name, it is followed by “ming" and then “ham" ends the word.  You'd find in English town names that “ham" is a quite frequent ending.  Once you've gathered up all these statistics you can then use them to create new town names that follow the same statistical patterns:
1. Barkingham
2. Basingham
3. Birkenham
4. Bebingham
5. Bollingham
and so on.  Code is available to do this in many (computer) languages.

Markov chains are somewhat old school; the new approach is to train a neural network.  And of course someone has done that.  (Actually, several people have done that.)  The results sound a lot like the names generated by the Markov chains, at least to me.

For these approaches, you need a corpus of existing names.  For English town names, you can find some lists online.  For a more comprehensive list, you can consult the Dictionary of British Place Names, although that might require you to have an account, and is not conveniently organized for name generation purposes.  Darius Kazemi has a Github project that collects corpora (lists of words) and it has a list of English town names.  If you want to do names that sound “Chinese" or “fantasy," you're going to have to find a bunch of examples to feed the Markov chain -- the more the better.

For me, these approaches aren't that useful, because I already have Martin O'Leary's code to invent names.  However, it's something to keep in mind if I ever want to generate names that specifically sound like other names, e.g., if I want to generate fantasy maps with names that sound like English towns.

One thing I will need to do is to find synonyms and similar words.  When I'm naming a mountain range, I'd like to know all my options for saying “mountain."  One good source for this is a thesaurus, of which there are several online.  Another good source is Wordnet, which is a lexical database of words that has a great deal of knowledge about the semantics of the words.  Wordnet can give you direct synonyms, related synonyms and even sister termsConceptNet is similar to Wordnet but perhaps a bit higher-level, with a graphical interface to explore concepts like mountain.

Those resources are good for finding modern synonyms, but for fantasy maps it's nice be able to sprinkle in some archaic and medieval terms for flavor.  These types of synonyms are much harder to find.  One resource is the Oxford Historical Thesaurus.  This is a companion volume to the massive Oxford English Dictionary (the OED) which provides a historical timeline of synonyms.  Using this, you can determine, for example, that around 885, the word “barrow" meant mountain, even though today's dictionaries will tell you it means a grave mound.  The Oxford Historical Thesaurus online requires a login, but your library card may get you access (mine did).

It might be that I can get by with synonyms and some simple patterns, but if not, there are some procedural text generation packages available for Javascript that offer additional capability.  Some that look promising are rant.js (which is a Javascript port of the Rant library), RiTa.js and Tracery, which started as a 2104 Procedural Jam entry.  These libraries let you specify text as patterns (grammars) and then process the patterns to create text, as in this example (from the rant.js page):
var rant = require("rantjs");
var sentence=rant('<firstname male> likes to <verb-transitive> <noun.plural> with <pron poss male> pet <noun-animal> on <timenoun dayofweek plural>.');

console.log(sentence); // 'Sean likes to chop parrots with his pet cat on Saturdays.'
Note that with these sorts of tools you can specify various features of the placeholders and the engine will pick a word that fits those features, e.g., in the above example where the author uses “<firstname male>" to pick from only male names.  These typically also include some useful features like pluralizing words.  Each package has differently capabilities, so the choice might depend upon which capabilities you need most.

There are also a number of packages and tools for writing interactive fiction that combine generative text with user inputs or choices, such as Inkle's Ink.  I haven't focused much on these because I don't (at least at the moment) need to make my place name generation interactive.

Another good source for text generation resources are the resources threads for National Novel Generation Month (NaNoGenMo).  You can start in the thread for 2017, and it has pointers to the threads from previous years.  Most of the resources discussed there are probably overkill or not relevant for place name generation, but might be useful for folks doing more in-depth language generation.