Saturday, February 18, 2017

A Mountain By Any Other Name Is Still a Hogsback

In its current incarnation, Dragons Abound doesn't name its mountains.  Originally, all of the terrain was generated with noise functions, and the mountains just showed up wherever the noise function produced a large value, so the program didn't understand that one area was mountains and another was not.  But as I've discussed in previous posts, the program now intentionally creates mountain areas, so it's more feasible to name the mountains.  So how do mountains get named?

Many mountains (and terrain features in general) are given proper names -- these are unique names that don't have any other meaning, like the "Appalachian Mountains."  These are often derived or shared with other features.  For example, the Appalachian Mountains are derived from the Apalachee tribe of Native Americans; in turn that name seems to derive from the name of the village where Spanish explorers first encountered the tribe.  Similarly, the Allegheny Mountains are named after the Allegheny River, which derives from the Lenape tribe's words for "fine river".  But these days "Allegheny" is just a name that doesn't have any other meaning.

Other times mountains are given descriptive names, like the "Rocky Mountains" or the "White Mountains" of New Hampshire (and other places).  One imagines an explorer saying to the mapmaker, "And across this area was a long range of white mountains..." and that description becoming the accepted name.

Finally, there's the "Mountains" part of the name.  This takes on many forms, such as "Hills", "Cliffs", "Buttes" and so on.  These are all synonyms for mountains, or for related mountain-ish features.

Putting that together, I can generate the name for a mountain range by generating a proper name + a mountain word (e.g., "The Numping Mountains") , or a descriptive phrase + a mountain word (e.g., "The Gravel Mountains").

It turns out that generating proper names is fairly easy for me, because Martin O'Leary already wrote some code to invent names in recognizably distinct languages that I'm using to name other map features like cities and rivers.  It even has the capability to generate a series of names that share some linguistic features as if they all come from the same class of names, e.g.,
Tangmim
Samkangmim
Sungsansin
Sumingtung
so I'm not going to spend a lot of time on that aspect of naming mountains.  At some point I may want to generate names that are more recognizably English, but that's a problem for another day.

Generating descriptive phrases is a bit more challenging.  A simple approach would be to pick a random English adjective or noun, and this works better than you might expect, because the human brain is so adept at rationalizing names.  To pick a random example, "The Wagonwheel Mountains" seems perfectly fine.  "The Loud Mountains" is a bit more jarring, but you can understand how that name might have been invented.  However, a lot of choices strain rationalization: "The Chlorophyll Mountains", "The Adaptable Mountains" and so on.  To work well, the descriptive phrase needs to have a semantic connection to the physical reality of the mountains.  "The White Mountains" is natural-sounding because mountains can be white colored; "The Adaptable Mountains" sounds strange because it's hard to imagine how mountains can be adaptable.  Unfortunately, understanding the semantic validity of a phrase with respect to a mountain is not an easy task -- this is the sort of thing that natural language processing and artificial intelligence struggles to solve.  If only someone had already compiled a list of words that could be used to describe mountains!

Of course, such a list does exist -- or at least in theory:  A list of all the real-world mountain names.  If I had that, I could pull out the adjectives and nouns that people have used to describe mountains.

As it happens, such a list does exist, at least for the United States.  (And, oddly, the Antarctic.)  The US Board on Geographic Names is a federal agency founded in 1890 to keep track of the official names of geographic features within the United States.  And thanks to the federal laws that require agencies to make this sort of information publicly available, you can download all the geographic names in the United States here.  What you get are huge data files that list the names of geographic features, the type of feature (e.g., Stream, Forest, etc.) and other information such as location.  For example, the first part of the file for Wyoming looks like this:
169560|Gibson Blair Ditch|Canal|CO|...
169563|Roosevelt National Forest|Forest|CO|...
169581|Crow Creek|Stream|CO|...
There's a lot of information about each feature, but for my purposes I can pull out the name and the feature type and filter down to just the mountains to make a list of all the mountain names in the US.  Here are the matches for Wyoming:
169920|Sierra Madre|Range|WY|...
170032|Medicine Bow Mountains|Range|WY|...
170426|Front Range|Range|CO|...
378928|Caribou Range|Range|ID|...
382164|Gannett Hills|Range|WY|...
768433|Badger Hills|Range|MT|...
The USGS feature type for a mountain range is "Range" (and for an individual mountain, "Summit", although "Cliff," "Ridge," and "Slope" are all used for other mountain-like features).  You can see that some of these ranges start in other states and cross into Wyoming.

Once I have a list of mountain names, I want to pull out all the descriptive nouns and adjectives.  If I assume that the last word in the name is the equivalent of "Mountains" then I'm left with a list like this to parse:
Sierra
Medicine Bow
Front
Caribou
Gannett
Badger
You can see already that this isn't going to be perfect, but it will hopefully get down to a list that I can curate by hand.

The next step is to label the words in this list as either adjectives or nouns.  I'll pass this job off to the Natural Language Toolkit.  This is a Python library that provides a number of handy natural-language processing functions, including classification by part of speech.  Classification works best when you have a full sentence (consider "He was a major" versus "It was a major problem") but for my purposes I can use the most common part of speech for a word (e.g., "major" is more often an adjective than a noun).  Conveniently, this can also tell when a word is a proper noun (or at least isn't recognized as any other word type) so that lets me filter words like "Sierra" and "Gannett" out.

I'll use this to go through the list and count the occurrences of each word.  So for example, the five most common nouns used to describe mountain ranges in the US are:
  1. pine: 767
  2. creek: 724
  3. rock: 523
  4. horse: 437
  5. buck: 426
There are 2600 nouns in the full list, including almost a thousand that only appear once, such as "bagpipes", "magicians", "library" and "clamshell."

I can do a similar analysis to get a list of adjectives:
  1. bald: 1037
  2. black: 971
  3. red: 955
  4. big: 871
  5. round: 717
This list is much shorter:  430 in total, including 161 that only appear once (alien, suburban, biological...).

The final step is to do the same analysis for the last word in the name of the mountain range.  This gives me a list of synonyms for "mountains":
  1. Hills: 1617
  2. Mountains: 1239
  3. Range: 479
  4. Buttes: 432
  5. Peaks: 319
It's interesting that "hills" are more common than "mountains" in the US.   This list is much shorter, with only 58 entries and no odd words except perhaps "Hogsback" and "Nubbles" (which appears 22 times and apparently means "a small nub"). 

With these lists generated, I can now name mountain ranges by the noun + a mountain synonym (e.g., "The Pine Hills"), the adjective + a mountain synonym (e.g., "The Bald Mountains") or a combination (e.g., "The Bald Pine Range").  The count of occurrences for the words can be used to determine the frequency of generation, so that I get "The Red Horse Peaks" much more frequently than "The Alien Bagpipe Nubbles".

There are actually many more named individual mountains in the US than named mountain ranges.  I can do a similar analysis on those names.  The main difference is more variety in the "mountain" synonym.  The five most popular are:
  1. Mountain: 22221
  2. Hill: 17126
  3. Ridge: 12842
  4. Peak: 7089
  5. Butte: 3937
And further down the list you get some interesting variants like Whaleback, Airy, Comb and so on.  I hadn't considered naming individual mountains, but it might make sense for (say) the tallest mountain in an area.

3 comments:

  1. Thank you SO much for the USGS link, I had no idea such a thing existed! Keep up the awesome work, this has been a fantastic read!

    ReplyDelete
  2. Ok, time to generate a mountain name: "Horse Buttes"
    well...

    ReplyDelete