## Wednesday, June 20, 2018

### The Naming of Places (Part 4): Using a Tool

One thing I learned in working on “Lost Coast" names in the last posting was that treating words as lists of strings is very limiting.  For example, I added some common monster names to the words I was using, and I could use those in various ways:  “The Vampire Coast," “The Coast of Vampires," “The Vampire's Graveyard," and so on.  But I ended up just duplicating those names in different templates.  It would be nice if I had a way to say “Include all the monster words here."  Likewise, I sometimes need the plural form of a word, and I had to add that in manually.  So it seems like it's time to consider using a language generation tool.

One of the main features I need is to pick a word randomly from a class of equivalent words, as for example when I select a word that means “coast."  It would be nice to have a tool that provides built-in word categories that would solve all my needs.  Some tools (like WordNet) provide word classification into various categories, and also provide lists of synonyms.  Unfortunately, these tools usually categorize words by parts of speech (POS), e.g., noun, adjective, verb, etc., which isn't useful for me.

Some tools also provide synonyms, but they're not usable without manual editing.  For example, WordNet synonyms for coast include “seashore" and “lakeshore", neither of which work well in labeling a coastline.  And there's no way for WordNet to know that in this case, “cliff" is an acceptable synonym for coast, even though those words do not mean at all the same thing.  There are shades of meaning in words that are difficult to capture and use in a language library.

So I don't think I'm going to find a tool with built-in categories and synonyms that I can use without modification.  Instead, I'll need a tool that supports creating my own categories, e.g., to be able to say “words to use in labeling a coastline are coast, shore, strand, bank, ... etc."

I also want to be able to weight the choices in these synonym lists to indicate that some choices are more common than others.  Quite a few of the tools I've looked at only support choosing randomly (with equal probability) from a list.  I can work around that limitation by repeating the same word multiple times in the list, but if I want a word to be 100 times as common as another word, that requires a lot of repeats.

Another feature I'd like is for the tool to be able to pluralize (and conversely, singularize) words for me, so that I don't have to enter each word as both a singular and a plural.  I can use this to switch between “The Ogre Coast" and “The Coast of Ogres" easily.  I think there will still be some cases where I have to explicitly indicate a plural or a singular but in many cases I could rely on the tool to create the appropriate plural / singular.

There are also a number of common features I don't need.  I'm not building interactive fiction (IF) so I don't need any facilities for interacting with the user.  Some tools have ways to remember a choice and use the same choice later, so that you can (for example) pick the name “Pete" in one place and then use it throughout a long text.  I can think of some cases where I might want to do this, but in general it's a feature I can live without.  I also don't need to parse text, assign parts of speech to words, or similar input-oriented tasks.

With all that in mind, I think my best choices come down to Tracery and RiTa.

Tracery is focused on text generation, and has a simple, clean format for expressing a grammar and some nice built-in features like capitalization.  On the negative side, it lacks any way to do choice weighting.  (There exists a fork of Tracery that adds choice weighting (and much more!) but it is written in Swift rather than Javascript.)  In fact, Tracery does a “shuffled deck" selection when making choices, meaning it runs through all the choices (randomly) before repeating itself.  Since I want some place name choices to repeat (e.g., I want to use “coast" much more often than “bracks"), this won't work for me.

RiTa is a more general toolkit that provides features like analyzing text, conjugation, stemming, Markov chains, and more.  The grammar format is similar to Tracery, but lacks a number of features (like capitalization) that Tracery provides.  However, it does provide choice weighting.  On the negative side, it's much bigger (about 10x) than Tracery, but it has some reduced versions and I suspect I can make do with the smallest.

For the moment, at least, I'm going to work with RiTa.

Both Tracery and RiTa do generation with context-free grammars.  If you aren't familiar with context-free grammars, they consist of rules that look something like this:
<coast> => coast | shore | banks
This rule means “Wherever you see the symbol <coast>, replace it with the word coast, or shore or banks.'  You can chain these rules:
<lost> => lost | forgotten | accursed
<name> => The <lost> <coast>
Together, these rules say that you generate a <name> by generating the word “The" followed by whatever the <lost> symbol generates and then whatever the <coast> symbol generates.

(The “context-free" part just means that you can only have one symbol on the left side of a rule.)

As you may remember, I can also name a lost coast after a monster, e.g., “The Zombie Coast".  With a grammar, I can now separate out the monster names:
<coast> => coast | shore | banks
<lost> => lost | forgotten | accursed
<monster> => Zombie | Kobold | Orc
The <adj> symbol can now expand into one of the synonyms for lost, or a monster name.

Whenever the grammar engine has to make a choice, it chooses randomly among the options.  So in this case, “coast", “shore" and “banks" are all equally likely names.  If we want “coast" to occur more frequently than the other choices we can (in RiTa) add a frequency to the choice:
<coast> => coast[5] | shore | banks
which says (in this case) to pick “coast" five times out of seven.

That works well to intentionally adjust word frequencies, but random choice can cause a more subtle problem.  Consider, for example, this rule:
50% of the time this rule will use a “lost" adjective, and 50% of the time it will use a monster name.  But in my case, I have 325 synonyms for lost, and only 33 monster names!  I really want to pick an adjective equally from that whole pool.  I can fix this by adding a frequency to this rule:
but this requires me to count the choices in each category, calculate the ratios, and keep all this up-to-date as I add new terms and options.  That's error-prone, and gets complicated when there are multiple levels of rules.

These sorts of rules represent composition rather than choice -- we'd really like to have some syntax like
to indicate that the grammar engine should compose the two lists together before choosing.   Neither RiTa or Tracery seem to have this capability.  Maybe I'll add that, but in the meantime I'll use a workaround of defining <lost> and <monster> in Javascript and combining them when I create the rule:
let lost = 'lost | forgotten | accursed';
let monster = 'Zombie | Kobold | Orc';
[...]
<adj> => lost + ' | ' + monster
Apologies for the psuedocode mish-mash, but I hope you understand what I mean.

Another thing I need to do in name generation is to insert the result of a Javascript function call.  For example, I can name a lost coast after some (imaginary) person:
Mesh's Boneyard
In these cases, the name of the imaginary person is generated by Martin O'Leary's place name generator, with a call that looks like this:
Language.makeName(world.lang, 'person')
So if I want to be able to generate a name like this, I need a way to tell the grammar “Hey, at this point go off and execute this Javascript and use the result."  This is called a “callback", and in RiTa this works by enclosing it in backticks:
Language.makeName(world.lang, 'person')'s <coast>
There's a lot of  quoting going on there, but the important part is that the call to Language.makeName() is inside backticks.  When the grammar engine evaluates this rule, it knows to pull out that bit of code, run it, and put the result back into the rule.

It turns out that this isn't as straightforward as it looks.  Without getting too technical, every bit of code executes in a context that represents all the other code and definitions around it.  In this case, code was written in one context but gets executed in a different context.  This creates no end of problems.  For example, in the rule above, “Language.makeName" isn't defined in the context where the code actually gets executed, and so the callback fails.

RiTa has a solution for this problem, but it isn't very good.  I patched my copy of RiTa with a better solution (and provided that back to the RiTa authors) so my callbacks work as I expect.

RiTa has a number of ways to actually write a grammar, but I'm using a JSON format.  Here's what the core of the Lost Coast naming rules look like:

     // The Lost Coast
'<lc1>': 'The <lost> <coast>',
'<lc2>': 'The <coast> of <lost2>',
'<lc3>': "The <sailor>'s <negcoast>",
'<lc4>': "Language.makeName(world.lang, 'person')'s <negcoast>",
'<lc5>': "<noble> Language.makeName(world.lang, 'person')'s <negcoast>",
'<lc6>': "<admiral> Language.makeName(world.lang, 'person')'s <negcoast>",
'<lc>': '<lc1>[10] | <lc2>[5] | <lc3>[3] | <lc4>[3] | <lc5> | <lc6>',

<lc1> through <lc6> are the basic patterns for different Lost Coast names (as described in the previous posting).  In <lc4>, <lc5> and <lc6> you can see callbacks to Martin O'Leary's name generator.  The last rule sets the proportions to use the various forms, “The Lost Coast" being the most common and forms like “Admiral Dyg's Boneyard" being fairly uncommon.

There are a couple of functions I think I might eventually want to use that aren't in RiTa or Tracery.

One is the capability to remember and reuse a name across different runs of the grammar.  For example, I create “Admiral Dyg's Boneyard" I might want to name a nearby rocky point “Admiral Dyg's Folly."  Or if I name a mountain “Black Rock Peak" I might want to name a nearby city "Black Rock Town."  I'm not exactly sure the best way to do this yet.

Another is the capability to adjust the distribution of some terms on a per-map basis.  For example, I might have a distribution for the names of bays:
'<bays>': 'Bay [10] | Cove [5] | Basin [3] | Bight | Estuary'
Mostly I want to use “Bays" and “Coves" and rarely “Bight" or “Estuary".  This reflects the relative proportions of those names on today's maps.  But maybe on this map, bays are mostly called “Bights" and only occasionally other names.  Right now I don't think there's a way to write a “meta-rule" to figure out the distribution within another rule.

Next time I'll start to work on bay names.