Here Dragons Abound: Map Compasses (Part 12): An Excursion into Tool Building

Welcome back to my blog series on implementing procedurally-generated map compasses! After implementing the Compass Design Language (CDL) that will be used to draw the compasses, and ruminating last time on procedural generation, I'm now ready to move to implementation of the procedural generation part of map compasses. The procedural generation's output will be descriptions of compasses in CDL, which I'll then pass to my CDL interpreter to draw:

For the generation of the CDL I will be using a generative grammar. Generative grammars look a lot like the grammar I wrote to parse CDL, but run “backwards" so that they generate output rather than parse input. You can think of generative grammars as replacement or expansion engines. Typically you start with a single token that represents what you want to generate, and the grammar expands that into the final product. For example, support you start with an input of <name> and have this set of rules:

<name> => <lost> <coast>
<coast> => coast | shore | banks
<lost> => lost | forgotten | accursed

The grammar tries to find an expansion for “<name>" and finding the first rule, replaces <name> with “<lost> <coast>." The process then repeats. The vertical pipe symbol | in rules indicates random choices. So when the grammar tries to expand <lost> using the third rule, it chooses randomly from the three choices and (let's say) replaces <lost> with “forgotten" and similarly replaces <coast> with “banks" and at that point no further expansions are possible so the results is “forgotten banks."

Dragons Abound uses generative grammars in several places, such as generating place names as suggested above. I've written quite a bit about generative grammars previously, initially when I wrote about generating place names and then in more detail later when I wrote about generating map borders. If you're not familiar with generative grammars those posts will give you a more thorough introduction.

In Dragons Abound I've been using a tool called RiTa for generative grammars. I selected RiTa because it had several features I needed and I found it easy to add some that were missing. Since I've started using RiTa, version 2.0 has been released with some major new features. (Including some that I added to my version.) It's probably much better, but I have a considerable investment in my modified version, so I'm going to use that for map compasses.

UPDATE: Or at least that was my intention. I've actually already written the next four blog posts, in which I develop 50 rules to generate two layer compasses like these:

But as I progressed along I found that I needed to add some additional features to my version of RiTa. First I wrote a parser that would take rules in a convenient text form (like the examples above) and turn them into RiTa rules. Then I added some capabilities that required rewriting a good chunk of the RiTa rules engine. But these fixes feel like they're “bolted on" and are just trying to compensate for a tool that isn't really fit for my purpose. I began to wonder if I wouldn't be better off re-writing the RiTa engine from scratch. I could incorporate my text-form parser and the modifications I've made as well as some additional features to make life a bit easier. In the end it seemed like a fun and intriguing side-project so in keeping with my philosophy, I'm going to make a bit of a right-turn here and write a procedural generation tool. Hopefully you'll find it fun and interesting as well.

Oh, and I need a name for the tool. Since I'm using it for procedural generation of map compasses, I'll call it Lodestone.

(Incidentally, this whole “I'm going to build my own tool because the available tools aren't exactly right for me" is a nasty trap that has killed more than one project. If this were anything but a Forever Project I would not be doing this! But it is and so I am. And you'll end up with the code to a tool you can repurpose, so maybe it's a good thing from your viewpoint! But as they say on TV, don't try this at home.)

The first step of this sub-project will be to define the format of the procedural generation rules and build a parser for them. A basic rule looks like this:

<start> => <first> <second> <third>;

It has a single non-terminal enclosed in angle brackets on the left-hand side of the rule, an arrow operator =>, and then a sequence of non-terminals (or other elements) on the right-hand side of the rule, terminated by a semi-colon.

There are a couple of difference here from RiTa and other tools. First, I'll require terminals to be enclosed in angle brackets. RiTa allows any sort of non-terminal, but delineating them this way makes parsing easier and (in my opinion) the rules easier to read. At any rate, I always write non-terminals this way so I'll require that in my tool. Second, I'll terminate a rule explicitly with a semi-colon. Rules tend to get long and complicated, and this will make it easy for me to split rules across lines and format them for easier reading.

A special kind of non-terminal has a name that starts with a $. This indicates a one-time terminal:

<$otr> => <fourth> <fifth>;

Normally a non-terminal is evaluated every time it is encountered, and if it can take on multiple values, they'll be randomly chosen each time. But a one-time terminal is only evaluated once, and then has that same value whenever it is used.

The right-hand side of the rule is a sequence of elements. In addition to non-terminals, the elements can include strings:

<first> => <second> and a string;

Note that the string is not quoted. Lodestone will generally do what you expect in picking out the string, but you can also use a quoted string if necessary (e.g., you want the string to include something that Lodestone might interpret as a non-terminal):

<first> => <second> 'and a quoted <string>';

Note the use of single quotes. Double quotes can be used without any special meaning.

An element can also be a Javascript expression enclosed in backticks:

<second> => `Math.PI/2`;

The Javascript expression is evaluated and the result goes into the output. Importantly, the Javascript expression can contain other Lodestone elements like non-terminals:

<second> => `Math.PI/<numSides>`;

And this does what you'd expect -- divides PI by the value of the non-terminal <$divisor>. You can also set the values of one-time non-terminals in Javascript:

<second> => `<$angle> = Math.PI/<numSides>`;

You cannot do this with a regular non-terminal; since they're evaluated every time they are referenced, giving them a value won't do anything.

If you want to evaluate multiple statements in the embedded Javascript you can do that by separating the statements with a semi-colon or the comma operator:

<second> => `<$divisor> = <numSides>/2; <$angle> = Math.PI/<$divisor>`;

The value of the embedded Javascript will be the value of the last statement.

The right-hand side of a rule can also have multiple choices separated by the vertical bar character:

<fourth> => 1 | 2 | 3 | 4;

In this case, the value of the rule is one of the choices selected at random. Every time <fourth> is used, this rule could pick a different number. But if the right-hand side was a one-time terminal like this:

<$otr> => 1 | 2 | 3 | 4;

Then a random choice would be selected only the first time, and <$otr> would have that value every time it was referenced. (Although as noted above, you can change this value in Javascript.)

Normally when a rule has multiple choices, they're all equally likely. To change that, a weight can provided in square brackets, like this:

<fifth> => cat [10] | dog;

Any choice that doesn't have a weight is given the default weight of 1. So in this case, the “cat" choice will be selected ten times as often as the “dog" choice. As with embedded Javascript, the value of the weight can be any Lodestone expression, so you could have these sorts of weights:

<fifth> => cat [<catweight>] | dog [`<catweight>/2`];

Lastly, a comment is any line that starts with a #. The entire line will be ignored.

# Ignored completely

A comment does not have to end with a semi-colon.

One last note: On the right-hand side of a Lodestone rule, whitespace is significant. For example:

<start> => <first> <second>;
<first> => cat;
<second> => dog;

These rules produce the string “cat dog" because the space between <first> and <second> is significant. On the other hand, these rules:

<start> => <first><second>;
<first> => cat;
<second> => dog;

Produce the string “catdog".

Whitespace is significant, but all whitespace outside of a quoted string gets reduced down to a single space character. So you could write the first rules as:

<start> => <first>
 <second>;
<first> => cat;
<second> => dog;

And the output would still be “cat dog". To get more whitespace, use a quoted string. But be careful! This rule with three spaces in the quoted string:

<start> => <first> '   ' <second>;

will actually produce five spaces between cat and dog, because there is whitespace on either side of the quoted string. This rule:

<start> => <first> ' ' <second>;

Would result in the expected three spaces between cat and dog.

And that's it for the definition of the rule language! So now I'll work on writing a parser.

Just as I did for CDL, I'll use Nearley for this parser. I'll also use a lexer. A lexer breaks a stream of characters into a stream of tokens, so that inside Nearley we won't have to do so much character-by-character processing. With careful definition of tokens, using a lexer can make a parser much easier to write and less likely to be ambiguous. (And in fact, I've defined the language with this in mind.) The Nearley guys recommend Moo, which looks straightforward, so I'll use that.

Much of the work of parsing will actually be done in the lexer. Moo uses regular expressions to define tokens, so the main work here is to craft regular expressions for all of the Lodestone tokens. I'll start off with a simple example, the token for the arrow operator:

 //
 // The arrow that separates lhs and rhs
 //
 arrow: /=>/,

Moo tokens are defined in a Javascript object. The key name is the type of the token, and the value is a regular expression. I'm not going to attempt to explain Javascript regular expressions here, but they're defined inside slashes, and in this case the regular expression just matches the “=>" characters.

A slightly more complex token is the end-of-rule token:

 //
 // End of rule
 //
 eor: /;\s*\r\n/,

There are two things to note here. First, the “\s" is a regular expression shorthand for “any whitespace character." Second, the asterisk after anything means “match zero or more" of the preceding thing. \r and \n match a carriage return and a newline, which is how the end of a line is defined on Windows. So in this case, the regular expression matches a semi-colon followed by any amount of whitespace followed by the end of the line.

Here's the regular expression for a one-time non-terminal:

 //
 // One-time non-terminal in the form <$test>
 //
 otnterm: /<\$[^>]+>/,

There are a couple of new things here. First of all, I have to put a backslash before the $ in this pattern because the $ has a special meaning in regular expressions (it means the end of a line) and so if I want to match an actual $ character, I have to escape it with a backslash to let Javascript know that I don't mean the end of line. After that is a pattern in square brackets. The square brackets are called a character class, and match any character inside the brackets. But the caret (^) at the beginning of the list of characters negates the list. So [^>] means “any character except a right angle bracket." The plus immediately following this is like the asterisk but means “match one or more times." And finally, there is a right angle bracket. So in total, this pattern says “match a string that starts with a left angle bracket, followed by a dollar sign, followed by any number of non-right angle brackets, and ends with a right angle bracket."

One more example that introduces a useful Moo feature:

 //
 // Embedded Javascript enclosed in backticks
 //
 jscript: {match:/`[^`]+`/, value: s => s.slice(1, -1)},

Instead of just a regular expression, the value of a token type can also be an object with various parts. The “match" key is the regular expression. (This one matches everything between two backticks.) The “value" key is a function that will process the matched text to produce a value. In this case, the value function slices off the first and last characters of the matched text. (This removes the backticks.)

Here's the complete code for the lexer:

//
// Order matters in the lexer.  Earlier
// rules take precedence.
// 
const lexer = moo.compile({
  //
  // Comment is an entire line starting with #
  // 
  comment: {match: /^\s*\#.*\r\n/, lineBreaks: true},
  //
  // Or
  //
  or: /\s+\|\s+/,
  // 
  // White space, including newlines
  // 
  ws: { match: /\s+/, lineBreaks: true, value: s => ' '},
  // 
  // One-time non-terminal reference in the form <@test>
  //
  otntermref: /<\@[^>]+>/,
  // 
  // One-time non-terminal in the form <$test>
  //
  otnterm: /<\$[^>]+>/,
  // 
  // Non-terminal in the form <test>
  //
  nterm: /<[^\$][^>]*>/,
  // 
  // Embedded Javascript enclosed in backticks
  //
  jscript: {match:/`[^`]+`/, value: s => s.slice(1, -1)},
  //
  // A weight in the form [50]
  // 
  weight: {match:/\[[^\]]+\]/, value: s => s.slice(1, -1)},
  //
  // The arrow that separates lhs and rhs
  //
  arrow: /=>/,
  //
  // End of rule
  //
  eor: {match:/;\s*\r\n/, lineBreaks: true},
  //
  // A quoted string
  //
  qstring: {match: /'[^']*?'/, value: s => s.slice(1, -1), type: s => "string"},
  //
  // A string of characters
  //
  string: /[^`\'\<\[\]; \t\n\r]+/,
});

The easiest way to use the lexer in parsing is to match against the token type. This is done by using a % and the token name. For example, in a Nearley parser rule, “%comment" would match a token with the comment type. Here's a simplified version of the Lodestone parser:

rule -> ws:? (%nterm | %otnterm) ws:? %arrow ws:? rhs ws:? %eor
rhs -> element | element %ws:? rhs
element -> %nterm | %otnterm | %string | %qstring | %jscript | %weight | %or

In Nearly, the “:?" suffix means to match zero or one. So this grammar says that a rule is possibly some initial whitespace, followed by a non-terminal or one-time non-terminal, the arrow, and then right-hand side of the rule followed by the end-of-rule. (With optional whitespace in various spots.) The right-hand side is then either a single element or an element followed by optional whitespace and a right-hand-side. This is a typical recursive grammar rule to describe one or more elements possibly separated by whitespace. Finally, an element is one of the tokens defined in the lexer -- the non-terminals, strings, quoted strings, embedded Javascript, weights, or the vertical bar.

The actual grammar is slightly more complex because it includes some post-processing so that the output of the parser is similar to the output of the CDL parser -- an array of structures with an opcode and a value. Here's an example of a simple rule and the result of parsing the rule:

<$test> => <rule>;

{
 op: 'rule',
 lhs: {type: 'otnterm',value: '<$test>'},
 rhs: [{type: 'nterm', value: '<rule>'}]
}

The rule is turned into a Javascript structure with an opcode of 'rule' to indicate that it is a rule, and then structures or lists representing the left and right-hand sides. The objects with type and value are coming straight from the lexer. The parse of a whole grammar is simply an array of these rule structures.

When the parse is complete there is still some additional processing required to get the rules ready for use in a rules engine.

First, any Javascript and weight elements need to be further parsed. I'll do Javascript first since that's the simpler case. After the initial parsing above, the value of a Javascript or weight element is a string:

{type: 'jscript', value: '5'}

That's okay if the value is just a number as above. But the value of a Javascript can be anything that could be a right-hand side, as in this example:

{type: 'jscript', value: '<$totalweight>/3'}

Here the value of the Javascript element contains a non-terminal. Since the value of a Javascript element can be anything that would be on the right-hand side of a rule, I'll use the same rules to parse the value, by calling the parser again. This shows an example rule and the result of parsing the rule and the parsing the value of the Javascript element:

<$test> => <rule> `50`;

{
 type: 'rule'
 lhs: {type: 'otnterm', value: '<$test>'...}
 rhs:
 {type: 'nterm', value: '<rule>', text: '<rule>', offset: 32, toString: …}
 {type: 'string', value: ' '}
 {type: "jscript" value: {type: 'string', value: '50' ...}
}

I've highlighted the relevant portion. The value of the jscript element has been parsed to a string element with the value 50. Here's what an initial version of the Javaascript processing looks like:

function processJscript(rhs) {
    // For each element in the rhs of the rule
    for (let i=0;i<rhs.length;i++) {
	if (rhs[i].type == "jscript") {
	    // Each time we parse we need to create a new parser, otherwise
	    // it just continues on from where it ended.
	    // const parser = new nearley.Parser(Lodestone,'rhs');
	    const parser = new nearley.Parser(nearley.Grammar.fromCompiled(Lodestone),'rhs');
	    // Run the value through the initial parser
	    parser.feed(rhs[i].value);
            rhs[i].value = parser.results[0];
        };
    };
 return rhs;
};

As you can see, I'm feeding the value of the Javascript element into a parser with the 'rhs' starting point.

However, there's some legitimate Javascript that the parser will mistake for Lodestone. For example, in Javascript [50] would be an array subscript, but the parser will mistake it for a rule weight. There's a similar problem with the pipe symbol. So after I've run the parser, I'll add a step turn those back into strings:

	    // Fix or and weight tokens
	    for (let j=0;j<rhs[i].value.length;j++) {
		let token = rhs[i].value[j];
		if (token.type == 'or' || token.type == 'weight') {
		    token.type = 'string';
		    token.value = token.text;
		};
	    };

And now the value of a Javascript element will include things like “<foo>" and “<$foo>" which I want to be interpreted as Lodestone, but won't treat something like “[50]" as Lodestone.

The processing of weight elements looks much the same, with one additional wrinkle. Weight elements can contain embedded Javascript. When that happens, I need to call processJscript to take care of the embedded Javascript. Since processJscript is expecting to get a list of elements to process, I can simply give it the result of the parse:

	    // Run the value through the initial parser
	    parser.feed(rhs[i].value);
	    // Now process any Jscript in the RHS
	    rhs[i].value = processJscript(parser.results[0]);

(At this point, you may be wondering why I didn't take care of properly parsing embedded Javascript in the original lodestone.ne parser. The answer is that (1) it would have made the parser much more complex and difficult to understand, and (2) creating a parser that allows Javascript within weights would have been tricky. I thought it was easier to understand and more straightforward to handle it as I have shown above. But see the Suggestions below if you want to tackle this!)

The next processing is a bit of bookkeeping. When two string elements are next to each other, they can be merged into a single element to simplify the rule.

function processStrings(rhs) {
    let newRHS = [];
    let currentString = null;
    for(let i=0;i<rhs.length;i++) {
	// Current element is not a string
	if (rhs[i].type != "string") {
	    // If we've been building up a string
	    // element, release it.
	    if (currentString != null) {
		newRHS.push(currentString);
		currentString = null;
	    };
	    // If this is a Jscript or weight it needs
	    // to be recursively processed.
	    if (rhs[i].type == "jscript" || rhs[i].type == "weight") {
		rhs[i].value = processStrings(rhs[i].value);
	    };
	    // And save this element
	    newRHS.push(rhs[i]);
	} else {
	    if (currentString == null) {
		// If not building a string, start one.
		currentString = rhs[i];
	    } else {
		// Add this string to the currentString.
		currentString.value =
		    currentString.value.concat(rhs[i].value);
	    };
	};
    };
    if (currentString != null) {
	newRHS.push(currentString);
    }
    return newRHS;
};

The basic idea here is to walk through the right-hand side elements, pushing them into newRHS. But when a string element is encountered, it is set aside (in currentString) without pushing into newRHS. If it is followed by other string elements, they're concatenated one by one onto currentString. When a non-string element is encountered (or when the end of the right-hand side is encountered) whatever has been gathered up in currentString is added to the newRHS. One thing to watch out for in implementing this pattern is the final if that releases currentString at the end of the right-hand side. It's very easy to forget that. The only other tricky bit is the recursion in the middle of the function. As with the previous processing, since Javascript and weight elements contain right-hand sides, we must recurse down into those values and process them as well.

There's one more bit of processing to do before building the rule dictionary. That has to do with the possible random choices on the right-hand side of a rule. A rule like this:

<start> => <a> | <b> | <c> [50] | <d>;

is executed by selecting one of the right-hand side choices at random, based upon the weights. One thing that may not be immediately apparent is that the above rule is the same as these rules:

<start> => <a> | <b>;
<start> => <c> [50];
<start> => <d>;

When the rule engine is executing, it will be useful to have all the options gathered up in one place with their associated weights. The first step of that is to break any rule that has multiple choices on the right-hand side into multiple right-hand sides.

Processing the choices is similar in some ways to processing the strings. We walk through the elements in a right-hand side, collecting them up until we hit a vertical bar (an “or" token) or the end of the rule. The elements we've gathered so far constitute one choice for this rule, so we save them in a simple choice structure that has the elements in the choice and the weight for the choice. The result when we're done is a list of choices [{weight: 1, elements: [...]}, {weight: 1, elements: [...]}]:

function processChoices(rhs) {
    let rhsList = [];
    let currentRHS = [];
    let currentWeight = '1';
    for(let i=0;i<rhs.length;i++) {
	// Starting a new choice
	if (rhs[i].type == "or") {
	    rhsList.push({weight: currentWeight, elements: currentRHS});
	    // Reset the currentRHS
	    currentRHS = [];
	    currentWeight = '1';
	    continue;
	};
	// Recursive processing of Javascript and weights
	if (rhs[i].type == "jscript" || rhs[i].type == "weight") {
	    rhs[i].value = processChoices(rhs[i].value);
	};
	// Saving the weight of the current choice
	if (rhs[i].type == "weight") {
	    // Simplification if weight is just a string
	    if (rhs[i].value.length == 1 &&
		rhs[i].value[0].elements.length == 1 &&
		rhs[i].value[0].elements[0].type == "string") {
		currentWeight = rhs[i].value[0].elements[0].value;
	    } else {
		currentWeight = rhs[i].value;
	    };
	    continue;
	};
	// Otherwise put the element in currentRHS
	currentRHS.push(rhs[i]);
    };
    // Pick up the last choice
    if (currentRHS != []) {
	// Put the previous choice in the rhsList
	rhsList.push({weight: currentWeight, elements: currentRHS});
    };
    return rhsList;
};

There's a simplification that saves the weight as a simple string if that's possible. Otherwise, the weight is itself a right-hand side that has been processed into choices. (As is any Javascript element.)

The last step is to store all the rules into a dictionary and to coalesce any rules that have the same non-terminal on the left-hand side.

function processRules(rules) {
    const dict = {};
    // For each rule in the ruleset
    for (let rule of rules) {
	if (rule.type == "rule") {
	    rule.rhs = processWeights(rule.rhs);
	    rule.rhs = processJscript(rule.rhs);
	    rule.rhs = processStrings(rule.rhs);
	    rule.rhs = processChoices(rule.rhs);
	    // Now add it to the dictionary
	    const key = rule.lhs.value;
	    if (key in dict) {
		// Already here, so add the rhs to the existing rhs
		dict[key].rhs = dict[key].rhs.concat(rule.rhs);
	    } else {
		// New key; add to dict
		dict[key] = {rhs: rule.rhs};
	    };
	};
    };
    return dict;
};

We call the four processing steps on each right-hand side, and then index the result into the dictionary based on the left-hand side. (Note that comments never get processed here, so they just silently disappear.)

And that's it! A bit of a whirlwind, but that's the complete definition of a generative grammar tool and and an implementation of its parser in a single blog post. If some parts remain mysterious, please grab the code and play with it until you understand what is going on.

Unfortunately there's not much you can do with Lodestone yet, but you can download the Part 12 branch from the Procedural Map Compasses repository on Github to access all of this code. Part 1 has instructions, but the short version (on Windows) is to double-click mongoose-free-6.9.exe to start up a web server in the Part 12 directory, and then select the test.html link. Obviously you can browse the code and other files, and make changes of your own. You can also try out this code on the web at https://dragonsabound-part12.netlify.app/test.html although you'll only be able to run the code, not modify it.

Next time I'll write the engine that will execute the rules in the dictionary.

Suggestions to Explore

As suggested above, try writing a more complex grammar that parses the embedded Javascript and weight elements correctly so that no post-processing is required.
The simplest part of the rule engine will take non-terminals and expand them into other non-terminals based upon the rules. Create a simple rule set with all non-terminals:
```
<start> => <a> <b>;
<a> => <e>;
<b> => <c>;
<c> => <d>;
```
Now create an interpreter that starts with “<start>" and repeatedly expands the terminal based on the rule set. When you can't expand the first non-terminal any more, “output" it and move on to expanding the next non-terminal. Expand the engine to include strings; these just get output as soon as they are encountered.

Here Dragons Abound

Thursday, February 10, 2022

Map Compasses (Part 12): An Excursion into Tool Building

No comments:

Post a Comment