about syllabus All example source code
From Wikipedia: “Grammar is the study of the rules governing the use of a given natural language, and, as such, is a field of linguistics. Traditionally, grammar included morphology and syntax; in modern linguistics these subfields are complemented by phonetics, phonology, semantics, and pragmatics.”
A Context-Free Grammar is a set of rules that define the syntax of a language — what is and is not a valid sequence of “tokens”. Programming languages, for example, are context-free grammars — a compiler reads your code to make sure it conforms to specific rules and informs you of any errors. (These rules aren’t limited to the production of text, and can be used for music, visual designs, etc.) A context-free grammar, however, isn’t robust enough to describe the complexities of natural language. After all, they have no context! Natural languages can be described using Context-sensitive grammars, a concept introduced by Chomsky in the 50s.
Here, I will demontrate how to use libraries to generate text with a Context Free Grammar as well as program your own from scratch. want to learn how to define our own context free grammars and generate text from them. I’m going to give a short explanations, followed by code examples. For more detailed discussions, I would recommend Daniel Howe’s (creator of RiTa) explanation and Context-Free Grammars by Allison Parrish.
All “production rules” in a context-free grammar are of the form:
V is a single “non-terminal” and
w is a sequence of “terminals” and “non-terminals”
A non-terminal symbol is simply another symbol which needs to be defined in another production rule. A terminal symbol does not need to be defined because that’s it, you’ve reached the end, what is here is what should be in the sentence itself. Here’s an incredibly simple CFG.
Anything in single quotes is a “terminal” element, meaning this is the end and no more substitutions are necessary. In the above “cat,” “happy,” and “sad” are all terminals. Anything that is not in quotes is non-terminal, or a “symbol.” The abve example has 4 symbols — s, a, b, c. The symbol “s” is commonly used to indicate “start” or “sentence.”
Notice how in the above set of rules the symbol c can be “happy” or “sad.” This is often express with an OR (pipe character) to combine these two rules:
Again, this grammar is incredibly basic, and is just for demonstrating how the elements work. The only two valid “sentences” that conform to this grammar are:
The process of generating the above two sentences goes something like:
a b which becomes
the c cat which then becomes
the happy cat or
the sad cat. Here, either “happy” or “sad” is picked randomly (with a 50% chance for each one.)
A great way to get started experimenting with CFG text generation is to use a library. One of my favorites is Tracery by Kate Compton. With Tracery, all you need to do is setup your grammar as a JSON object. Here’s a very simple example grammar:
#adj# is surrounded by the
# symbol. This indicates that it is a “non-terminal” element and should be replaced with a random element from the
adj property of
story. These types of replacement rules can be nested and result in complex outcomes. To generate the story, a grammar object is created from the above
story data. Any given outcome can be “expanded” from a starting element (like
#start#) and then “flattened” (meaning we just want the final result, not the full expansion tree.)
The RiTa.js library also includes Context Free Grammars as a feature with the
RiGrammar object. You can create a grammar in code by creating a grammar object and adding rules..
RiTa looks for a rule that begins with
<start> to generate the text with
RiTa also allows you pass a third argument (weight) to each rule to increase the probability of a particular option being selected. And finally, RiTa allows you to load grammars from a file (formatted with JSON or YAML) with
The above may look a little strange to you. Notice how the value for each key is an array of arrays. Each element of the larger array is one possible outcome which is an array itself: the list of elements in that outcome. I should also note that the is no distinction between a “symbol” and a “terminal.” Everything is just a String, if there is a rule associated with that String then it is a symbol, if not, it’s terminal.
A generated sentence from the CFG is commonly referred to as an “expansion” and is a bit tricky to implement. We could try to just use a loop, iterating over a sentence and executing the production rules. The nested nature of the rules, however, introduces a great deal of complexity. An elegant strategy, however, is to call a function recursively expanding the same sentence over and over again until there are no non-terminal elements left. Let’s assume we have an empty array and some seed (often called an “axiom”).
We can write a function that receives both the array and the seed and calls itself recursively if a rule that matches that seed String is present.
The full implementation of the CFG object can be found here. In addition, here are some sample grammars: creating a grammar in code, a grammar file, a grammar files as JSON, haiku grammar file as JSON. These grammars come from Allison Parrish and Daniel Howe.
Finally, a grammar file is something you can generate yourself based on some algorithm. For example, using this haiku grammar file from Daniel Howe, you could read in a source text, analyze the syllable counts of all the words, and rewrite the grammar file to reflect these new words. Here is a version of this grammar file using words from an Obama speech: generated_grammar.json.
If the concept of recursion in the
expand() function above is confusing to you, this video about recursion in algorithmic drawing from The Nature of Code may help.
Finally, another example of a Context-Free Grammar is an L-System, a grammar used to generated growth patterns found in nature.