daniel shiffman

Generative Text

Examples:

  • All the code as an Eclipse project zip: a2z_week8.zip
  • Updates on CVS: /home/dts204/a2z/examples/
  • Related

  • A lot of the code and discussion below is based on examples from Daniel Howe Check out: RiTa site and Programming for Digital Art & Literature.
  • Google N-Grams!!
  • Markov Models of Natural Language
  • Three Models for the Description of Language (Chomsky)
  •  
    Remember the ol’ monkey typing at a keyboard problem?

     String monkey = "";
     for (int i = 0; i < 20; i++) {
         int n = (int) (Math.random()*127);
         char c = (char) n;
         monkey += c;
     }
     System.out.println(monkey);
    

    There, a monkey-ish program. Although we sure do all love monkeys, this week is dedicated to creating text via generative methods ever so slightly more sophisticated than the randomly typing monkey.

    Probability and Chance Operations

    A true monkey types totally at random, meaning there is an equal probability that any key on the keyboard will be hit at any given time. What if we could generate our own custom probability map for a keyboard (or perhaps a sequence of word tokens)? What kind of results could we achieve? Some interesting examples of text generated via this type of methodology are Gnoetry and Travesty.

    Before we dive into code, it might be useful to review some probablility basics. You might look at this handout for the nature of code course, as well as The Layman’s Guide to Probability and this page from Math for Morons like Us.

    Let's examine the following example, where letters are more likely to appear in alphabetical sequence than not. This won't necessarily produce interesting results, but it will demonstrate the basics. First we create a class to represent each letter.

    public class Letter {
    
        char c;
        float[] probs;
    
        public Letter(char c_) {
            c = c_;
            probs = new float[127]; // We are allowing for 127 other possibilities
        }
    }
    

    Note each letter object has a character as well as an array that holds values for the probability that any given character (ASCII code 0-126) will follow that letter. An array is very convenient to use here because characters can be considered as integers and therefore function in a dual role as a letter as well as an index in the array.

    Now, for each letter, we simply want to fill that probability array. We’re doing it in a highly arbitrary way as a demonstration in this example:

    for (int i = 0; i < probs.length; i++) {
        // only considering lower case letters
        if (i > 96 &#038;&#038; i < 123) {
            // if it's  before in the alphabet
            if (i <= c) probs[i] = 0.01f;
            // if it's within 4 letters after
            else if (i-c < 4) probs[i] = (float) (4-(c-i))*10f;
            // anything else
            else probs[i] = 0.1f;
        }
        else {
            probs[i] = 0.0f;
        }
    }
    

    This just fills the array with numbers, it's best to then normalize these values (i.e. divide each one by the sum of all values) so that the array of probabilities adds up to 100%.

    // Divide each value by the sum of all values
    public void normalize(float[] arr) {
      float total = 0;
      for (int i = 0; i < arr.length; i++) {
          total += arr[i];
      }
      for (int i = 0; i < arr.length; i++) {
          arr[i] /= total;
      }
    }
    

    Finally, we can build a sentence letter at a time according to these probabilities. One way we can accomplish this result is via a process of picking two random numbers:

  • 1 -- pick a random character (0 - 126)
  • 2 -- pick a random floating point value (0 - 1.0)
  • 3 -- look up the probability of the character picked in step #1 in the probability table (array)
  • 4 -- if the random number picked in step #2 is less than the probability found in step #3, then we have found our next letter and we are all set
  • 5 -- if it is not, go back to step 1 and start over again.
  •  
    (If this seems confusing, consider a situation with only two possible letters, 'a' with a probability of 0.9 and 'b' with a probability of 0.1, we can see that the above method would yield us picking 'a' 9 times as often.)

    // An algorithm for picking the next character based on probability map
    public char pickNext() {
        // Have we found one yet
        boolean foundone = false;
        int hack = 0;  // let's count just so we don't get stuck in an infinite loop by accident
        while (!foundone &#038;&#038; hack < 10000) {
            // Pick two random numbers
            float r1 = (float) Math.random();
            float r2 = (float) Math.random();
            int index = (int) (r1*probs.length); // random spot in the array
            // If the second random number is less than the probability in the array
            // Then we've found an OK answer
            if (r2 < probs[index]) {
                foundone = true;
                return (char) index;
            }
            hack++;
        }
        // Hack in case we run into a problem (need to improve this)
        return ' ';
    }
    

    Code: Letter.java, ProbDriver.java, full source: prob.zip.

    Markov Chains and N-Grams

    A Markov Chain can be described as a sequence of random "states" where each new state is conditional only on the previous state. An example of a Markov Chain is monopoly. The "next" state of the monopoly board depends on the current state and the roll of the dice. It doesn't matter how we got to that current state, only what it is at the moment. A game like blackjack, for example, is different in that the deal of the cards is dependent on the history of many previous deals (assuming a single-deck not continuously shuffled.) Monopoly as Markov Chain.

    We can use a markov chain to generate text where each new word or character is dependent on the previous word (or character) or sequence of words (or characters). This is known as an N-gram model. An N-gram model for language predicts a word (or character) W[i] based on the previous sequence W[i-2] W[i-1], etc. Given the phrase "I have to" we might say the next word is 50% likely to be "go", 30% likely to be "run" and 20% likely to be "pee." We can construct these word sequence probabilities based on a large corpus of source texts. The following function, for example, displays all n-grams for a given String (of order N):

    public static ArrayList nGrams (String sentence,int n) {
      // Split sentence up into words
      String[] words = sentence.split("\W+");
      // Total number of n-grams we will have
      int total  = words.length - n;
      ArrayList grams = new ArrayList();
      // Loop through and create all sequences
      for(int i=0;i< =total;i++) {
        String seq = "";
        for (int j = i; j < i+n; j++) {
          seq += words[j] + " ";
        }
        // Add to ArrayList
        grams.add(seq);
      }
      return grams;
    }
    

    Developing the full statistical model involves a node-based data structure (see: Daniel Howe's Markov page). For us, since this functionality is built into the RiTa Library, we can get it for free.

    A quick note about using RiTa in the context of our class. RiTa is designed as a Processing library and thus assumes the existence of a PApplet when generating objects. This is because RiTa also includes functionality to draw text on the screen. Although you may ultimately use RiTa in Processing for projects that require visuals, these examples will remain simple Java programs with console output only. This is why you will see the use of "null" in this code instead of "this". Since we are not a PApplet, we cannot pass ourselves into the RiTa objects and so we just say null. "core.jar" will still have to be part of the build path and you must put certain files (such as "rita_addenda.txt") in a folder called "data." Other bugs may arise with RiTa outside of Processing and we'll discover these as we go. . .

    Ok, now onto building a Markov model and generating sentences based on that model in RiTa. First we must declare a RMarkov object.

    // Create a new markov model with n=3
    RiMarkov markov = new RiMarkov(null, 3);
    

    Once we have the object, we can pass it text from which to build its model:

    A2ZFileReader fr = new A2ZFileReader("plays/hamlet.txt");
    markov.loadString(fr.getContent());
    

    We can even pass it multiple texts with weights for those texts.

    fr = new A2ZFileReader("plays/seagull.txt");
    markov.loadString(fr.getContent(),3);
    

    The above code is the equivalent of adding Chekov’s The Seagull three times into the model. Once the model is built we can generate sentences:

    // Generate 8 lines
    String[] lines = markov.generate(8);
    System.out.println("Here are 8 generated lines: + n");
    for (int i = 0; i < lines.length; i++) {
      System.out.println(lines[i]);
    }
    

    And here's a sample result:

    Nonsense, Matriona will feed it.
    It is always either vodka or brandy.
    Yet I am sorry to leave.
    You should not handle youthful egoism so roughly, sister.
    What did I hurt my poor boy?
    No, indeed, are ambition; for the first day.
    Yes, they are singing across the water.
    It is like a beggar beneath your window.
    

    Here’s another great example, direct from our class: Thesis Title Generator by Adam Parrish.

    Examples: Markov.java, Ngram.java

    Context Free Grammars (CFGs)


    From Wikipedia: “Grammar is the study of the rules governing the use of a given natural language, and, as such, is a field of linguistics. Traditionally, grammar included morphology and syntax; in modern linguistics these subfields are complemented by phonetics, phonology, semantics, and pragmatics.”

    A Context-Free Grammar is a set of rules that define the syntax of a language — what is and is not a valid sequence of “tokens”. Programming languages, for example, are context-free grammars — a compiler reads your code to make sure it conforms to specific rules and informs you of any errors. (These rules aren’t limited to the production of text, and can be used for music, images, etc.) A context-free grammar, however, isn’t robust enough to describe the complexities of natural language. After all, they have no context! Natural languages can be described using Context-sensitive grammars, a concept introduced by Chomsky in the 50s.

    See Daniel Howe’s page on grammars for a fuller explanation (from which this section is adapted).

    For our purposes here, we want to learn how to define our own context free grammars and generate text from them. I’m going to give a short explanation here, followed by code examples. However, again, Daniel Howe’s page contains more detailed explanation.

    All “production rules” in a context-free grammar are of the form:

    V –> w

    where

    V is a single “non-terminal”
    and
    w is a sequence of “terminals” and “non-terminals”

    A non-terminal symbol is simply another symbol which needs to be defined in another production rule. A terminal symbol does not need to be defined because that’s it, you’ve reached the end, what is here is what should be in the sentence itself. Here’s an incredibly simple CFG.

    s -> a b
    a -> 'the'
    b -> c 'cat'
    c -> 'happy'
    c -> 'sad'
    

    Anything in single quotes is a “terminal” element, meaning this is the end, this is the word/string we can use. In the above “cat,” “happy,” and “sad” are all terminals. Anything that is not in quotes is non-terminal, or a “symbol.” The abve example has 4 symbols — s,a,b,c. The symbol “s” is commonly used to indicate “start” or “sentence.” Notice how in the above set of rules the symbol c can be “happy” or “sad.” We can use an OR (pipe character) to combine these two rules:

    c -> 'happy' | 'sad'
    

    Again, this grammar is incredibly basic, and is just for demonstrating how the elements work. The only two valid “sentences” that conform to this grammar are:

    the happy cat
    the sad cat
    

    The process of generating the above two sentences goes something like:

    s becomes a b which becomes the c cat which then becomes the happy cat or the sad cat. Here, either “happy” or “sad” is picked randomly (with a 50% chance for each one.)

    The RiTa library supports CFGs through the use of a grammar file. The format is different, instead of writing:

    V –> w

    you write:

    &lt;b&gt;
    {
      &lt;V&gt;
      &lt;w&gt;
    }
    
    or
    {
      &lt;V&gt;
       w
    }
    

    depending on whether w is a symbol — — or word — w.

    Our grammar is therefore:

    {
     &lt;start&gt;
     &lt;a&gt; &lt;b&gt;
    }
    
    {
      &lt;a&gt;
      the
    }
    
    {
      &lt;b&gt;
      &lt;c&gt; cat
    }
    
    {
      &lt;c&gt;
      happy | sad
    }
    

    Here are links to two examples from Daniel Howe’s site: Another simple grammar, Haiku grammar.

    Once you have a grammar file, you can use RiTa to generate text. Here’s how it works.

    First you create the RiGrammer object, referencing a PApplet (or in our case: null) and the grammar file (which should be placed in a folder named “data”):

    RiGrammar grammar = new RiGrammar(null, "simple.g");
    

    The expand() method is used to generate the text:

    String text = grammar.expand();
    System.out.println(text);
    

    RiTa uses “start” as the default first non-terminal symbol. However, the methods expandFrom() and expandWith(), described here allow for more advanced functionality.

    In addition, RiTa will assign equal probabilities to all choices in a given sequence, however, you can assign different weights to different choices using the following syntax:

       {
         &lt;a&gt;
         [2] b | c
       }
    

    In the above grammer, “b” is twice as likely to be selected as “c”.

    Code: CFG.java. Grammar file: simple.g.

    Finally, a grammar file is something we can generate ourselves as well based on some algorithm. For example, using this haiku grammar file from Daniel Howe, we could read in a source text, analyze the syllable counts of all the words, and rewrite the grammar file to reflect these new words. Here is a version of this grammar file using words from an Obama speech: generated_grammar.g.

    Writing the grammar file is easy using our A2ZFileWriter class:

    A2ZFileWriter fw = new A2ZFileWriter("data/generated_grammar.g");
    fw.appendLine("{");
    fw.appendLine("&lt;start&#038;t;");
    fw.appendLine("&lt;5-line> % &lt;7-line&gt; % &lt;5-line&gt;");
    fw.appendLine("}n");
    

    Here is the full source: GrammarMaker.java. Note the use of a Concordance to get all the words from a source file as well as RiTa’s Analyzer object to count syllables.

    L-Systems

    L-Systems were developed by Aristid Lindenmayer in 1968 as a means for modeling plant cell structure. L-Systems, although used primarily to generate graphics (often resulting in fractals), are based on a formal grammar, consisting of an alphabet, a set of rules, and an axiom (the starting sentence). An L-System can be generated with a CFG, as demonstrated in example by Daniel Howe. Here, we’ll demonstrate how to generate an L-System without CFGs and the RiTa library.

  • Alphabet: A,B,C
  • Rules: A → AB, B → AA, C → BC
  • Axiom: ABC
  •  
    We can then generate the L-System by applying the rules iteratively, beginning with the axiom, i.e.:

    Generation #0: ABC
    Generation #1: ABAABC
    Generation #2: ABAAABABAABC
    Generation #3: ABAAABABABAAABAAABABAABC
    Generation #4: ABAAABABABAAABAAABAAABABABAAABABABAAABAAABABAABC
    Generation #5: ABAAABABABAAABAAABAAABABABAAABABABAAABABABAAABAAABAAABABABAAABAAABAAABABABAAABABABAAABAAABABAABC
    

    To generate imagery from an L-System, the sentences are deciphered as directions for a turtle graphics engine. For example:

  • Alphabet: F (go forward), + (turn right), – (turn left), [ (save current location), ] (return to last saved location)
  • Rules: F → F[F]-F+F[--F]+F-F
  • Axiom: F-F-F-F
  •  
    The applet to the right visualizes the result.

    Let’s look a bit at how the code is implemented. Our program will involve two classes, LSystem.java and Rule.java.

    The LSystem class includes a String for the current sentence, an int to keep track of the total # of generations, and an array of Rule objects to describe the rules of the system.

    public class LSystem {
    
        private String sentence;     // The sentence (a String)
        private Rule[] ruleset;      // The ruleset (an array of Rule objects)
        private int generation;      // Keeping track of the generation #
    
        // Construct an LSystem with a startin sentence and a ruleset
        public LSystem(String axiom, Rule[] r) {
            sentence = axiom;
            ruleset = r;
            generation = 0;
        }
    

    The rather simple Rule class includes a character that is matched and a replacement String, i.e.

    public class Rule {
        private char match;
        private String replace;
    
        public Rule(char a_, String b_) {
            match = a_;
            replace = b_;
        }
    
        public char getMatch() {
            return match;
        }
    
        public String getReplace() {
            return replace;
        }
    }
    

    The main work is found in the function that generates a new String based on the existing String by matching characters in the ruleset and replacing them accordingly. The pseudo-code goes something like:

    1. Create an empty StringBuffer (A StringBuffer is faster than a old fashioned String when it comes to appending.)
    2. For every character in the current String:

  • Loop through all rules and see if it matches one
  • If it does, append the replacement String to the StringBuffer
  • If it does not, append itself to the StringBuffer
  • some weird bug is preventing me from pasting the code here, visit: <a href="http://shiffman.net/itp/classes/a2z/week09/LSystem.java">LSystem.java</a> and look for the generate() method.
    

    Code: LSys.java, LSystem.java, Rule.java,full source: lsystem.zip.

    Genetic Algorithms

    We can also accomplish some interesting results generating text via genetic algorithms. For example, what if a String of characters is considered DNA? For examples and explanation, visit the nature of code GA handout.

    Code: DNA.java, GAShakespeare.java.

    comments powered by Disqus