daniel shiffman

Week 1 A to Z

Examples:

  • HelloWorld.java
  • BankAccountTest.java BankAccount.java
  • SimpleFileIO.java
  • SimpleFileIO2.java
  • EveryOtherWord.java
  • ReverseWords.java
  • ReverseCharacters.java
  • God.java
  • FleschIndex.java
  • Some sample input files: spam.txt, nytimes.txt, bible.txt
  • Related:

  • Characters and Strings tutorial
  • Exercises (optional and purposefully mundane):

  • Write a program that opens multiple source text files and combines them together writing them out as one file.
  • Write a program that counts the number of punctuation occurrences in a source text.
  • Revise the file input / output program to have full error handling.
  •  

    Beyond Processing and into Java

    This course assumes one semester of programming experience in Processing. If you’re already familiar with compiling and running your own Java programs without Processing feel free to skip this section.

    Pulling back the curtain of Processing, what we’ll discover is it really is Java. For example, in Java, you:

  • Declare, initialize and use variables the same way
  • Declare, initialize and use arrays the same way
  • Employ conditional and loop statements the same way
  • Define and call functions the same way
  • Create classes the same way.
  • Instantiate objects the same way.
  •  
    Processing, of course, gives some extra stuff for free, and this is why it’s a great tool for learning rich media programming. Nevertheless, for the start of this semester, our programs will involve text / file processing and it will be simpler to compile and run programs from the command line. However, the examples will also be provided via CVS as an Eclipse project and you should feel free to use Eclipse if you prefer (CVS Instructions). From time to time, I will also include Perl and PHP versions of the examples. Using Java for the assignments will not be required if you prefer one of these languages.

    Let’s look at a basic first example.

    public class HelloWorld
    {
      int x = 0;
      public static void main(String[] args)
      {
        int x = 0;
        while (x < 10) {
          System.out.println("Look, I can count to " + x + "!");
          x++;
        }
      }
    }
    
    Compile: javac ClassName.java

    Run: java ClassName

    Take the above code and make a text file called HelloWorld.java (using notepad, textpad, bbedit, textwrangler, etc.) Congratulations, you've written your first java program. However, unlike with processing, we don't have a "Run" or "Play" button. You have to compile and run the program yourself.

    On a Mac, you can accomplish this via the Terminal.

    Holy more than one file, batman

    Let's take a look at an object oriented example where our class (Bank Account) is kept in its own file and a "driver" program accesses it. Here are the two files you need:
    BankAccount.java
    BankAccountTest.java

    As long as these files are both in the same directory, we can compile BankAccountTest.java, which, since it uses the BankAccount.java class, will instigate the compilation of that class.

    (examples from Big Java by Cay Horstmann.)

    What's new?

  • The "main" method – Every Java program must have a main method. In Processing, we controlled the flow of the program via setup() and draw(). Under the covers, however, every PApplet has a main method that creates initializes the applet, creates the window, etc. etc. Since we are writing Java programs from scratch, we'll need to write our own main method. A full explanation of the main method is available here. One thing that is important for us to note is that the main method takes an array of Strings as an argument, i.e. "main(String[] args)". When the program is run via the command line, you can pass Strings into the program using the array. This will prove incredibly useful for doing file processing (see below).
  • Import Statements – Although this particular HelloWorld program does not include any import statements, Java programs require that classes and libraries are explicitly imported. We’ve experienced this before when using the video, serial, or opengl library in Processing, i.e. import processing.video.*; .
  • public – In Java, variables, functions, and classes can be “public” or “private.” This designation indicates what level of access should be granted to a particular piece of code. It’s not something we have to worry much about right now, but it becomes an important consideration when moving on to larger Java programs.
  • class HelloWorld – Sound somewhat familiar? Java, it turns out, is a true object-oriented language. There is nothing written in Java that is not part of a class! Every program, even the main program is a class too!
  •  
    Exploring the Java API

    The Processing reference quickly became our BFF while learning to program. The Java API is going to be more of our rascally nemesis. We can explore the full java documentation by visiting http://java.sun.com. There, we can click over to “API” specifications: http://java.sun.com/reference/api/index.html and find a selection of versions of Java. We'll be using the JavaTM 2 Platform, Standard Edition, v 1.4.2: http://java.sun.com/j2se/1.4.2/docs/api/.

    And so, very quickly, you’ll find yourself completely lost. And that’s ok. The Java API is huge. Humongous. It’s not meant to be read or even perused. It’s really more of a reference for looking up specific classes. For example, you might be working on a program that requires sophisticated random number generation, and perhaps you overheard a conversation about the class “Random” and thought “Hey, maybe I should check that out!” You can find the appropriate reference page by scrolling down the “All Classes” list or else by knowing that it lives in the java.util package (which you can select from the package list on the top left.) Even better, if you type Java Random into google, the Random documentation page will be the first to appear. Much like Processing, you’ll find a reference page with an explanation of what the class does, the Constructors for creating an object instance, and available methods (functions). Since Random is part of the java.util package, we don’t need to explicitly write an import statement to use it.

    The “String” class

    The String class is what we will use to store textual information in our Java programs (from time to time, we may also use the StringBuffer class, but String will do for now.)

    You may be familiar with the Processing reference page for Strings. The complete reference for String is http://java.sun.com/j2se/1.4.2/docs/api/java/lang/String.html . And again, just in case, link to full JavaDocs: http://java.sun.com/j2se/1.4.2/docs/api.

    A String, at its core, is really just a fancy way of storing an array of characters – if we didn’t have the String class, we’d probably have to write some code like this:

    char[] sometext = {‘H’, ‘e’, ‘l’, ‘l’, ‘o’, ‘ ‘, ‘W’, ‘o’, ‘r’, ‘l’, ‘d’};
    

    Clearly, this would be a royal pain in the programming behind. It’s much simpler to do the following and make a String object:

    String sometext = “How to make a String? Characters between quotation marks!”;
    

    Simple String Analysis

    Java provides us with a basic set of String functions that allow for simple manipulation and analysis. Next week, we’ll also look at how regular expressions can allow to perform advanced String processing, but it’s good to pick up some of the basics first and gather some skills doing all of our text processing manually, character by character. All of the availabe String methods are functions are laid out on the JavaDoc page, and we’ll explore a few useful ones here. Let’s take a closer look at three String class functions: indexOf(), substring(), and length().

    indexOf() locates a sequence of characters within a string. For example, run this code and examine the result:

    String sentence = “The quick brown fox jumps over the lazy dog.”;
    System.out.println(sentence.indexOf("quick"));
    System.out.println(sentence.indexOf("fo"));
    System.out.println(sentence.indexOf("The"));
    System.out.println(sentence.indexOf("blah blah"));
    

    Note that indexOf() returns a 0 for the first character, and a -1 if the search phrase is not part of the String.

    After we find a certain search phrase within a String, we might want to pull out part of the String and save it in a different variable. This is what we call a “substring” and we can use java’s substring() function to take care of this task. Examine and run the following code:

    String sentence = "The quick brown fox jumps over the lazy dog.";
    String phrase = sentence.substring(4,9);
    System.out.println(phrase);
    

    Note that the substring begins at the specified beginIndex (the first argument) and extends to the character at endIndex (the second argument) minus one. Thus the length of the substring is endIndex minus beginIndex.

    At any given point, we might also want to access the length of the String. We can accomplish this by calling the length() function.

    String sentence = "The quick brown fox jumps over the lazy dog.";
    System.out.println(sentence.length());
    

    Note this is different than accessing the length of an array. Here we are calling the length function available to us within the String class, and therefore must also have the open and close parentheses — length() — associated with calling a function.

    It’s also important to note that we can concatenate (i.e. join) a String together using the “+” operator. With numbers plus means add, with Strings (or characters), it means concatenate, i.e.

    int num = 5 + 6; // ADDING TWO NUMBERS!
    String phrase = "To be" + " or not to be"; // JOINING TWO STRINGS!
    String anotherphrase = "Hell" + 'o'; //JOING A STRING WITH A CHAR!
    

    Splitting

    One String-related function that will prove very useful in our text analysis programs is split. split separates a group of strings embedded into a longer string into an array of strings.

    Examine the following code:

    String spaceswords = "The quick brown fox jumps over the lazy dog.";
    String list1[] = spaceswords.split(" ");
    System.out.println(list1[0]);
    System.out.println(list1[1]);
    
    String commaswords = "The,quick,brown,fox,jumps,over,the,lazy,dog.";
    String list2[] = commaswords.split(",");
    for (int i = 0; i < list2.length; i++) {
      System.out.println(list2[i] + " " + i);
    }
    
    //calculate sum of a list of numbers in a string
    String numbers = "8,67,5,309";
    String numlist[] = numbers.split( ',');
    int sum = 0;
    for (int i = 0; i < list.length; i++) {
      sum = sum + Integer.parseInt(list[i]);  // Converting each String into an int
    }
    System.out.println(sum);
    

    To perform the reverse of split, we can write a quick function that joins together an array of Strings.

    String[] lines = {“It”, “was”, “a”, “dark”, “and”, “stormy”, “night.”};
    

    Knowing about loops and arrays we could join the above array of strings together as follows:

    // Concatenating an array of Strings using the String class
    public String join(String str[], String separator) {
        String stuff = ""
        for (int i = 0; i < str.length; i++) {
          if (i != 0) stuff += separator;
          stuff += str[i];
        }
        return stuff;
    }
    
    // Concatenating an array of Strings using the StringBuffer class (thank you Ben Fry and the Processing source!)
    // Using a StringBuffer is better with really long Strings / arrays
    public String join(String str[], String separator) {
        StringBuffer buffer = new StringBuffer();
        for (int i = 0; i < str.length; i++) {
          if (i != 0) buffer.append(separator);
          buffer.append(str[i]);
        }
        return buffer.toString();
    }
    

    File Input and Output

    To start, we are going to be working in the simple world of text in and text out. We'll load some text from a file, analyze it, mess with it, etc. and then write some text back out to a file. Most of the examples you might find online read text from a file line by line. This is very useful for certain types of operations and you may want to investigate how to do this on your own (see: http://www.cafeaulait.org/slides/intljava/2001ny/javaio/57.html for an example). For our purposes, however, we're just going to read the text in character by character (or byte by byte). This can be accomplished using java.io or java.nio . java.nio is a java's "new and improved" input/output package that supposedly improves performance in buffer management, network and file I/O, regular-expression support, etc.

    Simple File I/O with the old java.io package

    // Simple File Input and Output using "old" I/O
    // Daniel Shiffman
    // Programming from A to Z, Spring 2006
    
    // Input file is the first argument passed from the command line
    // Output file is the second
    // This could be improved with some basic error handling (what if an invalid filename is entered, etc.?)
    
    import java.io.*;
    
    public class SimpleFileIO2 {
      public static void main (String[] args) throws IOException {
    
        // Read the file into a String, character by character
        // (We could read it line by line with BufferedReader)
        // This could also be greatly improved using StringBuffer
        FileReader in = new FileReader(new File(args[0]));
        String content = "";
        int c;
        while ((c = in.read()) != -1)  {
          content += (char) c;
        }
        in.close();
    
    
        // Do our fancy string editing stuff here
    
        // Write out a file with the content, character by character
        FileWriter out = new FileWriter(new File(args[1]));
        for (int i = 0; i < content.length(); i++) {
          out.write(content.charAt(i));
        }
        out.close();
      }
    }
    

    Simple File I/O with the new java.nio package

    // Simple File Input and Output using "new" I/O
    // Daniel Shiffman
    // Programming from A to Z, Spring 2006
    // Based off of code from Java Regular Expressions by Mehran Habibi
    
    // Input file is the first argument passed from the command line
    // Output file is the second
    // This could be improved with some basic error handling (what if an invalid filename is entered, etc.?)
    
    import java.io.*;
    import java.nio.*;
    import java.nio.channels.*;
    
    public class SimpleFileIO {
      public static void main (String[] args) throws IOException {
    
        // Create an input stream and file channel
        // Using first arguemnt as file name to read in
        FileInputStream fis = new FileInputStream(args[0]);
        FileChannel fc = fis.getChannel();
    
        // Read the contents of a file into a ByteBuffer
        ByteBuffer bb = ByteBuffer.allocate((int)fc.size());
        fc.read(bb);
        fc.close();
    
        // Convert ByteBuffer to one long String
        String content = new String(bb.array());
    
        // Conceivably we would now mess with the string here
        // Doing all sorts of fun stuff
    
        // Create an output stream and file channel
        // Using second argument as file name to write out
        FileOutputStream fos = new FileOutputStream(args[1]);
        FileChannel outfc = fos.getChannel();
    
        // Convert content String into ByteBuffer and write out to file
        bb = ByteBuffer.wrap(content.getBytes());
        outfc.write(bb);
        outfc.close();
      }
    }
    

    Once we’ve gotten the hang of reading and writing files, we can start to think about ways of creating output text based on an input text. For example, we could do something as simple as make a new text with every other word from a source text. To do this, we can split the text up into an array of Strings (with space as a delimiter) and create a new String by appended every other word to it. StringBuffer is good to use in case we are dealing with really long texts.

     //Split text by wherever there is a space
    String[] words = content.split(" ");
    StringBuffer everyotherword = new StringBuffer();
    for (int i = 0; i < words.length; i+=2) {
       String word = words[i];
       everyotherword.append(word + " ");
    }
    

    Using the Nigerian Spam as a source text, the result is something like:

    On 12th, a contractor the co-orporation, Kingdom Olaf made time
    Deposit  twelve months, at US$ (Seventeen Three Hundred fifty
    Thousand only) my maturity,I a notification his address but no
    After month, sent reminder finally from contract the Pertroleum
    co-orporation Mr.Olaf died an accident further found that died
    making WILL,and attempts his of was therefore further and
    that Olaf
    

    Another thing we might try is to search for every time a certain word appears. The following code examines a text for every time the word “God” appears and keeps the word “God” along with what follows it:

    for (int i = 0; i < words.length; i+=2) {
       if (words[i].equals("God")) {
          gods.append(words[i] + " " + words[i+1] + "n");
       }
     }
    

    The result applied to Genesis from the Bible looks something like:

    God Almighty
    God forbid
    God hath
    God did
    God hath
    God of
    God Almighty
    God make
    God of
    God of
    God meant
    God will
    

    We could also reverse all the characters in a text, by walking through the String backwards. Note how the for loop starts at the end of the String (content.length() -1).

    StringBuffer reverse = new StringBuffer();
    for (int i = content.length()-1; i >= 0; i--) {
       char c = content.charAt(i);
       reverse.append(c);
    }
    

    The result applied to the Nigerian Spam looks something like:

    rof %5 dna uoy rof %53 dna em rof %06 fo oitar eht ni erahs ot su
    rof tnuocca ruoy otni diap eb lliw yenom ehT .refsnart eht rof rovaf
    ruoy ni noitartsinimda/etaborp fo rettel dna stnemucod yrassecen eht
    niatbo ot dna LLIW eht fo noitaziraton dna gnitfard rof yenrotta na
    fo secivres eht yolpme llahs eW .nik fo txen eht sa ecalp ni uoy tup
    lliw taht stivadiffa dna stnemucod yrassecen eht eraperp lliw yenrotta
    

    Analysis

    We’ll end this week by looking at a basic example of text analysis. We will read in a file, examine some of its statistical properties, and write out a new file that will contain our report. Our example will compute the Flesch Index (aka Flesch-Kincaid Reading Ease test), a numeric score that indicates the readability of a text. The lower the score, the more difficult the text. The higher, the easier. For example, texts with a score of 90-100 are, say, around the 5th grade level, wheras 0-30 would be for “college graduates”. The result of the test on a few sample texts (the Bible, spam, a New York Times article, and Processing tutorials I’m writing) are displayed to the right.

    The Flesch Index is computed as a function of total words, total sentences, and total syllables. It was developed by Dr. Rudolf Flesch and modified by J. P. Kincaid (thus the joint name). Most word processing programs (MS Word, Corel Wordperfect, etc.) will compute the Flesch Index for you, which provides us with a nice method to check our results.

    Flesch Index = 206.835 – 1.015 * (total words / total sentences) + 84.6 * (total syllables / total words)

    Our pseudo-code will look something like this:

    1) Read input file into String object
    2) Count words
    3) Count syllables
    4) Count sentences
    5) Apply formula
    6) Write out report file
    

    We know we can read in text from a file and store it in a Java String object as demonstrated in the example above. Now, all we have to do is examine that String object, counting the total words, sentences, and syllables, applying the formula as a final step. To count words, we’ll use the StringTokenizer. (It should be noted that the StringTokenizer is a legacy class. split() should be used instead. However, before we get to next week (and for nostaglia) we’re going to solve the Flesch Index problem in a highly manual way, using the Tokenizer. Next week, you’ll be exposed to more advanced String parsing techniques using regular expressions.

    The first thing we’ll do is count the number of words in the text. We’ve seen in some of the examples above that we could accomplish this by using the “split” function, the StringTokenizer works in a similar way. To create a StringTokenizer, the constructor receives the String you want to tokenize as well as a set of delimiters (the characters that indicate where a token ends, and a new token begins.) You may be asking, what the heck is a token?? In our case, we want to split the String up into words, so each word is one “token.” Ok, so step one (creating the Tokenizer) looks like this:

    String delimiters = ".,':;?{}[]=-+_!@#$%^&#038;*() ";
    StringTokenizer tokenizer = new StringTokenizer(content,delimiters);
    

    We could have simplified our lives by just using space (” “) as the delimiter, but here we’re saying that any of the punctuation characters listed indicates the end of one word and the start of another. Now we just need to march through all the words (tokens) and count their syllables.

    while (tokenizer.hasMoreTokens())
    {
      String word = tokenizer.nextToken();
      syllables += countSyllables(word);
      words++;
    }
    

    Ok, so “countSyllables” isn’t a function that exists anywhere in Java. We’re going to have to write it ourselves. The following method is not the most accurate way to count syllables, but it will do for now.

    Syllables = total # of vowels in a word (not counting vowels that appear after another vowel and when ‘e’ is found at the end of the word), i.e.:

  • “beach” –> one syllable
  • “banana” –> three syllables
  • “home” –> one syllable
  •  
    Our code looks like this:

    // A method to count the number of syllables in a word
    // Pretty basic, just based off of the number of vowels
    // This could be improved
    public static int countSyllables(String word) {
        int      syl    = 0;
        boolean  vowel  = false;
        int      length = word.length();
    
        //check each word for vowels (don't count more than one vowel in a row)
        for(int i=0; i &lt; length ; i++) {
          if        (isVowel(word.charAt(i)) &#038;&#038; (vowel==false)) {
            vowel = true;
            syl++;
          } else if (isVowel(word.charAt(i)) &#038;&#038; (vowel==true)) {
            vowel = true;
          } else {
            vowel = false;
          }
        }
    
        char tempChar = word.charAt(word.length()-1);
        //check for 'e' at the end, as long as not a word w/ one syllable
        if (((tempChar == 'e') || (tempChar == 'E')) &#038;&#038; (syl != 1)) {
          syl--;
        }
        return syl;
    }
    
    //check if a char is a vowel (count y)
    public static boolean isVowel(char c) {
        if      ((c == 'a') || (c == 'A')) { return true;  }
        else if ((c == 'e') || (c == 'E')) { return true;  }
        else if ((c == 'i') || (c == 'I')) { return true;  }
        else if ((c == 'o') || (c == 'O')) { return true;  }
        else if ((c == 'u') || (c == 'U')) { return true;  }
        else if ((c == 'y') || (c == 'Y')) { return true;  }
        else                               { return false; }
      }
    }
    

    Again, this could be vastly improved using Regular Expressions, but it’s nice as an exercise to learn how to do all the String manipulation manually before we move on to more advanced techniques.

    Counting sentences is simple. We’ll just tokenize the content using periods, question marks, exclamation points, etc. (“.:;?!”) as delimiters and count the total number of tokens. This isn’t terribly accurate; for example, “My e-mail address is daniel.shiffman@nyu.edu.” will be counted as three sentences. Nevertheless, as a first pass, this will do. . .

     //look for sentence delimiters
    String sentenceDelim = ".:;?!";
    StringTokenizer sentenceTokenizer = new StringTokenizer(content,sentenceDelim);
    sentences = sentenceTokenizer.countTokens();
    

    Now, all we need to do is apply the formula, generate a report. . .

    //calculate flesch index
    final float f1 = (float) 206.835;
    final float f2 = (float) 84.6;
    final float f3 = (float) 1.015;
    float r1 = (float) syllables / (float) words;
    float r2 = (float) words / (float) sentences;
    float flesch = f1 - (f2*r1) - (f3*r2);
    
    //Write Report
    String report = "";
    
    report += "Total Syllables: " + syllables + "n";
    report += "Total Words    : " + words + "n";
    report += "Total Sentences: " + sentences + "n";
    report += "Flesch Index   : " + flesch + "n";
    System.out.println(report);
    

    . . . and we’re done!

    The full example code is here: FleschIndex.java

    Related Perl and PHP examples

    Perl:

    #!/usr/bin/perl
    
    # Reads a text and prints out wherever "I" appears along with the word following it
    # call it from the command line, like so:
    # perl I.pl input.txt >> output.txt
    # uses regular expressions, which we will cover next week
    
    while (<>) {
      if ($_ =~ m/ (IsS+)/) {
        print("$1 n");
      }
    }
    

    PHP:
    Run it: http://shiffman.net/itp/classes/a2z/week01/inputoutput.php
    Source: http://shiffman.net/itp/classes/a2z/week01/inputoutput.phps

    comments powered by Disqus