Compiling regular expressions to Java bytecodes using gnu.bytecode

(Alternative title: The gnu.bytecode toolkit for Java class files

Author: Per Bothner

Java class files are the basis for Java technology. They are a binary file format created by javac and other programs. Sometimes it is useful to generate a class file on the fly. The example we will use in this class in compiling simple regular expressions into custom classes. We will use the the gnu.bytecode package for generating Java classes.

There are a number of other toolkits for generating bytecode. Perhaps the best known is BCEL, which is part of Apache's Jakarta project. BCEL represents each instruction using a separate object, which is useful when analyzing and tweaking existing code, but is not efficient when you primarily want to generate good code quickly, which is the emphasis of gnu.bytecode. It can do so quickly and using little memory because appends instructions to byte arrays rather than to linked lists. In spite of that it is very convenient to use, and has lots of little conveniences to help pick good bytecode.

Why generate classes?

There are various reasons you might want to generate bytecode instead of using javac to do the job for you:

Compiling programs in some programming language: The gnu.bytecode package was written as the code generator for the Kawa Scheme-to-bytecode compiler. Since then Kawa has been generalized to support multiple languages, including the Qexo implementation of XQuery (a new XML Query language from W3C), and the ELisp extension language used by Emacs (see JEmacs). Kawa uses a high-level package (gnu.expr) that works on the level of abstract syntax trees and calls down to gnu.bytecode. But that's a topic for another article. For a slightly out-of-date introduction to Kawa internals see here.
Customization/specialization, with known value: Suppose some method depends on two parameters, and you're going to call it many times with the first parameter fixed. In come cases you can get a big payoff by creating a custom method that takes advantage of the fact that one parameter is constant.
Customization/specialization, with known type: Suppose you have a generic protocol or algorithm. It might involve serialization, data base access, or reflection. If you know the type of the data, you might generate custom code that takes advantage of that knowledge. For example, accessing a field directly (using a getfield JVM opcode) is much faster than using reflection (java.lang.reflect.Field).

All of these are special cases of compiling rather than interpretation, or as some academics call it partial evaluation.

Class files

We won't say much about the actual structure of a .class file. You read about it in The Java Virtual Machine Specification (which is available online). The main point to note is that a .class file is the compiled representation of a single Java class. Whether a class is public or private, top-level, anonymous, or inner, in all cases the javac compiler creates a single .class file for each Java class. (If the class is anonymous, then javac will generate a name for it.) Also, each .class file is self contained. When a class needs to reference some other class (for example to specify the type of a field) the .class file just contains the name of the other class. Tools that read a .class file are responsible for finding the name class by searching the class path.

A .class file has two main parts:

The constant pool contains names (including those of any referenced classes), and constants.
Specifications for each field and method of the class.

Normally only developers see .class files. Instead, Java applications are distributed as .jar files. However, a .jar files is basically just a compressed collection of .class files and resource (data) files.

Compiling regular expressions

Matching a regular expression against a target string is a common operation. Usually you will match the same regular expression against many strings - for example all the lines of a set of files. So many implementations do the matching in two passes:

Translate the regular expression (as given by a human-readable pattern like [a-zA-Z][a-zA_Z0-9]*) into some internal, more efficient representation.
Match the internal representation against the target string(s).

The JDK 1.4 java.util.regex package follows this model, where the translated internal form is a Pattern object.

The fastest matcher you can write, while still sticking to portable Java, is a custom match method created specifically from a given regular expression. You can do this be creating a custom Java class for a given regular expression. One can translate it into a Java program, which then gets compiled by javac. This is how JSP (JavaServer Pages) translate a .jsp page to a .java program, and thence to a compiled servlet executing inside a server. It works OK, but it has the overhead that you have to generate the .java program, write it to a file, invoke javac, and then load the resulting .class files. This typically takes a few seconds. This article explains how you can can generate the .class files directly without using javac, or even not create any files at all.

For simplicity, we will stick to a simpler form of regular expressions called glob patterns, which is mostly used in shells to specific filenames:

*.java
aa*bb
holiday??.jpg

Only two characters are special: '?' matches any single character, and '*' matches any number of (zero or more) arbitrary characters. Traditional glob patterns also support character sets: '[_a-z]' matches any lower-case letter or '_'. It is easy to extend the sample code to handle such character sets, but for simplicity we'll leave that as an exercise.

Getting started

We define an API for compiled regular expression patterns in terms of the following Pattern class:

public abstract class Pattern
{
  public abstract boolean match(String string, int pos, int limit);

  /** True if this Pattern matches the argument string. */
  public final boolean match(String string)
  {
    return match(string, 0, string.length());
  }
}

For a glob pattern like "a*b" we want to generate a class with a custom 'match' method that returns true only for those strings that match the pattern. This class does the job:

/** Custom class that matches the pattern "a*b". */
class a_star_b
{
  public boolean match(String str, int pos, int limit)
  {
    return limit >= pos + 2
      && str.charAt(pos) == 'a'
      && str.charAt(limit-1) == 'b';
  }
}

[This should probably be moved, or be a side-bar.] One very useful tool is a bytecode dis-assembler, JDK comes with the 'javap' dis-assembler. The 'gnu.bytecode' toolkit has its own disassembler, which may be more useful, because it shows *all* the information in a .class file. For example, this command:

java gnu.bytecode.dump a_star_b.class > a_star_b.dump

writes to a_star_b.dump a detailed dis-assembly of the a_star_b class generated for the a_star_b class. (See [Listing or Resources] for the actual listing.) Studying such listings is useful to learn what bytecode is generated from given Java source code. It is also very useful when debugging a code generator!

Generating a `Pattern`-derived class

Writing the a_star_b class by hand is all very well, but of course our real goal is to have such classes be auto-generated from a glob pattern. The code for doing so is in the 'compile' method of the CompileRegexp class. It takes a glob pattern, and returns a generated class. Let's look at what 'compile' does.

The gnu.bytecode.ClassType class is used to represent classes. A ClassType can be any of:

A class that the JVM knows about (and there is a corresponding Class object). In that case gnu.bytecode gets information about the class by using reflection.
A .class file in the file system that we've read in and parsed (using the .class-parsing features of gnu.bytecode).
A class that we're in the process of constructing.

The variable 'genClass' (see Listing) is an instance of the latter, We call the new class gen_matcher. Because the class will be local to its own ClassLoader, we don't need to worry about generating a unique name. After allocating the ClassType, we make it public, and set the super-class to the existing class Pattern.

We then must declare the fields and methods of gen_matcher. There are no fields, so we start with gen_matcher's constructor. If you don't define any constructors in your Java program, then javac will define a "default constructor" for you. However, it still has to be the .class file, and gnu.bytecode doesn't (currently) do it for you. The JVM internally uses the name <init> for constructors, so the first step is to add a method with that name and appropriate type to gen_matcher. Since this is not an abstract method, we need to generate a code body for it. In a .class file the code associated with a method is stored in a "Code" attribute belonging to the method. The 'startCode' invocation creates and initializes the Code attribute, and makes it ready for us to generate code.

A default constructors just calls the no-argument constructor of its super-class. With the 'emitThis' method we're finally generating executable bytecodes: It generates an 'aload_0' bytecode, which (if you check the JVM specification) takes the this pointer in local variable 0 and pushes into Java's argument stack. By convention the 'this' pointer is passed in local variable 0. We then call emitInvoke, which generates a call to the 'invokespecial' operation, calling the <init> method of the Pattern class. Finally, we have to emit an instruction to return from the constructor.

Compiling the '`match`' method

The meat of the project is generating the 'match' method for a given pattern. We're going to use a simple-minded algorithm that just handles '?' (any single character), '*' (any number of arbitrary characters), and other characters (match themselves). We'll keep it simple, and avoid cleverness, but we still have to support back-tracking. Consider matching "a*bc" against "abcbc". The first 'bc' will match the 'bc' in the pattern, but then we fail to match the end of the string, so we have to back up, trying to match the '*' against first '', then 'b', then 'bc' before the 'bc' pattern matches.

The 'match' method takes 3 arguments, not counting the implied 'this' argument. We create a variable> for each argument, so that we can refer to the appropriate argument when generating code. Thus the variable 'string' refers to the JVM local variable 1, which is where the JVM passes the first argument (after the implicit 'this' parameter).

The basic structure of how we compile 'match' is straight forward. We look through each character of the pattern, and generate code that matches some part of the input string. Let's skip handling of '*' for now. For '?' and for normal characters the logic is simple: We generate code to compare the current value of 'curPos' and the 'limit'. If they are equal, we go to the 'onFail' label. (Note that while the Java language doesn't have goto, the Java bytecode instructions set has both conditional and unconditional goto.) Unless we see a '*' in the pattern, the 'onFail' label is the same as the 'failureReturn' label, which is defined near the end of the method. If control transfers to 'failureReturn', the JVM will return 'false'. Otherwise, if (the runtime value of) curPos is less than (the runtime value of) limit, and if the pattern char isn't '?' then we have to generate code that compares the value of string.charAt(curPos) against the pattern character ''ch'. If they don't match, we again go to onFail. Finally, we emit code to increment the 'curPos' local variable.

Handling `'*'` and backtracking

The tricky part of the 'compile' method is how we handle '*'. The basic idea is that we compile "A*BC*D" as a a pair of nested loops. Each loop tries to match '*' against successively longer sub-strings of the input. Each time we get to a failure point, we do a 'continue', which backs up to the previous '*', and we try a longer match.

The following shows the control flow we want to generate from A*BC*D:

class match_A_star_BC_star_D extends Pattern
{
  public boolean match (String string, int curPos, int limit)
  {
  L0:
    {
      if (curPos == limit) break L0;
      if (string.charAt(curPos++) != 'A') break L0;
    L1:
      for (int savePos1 = curPos;  savePos1 < limit;  savePos1++)
	{
	  curPos = savePos1;
	  if (curPos == limit) continue L1;
	  if (string.charAt(curPos++) != 'B') continue L1;
	  if (curPos == limit) continue L1;
	  if (string.charAt(curPos++) != 'C') continue L1;
	L2:
	  for (int savePos2 = curPos;  savePos2 < limit;  savePos2++)
	    {
	      curPos = savePos2;
	      if (curPos == limit) continue L2;
	      if (string.charAt(curPos++) != 'D') continue L2;
	      if (curPos == limit)
		return true;
	    }
	}
    }
    return false;
  }
}

Note this code is not optimal. It is easy to replace the inner loop by a single test, as it can only succeed when curPos==limit-1. But I'm showing it this way so you can better see the the general idea. Each time 'compile' emits a jump to 'onFail' it is either a jump to the outermost failureReturn label (if we haven't seen a '*' yet), or it is a previously seen 'onFailHere'. The former correspond to the initial break statements in match_A_star_BC_star_D's match method, while the latter correspond to the continue statements.

The actual 'compile' method contains an optimization for the last '*': In that case, any remaining pattern characters can only match against a tail end of the substring that has the same length as the remainder of the pattern. So, the run-time matcher just needs to jump directly to that many characters from the end of the input string (after checking that there are enough characters).

The insight here is that we can put in as many smarts, handling of special cases, and optimization into the compiler as is worthwhile. You can also put in short-cuts and tricks if your matcher is an interpreter, but you have to consider that the test to see whether a short-cut is possible can take more time than you save by using the short-cut! If you can test for the short-cut at compile-time (and obviously some short-cuts you have to test for at run-time), then you can afford to consider more short-cuts.

Emitting class files

Once you have a ClassType object, and have added all its fields and methods, what do you do with it?

You can write out a ClassType to a file, by calling the writeToFile method:

CompileRegexp.compile(pattern).writeToFile(filename);

This takes all the information in the ClassType object, "serializes" it to a bunch of bytes, and writes those bytes out to a .class file with the given filename. (If you leave out the filename, the default is as you'd expect based on the class file followed by ".class".) Such a .class file can be loaded by a JVM just like any .class file generated by javac. You can even write Java classes that extend a generated .class file, in which case javac will read the information it needs from the .class file you've generated, just as it reads any other .class file.

The Kawa framework (URL) uses writeToFile when compiling a Scheme program to a .class file. If you un-comment the call to writeToFile in CompileRegexp's main method, you will get a .class file. You can then look at the class using gnu.bytecode.dump as mentioned earlier. This is useful for debugging, or just plain to verify that you're getting what you expect.

Classloaders

The ultimate user of a .class file is normally a Java Virtual Machine. How does the Java VM use .class or .jar files? A *class loader* handles the process of turning a .class file (which is just a collection of bytes) into an instance of java.lang.Class (which is a prerequisite before you can call the class's methods or create instances). A class loader is an instance of (a sub-class of) ClassLoader. The default "system" class loader extracts a .class file from a .jar file that it finds using a "class path". However, you have the option of by-passing the file system and .class files completely, using a custom ClassLoader. In that case, you can have gnu.bytecode construct a byte array that has the same encoding as a .class file, but it is just a Java array object and you never write it to a file. You can pass such an array to the standard defineClass method of the CloassLoader class, which will take the byte array and create a Class object. The defineClass method is protected, since it should only be used by a custom ClassLoader that manages the mapping from class name to class objects. The gnu.bytecode includes the ArrayClassLoader class which does for you. The code in the main method of CompileRegexp show how you can take the ClassType you've created, pass it to an ArrayClassLoader, load the generated class, and create an instance 'rexp' of this new class. We cast the new object to Pattern, so we can actually use it.

Conclusion

We skipped a number of features of gnu.bytecode, including creating fields, emitting debugging information (line numbers and local variable information), and support for more conplex constructs like switches and exceptions. I'm hoping that this article has shown you how you can compile new classes as needed. [[Future articles will show how you can implement an entire compiler using gnu.bytecode as the code generator.]]

API documentation for gnu.bytecode: http://www.gnu.org/software/kawa/api/gnu/bytecode/package-summary.html

Per Bothner See http://per.bothner.com/papers for list of publications (mostly conference papers). Original technical lead of GCJ, the GNU COmpiler for the Java platform, and the implementor of the Kawa toolkit, which includes gnu.bytecode.

Contact information: Per Bothner 18651 Cynthia Avenue Cupertino CA 95014 per@bothner.com http://per.bothner.com (408) 777-9211