(Alternative title: The gnu.bytecode toolkit for Java class files
Java class files are the basis for Java technology
.
They are a binary file format created by javac
and other programs.
Sometimes it is useful to generate a class file on the fly
.
The example we will use in this class in compiling simple
regular expressions into custom classes. We will use the
the
package for generating Java classes.
gnu.bytecode
There are a number of other toolkits for generating bytecode.
Perhaps the best known is BCEL,
which is part of Apache's Jakarta project.
BCEL represents each instruction using a separate object, which is useful
when analyzing and tweaking existing code, but is not efficient
when you primarily want to generate good code quickly,
which is the emphasis of gnu.bytecode
.
It can do so quickly and using little memory
because appends instructions to byte arrays rather than to linked lists.
In spite of that it is very convenient to use, and has
lots of little conveniences to help pick good bytecode.
There are various reasons you might want to generate bytecode
instead of using javac
to do the job for you:
Compiling programs in some programming language:
The gnu.bytecode
package was written as the code generator
for the Kawa
Scheme-to-bytecode compiler. Since then Kawa
has been generalized to support multiple languages,
including the Qexo
implementation of XQuery
(a new XML Query language from W3C), and the ELisp extension language
used by Emacs (see JEmacs).
Kawa uses a high-level package (gnu.expr
) that works on the
level of abstract syntax trees and calls down to gnu.bytecode.
But that's a topic for another article. For a slightly out-of-date
introduction to Kawa internals see
here.
Customization/specialization, with known value: Suppose some method depends on two parameters, and you're going to call it many times with the first parameter fixed. In come cases you can get a big payoff by creating a custom method that takes advantage of the fact that one parameter is constant.
Customization/specialization, with known type: Suppose you have
a generic protocol or algorithm. It might involve serialization,
data base access, or reflection. If you know the type of the data,
you might generate custom code that takes advantage of that knowledge.
For example, accessing a field directly (using a
JVM
opcode) is much faster than using reflection (getfield
java.lang.reflect.Field
).
All of these are special cases of compiling rather than interpretation,
or as some academics call it partial evaluation
.
We won't say much about the actual structure of a .class
file.
You read about it in The Java Virtual Machine Specification
(which is available online).
The main point to note is that a
.class
file is the compiled representation of a single Java
class. Whether a class is public or private, top-level, anonymous,
or inner, in all cases the javac
compiler creates a single
.class
file for each Java class.
(If the class is anonymous, then
javac
will generate a name for it.)
Also, each .class
file is self contained.
When a class needs to reference some other class (for example to
specify the type of a field) the .class
file just contains the name
of the other class. Tools that read a .class
file are responsible
for finding the name class by searching the class path.
A .class file has two main parts:
Normally only developers see .class files. Instead, Java applications are distributed as .jar files. However, a .jar files is basically just a compressed collection of .class files and resource (data) files.
Matching a regular expression against a target string is a common operation. Usually you will match the same regular expression against many strings - for example all the lines of a set of files. So many implementations do the matching in two passes:
Translate the regular expression (as given by a human-readable pattern
like
) into some internal,
more efficient representation.[a-zA-Z][a-zA_Z0-9]*
Match the internal representation against the target string(s).
The JDK 1.4 java.util.regex
package follows this model, where
the translated internal form is a Pattern
object.
The fastest matcher you can write, while still sticking to portable
Java, is a custom match method created specifically from a given
regular expression. You can do this be creating a custom Java
class for a given regular expression. One can translate it into
a Java program, which then gets compiled by javac
. This is how
JSP (JavaServer Pages) translate a .jsp page to a .java program,
and thence to a compiled servlet executing inside a server. It
works OK, but it has the overhead that you have to generate the
.java program, write it to a file, invoke javac, and then load
the resulting .class files. This typically takes a few seconds.
This article explains how you can can generate the .class files
directly without using javac
, or even not create any files at
all.
For simplicity, we will stick to a simpler form of regular expressions
called glob patterns
, which is mostly used in shells to specific
filenames:
*.java aa*bb holiday??.jpgOnly two characters are special:
'?'
matches any single character, and '*' matches any
number of (zero or more) arbitrary characters. Traditional
glob patterns also support character sets: '
[_a-z]
' matches
any lower-case letter or '_'
. It is easy to extend the
sample code to handle such character sets, but for simplicity
we'll leave that as an exercise.
We define an API for compiled regular expression patterns in
terms of the following Pattern
class:
public abstract class Pattern { public abstract boolean match(String string, int pos, int limit); /** True if this Pattern matches the argument string. */ public final boolean match(String string) { return match(string, 0, string.length()); } }
For a glob pattern like "a*b" we want to generate a class with a custom 'match' method that returns true only for those strings that match the pattern. This class does the job:
/** Custom class that matches the pattern "a*b". */ class a_star_b { public boolean match(String str, int pos, int limit) { return limit >= pos + 2 && str.charAt(pos) == 'a' && str.charAt(limit-1) == 'b'; } }
[This should probably be moved, or be a side-bar.] One very useful tool is a bytecode dis-assembler, JDK comes with the 'javap' dis-assembler. The 'gnu.bytecode' toolkit has its own disassembler, which may be more useful, because it shows *all* the information in a .class file. For example, this command:
java gnu.bytecode.dump a_star_b.class > a_star_b.dumpwrites to a_star_b.dump a detailed dis-assembly of the a_star_b class generated for the a_star_b class. (See [Listing or Resources] for the actual listing.) Studying such listings is useful to learn what bytecode is generated from given Java source code. It is also very useful when debugging a code generator!
Pattern
-derived class
Writing the a_star_b
class by hand is all very well, but of course
our real goal is to have such classes be auto-generated from a
glob pattern. The code for doing so is in the 'compile
' method
of the CompileRegexp class.
It takes a glob pattern, and returns
a generated class. Let's look at what 'compile' does.
The gnu.bytecode.ClassType
class is used to represent classes.
A ClassType
can be any of:
A class that the JVM knows about (and there is a corresponding
Class
object).
In that case gnu.bytecode
gets information about the class by
using reflection.
A .class
file in the file system that we've read in
and parsed (using the .class
-parsing features
of gnu.bytecode
).
A class that we're in the process of constructing.
The variable 'genClass
' (see Listing) is an instance of the latter,
We call the new class
.
Because the class will be local to its own gen_matcher
ClassLoader
,
we don't need to worry about generating a
unique name. After allocating the ClassType
, we make it public,
and set the super-class to the existing class Pattern
.
We then must declare the fields and methods of gen_matcher.
There are no fields, so we start with gen_matcher's constructor.
If you don't define any constructors in your Java program, then
javac will define a "default constructor" for you. However, it
still has to be the .class file, and gnu.bytecode doesn't (currently)
do it for you. The JVM internally uses the name <init>
for
constructors, so the first step is to add a method with that name
and appropriate type to gen_matcher. Since this is not an abstract
method, we need to generate a code body for it. In a .class file
the code associated with a method is stored in a "Code" attribute
belonging to the method. The 'startCode' invocation creates and
initializes the Code attribute, and makes it ready for us to generate
code.
A default constructors just calls the no-argument constructor of its
super-class. With the 'emitThis' method we're finally generating
executable bytecodes: It generates an 'aload_0' bytecode, which
(if you check the JVM specification) takes the this
pointer in
local variable 0 and pushes into Java's argument stack. By convention
the 'this' pointer is passed in local variable 0. We then call
emitInvoke, which generates a call to the 'invokespecial' operation,
calling the
method of the <init>
Pattern
class. Finally, we have to
emit an instruction to return from the constructor.
match
' method
The meat of the project is generating the 'match
' method for a
given pattern. We're going to use a simple-minded algorithm
that just handles '?' (any single character), '*' (any number
of arbitrary characters), and other characters (match themselves).
We'll keep it simple, and avoid cleverness, but we still have to
support back-tracking. Consider matching "a*bc" against "abcbc".
The first 'bc' will match the 'bc' in the pattern, but then we fail
to match the end of the string, so we have to back up, trying to
match the '*' against first '', then 'b', then 'bc' before the
'bc' pattern matches.
The 'match' method takes 3 arguments, not counting the implied 'this' argument. We create a variable> for each argument, so that we can refer to the appropriate argument when generating code. Thus the variable 'string' refers to the JVM local variable 1, which is where the JVM passes the first argument (after the implicit 'this' parameter).
The basic structure of how we compile 'match' is straight forward. We look through each character of the pattern, and generate code that matches some part of the input string. Let's skip handling of '*' for now. For '?' and for normal characters the logic is simple: We generate code to compare the current value of 'curPos' and the 'limit'. If they are equal, we go to the 'onFail' label. (Note that while the Java language doesn't have goto, the Java bytecode instructions set has both conditional and unconditional goto.) Unless we see a '*' in the pattern, the 'onFail' label is the same as the 'failureReturn' label, which is defined near the end of the method. If control transfers to 'failureReturn', the JVM will return 'false'. Otherwise, if (the runtime value of) curPos is less than (the runtime value of) limit, and if the pattern char isn't '?' then we have to generate code that compares the value of string.charAt(curPos) against the pattern character ''ch'. If they don't match, we again go to onFail. Finally, we emit code to increment the 'curPos' local variable.
'*'
and backtrackingThe tricky part of the 'compile' method is how we handle '*'. The basic idea is that we compile "A*BC*D" as a a pair of nested loops. Each loop tries to match '*' against successively longer sub-strings of the input. Each time we get to a failure point, we do a 'continue', which backs up to the previous '*', and we try a longer match.
The following shows the control
flow we want to generate from
:
A*BC*D
class match_A_star_BC_star_D extends Pattern { public boolean match (String string, int curPos, int limit) { L0: { if (curPos == limit) break L0; if (string.charAt(curPos++) != 'A') break L0; L1: for (int savePos1 = curPos; savePos1 < limit; savePos1++) { curPos = savePos1; if (curPos == limit) continue L1; if (string.charAt(curPos++) != 'B') continue L1; if (curPos == limit) continue L1; if (string.charAt(curPos++) != 'C') continue L1; L2: for (int savePos2 = curPos; savePos2 < limit; savePos2++) { curPos = savePos2; if (curPos == limit) continue L2; if (string.charAt(curPos++) != 'D') continue L2; if (curPos == limit) return true; } } } return false; } }Note this code is not optimal. It is easy to replace the inner loop by a single test, as it can only succeed when
curPos==limit-1
. But I'm showing it this way so you can
better see the the general idea.
Each time 'compile
' emits a jump to 'onFail
' it is either
a jump to the outermost failureReturn label (if we haven't
seen a '*
' yet), or it is a previously seen 'onFailHere
'.
The former correspond to the initial break
statements in
match_A_star_BC_star_D
's match method, while the latter
correspond to the continue statements.
The actual 'compile' method contains an optimization for the last '*': In that case, any remaining pattern characters can only match against a tail end of the substring that has the same length as the remainder of the pattern. So, the run-time matcher just needs to jump directly to that many characters from the end of the input string (after checking that there are enough characters).
The insight here is that we can put in as many smarts, handling of special cases, and optimization into the compiler as is worthwhile. You can also put in short-cuts and tricks if your matcher is an interpreter, but you have to consider that the test to see whether a short-cut is possible can take more time than you save by using the short-cut! If you can test for the short-cut at compile-time (and obviously some short-cuts you have to test for at run-time), then you can afford to consider more short-cuts.
Once you have a ClassType object, and have added all its fields and methods, what do you do with it?
You can write out a ClassType to a file, by calling the writeToFile method:
CompileRegexp.compile(pattern).writeToFile(filename);
This takes all the information in the ClassType object, "serializes" it to a bunch of bytes, and writes those bytes out to a .class file with the given filename. (If you leave out the filename, the default is as you'd expect based on the class file followed by ".class".) Such a .class file can be loaded by a JVM just like any .class file generated by javac. You can even write Java classes that extend a generated .class file, in which case javac will read the information it needs from the .class file you've generated, just as it reads any other .class file.
The Kawa framework (URL) uses writeToFile when compiling a Scheme program to a .class file. If you un-comment the call to writeToFile in CompileRegexp's main method, you will get a .class file. You can then look at the class using gnu.bytecode.dump as mentioned earlier. This is useful for debugging, or just plain to verify that you're getting what you expect.
The ultimate user of a .class file is normally a Java Virtual Machine. How does the Java VM use .class or .jar files? A *class loader* handles the process of turning a .class file (which is just a collection of bytes) into an instance of java.lang.Class (which is a prerequisite before you can call the class's methods or create instances). A class loader is an instance of (a sub-class of) ClassLoader. The default "system" class loader extracts a .class file from a .jar file that it finds using a "class path". However, you have the option of by-passing the file system and .class files completely, using a custom ClassLoader. In that case, you can have gnu.bytecode construct a byte array that has the same encoding as a .class file, but it is just a Java array object and you never write it to a file. You can pass such an array to the standard defineClass method of the CloassLoader class, which will take the byte array and create a Class object. The defineClass method is protected, since it should only be used by a custom ClassLoader that manages the mapping from class name to class objects. The gnu.bytecode includes the ArrayClassLoader class which does for you. The code in the main method of CompileRegexp show how you can take the ClassType you've created, pass it to an ArrayClassLoader, load the generated class, and create an instance 'rexp' of this new class. We cast the new object to Pattern, so we can actually use it.
We skipped a number of features of gnu.bytecode, including creating fields, emitting debugging information (line numbers and local variable information), and support for more conplex constructs like switches and exceptions. I'm hoping that this article has shown you how you can compile new classes as needed. [[Future articles will show how you can implement an entire compiler using gnu.bytecode as the code generator.]]
API documentation for gnu.bytecode: http://www.gnu.org/software/kawa/api/gnu/bytecode/package-summary.html
Per Bothner See http://per.bothner.com/papers for list of publications (mostly conference papers). Original technical lead of GCJ, the GNU COmpiler for the Java platform, and the implementor of the Kawa toolkit, which includes gnu.bytecode.
Contact information: Per Bothner 18651 Cynthia Avenue Cupertino CA 95014 per@bothner.com http://per.bothner.com (408) 777-9211