Help

Out of need, I recently contributed a data-flow analysis framework to javassist. The framework allows an application to determine, by inference, the type-state of the local variable table and stack frame at the start of every bytecode instruction. For those unfamiliar with the java bytecode format, a lot of information is lost once a java program is compiled, since it is not really needed when the program is executed, and leaving it out helps keep class files small.

To illustrate this loss, take a look at the following simple Java method:

public static class Base {}
public static class A extends Base{}
public static class B extends Base{}
public static class C extends B{}

private void foo(int x) {
   Base b;
   if (x > 4) {
       b = new A();
   } else {
       b = new C();
   }

   b.toString();
}

While it is quite clear in the Java code that b is of type Base, this information is missing from the output of a compiler:

private void foo(int);
  Code:
   Stack=2, Locals=3, Args_size=2
   0:   iload_1
   1:   iconst_4
   2:   if_icmple       16
   5:   new     #68; //class example/Example$A
   8:   dup
   9:   invokespecial   #70; //Method example/Example$A."<init>":()V
   12:  astore_2
   13:  goto    24
   16:  new     #71; //class example/Example$C
   19:  dup
   20:  invokespecial   #73; //Method example/Example$C."<init>":()V
   23:  astore_2
   24:  aload_2
   25:  invokevirtual   #74; //Method java/lang/Object.toString:()Ljava/lang/String;
   28:  pop
   29:  return

Since toString() is declared by Object, all that line 25 tells us is that the type is an Object, which is obviously not very specific. If the class was compiled with debugging, you would be able to learn that local #2 was of type Base, but even if you did have this information, you would not necessarily know that the object invoked on by invokevirtual is the value stored in local variable 2. The only way to determine that is to know the state of the stack frame immediately before the instruction executes.

The analysis framework provides this by modeling the effect of every instruction, until it can eventually infer the type information. This process does not use any debugging information, since there is no guarantee it is available. Instead, it extrapolates it by tracking all possible type states, as every branch is evaluated, until the type information is reduced to the most specific type state available.

The following code, which uses the framework, is able to tell us that the type invoked on line 25 is in fact Base:

Analyzer a = new Analyzer();
CtClass clazz = ClassPool.getDefault().get("example.Example");
Frame[] frames = a.analyze(clazz.getDeclaredMethod("foo"));
System.out.println(frames[25].peek()); // Prints "example.Example$Base"

There is also a nice little tool I added, called framedump, that dumps the entire state at every instruction in human readable format, and yes I know that's debatable :)

$ framedump example.Example
private void foo(int);
0: iload_1
     stack []
     locals [example.Example, int, empty]

... snipped for brevity ...

24: aload_2
     stack []
     locals [example.Example, int, example.Example$Base]
25: invokevirtual #85 = Method java.lang.Object.toString(()Ljava/lang/String;)
     stack [example.Example$Base]
     locals [example.Example, int, example.Example$Base]
28: pop
     stack [java.lang.String]
     locals [example.Example, int, example.Example$Base]
29: return
     stack []
     locals [example.Example, int, example.Example$Base]

Some of you are probably thinking:

That sounds nice and all, but why in the world would I ever need to use this?

It is definitely not something useful to everyone, however it is very useful for certain applications:

  • Bytecode Enhancers
  • Verifiers
  • Optimizers
  • Debugging/Profiling Tools
  • Decompilers

To expand on the enhancer example, for security reasons, the JVM actually does its own data-flow analysis to verify that a class does not violate type rules before it can be ran. This poses an interesting challenge to any application that manipulates bytecode, since any change that affects the possible type-state can lead to a verify error and the JVM throwing out the class. Frameworks such as this can be used to prevent this problem, since they reveal the same (in the javassist analyzer case, actually more detailed) information available to the JVM's verifier.

If you want to play with this new feature, download the recently released 3.8.0 here.

The javadoc is here.

Note, I should also mention that the ASM project has had a similar framework for quite some time, however, it wasn't usable in my case since I needed the ability to handle reduction of multi-interface and array types. Also, I was already using javassist and switching just wasn't possible, mainly due to other features I rely on.

Enjoy!

14 comments:
 
18. Jun 2008, 17:38 CET | Link
Manik Surtani

Nice article. What is the memory cost of such analysis, given that you track all possible type states?

ReplyQuote
 
18. Jun 2008, 22:42 CET | Link
Flavia Rainone

Great stuff, Jason!

 
18. Jun 2008, 22:56 CET | Link

ASM has such API and more for ages. http://asm.objectweb.org/

 
18. Jun 2008, 23:04 CET | Link

Yeah, I mentioned that, see the last paragraph.

Note, I should also mention that the ASM project has had a similar framework for quite some time, however, it wasn't usable in my case since I needed the ability to handle reduction of multi-interface and array types. Also, I was already using javassist and switching just wasn't possible, mainly due to other features I rely on.
 
18. Jun 2008, 23:31 CET | Link

The memory usage is a directly proportional to the size of the method. The process is a variant to the one described in the vmspec. When two branches merge, the reduced/merge type set replaces the previous. So you have fairly linear growth that eventually caps and is slightly decreased when the type set hopefully becomes a single type. Also, in an number of cases instances are reused.

19. Jun 2008, 01:54 CET | Link

Jason:

Assuming I have:

class MyClass implements MyInterface1, MyInterface2{}

...and...

class Client{

  Object obj = someGenericLookup("MyClass");
  MyInterface1 intf = (MyInterface1)obj;
  Class<?> clazz = intf.getClass();
}

Is it possible either with your current implementation or with some enhancements to know that clazz is of type MyInterface1 but not MyInterface2?

S, ALR

19. Jun 2008, 02:04 CET | Link

Of course, if this information is ditched by the compiler, any load-time additions we make would be irrelevant. :)

19. Jun 2008, 03:02 CET | Link

Yes, casts result in runtime checking via the checkcast instruction. So you can determine that clazz is at least a type that implements/extends MyInterface1.

Here is the relevant portion of the framedump output of your example:

8: checkcast #90 = Class example.Example$MyInterface1
     stack [java.lang.Object]
     locals [example.Example, java.lang.Object, empty]
11: astore_2
     stack [example.Example$MyInterface1]
     locals [example.Example, java.lang.Object, empty]
12: aload_2
     stack []
     locals [example.Example, java.lang.Object, example.Example$MyInterface1]
13: invokevirtual #92 = Method java.lang.Object.getClass(()Ljava/lang/Class;)
     stack [example.Example$MyInterface1]
     locals [example.Example, java.lang.Object, example.Example$MyInterface1]
 
19. Jun 2008, 10:02 CET | Link
Raghu

How can I find the reference to a method and the value passed as the parameter? For example: A method setName() exists in Class Person In Class Evaluate, the method is referred twice setName(Steve) and setName(Jan)

How do I get these values(Steve and Jan) using javaassist?

 
19. Jun 2008, 11:35 CET | Link

The data-flow analyzer in javassist is not intra-procedural, it's only ran against a single method, and only solves for type info, not values.

 
21. Jun 2008, 02:21 CET | Link

Not sure what you meant by reduction of multi-interface. ASM provides several interpreters that could handle chosen type system and one could also implement custom interpreter that could handle custom type system or do some other advanced stuff.

 
21. Jun 2008, 05:33 CET | Link

By multiple-interface types, I am referring to types that can not be immediately resolved to a single common type when merged (not an uncommon case). A type may have one or more common interfaces in addition to a common super class. The only way to solve for a correct answer is to infer based off of the information available in other instructions, and eventually arrive at the best fitting solution.

A simple example is a branch using a Long and another using a Integer; the merged type could be either a Number or Comparable.

It could be possible to do this with a new ASM interpreter, although other enhancements to the analyzer would have likely been needed.

Not that I should have to justify this, but from my perspective, the cost of enhancing asm to resolve this issue, in addition to building some kind of asm to javassist bridge, and the negative aspects of having yet another dependency and the less than ideal code integration path was more than the cost of just adding a new framework to javassist that met my needs. Not to mention that other javassist users have asked for this in the past.

 
27. Jun 2008, 13:58 CET | Link
Dietrich Schulten

Version 3.8 has not yet appeared on http://repo1.maven.org/maven2/jboss/javassist/. You probably know that this is the maven repository, where the maven buildsystem reads dependencies from. I would like to use the new version, but I have to add your jar manually. The older versions are there - could you add the new version as well or do you know who would be the right person to ask?

 
28. Jun 2008, 03:20 CET | Link

The maven2 repo for javassist is at: http://repository.jboss.org/maven2/javassist/javassist/

Not sure who is maintaining the one on repo1.maven.org.

Post Comment