Red Hat

Hand-holding sucks, Java needs unsigned types!

Posted by    |       |    Tagged as Java EE

While there are many things I like about the Java language, the lack of unsigned types has always bothered me.

According to Gosling, they were too complicated:

Quiz any C developer about unsigned, and pretty soon you discover that almost no C developers actually understand what goes on with unsigned, what unsigned arithmetic is. Things like that made C complex.

Ironically this kind of hand-holding tends to introduce other complexities that are often more difficult to deal with than the original solution. In this particular case, leaving out unsigned types doesn't stop the need to work with unsigned data. Instead, it forces the developer to work around the language limitation in various unusual ways.

The first major problem in this system is that byte is signed in Java. Out of all of the code I have ever written, I can only think of a select few situations where I needed a signed byte value. In almost all cases I wanted the unsigned version.

Let's look at a very simple example, initializing a byte to be 0xFF (or 255), the following will fail (note: this works in C#, since they made byte unsigned):

byte b = 0xFF;

Java will not narrow this for us because the value is outside the range of the signed byte type (>127). We can, however, work around it with a cast:

byte b = (byte) 0xFF;

If we are clever and know twos compliment we can use a negative equivalent of our simple unsigned value:

byte b = -1;

This is just the tip of the iceberg though. A common technique used in search and compression algorithms is to precompute a table based on the occurrence of a particular byte value. Since a byte can represent 256 values, this is typically done using an array with the byte value as an index, which is very efficient. Ok so you might think you can do the following:

byte b = (byte) 0xFF;
int table[] = new int[256];
table[b] = 1; // OOPS!

While this code will legally compile, it will result in a runtime exception. What happens is that the array index operator requires an integer. Since a byte is specified instead, Java converts the byte to an integer, and this results in sign extension. Again, 0xFF means -1 for a signed byte, so it gets converted to an integer with a value of -1. This, of course, is an invalid array index.

To solve the problem, we must use the bitwise-and operator to force the conversion to occur in the correct (yet unintuitive) way like so:

table[b & 0xFF] = 1;

This technique gets ugly quick. Take a look at composing an int from 4 bytes (Ugh!):

byte b1 = 1;
byte b2 = (byte)0x82;
byte b3 = (byte)0x83;
byte b4 = (byte)0x84;

int i = (b1 << 24) | ((b2 & 0xFF) << 16) | ((b3 & 0xFF) << 8) | (b4 & 0xFF);

These issues have in turn lead to odd API workarounds. For example, look at InputStream.read() , which according to its docs is supposed to return a byte, but instead returns an integer. Why? So it can do the & 0xFF for you.

We also have a DataOutput.writeShort() and DataOutput.writeByte() that take integers instead of the their respective types. Why? So that you can output unsigned values on the wire. On the reading side we end up with four methods DataInput.readShort(), DataInput.readUnsignedShort(), DataInput.readByte(), and DataInput.readUnsignedByte(). The unsigned versions return converted integers instead of the described type names.

To add to the confusion, we also have 2 right shift operators in this signed only mess. The unsigned right shift operator, which treats the type as if it were unsigned, and the normal right shift, which preserves the sign (essentially acts like divide by 2). If we want to get the most significant nibble value of an integer we need the unsigned version.

int i = 0xF0000000;

System.out.printf("%x\n", i >> 28); // Returns ffffffff!
System.out.printf("%x\n", i >>> 28); // Returns f, as desired

So I ask all of you, was all of this hassle worth leaving out the simple and well understood unsigned keyword? I think not, and I hope anyone who considers doing this in another language they are designing learns from it. At least C# has.

back to top