While there are many things I like about the Java language, the lack of unsigned types has always bothered me.
According to Gosling, they were too complicated:
Quiz any C developer about unsigned, and pretty soon you discover that almost no C developers actually understand what goes on with unsigned, what unsigned arithmetic is. Things like that made C complex.
Ironically this kind of hand-holding tends to introduce other complexities that are often more difficult to deal with than the original solution. In this particular case, leaving out unsigned types doesn't stop the need to work with unsigned data. Instead, it forces the developer to work around the language limitation in various unusual ways.
The first major problem in this system is that byte is signed in Java. Out of all of the code I have ever written, I can only think of a select few situations where I needed a signed byte value. In almost all cases I wanted the unsigned version.
Let's look at a very simple example, initializing a byte to be 0xFF (or 255), the following will fail (note: this works in C#, since they made byte unsigned):
byte b = 0xFF;
Java will not narrow this for us because the value is outside the range of the signed byte type (>127). We can, however, work around it with a cast:
byte b = (byte) 0xFF;
If we are clever and know twos compliment we can use a negative equivalent of our simple unsigned value:
byte b = -1;
This is just the tip of the iceberg though. A common technique used in search and compression algorithms is to precompute a table based on the occurrence of a particular byte value. Since a byte can represent 256 values, this is typically done using an array with the byte value as an index, which is very efficient. Ok so you might think you can do the following:
byte b = (byte) 0xFF; int table[] = new int[256]; table[b] = 1; // OOPS!
While this code will legally compile, it will result in a runtime exception. What happens is that the array index operator requires an integer. Since a byte is specified instead, Java converts the byte to an integer, and this results in sign extension. Again, 0xFF means -1 for a signed byte, so it gets converted to an integer with a value of -1. This, of course, is an invalid array index.
To solve the problem, we must use the bitwise-and operator to force the conversion to occur in the correct (yet unintuitive) way like so:
table[b & 0xFF] = 1;
This technique gets ugly quick. Take a look at composing an int from 4 bytes (Ugh!):
byte b1 = 1; byte b2 = (byte)0x82; byte b3 = (byte)0x83; byte b4 = (byte)0x84; int i = (b1 << 24) | ((b2 & 0xFF) << 16) | ((b3 & 0xFF) << 8) | (b4 & 0xFF);
These issues have in turn lead to odd API workarounds. For example, look at InputStream.read() , which according to its docs is supposed to return a byte, but instead returns an integer. Why? So it can do the & 0xFF for you.
We also have a DataOutput.writeShort() and DataOutput.writeByte() that take integers instead of the their respective types. Why? So that you can output unsigned values on the wire. On the reading side we end up with four methods DataInput.readShort(), DataInput.readUnsignedShort(), DataInput.readByte(), and DataInput.readUnsignedByte(). The unsigned
versions return converted integers instead of the described type names.
To add to the confusion, we also have 2 right shift operators in this signed only mess. The unsigned
right shift operator, which treats the type as if it were unsigned, and the normal right shift, which preserves the sign (essentially acts like divide by 2). If we want to get the most significant nibble value of an integer we need the unsigned version.
int i = 0xF0000000;
System.out.printf("%x\n", i >> 28); // Returns ffffffff!
System.out.printf("%x\n", i >>> 28); // Returns f, as desired
So I ask all of you, was all of this hassle worth leaving out the simple and well understood unsigned
keyword? I think not, and I hope anyone who considers doing this in another language they are designing learns from it. At least C# has.
Hum, actually, /is/ there any common usecase for signed bytes?
Grumpy old man's comment:
Actually, InputStream.read() returns an int so that it can return an EOF in-band. This idiom dates back to the K&R stdio library (e.g. fgetc).
According to the (§3.3) there is one unsigned primitive type:
So it's possible to avoid all that trickery, at the cost of a few extra bits. :)
I feel your pain. I have been developing some code for a tool that requires a well-defined binary format, and it's written in Java. I need to represent offsets of a record from the start of the file and unsigned types make this a real pain. Sure, I could just use signed types and just bail if I find a negative value within the file but that doesn't sit well with me. So basically, what I have ended up doing is storing 32-bit unsigned ints within the binary, that are represented as Java longs in the program. PITFA.
it is true that some low level operations especially protocols and things like crypto algorithms would benefit from this. but i more agree with gosling here. concept is really not necessary for todays high level development needs. modern languages do not include it either. i guess Csharp is an exception. they wanted to lure C
Jesper,
Yes EOF handling is also a reason. Although that could have been done using an exception (the case of C#). In any case there is a great disparity between the the single byte read which returns a promoted unsigned byte (0-255), and the buffer style read calls which return a signed byte array. Also the DataInput methods I listed are only peculiar because of the unsigned/signed issues.
Luis,
Yes we could just use chars, the problem is that since you tend to read data in bytes, you will still have to do a bitwise-and promotion to the char type. Also you have to remember to promote chars to ints before you print them, else you get a unicode char ;)
Afsina,
Essentially any Java program that reads binary data, (files, network, etc), will run into these issues. That's a very significant percentage of Java programs. Further, anybody that doesn't want it, just doesn't have to use it. The largest problem with adding them to the language is updating the libraries to use them in a non-bc breaking way. Not so easy,
Also, can you expand on what you meant by modern languages? Perhaps you are referring to languages that don't have strong primitive types? In which case they automatically promote and type convert for you. This introduces an entirely different set of problems.
I have to wonder, why would you use a byte when you know it doesn't do what you want? What is wrong with an int or a long?
Is it a waste of memory or a waste of your time worrying about it? 4 bytes costs 0.000028 cents. Thats worth less than 1 second of your time.
You could argue that int is more efficient because the time you save writing it is worth more than the memory you save. The memory is reusable, your time isn't.
@Jason: sorry my message was eaten because of the plus and sharp symbols. well, introducing unsigned is not a small change in the language. once you have 4 different integer types, (byte, int, long, short) you may need four new set of concerns in your low level API's when unsigned comes to scene. i think sor simplicity's sake they did not include that. i was wrong about new high level languages and unsigned support tough. For example Fortress has type Z (Z, Z8, Z16, Z32, Z64, Z128) for integers and N for unsigned ones (N, N8, N16, N32, N64, N128). for java, maybe byte should have been unsigned by default or There should be some helper syntax for byte operations to ease the pain Gavin is having.
I forgot to mention that I agree (up to a point) that byte should have been unsigned. Unsigned ints and longs are not quite as useful, and type modifiers just end up as a combinatorial explosion in case of reflection, etc.
@afsina: While I agree with you that high-level languages don't need unsigned, most high level languages that omit it don't require the programmer to differentiate between byte, short, int, and long either. Java attempts to be high-level in some places and makes low-level performance concessions with things like primitive types in others, owing more to the whims of the language designers and performance concerns of the day than any idea of internal consistency.
Why does Java bother to give the programmer control over how many bytes are used to represent a numeric value but not over what those bytes mean?
Isn't this an unneccessary optimization issue? Who forces you to use bytes, what do you gain working with other developers? Do they understand your problem? Do they remember it after a month returning to the code to fix a bug?
InputStream.read() returns ints. And it works :-)
Lol, Gosling is a pretentious fucking asshole, so his pathetic language didn't get unsigned types. Perhaps that is why it is dying. Good riddance, it's a piece of shit.
People who decide that modern languages don't need unsigned types are very shortsighted. They obviously are only experienced in certain kinds of programming. I do a lot of low-level bit-twiddling work, and I've almost given up the idea of using Java because it doesn't have unsigned. There are whole classes or problems where you need unsigned, and you have to rule Java out because of them.
here is an example. The packet returned from a socket is a byte array. To turn four of the bytes into an int, it looks like this:
data = (int)((msg[4] << 24) & 0xff000000) | (int)((msg[5] << 16) & 0xff0000) | (int)((msg[6] << 8) & 0xff00) | (int)(msg[7] & 0xff);
You have to mask off the other bits, because otherwise the signed portion becomes all F's, and it's a mess.
If someone can show me a better way to do that, I would be VERY appreciative. Because when I have 1k of data that comes like this, it is VERY computing expensive.
Thanks, a good insight on unsigned types.
@Stefan: We care because it is PITA working with some set of problems in Java. I was shocked to learn that they left out unsigned bytes... I think they escalated the very problems they were trying to avoid. For those who need (and understand) unsigned data types their decision is... less than perfect. :)
I guess there is no way they can fix that now, is it? I haven't found the time to study the Java VM specs yet. But I doubt they will anyway.
Nice write-up, thanks!
@Stefan: me again... maybe I wasn't clear enough and there is no on this blog... It is not (only) about optimization, it is about writing a clean code too. See the biggunzclub's example above.
Bollocks! Wat if you're writing a network protocol that needs to be used on limited bandwidth devices, Yeah just send 4 times as much data as you need to, what a great idea!
If you never had to use java to deal with network protocols and/or binary file access you shouldnt comment about this issue. I like java but i really hate this, in my current project i must choose between wasting a lot of bandwith or wasting cpu cycles and making creepy code. This is a great fault that didnt have to exist.
No primitive structured type leaves us using arrays in a class to back any data structure we need more than 10 of...it's like coding in Fortran, before Fortan got structured types. Java sucks. Gosling ignored the rule about not designing things that an idiot can use, because you end up with something that only an idiot would use.
Re: comment concerning creating an int from an array of bytes read over the network.
If your required format is little-endian and the network format is big endian, you have to convert it anyway. However, you appear to be unnecessarily applying the mask operator, from java.io.DataInputStream#readInt():
public final int readInt() throws IOException {
int ch1 = in.read();
int ch2 = in.read();
int ch3 = in.read();
int ch4 = in.read();
if ((ch1 | ch2 | ch3 | ch4) < 0)
throw new EOFException();
return ((ch1 << 24) + (ch2 << 16) + (ch3 << 8) + (ch4 << 0));
}
we see no masking.