I need to parse some data that has encoded primitive types (ints, floats, doubles, floats) outputted by java. I'm adding this functionality to an existing set of python scripts, so rewriting it in Java isn't really an option. I'd like to re-implement and/or use a python library to decode the data (e.g. TH3IFMw for a float).
I don't recognize this encoding. I'm working with the requests sent to Google Web Toolkit, and based on the source here and here - I thought it was string.ValueOf - but this is incorrect. Does anyone recognize it?
I think this is encoding a long int, not a float. In particular, it's probably 0x0000004c7dc814cc, but might be 0x00000131f7205330.
My reasoning...
Looking through the code you linked to, it doesn't look like anything remotely out of the ordinary is being done to floats, and the standard valueOf implementation definitely does nothing like this.
On the other hand, the string TH3IFMw looks for all the world like a base64 encoded string. I can't think of many other common encodings that use upper alpha, lower alpha, and digits. Looking through the same code, I can only find one reference to base64... line 575 of StreamWriter, where it handles the encoding long instances. This is the only part of the linked code which seems even remotely capable of generating the output you observed.
Looking at the size of the string... assuming it is base64, it's missing a trailing = padding/alignment character, but some implementations of base64 do omit these for brevity. Adding that back (TH3IFMw=), and decoding as base64, this results in the hex value 0x4c7dc814cc. This is only 5 bytes in size, which is a little odd. But this does mean it's probably not a float (4 bytes) or double (8 bytes).
But this could fit with line 575's encoding of a long... looking at the documentation for Base64Utils.toBase64, it makes reference to the fact that "Leading groups of all zero bits are omitted." This would explain the 5 byte value, if the original long was 0x0000004c7dc814cc.
However, the documentation's wording is frustratingly ambiguous (and I don't have java+gwt available to me right now to test). "leading groups of all zero bits" could mean they are omitting source bytes which are all zeros, but it could also meaning they're omitting leading A characters from the encoded base64 characters (A represents 6 0 bits in base64). If that's the case, then the actual base64 string is ATH3IFMw, which decodes to the long value 0x00000131f7205330.
If you can find either of those numbers in what you're providing as input, then that's probably what's happening. If not... I'm afraid I'm stumped.
Related
It has occurred to me that the char type in java can be entirely replaced with integer types (and leaving the character literals for programmers' convenience). This would allow for flexibility of storage size, as ASCII only takes one byte and Unicode beyond the Basic Multilingual Plane requires more than two bytes. If a character is just a two-byte number like the short type, why is there a separate type for it?
Nothing is 100% necessary in a programming language; we could all use BCPL if we really wanted to. (BTW, I learned that well enough to write fizzbuzz a few years ago and recommend doing that. It's a language with an interesting viewpoint and/or historical perspective.)
The question is, does char simplify or improve programming? Put in your terms, is it worth one byte per character to save the if/then complexity inherent in using byte for some characters and short for others? I think the answer to that is cheap: Even for a novel-length string that's only half a megabyte, or about one cent's worth of RAM. Or compared to using short: does having a separate unsigned 16-byte type improve or simplify anything over using a signed 16-byte type for holding unicode code points, so that the character "ꙁ" would be a negative number in java and positive in reality? A matter of judgment.
As I understand java keeps string in uft16 which for every code points uses either 16 (for BMP) or 32 bits. But I am not sure if class Character can be used for keeping code point which need 32 bits. Reading http://docs.oracle.com/javase/7/docs/api/java/lang/Character.html didn't help. So can it?
No, char and Character can't represent a code point outside the BMP. There's no specific type for this, but all the Java APIs just use int to refer to code points specifically as opposed to UTF-16 code units.
If you look at all the codePoint* methods in java.lang.Character, such as codePointAt(char[], int, int) you'll see they use int.
In my experience, very little code (including my own) correctly takes account of this, instead assuming that it's reasonable to talk about the length of a string as being the number of UTF-16 code units in it. Having said that, "length" is a pretty hard-to-pin-down concept for strings, in that it doesn't mean the number of displayed glyphs, and different normalization forms of logically-equivalent text can consist of different numbers of code points...
I'm trying to create a Java program that writes files for my Adruino to read. The Arduino is a simple 8 bit microcontroller board, and with some extra hardware, can read text files from SD cards, byte by byte.
Turns out this was a whole lot harder than I thought. Firstly, there are no unsigned values in Java. Not even bytes for some reason! Even trying to set a byte to 0xFF gives a possible loss of precision error! This isn't very useful for this low-level code..
I would use ints and only use the positive values, but I like using byte overflow to my advantage in a lot of my code (though I could probably do this with modulus right after the math operation or something) and the biggest problem of all is I have no idea how to add an int as an 8 bit character to a String that gets written to a file later. Output is currently my biggest problem.
So, what would be the best way to do unsigned bit math based on some user input and then write those bits to a file as if each one was an ASCII character?
So, here's how it works.
You can treat Java bytes as unsigned. The only places where signs make a difference are
constants: just cast them to bytes
toString and parseInt
division
<, >, >=, <=
Operations where signedness does not matter:
addition
subtraction
multiplication
bit arithmetic (except for >>, just use >>> instead)
To convert bytes to their unsigned values as ints, just use & 0xFF, and to convert those to bytes use (byte).
Alternatively, if third-party libraries are acceptable, you might be interested in Guava's UnsignedBytes utility class. (Disclosure: I contribute to Guava.)
I have often heard complaints against Java for not having unsigned data types. See for example this comment. I would like to know how is this a problem? I have been programming in Java for 10 years more or less and never had issues with it. Occasionally when converting bytes to ints a & 0xFF is needed, but I don't consider that as a problem.
Since unsigned and signed numbers are represented with the same bit values, the only places I can think of where signedness matters are:
When converting the numbers to other bit representation. Between 8, 16 and 32 bit integer types you can use bitmasks if needed.
When converting numbers to decimal format, usually to Strings.
Interoperating with non-Java systems through API's or protocols. Again the data is just bits, so I don't see the problem here.
Using the numbers as memory or other offsets. With 32 bit ints this might be problem for very huge offsets.
Instead I find it easier that I don't need to consider operations between unsigned and signed numbers and the conversions between those. What am I missing? What are the actual benefits of having unsigned types in a programming language and how would having those make Java better?
Occasionally when converting bytes to ints a & 0xFF is needed, but I don't consider that as a problem.
Why not? Is "applying a bitwise AND with 0xFF" actually part of what your code is trying to represent? If not, why should it have to be part of have you write it? I actually find that almost anything I want to do with bytes beyond just copying them from one place to another ends up requiring a mask. I want my code to be cruft-free; the lack of unsigned bytes hampers this :(
Additionally, consider an API which will always return a non-negative value, or only accepts non-negative values. Using an unsigned type allows you to express that clearly, without any need for validation. Personally I think it's a shame that unsigned types aren't used more in .NET, e.g. for things like String.Length, ICollection.Count etc. It's very common for a value to naturally only be non-negative.
Is the lack of unsigned types in Java a fatal flaw? Clearly not. Is it an annoyance? Absolutely.
The comment that you quote hits the nail on the head:
Java's lack of unsigned data types also stands against it. Yes, you can work around it, but it's not ideal and you'll be using code that doesn't really reflect the underlying data correctly.
Suppose you are interoperating with another system, which wants an unsigned 16 bit integer, and you want to represent the number 65535. You claim "the data is just bits, so I don't see the problem here" - but having to pass -1 to mean 65535 is a problem. Any impedance mismatch between the representation of your data and its underlying meaning introduces an extra speedbump when writing, reading and testing the code.
Instead I find it easier that I don't need to consider operations between unsigned and signed numbers and the conversions between those.
The only times you would need to consider those operations is when you were naturally working with values of two different types - one signed and one unsigned. At that point, you absolutely want to have that difference pointed out. With signed types being used to represent naturally unsigned values, you should still be considering the differences, but the fact that you should is hidden from you. Consider:
// This should be considered unsigned - so a value of -1 is "really" 65535
short length = /* some value */;
// This is really signed
short foo = /* some value */;
boolean result = foo < length;
Suppose foo is 100 and length is -1. What's the logical result? The value of length represents 65535, so logically foo is smaller than it. But you'd probably go along with the code above and get the wrong result.
Of course they don't even need to represent different types here. They could both be naturally unsigned values, represented as signed values with negative numbers being logically greater than positive ones. The same error applies, and wouldn't be a problem if you had unsigned types in the language.
You might also want to read this interview with Joshua Bloch (Google cache, as I believe it's gone from java.sun.com now), including:
Ooh, good question... I'm going to say that the strangest thing about the Java platform is that the byte type is signed. I've never heard an explanation for this. It's quite counterintuitive and causes all sorts of errors.
If you like, yes, everything is ones and zeroes. However, your hardware arithmetic and logic unit doesn't work that way. If you want to store your bits in a signed integer value but perform operations that are not natural to signed integers, you will usually waste both storage space and processing time.
An unsigned integer type stores twice as many non-negative values in the same space as the corresponding signed integer type. So if you want to take into Java any data commonly used in a language with unsigned values, such as a POSIX date value (unsigned number of seconds) that is normally used with C, then in general you will need to use a wider integer type than C would use. If you are processing many such values, again you will waste both storage space and fetch-execute time.
The times I have used unsigned data types have been when I read in large blocks of data that correspond to images, or worked with openGL. I personally prefer unsigned if I know something will never be negative, as a "safety feature" of sorts.
Unsigned types are useful for bit-by-bit comparisons, and I'm pretty sure they are used extensively in graphics.
Why some methods that write bytes/chars to streams takes int instead of byte/char??
Someone told me in case of int instead of char:
because char in java is just 2 bytes length, which is OK with most character symbols already in use, but for certain character symbols (chines or whatever), the character is being represented in more than 2 bytes, and hence we use int instead.
How far this explanation is close to the truth?
EDIT:
I use the stream word to represent Binary and character streams (not Just Binary streams)
Thanks.
Someone told me in case of int instead of char: because char in java is just 2 bytes length, which is OK with most character symbols already in use, but for certain character symbols (chinese or whatever), the character is being represented in more than 2 bytes, and hence we use int instead.
Assuming that at this point you are talking specifically about the Reader.read() method, the statement from "someone" that you have recounted is in fact incorrect.
It is true that some Unicode codepoints have values greater than 65535 and therefore cannot be represented as a single Java char. However, theReader API actually produces a sequence of Java char values (or -1), not a sequence of Unicode codepoints. This clearly stated in the javadoc.
If your input includes a (suitably encoded) Unicode code point that is greater than 65535, then you will actually need to call the read() method twice to see it. What you will get will be a UTF-16 surrogate pair; i.e. two Java char values that together represent the codepoint. In fact, this fits in with the way that the Java String, StringBuilder and StringBuffer classes all work; they all use a UTF-16 based representation ... with embedded surrogate pairs.
The real reason that Reader.read() returns an int not a char is to allow it to return -1 to signal that there are no more characters to be read. The same logic explains why InputStream.read() returns an int not a byte.
Hypothetically, I suppose that the Java designers could have specified that the read() methods throw an exception to signal the "end of stream" condition. However, that would have just replaced one potential source of bugs (failure to test the result) with another (failure to deal with the exception). Besides, exceptions are relatively expensive, and an end of stream is not really an unexpected / exceptional event. In short, the current approach is better, IMO.
(Another clue to the 16 bit nature of the Reader API is the signature of the read(char[], ...) method. How would that deal with codepoints greater than 65535 if surrogate pairs weren't used?)
EDIT
The case of DataOutputStream.writeChar(int) does seem a bit strange. However, the javadoc clearly states that the argument is written as a 2 byte value. And in fact, the implementation clearly writes only the bottom two bytes to the underlying stream.
I don't think that there is a good reason for this. Anyway, there is a bug database entry for this (4957024), which marked as "11-Closed, Not a Defect" with the following comment:
"This isn't a great design or excuse, but it's too baked in for us to change."
... which is kind of an an acknowledgement that it is a defect, at least from the design perspective.
But this is not something worth making a fuss about, IMO.
I'm not sure exactly what you're referring to but perhaps you are thinking of InputStream.read()? It returns an integer instead of a byte because the return value is overloaded to also represent end of stream, which is represented as -1. Since there are 257 different possible return values a byte is insufficient.
Otherwise perhaps you could come with some more specific examples.
There are a few possible explanations.
First, as a couple of people have noted, it might be because read() necessarily returns an int, and so it can be seen as elegant to have write() accept an int to avoid casting:
int read = in.read();
if ( read != -1 )
out.write(read);
//vs
out.write((byte)read);
Second, it might just be nice to avoid other cases of casting:
//write a char (big-endian)
char c;
out.write(c >> 8);
out.write(c);
//vs
out.write( (byte)(c >> 8) );
out.write( (byte)c );
It's correct that the maximum possible code point is 0x10FFFF, which doesn't fit in a char. However, the stream methods are byte-oriented, while the writer methods are 16-bit. OutputStream.write(int) writes a single byte, and Writer.write(int) only looks at the low-order 16 bits.
In Java, Streams are for raw bytes. To write characters, you wrap a Stream in a Writer.
While Writers do have write(int) (which writes the 16 low bits; it's an int because byte is too small, and short is too small due to it being signed), you should be using write(char[]) or write(String) instead.
probably to be symmetric with the read() method which returns an int. nothing serious.