Java unreadable strings

Java unreadable strings - java

I have made a java socket listener which listens on port 80. And what is basically does is it gathers the data that it listens on port 80 and stores it in a temporary string which is then used for further operation(type conversions et all). Now the basic problem is that the data that comes on port 80 has parts that are unreadable (like # [ Qô — z ‡ ). And now that im storing it in a string and when i print the string, it prints only the readable parts which is understandable, but what puzzles me is that when i print the length of the string, it only prints the length of the readable part. SO i want to know if my approach of storing unreadable string parts in a string is acceptable to enable further operations on them. If not, I would also like some pointers as to how I could store such incoming data.
Regards
p1nG

Something does not make sense here. If you are storing the "unreadable" part of the data in the String, it will be reflected in the length of the String.
i want to know if my approach of storing unreadable string parts in a string is acceptable to enable further operations on them. If not, I would also like some pointers as to how I could store such incoming data.
It depends on why the data is unreadable.
One possibility is that the remote system is sending data in some unexpected character set or encoding. For example, if it is sending Latin-1 and you are expecting UTF-8 (or vice versa) some sections of the text may be unreadable. The solution is to figure out what character set and encoding the remote system is sending, and use the correct Java charset name when converting to to Java characters.
Another possibility is that some of the data is binary data. If so, you should separate the text from the binary data, based on the application protocol used by the remote system.
Finally, the unreadable stuff might be caused by line noise or such like. If that's the case, you should probably leave it intact.
An alternative approach is to use a byte array (or something similar) rather than a String to hold the data. The problem with trying to convert bytes to characters when you are not sure of the character set and encoding is that the conversion may be lossy. By storing the raw bytes, your application at least has the possibility of getting it right later ... when you figure out what the correct conversion is.

you can store the data in a java.nio.ByteBuffer to avoid all the string wackiness...
if it's truly text being sent in some wide character encoding, you'll want to convert the byte buffer into a string using the appropriate character set with the handy java.nio.charset.Charset.decode

Related

Java library to fix incorrectly encoded text using heuristics

I'm dealing with an external web service that is giving me incorrectly encoded (and or corrupted) Strings (UTF-8) that were most likely either ISO LATIN or WINDOWS-1252 but are now UTF-8 (and or a mixture of ISO/WINDOWS/UTF-8). Lovely A hats (Â) abound.
I obviously cannot fix how the external web service stores its strings so the information is lost. Thus hopes of a 100% translation I know are not possible.
But I was hoping that someone had written a heuristic character mapping library in Java (its unlikely some one would type A hats).
If not I guess I can port this guys PHP code: https://stackoverflow.com/a/3521340/318174
UPDATE and Explanation: A simple conversion like #VGR answered with will not work. I do not have the original bytes. The data was converted incorrectly at the endpoint (SOAP server maybe getBytes(/*with out correct encoding*/) was done or maybe the data is stored in the incorrect format). When you convert bytes to Strings in Java back forth the data is not retained unless the encoding is the same everywhere. This is easy to understand if you think of something like ASCII <-> UTF-8. With Windows-1252 or ISO Latin its much more complicated because data is not lost but often confused. That is because those encodings can be two bytes and are not a subset of UTF-8.
If you don't believe me you can try doing getBytes() back in forth with various encodings and will see data corruption and data loss.

I may be misunderstanding the nature of the incorrectly encoded data, but that PHP code seems like overkill to me. If you have UTF-8 bytes that were passed as individual characters, you should be able to just do:
String fix(String s) {
byte[] bytes = s.getBytes(Charset.forName("windows-1252"));
return new String(bytes, StandardCharsets.UTF_8);
}

Can a file be encoded in multiple charsets in Java?

I'm working on a Java plugin which would allow people to write to and read from a file by specifying a charset encoding they would wish to use. However, I was confused as to how I would encode multiple encodings in a single file. For example, suppose that A characters come from one charset and B characters come from another, would it be possible to write "AAAAABBBBBAAAAA" to a file?
If it is not possible, is this generally true for any programming language, or specifically for Java? And if it is possible, how would I then proceed to read (decode) the file?
I do not want to use the encode() and decode() methods of Charset since tests with them have failed (some charsets were not decoded properly). I also don't want to use third-party programs for various reasons, so the scope of this question is purely in the standard java packages/code.
Thanks a lot!
N.S.

You'd need to read it as a byte stream and know beforehand at which byte positions the characters start and end, or to use some special separator character/byterange which indicates the start and end of the character group. This way you can get the bytes of the specific character group and finally decode it using the desired character encoding.
This problem is not specific to Java. The requirement is just strange. I wonder how it makes sense to mix character encodings like that. Just use one uniform encoding all the time, for example UTF-8 which supports practically all characters the mankind is aware of.

Ofcourse it is in principle possible to write text that is encoded in different character sets into one file, but why would you ever want to do this?
A character encoding is simply a mapping from text characters to bytes and vice versa. A file consists of bytes. When writing a file, the character encoding determines how the characters are converted to bytes, and when reading, it determines how the bytes are converted back to characters.
You could have one part of the file encoded with one character encoding, and another part with another character encoding. You'd have to have some mechanism to keep track of what parts are encoded with what encoding, because the file doesn't automatically keep track of that for you.

I was wondering about this as well, because my client just asked a similar question. Like BalusC mentioned this is not a java specific problem.
After a few back and forth, I found the real question might be 'multiple encoding of information', instead multiple encoding file.
i.e. we have a xml string text needs to be encoded with 8859-1, if we save it as a file, then we need encode it. The default encoding for xml is UTF-8, we might not necessary to encode the whole xml as 8859-1. Since the xml node is just a vehicle of passing information over to other system and the content (value of the xml node, which needs to be persisted with 8859-1). So do we need multiple encoding in this case? probably not. We can still encode the xml with UTF-8, then pass it over. once the client receives the xml, then they need read the information out of the UTF-8 encoded file, and persist value of the xml node as 8859-1.

How can I find the byte encoding of a TIBCO Rendezvous message?

In my Java application, I am archiving TIBCO RV messages to a file as bytes.
I am writing a small utility app that will play the messages back. This way I can just create a TibrvMsg object from the bytes without having to parse the file and construct the object manually.
The problem I am having is that I am reading a file that was created on a Linux box, and attempting to run my app on a Windows machine. I get an error due to the different charset the file was written in.
So now, what I want to do is log each message in a specific charset (UTF-8), so that I don't care what platform I run my playback app in. The app should just read in the file knowing before-hand the charset the file is written in. I am planning on using java.nio packages for this, to transform the bytes from one charset to another.
Do I need to know what charset the TIBRV message bytes are encoded in to do the transformation? If so, how can I find this out?

You are taking opaque data and, it would appear, attempting to write it to a file as textual data without escaping the non textual portions of it (alternatively you are writing it as raw bytes and then trying to read it as if it were character based which is much the same problem).
This is flawed from the very start.
Opaque data should be treated as meaningless and simply stored without modification to give back to an API that does know how to deal with it. If the data must be stored in a textual form then you must losslessly convert the bytes into text. Appropriate encodings are things like base64. Encoding in the sense of character set encoding is NOT lossless if you apply it to raw binary data.
Simply storing the bytes in a file as bytes (not characters) along with a fixed length prefix indicating the length of the message and the subject it was sent on is sufficient to replay RV messages through the system.
In relation to any text based fields inside the message if the encoding matters (I strongly suggest avoiding this mattering in general when designing the app) then you have the same problem on replay as you would have had at the original receipt time which is to convert from the source encoding to the desired encoding (hopefully using exactly the same code) so this should be a non issue in relation to the replaying.

This is probably related to Java string encoding, not TIBRV. Though there's this in the documentation:
Strings and Character Encodings
--------------------------------------------------------------------------------
Rendezvous software uses strings in several roles:
* String data inside message fields
* Field names
* Subject names (and other associated strings that are not
strictly inside the message)
* Certified delivery correspondent names
* Group names (fault tolerance)
All these strings (both in C and in wire format) use the character
encoding appropriate to the ISO locale of the sender. For example,
the United States is locale en_US, and uses the Latin-1 character
encoding (also called ISO 8859-1); Japan is locale ja_JP, and uses
the Shift-JIS character encoding.
When two programs exchange messages within the same locale, strings
are always correct. However, when a message sender and receiver use
different character encodings, the receiving program must convert
between encodings as needed. Rendezvous software does not convert
automatically.
EBCDIC
For information about string encoding in EBCDIC environments,
see tibrv_SetCodePages() .
So you might want to look at the locale of the machines.

As this (admittedly rather old) mailing list message indicates, little is known about the internal structure of that network protocol. This might make it quite a challenge to do what you're after.
That said, if the messages are just binary blocks of data (as captured from the network), they shouldn't even have a charset. Charsets is for textual data, where it matters since a single character can be encoded in many different ways. Binary data is not composed out of characters, so there cannot be an encoding in that sense.

Do I need to know what charset the
TIBRV message bytes are encoded in to
do the transformation?
Yes. A charset is a method of transforming text into a byte stream and vice versa. Your network data is a byte stream, so when you interpret parts of it as text, you ARE (implicitly or explicitly) using a charset - the question is whether it is the correct one.
Transforming bytes from one charset to another basically means convering them to text using one charset and then back to bytes using another. Note that this can result in the length of the data changing, since many charsets use more than 1 byte for some characters. In the context of network messages, this could be problematic when it invalidates length fields or causes text fields to overflow. It's probably better not to do any transformation and instead teach the reading app to learn how to deal with varying charsets.
If so, how can I find this out?
Look at the protocol specification.

Read everything inte a byte[] from a inputStream, write the byte[] to a a FileOutputStream.
NO Reader or Writer should be involved, they do character conversion and that is wrong.
Stay away from java.nio until you understand java.io.

Problem transmitting null character over sockets

I am writing a small Java server, and a matching client in C++, which implement a simple IM service over the STOMP protocol.
The protocol specifies that every frame (message that passes between server and client, if you will) must end with a null character, which in code I refer to as '\0', both in Java and in C++.
However, when I transmit a frame over TCP via sockets, the null character simply does not show up, on either side. I am working with UTF-8 encoding, and tried switching to ASCII, didn't help.
What am I doing wrong?

Wether you are encoding text in ASCII or UTF-8, you convert your "letters" to a stream of bytes (byte encodings). You need to add a ZERO byte to the end of the message strings.
[Guessing] You may be using a high-level library with a method like "WriteLine(String line)" to send the data over the network. The documentation for that method with describe what bytes are actually sent, which typically includes the message text encoding in the current encoding (ASCII, UTF-8, etc) followed by a line termination sequence, which is typically either the byte 13, 10m or a combination of them ('\n', '\r\n').
Use the low-level Write() method or WriteBytes() method (depending on your libraries). Convert the text to ASCII or UTF-8, add the zero byte to the end, and send exactly what you want to send.

I'd recommend downloading Wireshark and monitoring the transmission to see if the problem is on the sending or the receiving end.

Are you transmitting a buffer or a string? If you transmit a string, the null character will be terminating the string and won't be transmitted. Using a buffer, you can specify how many bytes you want to transmit and include the null character.
Of course, the problem can be both on the transmission and the receiving side.

The first thing you need to do is use Wireshark (or something similar) as suggested by Spencer.
If it's a transmit-side issue, double check that you are encoding the message properly by adding appropriate diagnostic traces to your code.
If it's a receive-side issue, is there a way to set up the delimiting character on the receive socket? There might be a socket option that says whether to include or exclude the delimiting character. Maybe it's being transmitted properly, but the receive socket is stripping it off.

You are in C++?
A newbie mistake is putting the NUL at the end of the string: "Foo!\x00".
Of course, the ting that writes the string treats that first NUL as the end of the string, and dos not transmit it. You need to white the nul character '\x00' explicitly as a character (with putChar or however c++ does it), not as part of a string.

Decoding split 16-bit character in Java

In my application, I receive a URL-UTF8 encoded string of characters, which is split up by the sending client. After splitting, each message part includes some header information which is meant to be used to reconstruct the message.
With English characters, it's pretty straightforward
String content = new String(request.getParameter("content").getBytes("UTF-8"));
I store this in along with the header information in a buffer for each received part. When all parts have been received, I simply recompose the message by concatenating each individual part according to header information.
With languages that use 16-bit encodings this is sometimes not working as expected. Everything works fine if the split does NOT happen in the middle of a single character.
For instance here's a string of three Hebrew characters being sent by the client:
%D7%93%D7%99%D7%91
If this winds up split as follows: {%D7%93%D7%99} {%D7%91}, reconstruction isn't a problem.
However sometimes the client splits it up in the middle (example: {%D7%93%D7} {%99%D7%91})
When this happens, after reconstruction I get two � characters at the boundary point instead of the single correct Hebrew character.
I thought the inability to correctly retain the single byte information was related to passing around strings, so I tried passing around byte array from request.getParameter("content").getBytes("UTF-8") to the buffer without wrapping in the string joining together the byte arrays. In the buffer I joined all these arrays BEFORE converting the final array to a string.
Even after doing this, it appears I still "lost" that information held by the single bytes. I'm guessing this is because the getBytes("UTF-8") method can't correctly resolve the single bytes since they are not valid characters. Is that right?
Is there any way I can get around this and preserve these tail/head bytes?

Your client is the problem here. Apparently it treats the text data as a byte array for the purpose of splitting it up, and then sending the invalid fragments as text (HTTP request parameters are inherently textual). At that point, you have already lost.
You either have to change the client to split the data as text (i.e. along character boundaries), or change your protocol to send the fragments as binary data, i.e. not as a parameter but as the request body, to be retrieved via ServletRequest.getInputStream() - then, concatenating the data before decoding it should work.
(Caveat: the above assumes that you are indeed writing Servlet code, which I inferred from the request.getParameter() method; but even if that's a coincidence the same principles apply: either split the data as a String before any conversion to byte[] happens on the client side, or make sure you concatenate the byte arrays on the server before any conversion to String happens.)

You must first collect all bytes and then convert them all at once into a string.

Following scheme is a hack but it should work in your case,
Set you server/page in Latin-1 mode. If this is a GET, client has no way to set encoding. You have to do this on server's end. For example, you need to add URIEncoding="iso-8859-1" in connector for Tomcat.
Get content as Latin1. It will be wrong value at this point but don't worry,
String content = request.getParameter("content");
Concatenate the string as Latin-1.
data = data + content;
When you get the whole thing, you need to re-encode the string as UTF-8 like this,
String value = new String(data.getBytes("iso-8859-1"), "utf-8");
The value should contain the correct characters.

You never need to convert a string to bytes and then to a String java, it is completely pointless. Once a series of bytes have been decoded to a String it is in Java String encoding (UTF-16E I think).
The problem you have is that the application server is making an assumation about the encoding of the incoming HTTP request, usually the platform encoding. You can give the application server a hint as to the expected encoding by calling ServletRequest.setCharacterEncoding(String) before anything else calls getParameter().
Browser's assume that form fields should be submitted back to the server using the same encoding that the page was served with. This is a general rule as the HTTP spec doesn't have a way to specify the encoding of the incoming request, only the response.
Spring has a nice Filter to do this for you CharacterEncodingFilter if you define this as the every first filter in web.xml most of your encoding issue will go away.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.