Comparing strings passed through socket UTF8

Comparing strings passed through socket UTF8 - java

I have an interesting problem here.
First I have a UI in Java. The UI at one point connects to a rpi4 on the network via a socket. From there data is sent over the socket using .writeUTF(string).
On the rpi4 side, I'm running a simple Python 3 script. Its sole purpose is to spit out anything that comes over the socket and it does. But before it does I use recv.decode('utf-8') to decode the string.
From Java I send "fillOpen"
In python after decoding it prints "fillOpen"
The issue:
Performing a string compare in the python script on the decoded string always results in false. I have set it up as such:
Command = recv.decode('utf-8')
If Command == "fillOpen":
#Do work
I have also tried to not decode the string and compare to an encoded string. As such:
Command = recv
FillOpenCommand =
("fillOpen").encode('utf-8')
If fillOpenCommand == Command:
#Do work
None of these comparisons result in true.
I have read that the Java writeUTF is a UTF8 encoding but slightly "different"?
Can I adjust the .writeUTF to work with the Python 3 decoder? Is there an alternative for sending data that can be parsed then have a string comp applied via Python that would work?
Thank you guys.

Assuming you are using the writeUTF method as defined in the Java DataOutput interface:
The output from writeUTF starts with two bytes of length information. You can skip it or you can use it to make sure you have received a complete message.
The easiest thing to do is to skip it:
Command = recv[2:].decode('utf-8')
If your commands are simply ASCII and don't contain things like user input, emojis, musical notation, this is good enough. Otherwise, you still have a problem. The way writeUTF handles "surrogate pair" characters is not valid "utf-8", and decode('utf-8') will throw a UnicodeDecodeError. If I were you, in this case I would stop using writeUTF and start using methods that produce standard UTF-8 encoded data.

Related

'Communicate' in Python does not work

I'm trying to write a python program to test a java program that takes input from stdin using Scanner.
All other posts point to using communicate with popen, but for me it absolutely does not work. When i run my python program, it just calls popen and then stops while the java program waits for input. I wrote a print statement after popen to check. It never prints.
Its very simple. I just want to give this program that waits for input some input.
here is the code:
import os.path, subprocess
from subprocess import PIPE
p = subprocess.Popen(['java', 'Main'], stdin=PIPE, stdout=PIPE)
print 'after subprocess' #this never get's printed
output = p.communicate(input='5 5 4 3 2 1'.encode())[0]
print output

Without more information (like some sample Java code) it's hard to be sure, but I'll bet the problem is that the Java code is waiting for a complete line, and you haven't sent one.
If so, the fix is simple:
output = p.communicate(input='5 5 4 3 2 1\n'.encode())[0]
As a side note, why exactly are you calling encode on that string? It's already encoded in whatever character set your source code uses. So, when you call encode, it has to first decode that to Unicode. And then, because you didn't pass an argument to encode, it's going to encode it to your default character set (sys.getdefaultencoding()), which doesn't seem any more likely to match what the Java code is expecting than what you already have. It's rarely worth calling encode with an argument, and you should almost* never call it on a str, only a unicode.
* In case you're wondering, the exception is when you're using a handful of special codecs like hex or gzip. In Python 3, they decided that the occasional usefulness of those special cases was nowhere near as much as the frequent bug-magnet of calling encode on already-encoded strings, so they took it out of the language.

How do I decode data from a TCP socket

I am trying to make a very simplistic chat program with a server made in python and the client in java. However I have no idea how to decode the data which the server receives from the client. The client sends and encodes to UTF-8.
Just printing it looks like this: http://i.imgur.com/0usK6j7.jpg
And decoding from UTF-8 first it looks like this: http://i.imgur.com/Ctwivl4.jpg
I assume that the NUL character or \x00 can be removed. the same going for the b'' which wraps the entire message. The second character seems to specify the length of the message. But how do I decode this? Should I just remove characters manually? I know this is quite a basic question and has probably been asked before but I don't even know what to search for.

In the java client I have a DataOutputStream object which i use with this method: out.writeUTF(input);
According to the documentation of that method, it doesn't write UTF-8 to the output stream. It says "First, two bytes are written to the output stream", which explains your 16-bit lengths that precede the strings. And even after that it doesn't write UTF-8, it writes in Java's own idiosyncratic encoding which it calls Modified UTF-8 and which is a actually variant of CESU-8, not UTF-8.
So first of all, you need to clarify what format exactly you wish to use to communicate between the client and server: the protocol. Is it plain UTF-8? Is it the bizarre structured encoding that writeUTF emits? Is it something else? Then write both your client and server to follow that specification.

Java unreadable strings

I have made a java socket listener which listens on port 80. And what is basically does is it gathers the data that it listens on port 80 and stores it in a temporary string which is then used for further operation(type conversions et all). Now the basic problem is that the data that comes on port 80 has parts that are unreadable (like # [ Qô — z ‡ ). And now that im storing it in a string and when i print the string, it prints only the readable parts which is understandable, but what puzzles me is that when i print the length of the string, it only prints the length of the readable part. SO i want to know if my approach of storing unreadable string parts in a string is acceptable to enable further operations on them. If not, I would also like some pointers as to how I could store such incoming data.
Regards
p1nG

Something does not make sense here. If you are storing the "unreadable" part of the data in the String, it will be reflected in the length of the String.
i want to know if my approach of storing unreadable string parts in a string is acceptable to enable further operations on them. If not, I would also like some pointers as to how I could store such incoming data.
It depends on why the data is unreadable.
One possibility is that the remote system is sending data in some unexpected character set or encoding. For example, if it is sending Latin-1 and you are expecting UTF-8 (or vice versa) some sections of the text may be unreadable. The solution is to figure out what character set and encoding the remote system is sending, and use the correct Java charset name when converting to to Java characters.
Another possibility is that some of the data is binary data. If so, you should separate the text from the binary data, based on the application protocol used by the remote system.
Finally, the unreadable stuff might be caused by line noise or such like. If that's the case, you should probably leave it intact.
An alternative approach is to use a byte array (or something similar) rather than a String to hold the data. The problem with trying to convert bytes to characters when you are not sure of the character set and encoding is that the conversion may be lossy. By storing the raw bytes, your application at least has the possibility of getting it right later ... when you figure out what the correct conversion is.

you can store the data in a java.nio.ByteBuffer to avoid all the string wackiness...
if it's truly text being sent in some wide character encoding, you'll want to convert the byte buffer into a string using the appropriate character set with the handy java.nio.charset.Charset.decode

Problem transmitting null character over sockets

I am writing a small Java server, and a matching client in C++, which implement a simple IM service over the STOMP protocol.
The protocol specifies that every frame (message that passes between server and client, if you will) must end with a null character, which in code I refer to as '\0', both in Java and in C++.
However, when I transmit a frame over TCP via sockets, the null character simply does not show up, on either side. I am working with UTF-8 encoding, and tried switching to ASCII, didn't help.
What am I doing wrong?

Wether you are encoding text in ASCII or UTF-8, you convert your "letters" to a stream of bytes (byte encodings). You need to add a ZERO byte to the end of the message strings.
[Guessing] You may be using a high-level library with a method like "WriteLine(String line)" to send the data over the network. The documentation for that method with describe what bytes are actually sent, which typically includes the message text encoding in the current encoding (ASCII, UTF-8, etc) followed by a line termination sequence, which is typically either the byte 13, 10m or a combination of them ('\n', '\r\n').
Use the low-level Write() method or WriteBytes() method (depending on your libraries). Convert the text to ASCII or UTF-8, add the zero byte to the end, and send exactly what you want to send.

I'd recommend downloading Wireshark and monitoring the transmission to see if the problem is on the sending or the receiving end.

Are you transmitting a buffer or a string? If you transmit a string, the null character will be terminating the string and won't be transmitted. Using a buffer, you can specify how many bytes you want to transmit and include the null character.
Of course, the problem can be both on the transmission and the receiving side.

The first thing you need to do is use Wireshark (or something similar) as suggested by Spencer.
If it's a transmit-side issue, double check that you are encoding the message properly by adding appropriate diagnostic traces to your code.
If it's a receive-side issue, is there a way to set up the delimiting character on the receive socket? There might be a socket option that says whether to include or exclude the delimiting character. Maybe it's being transmitted properly, but the receive socket is stripping it off.

You are in C++?
A newbie mistake is putting the NUL at the end of the string: "Foo!\x00".
Of course, the ting that writes the string treats that first NUL as the end of the string, and dos not transmit it. You need to white the nul character '\x00' explicitly as a character (with putChar or however c++ does it), not as part of a string.

Decoding split 16-bit character in Java

In my application, I receive a URL-UTF8 encoded string of characters, which is split up by the sending client. After splitting, each message part includes some header information which is meant to be used to reconstruct the message.
With English characters, it's pretty straightforward
String content = new String(request.getParameter("content").getBytes("UTF-8"));
I store this in along with the header information in a buffer for each received part. When all parts have been received, I simply recompose the message by concatenating each individual part according to header information.
With languages that use 16-bit encodings this is sometimes not working as expected. Everything works fine if the split does NOT happen in the middle of a single character.
For instance here's a string of three Hebrew characters being sent by the client:
%D7%93%D7%99%D7%91
If this winds up split as follows: {%D7%93%D7%99} {%D7%91}, reconstruction isn't a problem.
However sometimes the client splits it up in the middle (example: {%D7%93%D7} {%99%D7%91})
When this happens, after reconstruction I get two � characters at the boundary point instead of the single correct Hebrew character.
I thought the inability to correctly retain the single byte information was related to passing around strings, so I tried passing around byte array from request.getParameter("content").getBytes("UTF-8") to the buffer without wrapping in the string joining together the byte arrays. In the buffer I joined all these arrays BEFORE converting the final array to a string.
Even after doing this, it appears I still "lost" that information held by the single bytes. I'm guessing this is because the getBytes("UTF-8") method can't correctly resolve the single bytes since they are not valid characters. Is that right?
Is there any way I can get around this and preserve these tail/head bytes?

Your client is the problem here. Apparently it treats the text data as a byte array for the purpose of splitting it up, and then sending the invalid fragments as text (HTTP request parameters are inherently textual). At that point, you have already lost.
You either have to change the client to split the data as text (i.e. along character boundaries), or change your protocol to send the fragments as binary data, i.e. not as a parameter but as the request body, to be retrieved via ServletRequest.getInputStream() - then, concatenating the data before decoding it should work.
(Caveat: the above assumes that you are indeed writing Servlet code, which I inferred from the request.getParameter() method; but even if that's a coincidence the same principles apply: either split the data as a String before any conversion to byte[] happens on the client side, or make sure you concatenate the byte arrays on the server before any conversion to String happens.)

You must first collect all bytes and then convert them all at once into a string.

Following scheme is a hack but it should work in your case,
Set you server/page in Latin-1 mode. If this is a GET, client has no way to set encoding. You have to do this on server's end. For example, you need to add URIEncoding="iso-8859-1" in connector for Tomcat.
Get content as Latin1. It will be wrong value at this point but don't worry,
String content = request.getParameter("content");
Concatenate the string as Latin-1.
data = data + content;
When you get the whole thing, you need to re-encode the string as UTF-8 like this,
String value = new String(data.getBytes("iso-8859-1"), "utf-8");
The value should contain the correct characters.

You never need to convert a string to bytes and then to a String java, it is completely pointless. Once a series of bytes have been decoded to a String it is in Java String encoding (UTF-16E I think).
The problem you have is that the application server is making an assumation about the encoding of the incoming HTTP request, usually the platform encoding. You can give the application server a hint as to the expected encoding by calling ServletRequest.setCharacterEncoding(String) before anything else calls getParameter().
Browser's assume that form fields should be submitted back to the server using the same encoding that the page was served with. This is a general rule as the HTTP spec doesn't have a way to specify the encoding of the incoming request, only the response.
Spring has a nice Filter to do this for you CharacterEncodingFilter if you define this as the every first filter in web.xml most of your encoding issue will go away.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.