Client-side string encoding java

Client-side string encoding java - java

My team and I have this nasty problem with parsing a string received from our server. The server is pretty simple socket stuff done in qt here is the sendData function:
void sendData(QTcpSocket *client,QString response){
QString text = response.toUtf8();
QByteArray block;
QDataStream out(&block, QIODevice::WriteOnly);
out << (quint32)0;
out << text;
out.device()->seek(0);
out << (quint32)(block.size() - sizeof(quint32));
try{
client->write(block);
}
catch(...){...
The client is in Java and is also pretty standard socket stuff, here is where we are at now after trying many many different ways of decoding the response from the server:
Socket s;
try {
s = new Socket(URL, 1987);
PrintWriter output = new PrintWriter(s.getOutputStream(), true);
InputStreamReader inp = new InputStreamReader(s.getInputStream(), Charset.forName("UTF-8"));
BufferedReader rd = new BufferedReader( inp );
String st;
while ((st = rd.readLine()) != null){
System.out.println(st);
}...
If a connection is made with the server it sends a string "Send Handshake" with the size of the string in bytes sent before it as seen in the first block of code. This notifies the client that it should send authentication to the server. As of now the string we get from the server looks like this:
������ ��������S��e��n��d�� ��H��a��n��d��s��h��a��k��e
We have used tools such as string encode/decode tool to try and assess how the string is encoded but it fails on every configuration.
We are out of ideas as to what encoding this is, if any, or how to fix it.
Any help would be much appreciated.

At a glance, the line where you convert the QString parameter to a Utf8 QByteArray and then back to a QString seems odd:
QString text = response.toUtf8();
When the QByteArray returned by toUtf8() is assigned to text, I think it is assumed that the QByteArray contains an Ascii (char*) buffer.

I'm pretty sure that QDataStream is intended to be used only within Qt. It provides a platform-independent way of serializing data that is then intended to be deserialized with another QDataStream somewhere else. As you noticed, it's including a lot of extra stuff besides your raw data, and that extra stuff is subject to change at the next Qt version. (This is why the documentation suggests including in your stream the version of QDataStream being used ... so it can use the correct deserialization logic.)
In other words, the extra stuff you are seeing is probably meta-data included by Qt and it is not guaranteed to be the same with the next Qt version. From the docs:
QDataStream's binary format has evolved since Qt 1.0, and is likely to
continue evolving to reflect changes done in Qt. When inputting or
outputting complex types, it's very important to make sure that the
same version of the stream (version()) is used for reading and
writing.
If you are going to another language, this isn't practical to use. If it is just text you are passing, use a well-known transport mechanism (JSON, XML, ASCII text, UTF-8, etc.) and bypass the QDataStream altogether.

Related

Issue when convert buffer to string with hexadecimal code of LF

I am trying to download web page with all its resources . First i download the html, but when to be sure to keep file formatted and use this function below .
there is and issue , i found 10 in the final file and when i found that hexadecimal code of the LF or line escape . and this makes troubles to my javascript functions .
Example of the final result :
<!DOCTYPE html>10<html lang="fr">10 <head>10 <meta http-equiv="content-type" content="text/html; charset=UTF-8" />10
Can someone help me to found the real issue ?
public static String scanfile(File file) {
StringBuilder sb = new StringBuilder();
try {
BufferedReader bufferedReader = new BufferedReader(new FileReader(file));
while (true) {
String readLine = bufferedReader.readLine();
if (readLine != null) {
sb.append(readLine);
sb.append(System.lineSeparator());
Log.i(TAG,sb.toString());
} else {
bufferedReader.close();
return sb.toString();
}
}
} catch (IOException e) {
e.printStackTrace();
return null;
}
}

There are multiple problems with your code.
Charset error
BufferedReader bufferedReader = new BufferedReader(new FileReader(file));
This isn't going to work in tricky ways.
Files (and, for that matter, data given to you by webservers) comes in bytes. A stream of numbers, each number being between 0 and 255.
So, if you are a webserver and you want to send the character ö, what byte(s) do you send?
The answer is complicated. The mapping that explains how some character is rendered in byte(s)-form is called a character set encoding (shortened to 'charset').
Anytime bytes are turned into characters or vice versa, there is always a charset involved. Always.
So, you're reading a file (that'd be bytes), and turning it into a Reader (which is chars). Thus, charset is involved.
Which charset? The API of new FileReader(path) explains which one: "The system default". You do not want that.
Thus, this code is broken. You want one of two things:
Option 1 - write the data as is
When doing the job of querying the webserver for the data and relaying this information onto disk, you'd want to just store the bytes (after all, webserver gives bytes, and disks store bytes, that's easy), but the webserver also sends the encoding, in a header, and you need to save this separately. Because to read that 'sack of bytes', you need to know the charset to turn it into characters.
How would you do this? Well, up to you. You could for example decree that the data file starts with the name of a charset encoding (as sent via that header), then a 0 byte, and then the data, unmodified. I think you should go with option 2, however
Option 2
Another, better option for text-based documents (which HTML is), is this: When reading the data, convert it to characters, using the encoding as that header tells you. Then, to save it to disk, turn the chars back to bytes, using UTF-8, which is a great encoding and an industry standard. That way, when reading, you just know it's UTF-8, period.
To read a UTF-8 text file, you do:
Files.newBufferedReader(Paths.get(file));
The reason this works, is that the Files API, unlike most other APIs (and unlike FileReader, which you should never ever use), defaults to UTF_8 and not to platform-default. If you want, you can make it more readable:
Files.newBufferedReader(Paths.get(file), StandardCharsets.UTF_8);
same thing - but now in the code it is clear what's happening.
Broken exception handling
} catch (IOException e) {
e.printStackTrace();
return null;
}
This is not okay - if you catch an exception, either [A] throw something else, or [B] handle the problem. And 'log it and keep going' is definitely not 'handling' it. Your strategy of exception handling results in 1 error resulting in a thousand things going wrong with a thousand stack traces, and all of them except the first are undesired and irrelevant, hence why this is horrible code and you should never write it this way.
The easy solution is to just put throws IOException on your scanFile method. The method inherently interacts with files, it SHOULD be throwing that. Note that your psv main(String[] args) method can, and usually should, be declared to throws Exception.
It also makes your code simpler and shorter, yay!
Resource Management failure
a filereader is a resource. You MUST close it, no matter what happens. You are not doing that: If .readLine() throws an exception, then your code will jump to the catch handler and bufferedReader.close is never executed.
The solution is to use the ARM (Automatic Resource Management) construct:
try (var br = Files.newBufferedReader(Paths.get(file), StandardCharsets.UTF_8)) {
// code goes here
}
This construct ensures that close() is invoked, regardless of how the 'code goes here' block exits. Even if it 'exits' via an exception or a return statement.
The problem
Your 'read a file and print it' code is other than the above three items mostly fine. The problem is that the HTML file on disk is corrupted; the error lies in your code that reads the data from the web server and saves it to disk. You did not paste that code.
Specifically, System.lineSeparator() returns the actual string. Thus, assuming the code you pasted really is the code you are running, if you are seeing an actual '10' show up, then that means the HTML file on disk has that in there. It's not the read code.
Closing thoughts
More generally the job of 'just print a file on disk with a known encoding' can be done in far fewer lines of code:
public static String scanFile(String path) throws IOException {
return Files.readString(Paths.get(path));
}
You should just use the above code instead. It's simple, short, doesn't have any bugs, cannot leak resources, has proper exception handling, and will use UTF-8.

Actually, there is no problem in this function I was mistakenly adding 10 using another function in my code .

How to get rid of incorrect symbols during Java NIO decoding?

I need to read a text from file and, for instance, print it in console. The file is in UTF-8. It seems that I'm doing something wrong because some russian symbols are printed incorrectly. What's wrong with my code?
StringBuilder content = new StringBuilder();
try (FileChannel fChan = (FileChannel) Files.newByteChannel(Paths.get("D:/test.txt")) ) {
ByteBuffer byteBuf = ByteBuffer.allocate(16);
Charset charset = Charset.forName("UTF-8");
while(fChan.read(byteBuf) != -1) {
byteBuf.flip();
content.append(new String(byteBuf.array(), charset));
byteBuf.clear();
}
System.out.println(content);
}
The result:
Здравствуйте, как поживае��е?
Это п��имер текста на русском яз��ке.ом яз�
The actual text:
Здравствуйте, как поживаете?
Это пример текста на русском языке.

UTF-8 uses a variable number of bytes per character. This gives you a boundary error: You have mixed buffer-based code with byte-array based code and you can't do that here; it is possible for you to read enough bytes to be stuck halfway into a character, you then turn your input into a byte array, and convert it, which will fail, because you can't convert half a character.
What you really want is either to first read ALL the data and then convert the entire input, or, to keep any half-characters in the bytebuffer when you flip back, or better yet, ditch all this stuff and use code that is written to read actual characters. In general, using the channel API complicates matters a ton; it's flexible, but complicated - that's how it goes.
Unless you can explain why you need it, don't use it. Do this instead:
Path target = Paths.get("D:/test.txt");
try (var reader = Files.newBufferedReader(target)) {
// read a line at a time here. Yes, it will be UTF-8 decoded.
}
or better yet, as you apparently want to read the whole thing in one go:
Path target = Paths.get("D:/test.txt");
var content = Files.readString(target);
NB: Unlike most java methods that convert bytes to chars or vice versa, the Files API defaults to UTF-8 (instead of the useless and dangerous, untestable-bug-causing 'platform default encoding' that most java API does). That's why this last incredibly simple code is nevertheless correct.

Java encoding from OS X to Windows

I have made a primitive multi-client chat with swing GUI. Everything works fine as long as both people write from the same OS. If one of them writes from Windows and the other from OS X, the encoding of some special characters goes nuts. ( I am from CZE, we use characters as š,ě,č,ř,ž...). I have searched for a long time but didn't find anything that would help.
I have input and output defined as:
in = new BufferedReader(new InputStreamReader(soc.getInputStream()));
out = new PrintWriter(new OutputStreamWriter(soc.getOutputStream()));
where soc is the socket used for connecting to the server side.
The sending process is as simple as:
out.println(message);
where message is a String, which I got from JTextArea by calling method .getText()
I know why this problem occurs, but I was unable to find any reasonable solution.
Any help will be appreciated.
Thanks

When reading character data from Input/OutputStreams, it's a good practice to always specify the character encoding. Otherwise the platform default encoding is used (which might not be the same on all systems).
in = new BufferedReader(new InputStreamReader(soc.getInputStream(), StandardCharsets.UTF_8));
out = new PrintWriter(new OutputStreamWriter(soc.getOutputStream(), StandardCharsets.UTF_8));

Same hashing + encryption in different platforms generating unidentical values

I am writing some web-services for a social networking website. These web-services would be utilized by android for making android-app. As the person who designed the website is no longer under contact, I looked at the whole website code which was written in java with spring framework. I am writing web services in php.
Now, when I tried to send a post request to a php page that would confirm if the given username & pass combination is correct or not and then return a session id. But i'm not being able to get the correct hashing method to get correct hash value that is saved in the database.
Because of this, everytime, I am getting rejected by the php code.
The encryption that I found on the website code is as follows:
public static final synchronized String encrypt(String plaintext, String algorithm, String encoding) throws Exception
{
MessageDigest msgDigest = null;
String hashValue = null;
try
{
msgDigest = MessageDigest.getInstance(algorithm);
msgDigest.update(plaintext.getBytes(encoding));
byte rawByte[] = msgDigest.digest();
hashValue = (new BASE64Encoder()).encode(rawByte);
}
catch (NoSuchAlgorithmException e)
{
System.out.println("No Such Algorithm Exists");
}
catch (UnsupportedEncodingException e)
{
System.out.println("The Encoding Is Not Supported");
}
return hashValue;
}
For example, if i am giving password as monkey123 as password, it is giving hash value encoded in base 64 as: hge2WiM7vlaTTS1qU404+Q==
Now, after struggling to do the same in php for hours, I realised I could do the above procedure in android itself. So, I wrote the following code:
MessageDigest pwdDigest=MessageDigest.getInstance("MD5");
pwdDigest.update(password.getBytes("UTF-16"));
byte rawbyte[]=pwdDigest.digest();
String passwordHash=Base64.encodeToString(rawbyte,Base64.DEFAULT);
URL url = new URL(loginURL);
HttpURLConnection Connection = (HttpURLConnection) url.openConnection();
Connection.setReadTimeout(10000);
Connection.setAllowUserInteraction(false);
Connection.setDoOutput(true);
//set the request to POST and send
Connection.setRequestProperty("Content-Type", "application/x-www-form-urlencoded");
DataOutputStream out = new DataOutputStream(Connection.getOutputStream());
out.writeBytes("username=" + URLEncoder.encode(username, "UTF-8"));
out.writeBytes("&password="+URLEncoder.encode(passwordHash,"UTF-8"));
out.flush();
out.close();
if(Connection.getResponseCode()==200){
String data="Connected";
return data;
} else
return Connection.getResponseCode()+": "+Connection.getResponseMessage();
I expected this would be successful because in both the cases, I am doing same process to encrypt the password, but amazingly, this is not giving the hash value as:
hge2WiM7vlaTTS1qU404+Q== but it's giving : nZlvVe7GSS2Zso1dOwJrIA==
I am really struggling to find out a reason why these two are not the same. Any help would be hugely appreciated.

I don't expect MD5 to differ between platforms. It's stable and well documented and part of the core libraries. If this were broken in some Android version, nothing would work on that phone.
The re-encoding into UTF-8 is harmless, because all Base64 characters fit into the lower ASCII range. Three characters of the base64 alphabet require URL encoding, but you would have seen the %-escapes if something went wrong there.
Base64 is less stable ground (lots and lots of different implementations, no single canonical one), but it's not exactly rocket science either. Again, I don't expect a faulty implementation to really make it out into the wild, but the Base64 step may be where the difference arises.
Personally, I suspect the error is introduced during the password.getBytes("UTF-16") call. A quick way to verify this hunch is to inspect the resulting byte array in a debugger on both platforms. According to java.lang.Charset, the "UTF-16" encoding will use big endian byte order, whereas your PHP code may be defaulting to little endian because it's running on x86 and no byte order mark is present (I don't know PHP well enough to tell if this behaviour is well defined). Try modifying the Java code to use password.getBytes("UTF-16LE") and see if that makes a difference.
Side note: MD5 is no longer considered secure for hashing passwords; you'll want to use something like scrypt or PBKDF2 with plenty of rounds and a random salt, but that's a topic unto itself.

UTF-8 character encoding in Java

I am having some problems getting some French text to convert to UTF8 so that it can be displayed properly, either in a console, text file or in a GUI element.
The original string is
HANDICAP╔ES
which is supposed to be
HANDICAPÉES
Here is a code snippet that shows how I am using the jackcess Database driver to read in the Acccess MDB file in an Eclipse/Linux environment.
Database database = Database.open(new File(filepath));
Table table = database.getTable(tableName, true);
Iterator rowIter = table.iterator();
while (rowIter.hasNext()) {
Map<String, Object> row = this.rowIter.next();
// convert fields to UTF
Map<String, Object> rowUTF = new HashMap<String, Object>();
try {
for (String key : row.keySet()) {
Object o = row.get(key);
if (o != null) {
String valueCP850 = o.toString();
// String nameUTF8 = new String(valueCP850.getBytes("CP850"), "UTF8"); // does not work!
String valueISO = new String(valueCP850.getBytes("CP850"), "ISO-8859-1");
String valueUTF8 = new String(valueISO.getBytes(), "UTF-8"); // works!
rowUTF.put(key, valueUTF8);
}
}
} catch (UnsupportedEncodingException e) {
System.err.println("Encoding exception: " + e);
}
}
In the code you'll see where I want to convert directly to UTF8, which doesn't seem to work, so I have to do a double conversion. Also note that there doesn't seem to be a way to specify the encoding type when using the jackcess driver.
Thanks,
Cam

New analysis, based on new information.
It looks like your problem is with the encoding of the text before it was stored in the Access DB. It seems it had been encoded as ISO-8859-1 or windows-1252, but decoded as cp850, resulting in the string HANDICAP╔ES being stored in the DB.
Having correctly retrieved that string from the DB, you're now trying to reverse the original encoding error and recover the string as it should have been stored: HANDICAPÉES. And you're accomplishing that with this line:
String valueISO = new String(valueCP850.getBytes("CP850"), "ISO-8859-1");
getBytes("CP850") converts the character ╔ to the byte value 0xC9, and the String constructor decodes that according to ISO-8859-1, resulting in the character É. The next line:
String valueUTF8 = new String(valueISO.getBytes(), "UTF-8");
...does nothing. getBytes() encodes the string in the platform default encoding, which is UTF-8 on your Linux system. Then the String constructor decodes it with the same encoding. Delete that line and you should still get the same result.
More to the point, your attempt to create a "UTF-8 string" was misguided. You don't need to concern yourself with the encoding of Java's strings--they're always UTF-16. When bringing text into a Java app, you just need to make sure you decode it with the correct encoding.
And if my analysis is correct, your Access driver is decoding it correctly; the problem is at the other end, possibly before the DB even comes into the picture. That's what you need to fix, because that new String(getBytes()) hack can't be counted on to work in all cases.
Original analysis, based on no information. :-/
If you're seeing HANDICAP╔ES on the console, there's probably no problem. Given this code:
System.out.println("HANDICAPÉES");
The JVM converts the (Unicode) string to the platform default encoding, windows-1252, before sending it to the console. Then the console decodes that using its own default encoding, which happens to be cp850. So the console displays it wrong, but that's normal. If you want it to display correctly, you can change the console's encoding with this command:
CHCP 1252
To display the string in a GUI element, such as a JLabel, you don't have to do anything special. Just make sure you use a font that can display all the characters, but that shouldn't be problem for French.
As for writing to a file, just specify the desired encoding when you create the Writer:
OutputStreamWriter osw = new OutputStreamWriter(
new FileOutputStream("myFile.txt"), "UTF-8");

String s = "HANDICAP╔ES";
System.out.println(new String(s.getBytes("CP850"), "ISO-8859-1")); // HANDICAPÉES
This shows the correct string value. This means that it was originally encoded/decoded with ISO-8859-1 and then incorrectly encoded with CP850 (originally CP1252 a.k.a. Windows ANSI as pointed in a comment is indeed also possible since the É has the same codepoint there as in ISO-8859-1).
Align your environment and binary pipelines to use all the one and same character encoding. You can't and shouldn't convert between them. You would risk losing information in the non-ASCII range that way.
Note: do NOT use the above code snippet to "fix" the problem! That would not be the right solution.
Update: you are apparently still struggling with the problem. I'll repeat the important parts of the answer:
Align your environment and binary pipelines to use all the one and same character encoding.
You can not and should not convert between them. You would risk losing information in the non-ASCII range that way.
Do NOT use the above code snippet to "fix" the problem! That would not be the right solution.
To fix the problem you need to choose character encoding X which you'd like to use throughout the entire application. I suggest UTF-8. Update MS Access to use encoding X. Update your development environment to use encoding X. Update the java.io readers and writers in your code to use encoding X. Update your editor to read/write files with encoding X. Update the application's user interface to use encoding X. Do not use Y or Z or whatever at some step. If the characters are already corrupted in some datastore (MS Access, files, etc), then you need to fix it by manually replacing the characters right there in the datastore. Do not use Java for this.
If you're actually using the "command prompt" as user interface, then you're actually lost. It doesn't support UTF-8. As suggested in the comments and in the article linked in the comments, you need to create a Swing application instead of relying on the restricted command prompt environment.

You can specify encoding when establishing connection. This way was perfect and solve my encoding problem:
DatabaseImpl open = DatabaseImpl.open(new File("main.mdb"), true, null, Database.DEFAULT_AUTO_SYNC, java.nio.charset.Charset.availableCharsets().get("windows-1251"), null, null);
Table table = open.getTable("FolderInfo");

Using "ISO-8859-1" helped me deal with the French charactes.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.