Form encoding in Tapestry - java

I have a problem with Tapestry form.
My XML database is very sensible with encoding, and need utf-8.
When i put the char 'à' in my form, tapestry receive 'Ó' and my core get an error : Invalid byte 2 of 3-byte UTF-8 sequence.
I haven't the problem in eclipse with local default configuration for tomcat.
But whatever the tomcat configuration, i think my application must do the conversion itself.
So i try :
charset="utf-8" in form => FAIL
buildUtf8Filter in AppModule => FAIL
The charset of every page is always utf-8.
So, what could i do before using java Charset encoder ?
thank you for helping me. :)

I wouldn't think there's anything wrong with your application. Tapestry does everything in UTF-8 by default; that wiki page is fairly out of date (referring to the 5.0.5 beta, where apparently forms with file uploads still didn't use UTF-8 properly).
You're saying you don't have the problem locally. Have you tried running on a different server? If you do not have the problem there, there's probably something wrong with the codepage settings of the operating system on the server.
Purely anecdotal evidence below
I have once had a similar character set problem in a Tapestry 5 app on the production server (running SUSE Linux) that I could not reproduce on any other server. All seemed fine with the application, the Tomcat server, and the codepage settings of the system, but POST data would end up decoded as ISO 8859-1 instead of UTF-8 in the application. The app had run on that server for a year before the problem manifested - maybe through an update in the operating system.
Afer a day of not getting anywhere, we ended up just re-installing the whole server OS, and everything was fine again.

The problem was about the default charset of the JVM launched into windows shell.
It caused trouble with FileWriter and then show bad character in the console :)

Related

Encoding Issue between fresh installed Ubuntu 16.04 LTS server, and upgraded server

As part of a project, we needed to move from Ubuntu 14.04 to Ubuntu 16.04. However, since the upgrade was completed, full functionality has not been working correctly. The encoding of the characters is being jumbled when stored in the database. The same debian version of the software produces different results, implying an ISO issue with a different library or some differences in Java behaviour.
The upgraded server is experiencing no problems and it persists only on newer installs, which implies an issue at the ISO level, but there is no obvious sign as to which library or similar may have failed to install.
Logging was added to print the bytes received, and Java still reads this as it would be expected. However, when it stores them in the database, they are completely different. This is done via a JPA connection setup earlier. This is already using the 'useUnicode=true&characterEncoding=UTF-8' field. When Java reads this data again, it still thinks it is using the correct bytes when it is not. Likewise, if you add something directly to the DB, Java's debugging logs do not show the correct bytes, yet the information is still shown correctly when displayed via the interface which could only have passed through here. This implies the issue is with storing the data rather than handling of it, but the same version of the debian install affects both versions. The working version reads the bytes correctly when it gets them out of the database.
شلاؤ, in Arabic for example is supposed to be encoded as (by using hex function in mysql/mariadb), comes out, in the correct version as "D8B4D984D8A7D8A4" BUT in the incorrect version, displays as "C398C2B4C399C284C398C2A7C398C2A4". This may provide more information as to why the encoding is failing to work correctly. With Java reading the incorrect bytes as if they are correct, this is more likely to be an issue with Java, but the confusion remains due to the inconsistency between systems.
D8B4D984D8A7D8A4 is the correct utf8 (or utf8mb4) encoding for شلاؤ. C398C2B4C399C284C398C2A7C398C2A4 is the "double-encoded" version. This implies that something is still specifying "latin1" as the character set. Perhaps you dumped and reloaded the data, and that is where it happened?
For more on such, see Trouble with UTF-8 characters; what I see is not what I stored and perhaps http://mysql.rjweb.org/doc.php/charcoll
For anyone who may be experiencing something similar, the result turned out the be that Java was running without defaulting to utf8. OpenEJB/JPA was configured correctly, as was the database, but one aspect of the server was defaulting to a different charset, so the startup arguments for the affected area resolved the problem!

JavaFx application in Windows is not displaying text correctly

So I have an application written in JavaFx 2.2 that has been packaged for linux, mac, and windows. I am getting a strange issue with some of the text fields though. The application will read a file and populate some labels based on whats found in the file. When run on ubuntu or mac we get a result like as you can see we have that special accent character over the c and it looks just fine. However in Windows it shows up like this . Any idea as to why this is happening? I was a bit confused as it is the same exact application on all three. Thanks.
Make sure to specify character encoding when reading the file, in order to avoid using the platform's default encoding, which varies between operating systems. Just by coincidence, the default on Linux and Mac happens to match the file encoding and produces correct output, but you should not rely on it.

opposite world "XML Parsing Error: not well-formed" error

I know what "XML Parsing Error: not well-formed" means broadly. Somehow the text does not comply with the xml specification. This normally would mean that there are unmatched tags or perhaps an incorrectly written header.
However, there is also the character encoding type of not well formed documents. I'm getting results that seem opposite of what I would expect.
When I make a rest call to a java rest service from a browser on my windows 7 machine to the tomcat instance on my windows 7 machine, I get back an xml document that contains the following word as text like so:
<foo>RÃœCK</foo>
I know that's what I get because I used curl to save the results and that's exactly what's in the document. However, when viewed in firefox, ie8, or chrome, the "Ü" part of the text actually displays as a U with 2 dots above it. And, none of the browsers complains about the document being not well-formed.
Then I make a call to the same rest service except I make it from my windows 7 machine to a linux machine running tomcat. What I get is:
<foo>RÜCK</foo>
That's what I see when I use curl to download the results. However, both firefox and ie complain that the xml document is not well-formed!
I know that somehow when I copy paste "Ü" it changes from being a single character to being two characters due to document encoding or something. But, here is the next confusing thing.
When I update things in the db to store "RÃœCK" as the copy pasted value, it displays as "RÃœCK" when sent from tomcat on windows, but when sent from tomcat on linux it's giving a not well formed error! Why?
Can anyone explain what exactly is causing the windows and linux systems to display the same data differently and why it's not well formed from the linux tomcat server but it is well formed from the windows 7 tomcat server?
The XML 1.0 specification defines, in 4.3.3 Character Encoding in Entities, that it is a fatal error “if an XML entity is determined (via default, encoding declaration, or higher-level protocol) to be in a certain encoding but contains byte sequences that are not legal in that encoding”. It also says that violations of well-formedness constraints are fatal errors, and this is apparently meant to work in the other direction, too.
Thus, apparently your XML document is in fact UTF-8 encoded but declared (or implied) to be ISO-8859-1 (or maybe windows-1252), or vice versa. Either way, there will be be bytes or byte combinations that must be recognized as illegal.

Defining accent characters in different platforms under java

I have a problem with the accent character in different platforms.
When I log this in my machine under fedora (where default charset is UTF-8) it is printing correvtly as Sacré Coeur.
But when i update to another server that is running on RedHat (where default charset is ISO-8859-1), it is printing as
Sacré Coeur. I want to log it in RedHat server as same as in my my Fedora machine. How can I do this?
My Workout :
I tried to changes the System.setProperty("file.encoding",
"ISO-8859-1"); in local with the purpouse of doing the reverse
version System.setProperty("file.encoding", "UTF-8"); in the RedHat
Server, if it change the way of logging in the local. But nothing
changed.
I noticed there are couple of threads regarding the accent character
but nithing answers me. That's why I asked a new question.
I tried this one as well but not working.
System.setProperty("file.encoding","ISO-8859-1");
Field charset =Charset.class.getDeclaredField("defaultCharset");
charset.setAccessible(true);
charset.set(null,null);
But I didn't try to set the charset at the JVM start. If it will works please explain me how can I do it?
To get similar out put from all the environments, with out depending on the server OS
default character encoding, when you start your program or the server environment (Jboss tomcat or jetty) pass -Dfile.encoding to the start-up script
(lets say run.sh in jboss, add -Dfile.encoding=UTF-8 to JAVA_OPTS)
-Dfile.encoding=UTF-8

Character encoding between Java (Linux) and Windows system

I have a simple program that makes a request to a remote server running a service which I believe is written in Delphi, but definately running on Windows.
I'm told the service will be using whatever the default encoding is for Windows.
When I get a response and use println to output it I'm getting some strange symbols in the output, which make me think it is a character encoding issue.
How can I tell Java the the input from the remote system is in the windows encoding?
I have tried the following:
_receive = new BufferedReader(new InputStreamReader(_socket.getInputStream(),"ISO-8859-1"));
_System.out.println(_receive.readLine());
The extra characters appear as squares in the output with 4 numbers in the square.
Unless you KNOW what the "default encoding" is, you can't tell what it is. The "default encoding" is generally the system-global codepage, which can be different on different systems.
You should really try to make people use an encoding that both sides agree on; nowadays, this should almost always be UTF-16 or UTF-8.
Btw, if you are sending one character on the Windows box, and you receive multiple "strange symbols" on the Java box, there's a good chance that the Windows box is already sending UTF-8.
Use cp1252 instead of ISO-8859-1, as it is default on windows.

Categories