I am developing a web application with Java and Tomcat 8. This application has a page for uploading a file with the content that will be shown in a different page. Plain simple.
However, these files might contain not-so-common characters as part of their text. Right now, I am working with a file that contains Vietnamese text, for example.
The file is encoded in UTF-8 and can be opened in any text editor. However, I couldn't find any way to upload it and keep the content in the correct encoding, despite searching a lot and trying many different things.
My page which uploads the file contains the following form:
<form method="POST" action="upload" enctype="multipart/form-data" accept-charset="UTF-8" >
File: <input type="file" name="file" id="file" multiple/><br/>
Param1: <input type="text" name="param1"/> <br/>
Param2: <input type="text" name="param2"/> <br/>
<input type="submit" value="Upload" name="upload" id="upload" />
</form>
It also contains:
<%#page contentType="text/html" pageEncoding="UTF-8"%>
...
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
My servlet looks like this:
protected void processRequest(HttpServletRequest request, HttpServletResponse response)
throws ServletException, IOException {
try {
response.setContentType("text/html;charset=UTF-8");
request.setCharacterEncoding("UTF-8");
String param1 = request.getParameter("param1");
String param2 = request.getParameter("param2");
Collection<Part> parts = request.getParts();
Iterator<Part> iterator = parts.iterator();
while (iterator.hasNext()) {
Part filePart = iterator.next();
InputStream filecontent = null;
filecontent = filePart.getInputStream();
String content = convertStreamToString(filecontent, "UTF-8");
//Save the content and the parameters in the database
if (filecontent != null) {
filecontent.close();
}
}
} catch (ParseException ex) {
}
}
static String convertStreamToString(java.io.InputStream is, String encoding) {
java.util.Scanner s = new java.util.Scanner(is, encoding).useDelimiter("\\A");
return s.hasNext() ? s.next() : "";
}
Despite all my efforts, I have never been able to get that "content" string with the correct characters preserved. I either get something like "K?n" or "Kạn" (which seems to be the ISO-8859-1 interpretation for it), when the correct should be "Kạn".
To add to the problem, if I write Vietnamese characters in the other form parameters (param1 or param2), which also needs to be possible, I can only read them correctly if I set both the form's accept-charset and the servlet scanner encoding to ISO-8859-1, which I definitely don't understand. In that case, if I print the received parameter I get something like "K & # 7 8 4 1 ; n" (without the spaces), which contains a representation for the correct character. So it seems to be possible to read the Vietnamese characters from the form using ISO-8859-1, as long as the form itself uses that charset. However, it never works on the content of the uploaded files. I even tried to encode the file in ISO-8859-1, to use the charset for everything, but it does not work at all.
I am sure this type of situation is not that rare, so I would like to ask some help from the people who might have been there before. I am probably missing something, so any help is appreciated.
Thank you in advance.
Edit 1: Although this question is yet to receive a reply, I will keep posting my findings, in case someone is interested or following it.
After trying many different things, I seem to have narrowed down the causes of problem. I created a class which reads a file from a specific folder in the disk and prints its content. The code goes:
public static void openFile() {
System.out.println(String.format("file.encoding: %s", System.getProperty("file.encoding")));
System.out.println(String.format("defaultCharset: %s", Charset.defaultCharset().name()));
File file = new File(myFilePath);
byte[] buffer = new byte[(int) file.length()];
BufferedInputStream f = null;
String content = null;
try {
f = new BufferedInputStream(new FileInputStream(file));
} catch (FileNotFoundException ex) {
}
try {
f.read(buffer);
content = new String(buffer, "UTF-8");
System.out.println("UTF-8 File: " + content);
f.close();
} catch (IOException ex) {
}
}
Then I added a main function to this class, making it executable. When I run it standalone, I get the following output:
file.encoding: UTF-8
defaultCharset: UTF-8
UTF-8 File: {"...Kạn..."}
However, if run the project as a webapp, as it is supposed to be, and call the same function from that class, I get:
file.encoding: Cp1252
defaultCharset: windows-1252
UTF-8 File: {"...K?n..."}
Of course, this was clearly showing that the default encoding used by the webapp to read the file was not UTF-8. So I did some research on the subject and found the classical answer of creating a setenv.bat for Tomcat and having it execute:
set "JAVA_OPTS=%JAVA_OPTS% -Dfile.encoding=UTF-8"
The result, however, is still not right:
file.encoding: UTF-8
defaultCharset: UTF-8
UTF-8 File {"...Kạn..."}
I can see now that the default encoding became UTF-8. The content read from the file, however, is still wrong. The content shown above is the same I would get if I opened the file in Microsoft Word, but chose to read it using ISO-Latin-1 instead of UTF-8. For some odd reason, reading the file is still working with ISO-Latin-1 somewhere, although everything points out to the use of UTF-8.
Again, if anyone might have suggestions or directions for this, it will be highly appreciated.
I don't seem to be able to close the question, so let me contribute with the answer I found.
The problem is that investigating this type of issue is very tricky, since there are many points in the code where the encoding might be changed (the page, the form encoding, the request encoding, file reading, file writing, console output, database writing, database reading...).
In my case, after doing everything that I posted in the question, I lost a lot of time trying to solve an issue that didn't exist any longer, just because the console output in my IDE (NetBeans, for that project) didn't use the desired character encoding. So I was doing everything right to a certain point, but when I tried to print anything I would get it wrong. After I started writing my logs to files, instead of the console, and thus controlling the writing encoding, I started to understand the issue clearly.
What was missing in my solution, after everything I had already described in my question (before the edit), was to configure the encoding for the database connection. To my surprise, even though my database and all of my tables were using UTF-8, the comunication between the application and MySQL was still in ISO-Latin. The last thing that was missing was adding "useUnicode=true&characterEncoding=utf-8" to the connection, just like this:
con = DriverManager.getConnection("jdbc:mysql:///dbname?useUnicode=true&characterEncoding=utf-8", "user", "pass");
Thanks to this answer, amongst many others: https://stackoverflow.com/a/3275661/843668
Related
I am trying to download web page with all its resources . First i download the html, but when to be sure to keep file formatted and use this function below .
there is and issue , i found 10 in the final file and when i found that hexadecimal code of the LF or line escape . and this makes troubles to my javascript functions .
Example of the final result :
<!DOCTYPE html>10<html lang="fr">10 <head>10 <meta http-equiv="content-type" content="text/html; charset=UTF-8" />10
Can someone help me to found the real issue ?
public static String scanfile(File file) {
StringBuilder sb = new StringBuilder();
try {
BufferedReader bufferedReader = new BufferedReader(new FileReader(file));
while (true) {
String readLine = bufferedReader.readLine();
if (readLine != null) {
sb.append(readLine);
sb.append(System.lineSeparator());
Log.i(TAG,sb.toString());
} else {
bufferedReader.close();
return sb.toString();
}
}
} catch (IOException e) {
e.printStackTrace();
return null;
}
}
There are multiple problems with your code.
Charset error
BufferedReader bufferedReader = new BufferedReader(new FileReader(file));
This isn't going to work in tricky ways.
Files (and, for that matter, data given to you by webservers) comes in bytes. A stream of numbers, each number being between 0 and 255.
So, if you are a webserver and you want to send the character ö, what byte(s) do you send?
The answer is complicated. The mapping that explains how some character is rendered in byte(s)-form is called a character set encoding (shortened to 'charset').
Anytime bytes are turned into characters or vice versa, there is always a charset involved. Always.
So, you're reading a file (that'd be bytes), and turning it into a Reader (which is chars). Thus, charset is involved.
Which charset? The API of new FileReader(path) explains which one: "The system default". You do not want that.
Thus, this code is broken. You want one of two things:
Option 1 - write the data as is
When doing the job of querying the webserver for the data and relaying this information onto disk, you'd want to just store the bytes (after all, webserver gives bytes, and disks store bytes, that's easy), but the webserver also sends the encoding, in a header, and you need to save this separately. Because to read that 'sack of bytes', you need to know the charset to turn it into characters.
How would you do this? Well, up to you. You could for example decree that the data file starts with the name of a charset encoding (as sent via that header), then a 0 byte, and then the data, unmodified. I think you should go with option 2, however
Option 2
Another, better option for text-based documents (which HTML is), is this: When reading the data, convert it to characters, using the encoding as that header tells you. Then, to save it to disk, turn the chars back to bytes, using UTF-8, which is a great encoding and an industry standard. That way, when reading, you just know it's UTF-8, period.
To read a UTF-8 text file, you do:
Files.newBufferedReader(Paths.get(file));
The reason this works, is that the Files API, unlike most other APIs (and unlike FileReader, which you should never ever use), defaults to UTF_8 and not to platform-default. If you want, you can make it more readable:
Files.newBufferedReader(Paths.get(file), StandardCharsets.UTF_8);
same thing - but now in the code it is clear what's happening.
Broken exception handling
} catch (IOException e) {
e.printStackTrace();
return null;
}
This is not okay - if you catch an exception, either [A] throw something else, or [B] handle the problem. And 'log it and keep going' is definitely not 'handling' it. Your strategy of exception handling results in 1 error resulting in a thousand things going wrong with a thousand stack traces, and all of them except the first are undesired and irrelevant, hence why this is horrible code and you should never write it this way.
The easy solution is to just put throws IOException on your scanFile method. The method inherently interacts with files, it SHOULD be throwing that. Note that your psv main(String[] args) method can, and usually should, be declared to throws Exception.
It also makes your code simpler and shorter, yay!
Resource Management failure
a filereader is a resource. You MUST close it, no matter what happens. You are not doing that: If .readLine() throws an exception, then your code will jump to the catch handler and bufferedReader.close is never executed.
The solution is to use the ARM (Automatic Resource Management) construct:
try (var br = Files.newBufferedReader(Paths.get(file), StandardCharsets.UTF_8)) {
// code goes here
}
This construct ensures that close() is invoked, regardless of how the 'code goes here' block exits. Even if it 'exits' via an exception or a return statement.
The problem
Your 'read a file and print it' code is other than the above three items mostly fine. The problem is that the HTML file on disk is corrupted; the error lies in your code that reads the data from the web server and saves it to disk. You did not paste that code.
Specifically, System.lineSeparator() returns the actual string. Thus, assuming the code you pasted really is the code you are running, if you are seeing an actual '10' show up, then that means the HTML file on disk has that in there. It's not the read code.
Closing thoughts
More generally the job of 'just print a file on disk with a known encoding' can be done in far fewer lines of code:
public static String scanFile(String path) throws IOException {
return Files.readString(Paths.get(path));
}
You should just use the above code instead. It's simple, short, doesn't have any bugs, cannot leak resources, has proper exception handling, and will use UTF-8.
Actually, there is no problem in this function I was mistakenly adding 10 using another function in my code .
We have a requirement of picking the data from Oracle DB table and dump that data into a csv file and a plain pipe seperated text file. Give a link to user on application so user can view the generated csv/text files.
As lot of parsing was involved so we wrote a Unix shell script and are calling it from out Struts/J2ee application.
Earlier we were loosing the Chinese and Roman chars in the generated files and the generated file were having us-ascii charset(cheked using-> file -i). Later we used NLS_LANG=AMERICAN_AMERICA.AL32UTF8 and this gave us utf-8 format files.
But still the characters were gibberish, so again we tried iconv command and converted utf-8 files to utf-16le charset.
iconv -f utf-8 -t utf-16le $recordFile > $tempFile
This works fine for the generated text file. But with CSV the Chinese and Roman chars are still not correct. Now if we open this csv file in a Notepad and give a newline by pressing Enter key from keyboard, save it. Open it with MS-Excel, all characters are coming fine including the Chinese and Romans but now the text is in single line for each row instead of columns.
Not sure what's going on.
Java code
PrintWriter out = servletResponse.getWriter();
servletResponse.setContentType("application/vnd.ms-excel; charset=UTF-8");
servletResponse.setCharacterEncoding("UTF-8");
servletResponse.setHeader("Content-Disposition","attachment; filename="+ fileName.toString());
FileInputStream fileInputStream = new FileInputStream(fileLoc + fileName);
int i;
while ((i=fileInputStream.read()) != -1) {
out.write(i);
}
fileInputStream.close();
out.close();
Please let me know if i missed out any details.
Thanks to all for taking out time to go through this.
Was able to solve it out. First as mentioned by Aaron removed UTF-16LE encoding to avoid future issues and encoded files to UTF-8. Changed the PrintWriter in Java code to OutputStream and was able to see the correct characters in my text file.
CSV was still showing garbage. Came to know that we need to prepend EF BB BF at the beginning of file as the BOM aware software like MS-Excel needs it. So changing the Java code as below did the trick for csv.
OutputStream out = servletResponse.getOutputStream();
os.write(239); //0xEF
os.write(187); //0xBB
out.write(191); //0xBF
FileInputStream fileInputStream = new FileInputStream(fileLoc + fileName);
int i;
while ((i=fileInputStream.read()) != -1) {
out.write(i);
}
fileInputStream.close();
out.flush();
out.close();
As always with Unicode problems, every single step of the transformation chain must work perfectly. If you make a mistake in one place, data will be silently corrupted. There is no easy way to figure out where it happens, you have to debug the code or write unit tests.
The Java code above only works if the file actually contains UTF-8 encoded data; it doesn't "magically" figure out what's in the file and converts it to UTF-8. So if the file already contains garbage, you just slap a "this is UTF-8" label on it but it's still garbage.
That means for you that you need to create test cases which take known test data and move that through every step of the chain: Inserting into database, reading from the database, writing to CSV, writing to the text file, reading those files and download to the user.
For each step, you need to write unit tests which takes a known Unicode string like abc öäü and processes it and then check the result. To make it easier to input in Java code, use "abc \u00f6\u00e4\u00fc" You may also want to add spaces at the beginning and end of the string to see whether they are properly preserved or not.
file -i doesn't help you much here since it just makes a guess what the file contains. There is no indicator (data or metadata) in a text file which says "this is UTF-8". UTF-16 supports a BOM header for this but almost no one uses UTF-16, so many tools don't support it (properly).
First time use FreeMarker on JAVA project and stack on configure the chinese character.
I tried a lot of examples to fix the code like below, but it still not able to make it.
// Free-marker configuration object
Configuration conf = new Configuration();
conf.setTemplateLoader(new ClassTemplateLoader(getClass(), "/"));
conf.setLocale(Locale.CHINA);
conf.setDefaultEncoding("UTF-8");
// Load template from source folder
Template template = conf.getTemplate(templatePath);
template.setEncoding("UTF-8");
// Get Free-Marker output value
Writer output = new StringWriter();
template.process(input, output);
// Map Email Full Content
EmailNotification email = new EmailNotification();
email.setSubject(subject);
.......
Saw some example request to make changes on the freemarker.properties but i have no this file. I just import the .jar file and use it.
Kindly advise what should i do to make it display chinese character.
What exactly is the problem?
Anyway, cfg.setDefaultEncoding("UTF-8"); should be enough, assuming your template files are indeed in UTF-8. But, another place where you have to ensure proper encoding is when you convert the the template output back to "binary" from UNICODE text. So FreeMarker sends its output into a Writer, so everything is UNICODE so far, but then you will have an OutputStreamWriter or something like that, and that has to use charset (UTF-8 probably) that can encode Chinese characters.
You need to change your file encoding of your .ftl template files by saving over them in your IDE or notepad, and changing the encoding in the save dialog.
There should be an Encoding dropdown at the bottom of the save dialog.
I am sending an AJAX request with jQuery post() and serialize. That uses UTF-8.
For example when 'ś' is a name input value , JavaScript shows name=%C5%9B.
I have tried setting form encoding without success.
<form id="dyn_form" action="dyn_ajax.xml" style="display:none;" accept-charset="UTF-8">
The same happens with encodeURI(document.getElementById("name_id").value). I'm using Servlets on Tomcat 5.5.
I had this kind of problem many times.
Verify your pages are saved in UTF-8 encoding.
If it's really UTF-8, try decodeURIComponent.
I always had a hard time convincing the request object to decode the URIEncoded strings correctly.
I finally made the following hack.
try {
String pvalue = req.getParameter(name);
if (null != pvalue) {
byte[] pbytes = pvalue.getBytes("ISO-8859-1");
res = new String(pbytes, "UTF-8");
}
} catch (java.io.UnsupportedEncodingException e) {
// This should never happen as ISO latin 1 and UTF-8 are always included in jvms.
}
I don't really like this, and it's been a while since I stopped developing servlets, but it was already on tomcat 5.5, so it's worth trying.
I Have a Page where I search for a term and it is displaying perfect. Whatever character type it is.
Now when I have few checkboxes in JSP and I check it and submit. In these checkboxes I have one box name like ABC Farmacéutica Corporation.
When I click on submit button, I am calling a function and will set all parameters to a form and will submit that form. (I tested putting alert for the special character display before submit and it is displaying good).
Now, coming to the Java end, I use Springs Frame work. When I print the term in controller, then it is displayed like ABC Farmacéutica Corporation.
Please help...
Thanks in advance.
EDIT :
Please try this sample Example
import java.net.*;
class sample{
public static void main(String[] args){
try{
String aaa = "ABC Farmacéutica Corporation";
String bbb = "ABC Farmacéutica Corporation";
aaa = URLEncoder.encode(aaa, "UTF-8");
bbb = URLDecoder.decode(bbb, "UTF-8");
System.out.println("aaa "+aaa);
System.out.println("bbb "+bbb);
}catch(Exception e){
System.out.println(e);
}
}
}
I am getting output as,
aaa PiSA+Farmac%C3%A9utica+Mexicana+Corporativo
bbb PiSA Farmacéutica Mexicana Corporativo
Try to print the string aaa as it is.
You get "ABC Farmacéutica Corporation" because the string you receive from the client is ISO-8859-1, you need to convert it into UTF-8 before you URL decode it. Like this :
bbb = URLDecoder.decode(new String(bbb.getBytes("ISO-8859-1"), "UTF-8"), "UTF-8");
NOTE : some encodings cannot be converted from and to different encodings without risking data loss. For example, you cannot convert Thaï characters (TIS-620) to another encoding, not even UTF-8. For this reason, avoid converting from one encoding to another, unless ultimately necessary (ie. the data comes from an external, third perty, or proprietary source, etc.) This is only a solution on how to convert from one source to another, knowing the source encoding.
This is an encoding problem, and the à clearly identify that this is UTF-8 text interpreted as ISO-Latin-1 (or one of its cousins).
Ensure that your JSP-page at the top show that it uses UTF-8 encoding.
I suspect the problem is with character encoding on the page. Make sure the page you submit from and the one you display to use the same character set and make sure that you set it explicitely.
for instance if your server runs on Linux the default encoding will be UTF-8 but if you view the page on Windows it will assume (if no encoding is specified) it to be ISO-8859-1.
Also when you are receiving the submitted text on your server side, the server will assume the default character set when building the string -- whereas your user might have used a differrent encoding if you didn't specify one.
As I understand it, the text is hardcoded in controller code like this:
ModelAndView mav = new ModelAndView("hello");
mav.addObject("message", "ABC Farmacéutica Corporation");
return mav;
I expect this would work:
ModelAndView mav = new ModelAndView("hello");
mav.addObject("message", "ABC Farmac\u00e9utica Corporation");
return mav;
If so, the problem is due to a mismatch between the character encoding your Java editor is using and the encoding your compiler uses to read the source code.
For example, if your editor saves the Java file as UTF-8 and you compile on a system where UTF-8 is not the default encoding, then you would need to tell your compiler to use that encoding:
javac -cp foo.jar -encoding UTF-8 Bar.java
Your build scripts and IDE settings need to be consistent when handling character data.
If your text editor saved your file as UTF-8 then, in a hex editor, é would be the byte sequence C3 A9; in many other encodings, it would have the value E9. ISO-8859-1 and windows-1252 would encode é as C3 A9. You can read about character encoding in Java source files here.
Change the encoding of jsp page to UTF-8 in the File> Properties then add this line in the head of your jsp page: <%# page language="java" contentType="text/html; charset=UTF-8" pageEncoding="UTF-8"%>