Hebrew encoding from CSV to Eclipse to MySql comes as garbage - java

I have a CSV file with Hebrew characters. When I open it in TextEdit, on my Mac, I can see the Hebrew just fine.
I bring it into my Java code using a scanner, while encoding it to UTF-8:
File file = new File(System.getProperty("user.dir") + System.getProperty("file.separator") + fileName);
Scanner scanner = new Scanner(new FileInputStream(file), "UTF-8");
Then I parse, and send it to the MySql database using Hibernate:
for(int i=0; i<elements.length; i++) {
String elem = elements[i];
String[] client = elem.split(",");
for(int j=0; j<client.length; j++) {
Client c = new Client();
c.setFirstName(client[j]);
System.out.println(client[j]);
DatastoreManager.persist(c);
}
}
Both the printout in the Eclipse consol, and the entry into MySql come as ?????.
Searching for solutions I tried converting the string to bytes:
byte[] ptext = client[j].getBytes("UTF8");
String value = new String(ptext, "UTF-8");
and I converted the MySql table to Character Set UTF-8 Unicode and Collation utf8mb4_general_ci.
But nothing seems to work. Any ideas?

Use file -I {filename} in mac to check encoding.
Encoding that you get change in:
Scanner scanner = new Scanner(new FileInputStream(file), "UTF-8");
Now I suppose that you see properly encoded characters in eclipse.
Since you are using Hibernate and MySql You should add following to your hibernate configuration:
app_persistance.connection.url=jdbc:mysql://localhost:3306/yourDatabase?useUnicode=true&characterEncoding=utf-8
app_persistance.hibernate.connection.CharSet=utf8
app_persistance.hibernate.connection.characterEncoding=utf8
app_persistance.hibernate.connection.useUnicode=true

Related

Encoding issue when reading from Google translator API and writing to properties file

I am using Google translator API to generate Arabic property file from English property file.
Making a URL connection and making a GET request to the URL.,passing original language, translation language and value to be translated
URLConnection urlCon = null;
String urlStr = "https://www.googleapis.com/language/translate/v2";
URL url = new URL(urlStr + "?key=" + apikey + "&source=" + origlang + "&target=" + translateToLang + "&q=" + value);
urlCon = url.openConnection();
urlCon.setConnectTimeout(1000 * 60 * 5);
urlCon.setReadTimeout(1000 * 60 * 5);
urlCon.setDoInput(true);
urlCon.setDoOutput(true);
urlCon.setUseCaches(false);
((HttpURLConnection) urlCon).setRequestMethod("GET");
urlCon.setRequestProperty("Accept-Charset", "UTF-8");
Reading the response from the URL connection through inputstream reader. Passing UTF-8 in the encoding parameter.
BufferedReader br = new BufferedReader(new InputStreamReader(((URLConnection) urlCon).getInputStream(), "UTF-8"));
/* Reading the response line by line */
StringBuffer responseString = new StringBuffer();
String nextLine = null;
while ((nextLine = br.readLine()) != null) {
responseString.append(nextLine);
}
// if response is null or empty, throw exception
String response = responseString.toString();
Parsing the JSON received through GSON parser
JsonElement jelement = new JsonParser().parse(response);
JsonObject jobject = jelement.getAsJsonObject();
jobject = jobject.getAsJsonObject("data");
JsonArray jarray = jobject.getAsJsonArray("translations");
jobject = jarray.get(0).getAsJsonObject();
String result = jobject.get("translatedText").toString();
Writing the translated value in a new property file through fileoutstream
FileOutputStream foutStream = new FileOutputStream(outFile);
foutStream.write(key.getBytes());
foutStream.write("=".getBytes());
foutStream.write(transByte.getBytes());foutStream.write("\n".getBytes());
The issue is I am getting garbled text(?????) written in the new property file for Arabic language.
When you call transByte.getBytes(), the Arabic translation is encoded with your platform default encoding, which will only handle Arabic if your machine is configured for UTF-8 or Arabic. Otherwise, characters will be replaced by '�' or '?' .
Create a new Properties instance, and populate it using setProperty() calls. Then when you store it, the proper escaping will be applied to your Arabic text, which is necessary because property files are encoded with ISO-8859-1 (an encoding for Western Latin characters).
Alternatively, you can store the Properties using a Writer instance that is configured with whatever encoding you choose, but the encoding isn't stored in the file itself, so you will need meta-data or a convention to set the correct encoding when reading the file again.
Finally, you can store the Properties in an XML format, which will use UTF-8 by default, or you can specify another encoding. The file itself will specify the encoding, so it's easier to use an optimal encoding for each language.
Trying to emit a file format using custom string concatenation, as you are doing, is an oft-repeated recipe for disaster. Whether it's XML, JSON, or a simple properties file, it's far too easy to overlook special cases that require escape sequences, etc. Use a library designed to emit the format instead.

Convert .txt file (ANSI codification) to .Arff without losing accents

I've got serious trouble to find how to convert an .txt file in ANSI codification to .arff file in weka without loosing some accents and the meaning of a word in the process. I'm reading articles in SPANISH and the problem is that the words that have accents are bad converted because the letter with the accent is converted like this.
My original .txt | .arff file result of the conversion
Minería | Miner�a
The letter "í" is lost in the process.
My code now is this (code provided by weka university)
public Instances createDataset(String directoryPath) throws Exception {
FastVector atts = new FastVector(2);
atts.addElement(new Attribute("filename", (FastVector) null));
atts.addElement(new Attribute("contents", (FastVector) null));
Instances data = new Instances("text_files_in_" + directoryPath, atts, 0);
File dir = new File(directoryPath);
String[] files = dir.list();
for (int i = 0; i < files.length; i++) {
if (files[i].endsWith(".txt")) {
try {
double[] newInst = new double[2];
newInst[0] = (double)data.attribute(0).addStringValue(files[i]);
File txt = new File(directoryPath + File.separator + files[i]);
// meto codigo nuevo aqui dentro
// hasata aqui
InputStreamReader is;
is = new InputStreamReader(new FileInputStream(txt));
StringBuffer txtStr = new StringBuffer();
int c;
while ((c = is.read()) != -1) {
txtStr.append((char)c);
// s pstir de aqui contamino yo el codigo
// System.out.println("Sale " + is.toString();
}
newInst[1] = (double)data.attribute(1).addStringValue(txtStr.toString());
data.add(new Instance(1.0, newInst));
} catch (Exception e) {
//System.err.println("failed to convert file: " + directoryPath + File.separator + files[i]);
}
}
}
return data;
}
I'm using Netbeans to cast files from a file in my computer.
You may think that I'm asking the same thing from other posts in this page but really I'm not because what I really need is a converter that converts correctly the accents in Spanish.
I've tried to change the codification in Netbeans to UTF-8 and to ANSI, but none of the solutions worked for me ( I went to the configuration file in Netbeans8.1 --> etc --> netbeans.conf and add there -J-Dfile.encoding=UTF-8 in the line netbeans_default_options=.......... but still doesn't work). I'm getting a bit frustrated with this problem.
Well I found a partial solution after loosing my mind. In fact this solution isn't a real solution so I hoe that one day someone answers something that may save the world of datamining. The solution consist in saving the text in UTF-8 without BOM (UTF-8 sin BOM). You have also to configure Netbeans to read UTF8 as I explained above.
I had this problem, my solution was encode to ANSI.
I used Notepad ++
Steps:
Open your file
Go to the top panel
Enconding -> Encode in ANSI

Reading from UTF-8 formatted pdf file and writing it to a writer in cp1252 format

I am trying to read from a pdf file using file streams and I want to write it to a writer in cp1252 encodeded format. Following is the code:
byte buf[] = new byte[8192];
InputStream is = new FileInputStream(f);
ByteArrayOutputStream oos = new ByteArrayOutputStream();
int c=0;
while ((c = is.read(buf)) != -1) {
oos.write(buf, 0, c);
}
byte out[] = oos.toByteArray();
String str = oos.toString(out,"UTF-8");
char[] ch = str.toCharArray();
writer.write(ch);
is.close();
oos.close();
But the output is erroneous as the text is not readable(not properly converted). How do I fix this ?
You are probably encountering an error while trying to read from the PDF file. Try using PDFBox for extracting text from the PDF file. It's probably one of the best ways to do so. Once you have the required text, you can then save it using cp1252 encoding.
You can check out examples of text extraction using PDFBox from here
Regarding conversion to cp1252, if you are using a Windows machine, then the default encoding is cp1252. So simply trying to save the text should hopefully save it in cp1252 encoding.

Byte Order Mark to read UTF-8 cvs and excel files

Friends
I have to use BOM ( Byte Order Mark ) to make sure that downloaded files in cvs and excel in UTF-8 formats are displayed properly.
**
My question is can BOM be applied to FileOutputStream instead of
OutputStream as below ?
**
String targetfileName = reportXMLFileName.substring(0, reportXMLFileName.lastIndexOf(".") + 1);
targetfileName += "xlsx";
File tmpFile=new File((filePath !=null?filePath.trim():"")+"/"+templateFile);
FileOutputStream out = new FileOutputStream((filePath !=null?filePath.trim():"")+"/"+targetfileName);
/* Here is the example*/
out.write(new byte[] {(byte)0xEF, (byte)0xBB, (byte)0xBF });
substitute(tmpFile, tmp, sheetRef.substring(1), out);
out.close();
(As asked in comment.)
In the following file may be File, (File)OutputStream or String (filename).
final String ENCODING = "UTF-8";
PrintWriter out = new PrintWriter(file, ENCODING))));
try {
out.print("\ufffe"); // Write the BOM
out.println("<?xml version=\"1.0\" encoding=\"" + ENCODING + "\"?>");
...
} finally {
out.close();
}
Or since Java 7:
try (PrintWriter out = new PrintWriter(f, ENCODING))))) {
out.print("\ufffe"); // Write the BOM
out.println("<?xml version=\"1.0\" encoding=\"" + ENCODING + "\"?>");
...
}
Working with text, makes it more natural to use a Writer, as one does not need oneself to convert string with String.getBytes("UTF-8").
Yes. A BOM sits at the beginning of any data stream, whether over the network or in a file. Just write the BOM to the file at the beginning in the same manner, just to a FileOutputStream. Anyway, remember that a FileOutputStream is a type of OutputStream.

How to save an HTML page with special chars (UTF-8) to a txt file

I need to make a java code that save an html to a txt file.
The problem is that the special chars in UTF-8 are broken.
Words like "Hamamélis" are saved in this way "Hamam�lis".
the code that i writed is listed down there:
URLConnection conn;
conn = site.openConnection();
conn.setReadTimeout(10000);
Charset charset = Charset.forName("UTF8");
BufferedReader in = new BufferedReader( new InputStreamReader( conn.getInputStream(), "UTF-8" ) );
buff = in.readLine();
And after:
out = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(Nome), "UTF-8"));
out.write(buff);
out.close();
Anyone can suggest me a solution?
One possible error is omitting the hyphen from "UTF-8" in the 4th line of your first piece of code. See the CharSet documentation.
Otherwise, code seems correct. But of course we cannot test it directly as we do not have your data.
For comparison, here is a little class I wrote. In a manner similar to your code, this class correctly writes your "Hamamélis" example's accented 'e' as the two octets expected in UTF-8 for a single (non-normalized) character: in hex 'C3' & 'A9'.
import java.io.File;
import java.io.FileOutputStream;
import java.io.OutputStreamWriter;
import java.io.BufferedWriter;
import java.io.IOException;
public class ReaderWriter {
public static void main(String[] args) {
try {
String content = "Hamamélis. Written: " + new java.util.Date();
File file = new File("some_text.txt");
// Create file if not already existent.
if (!file.exists()) {
file.createNewFile();
}
FileOutputStream fileOutputStream = new FileOutputStream( file );
OutputStreamWriter outputStreamWriter = new OutputStreamWriter( fileOutputStream, "UTF-8" );
BufferedWriter bufferedWriter = new BufferedWriter( outputStreamWriter );
bufferedWriter.write( content );
bufferedWriter.close();
System.out.println("ReaderWriter 'main' method is done. " + new java.util.Date() );
} catch (IOException e) {
e.printStackTrace();
}
}
}
As icktoofay commented, you should dig deeper to discover exactly what octets are involved. Use a hex editor like this "File Viewer" app I found today on the Mac App Store to see the exact octets in your saved file.
If the octets are C3 & A9, then the problem is simply that the text editor you used to look at the file as text used the wrong character encoding. For example, you can open that text file in a web browser, and use its menu commands to re-interpret the file as UTF-8.
If the octets are not C3 & A9, I would go further back to examine the input's octets.
If you do not understand that text files in computers actually contain numbers (not text in the human sense), then take a break from coding to read this entertaining article:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky

Categories