I have a java program which reads docx files line by line with Apache POI. I have a word list and if I match this word in one line, I print the line from docx file. So far, I had not problem. Today I had an output like this :
Attempt to output character of integral value 0 that is not represented in specified output encoding of UTF-8.
What does this mean ? And please provide me a solution.
Thank you.
My code where I read docx file and print the line.
URL url = new URL(URL.get(y));
File file = new File("E:\\demo\\myfile.docx");
org.apache.commons.io.FileUtils.copyURLToFile(url, file);
POITextExtractor extractor1 = ExtractorFactory.createExtractor(file);
String text = extractor1.getText();
StringReader sr = new StringReader(text);
BufferedReader readme = new BufferedReader(sr);
while ((sCurrentLine3 = readme.readLine()) != null) {
sCurrentLine3= sCurrentLine3.trim().replaceAll("\\s+","").replaceAll("\n", "").replaceAll("\r", "").replaceAll(" ", "");
sCurrentLine3 = "Z:" + sCurrentLine3;
sCurrentLine3 = sCurrentLine3.replace("/", "\\");
System.out.println(ObjectsLine.get(i) + " " + Change.get(y) + " " + sCurrentLine3);
Related
This question already has answers here:
What is character encoding and why should I bother with it
(4 answers)
Closed 4 years ago.
I'm exporting a SQL table into a txt using java. All works fine except a minus problem which is the accent mark on some letter.
In particular I have problem with the 'è' because in my txt file is transmuted into 'è'.
I don't know why, but if i try to export the record of my table that contains that field (so a single record writing) there's no issues; If i try to export the whole table (thousands of record), it gives me this problem.
I've already tried the replace method that the String class offers, but same result. I've also tried to use che escape character in unicode, but still the same problem.
Any hints?
EDIT
here's my code:
if(array_query.equals(versamenti_tasi)) {
File myOutputDir = new File("C:\\comuni\\prova\\009_001_Versamenti");
if(!myOutputDir.exists())
myOutputDir.mkdir(); //Returns a boolean if you want to check it was successful.
PrintStream o = new PrintStream(myOutputDir + "\\" + array_query[x].substring(26) + ".txt");
System.setOut(o);
while (rs.next()) {
out = "";
final_index = 0;
for(int i =1; i <= rsmd.getColumnCount(); i++) {
if(rs.getString(i) == null) {
final_index += rsmd.getColumnDisplaySize(i);
out += "" + str_valore;
out = out.substring(0, final_index);
}else{
final_index += rsmd.getColumnDisplaySize(i);
out += rs.getString(i).trim() + str_valore;
out = out.substring(0, final_index);
}
}
System.out.println(out);
}
briefily, the writing part is in the loop; str_valore is an empty string of 250 char, that's because I need to take the field's original size of my db; rs is my resultset and rsmd is the metadata resultset
I do not know which Eclipse setting you've changed, but I doubt that this affects your PrintStream (it probably reads/presents files as UTF-8).
To set the encoding for your PrintStream use the appropriate constructor:
change this line
PrintStream o = new PrintStream(myOutputDir + "\\" + array_query[x].substring(26) + ".txt");
to
PrintStream o = new PrintStream(myOutputDir + "\\" + array_query[x].substring(26) + ".txt",
"UTF-8");
(or whatever encoding that is configured for your DB schema.)
A question on reading text files in Java. I have a text file saved with UTF-8 encoding with only the content:
Hello. World.
Now I am using a RandomAccessFile to read this class. But for some reason, there seems to be an "invisible" character at the beginning of the file ...?
I use this code:
File file = new File("resources/texts/books/testfile2.txt");
try(RandomAccessFile reader = new RandomAccessFile(file, "r")) {
String readLine = reader.readLine();
String utf8Line = new String(readLine.getBytes("ISO-8859-1"), "UTF-8" );
System.out.println("Read Line: " + readLine);
System.out.println("Real length: " + readLine.length());
System.out.println("UTF-8 Line: " + utf8Line);
System.out.println("UTF-8 length: " + utf8Line.length());
System.out.println("Current position: " + reader.getFilePointer());
} catch (Exception e) {
e.printStackTrace();
}
The output is this:
Read Line: ?»?Hello. World.
Real length: 16
UTF-8 Line: ?Hello. World.
UTF-8 length: 14
Current position: 16
These (1 or 2) characters seem to appear only at the very beginning. If I add more lines to the file and read them, then all the further lines are being read normally.
Can someone explain this behavior? What is this character at the beginning?
Thanks!
The first 3 bytes in your file (0xef, 0xbb, 0xbf) is so called UTF-8 BOM (Byte Order Mark). BOM is important for UTF-16 and UTF-32 only - for UTF-8 it has no meaning. Microsoft introduced it for the better guess of the file encoding.
That is, no all UTF-8 encoded text files have that mark, but some have.
I am trying to get data out of a CSV data. However, if I try to read the data so I can use the individual data inside it, it prints extra stuff like:
x����sr��java.util.ArrayListx����a���I��sizexp������w������t��17 mei 2017t��Home - Gastt��4 - 1t��(4 - 0)t��
With this code:
FileInputStream in = openFileInput("savetest13.dat");
BufferedReader reader = new BufferedReader(new InputStreamReader(in));
List<String[]> resultList = new ArrayList();
String csvLine;
while ((csvLine = reader.readLine()) != null){
String[] row = csvLine.split(",");
out.println("while gepakt");
out.println(row);
date = row[0];
out.println("date: "+date);
resultList.add(row);
txtTest.setText(date);
}
But whenever I read the file to check what data it contains, I get the exact same data as I put in. But I can't manage to split the data with stuff:
FileInputStream in = openFileInput("savetest13.dat");
ObjectInputStream ois = new ObjectInputStream(in);
List stuff = (List) ois.readObject();
txtTest.setText(String.valueOf(stuff));
[17 mei 2017, Home - Guest, 2 - 1, (2 - 0), ]
I am trying to get them separated into date, names, score1, score2
.
Which of the 2 would be better to use and how can I get the correct output, which I am failing to obtain?
You are not writing CSV to your output file, rather than that your are using standard java serialization ObjectOutputStream#writeObject(...) to create that file. try using a CSV library to write/read data in CSV format (see hier) and before that, start here to learn about CSV because
[17 mei 2017, Home - Guest, 2 - 1, (2 - 0), ]
is not csv, but only the output of toString of the list you are using.
Here is an easy way to write the CSV file formatted correctly. This is a simple example which does not take into account the need to escape any commas found in your data.You can open the file created in Excel, Google Sheets, OpenOffice etc. and see that it is formatted correctly.
final String COMMA = ",";
final String NEW_LINE = System.getProperty("line.separator");
ArrayList<String> myRows = new ArrayList<String>();
// add comma delimited rows to ArrayList
myRows.add("date" + COMMA +
"names" + COMMA +
"score1" + COMMA +
"score2"); // etc.
// optional - insert field names into the first row
myRows.add(0, "[Date]" + COMMA +
"[Names]" + COMMA +
"[Score1]" + COMMA +
"[Score2]");
// get a writer
final String fileName = "myFileName.csv";
final String dirPath = Environment.getExternalStoragePublicDirectory(Environment.DIRECTORY_DOCUMENTS).getAbsolutePath();
File myDirectory = new File(dirPath);
FileWriter writer = new FileWriter(new File(myDirectory, fileName));
// write the rows to the file
for (String myRow : myRows) {
writer.append(myRow + NEW_LINE);
}
writer.close();
I've got serious trouble to find how to convert an .txt file in ANSI codification to .arff file in weka without loosing some accents and the meaning of a word in the process. I'm reading articles in SPANISH and the problem is that the words that have accents are bad converted because the letter with the accent is converted like this.
My original .txt | .arff file result of the conversion
Minería | Miner�a
The letter "í" is lost in the process.
My code now is this (code provided by weka university)
public Instances createDataset(String directoryPath) throws Exception {
FastVector atts = new FastVector(2);
atts.addElement(new Attribute("filename", (FastVector) null));
atts.addElement(new Attribute("contents", (FastVector) null));
Instances data = new Instances("text_files_in_" + directoryPath, atts, 0);
File dir = new File(directoryPath);
String[] files = dir.list();
for (int i = 0; i < files.length; i++) {
if (files[i].endsWith(".txt")) {
try {
double[] newInst = new double[2];
newInst[0] = (double)data.attribute(0).addStringValue(files[i]);
File txt = new File(directoryPath + File.separator + files[i]);
// meto codigo nuevo aqui dentro
// hasata aqui
InputStreamReader is;
is = new InputStreamReader(new FileInputStream(txt));
StringBuffer txtStr = new StringBuffer();
int c;
while ((c = is.read()) != -1) {
txtStr.append((char)c);
// s pstir de aqui contamino yo el codigo
// System.out.println("Sale " + is.toString();
}
newInst[1] = (double)data.attribute(1).addStringValue(txtStr.toString());
data.add(new Instance(1.0, newInst));
} catch (Exception e) {
//System.err.println("failed to convert file: " + directoryPath + File.separator + files[i]);
}
}
}
return data;
}
I'm using Netbeans to cast files from a file in my computer.
You may think that I'm asking the same thing from other posts in this page but really I'm not because what I really need is a converter that converts correctly the accents in Spanish.
I've tried to change the codification in Netbeans to UTF-8 and to ANSI, but none of the solutions worked for me ( I went to the configuration file in Netbeans8.1 --> etc --> netbeans.conf and add there -J-Dfile.encoding=UTF-8 in the line netbeans_default_options=.......... but still doesn't work). I'm getting a bit frustrated with this problem.
Well I found a partial solution after loosing my mind. In fact this solution isn't a real solution so I hoe that one day someone answers something that may save the world of datamining. The solution consist in saving the text in UTF-8 without BOM (UTF-8 sin BOM). You have also to configure Netbeans to read UTF8 as I explained above.
I had this problem, my solution was encode to ANSI.
I used Notepad ++
Steps:
Open your file
Go to the top panel
Enconding -> Encode in ANSI
taking a string from java and placing it into a text file. When the string is written it does not contain Â, however when the string comes open in word pad the character appears.
String without:
Notice of Appeal:
Hamilton City Board of Education
String with:
Notice of Appeal:
Â
Â
Hamilton City Board of Education
Below is the write string
out = new BufferedWriter (new FileWriter(filePrefix + "-body" + ".txt"));
out.write("From: " + em.from);
out.newLine();
out.write("Sent Date: " + em.sentDate);
out.newLine();
out.write("Subject: " + em.subject);
out.newLine();
out.newLine();
out.newLine();
String temp = new String(emi.stringContent.getBytes("UTF-8"), "UTF-8");
out.write(temp);
What should i do to not have them appear in word pad?
This looks like a UTF-8 encoding problem to me. I believe you are getting the  character because you are writing the content in UTF-8, and the content contains a high-ASCII value, but WordPad is expecting the data to be in the code-page your local system is running in. Either write the content in the code-page expected by WordPad, or make WordPad expect UTF-8.
As an aside:
String temp = new String(emi.stringContent.getBytes("UTF-8"), "UTF-8");
out.write(temp);
is a complete waste of time; use:
out.write(emi.stringContent);
instead.
Im assuming this is a line separator issue.
use:
String line = System.getProperty("line.separator");
and just add it to your string wherever you want a new line