I am trying to display txt or docx file content in JTextArea, but text area does not correcty display armenian or russian text. UTF-8 enconding in InputStreamReader does not help:
public class TextReader {
public static String getText(File textFile) throws IOException {
FileInputStream fis = new FileInputStream(textFile);
InputStreamReader isr = new InputStreamReader(fis, "UTF8");
BufferedReader br = new BufferedReader(isr);
StringBuilder text = new StringBuilder();
String c;
while ((c = br.readLine()) != null)
text.append(c + "\n");
fis.close();
isr.close();
br.close();
return String.valueOf(text);
}
}
I am using this static method in another class in JTextArea:
String text = TextReader.getText(currentFile);
textArea.setText(text);
After running and choosing the file, I got random characters. What could be the solution in this case?
Your code seems to be fine. My guess is you are trying to read a docx file.
You can't directly read docx files this way. Use some library like Apache POI.
If you are indeed using a text file, it might be the case that application you use to save the file uses wrong encoding. You could try saving some (hard-coded) sample Russian text using Java itself to a text file and reading it again in to your JTextArea.
Related
I have an
InputStream inputStream = this.getClass().getClassLoader().getResourceAsStream("templates/createUser/new-user.txt");
and the content of the new-user.txt is :
Hello™ how r u ®
but when they are displayed in the output they are displayed as
Hello��� how r u��
Can you tell me what changes should I make to my txt file so that it starts displaying the data accordingly.
UPDATE
So here is the code :-
Handlebars handlebars = new Handlebars();
InputStream txtInputStream = this.getClass().getClassLoader()
.getResourceAsStream("templates/createUser/new-user.txt");
Template textTemplate = handlebars.compileInline(IOUtils.toString(txtInputStream));
String emailText = textTemplate.apply(vars);
The problem does not lie in the InputStream object. InputStreams are just streams of bytes, they do not differentiate between encodings. The problem is you should use this as your reader:
Reader reader = new InputStreamReader(inputStream, "UTF-8");
as opposed to using this:
Reader reader = new InputStreamReader(inputStream); // does not specify encoding
You can then get the string with:
String theString = IOUtils.toString(inputStream, "UTF-8");
Edit:
I did not realize you posted full code in the comments. Just change your second to last line to:
Template textTemplate = handlebars.compileInline(IOUtils.toString(txtInputStream, "UTF-8"));
I am trying to read an xls file in java and convert it to csv. The problem is that it contains greek characters. I have used various different methods with no success.
br = new BufferedReader(new InputStreamReader(
new FileInputStream(saveDir+"/"+fileName+".xls"), "UTF-8"));
FileWriter writer1 = new FileWriter(saveDir+"/A"+fileName+".csv");
byte[] bytes = thisLine.getBytes("UTF-8");
writer1.append(new String(bytes, "UTF-8"));
used that with different encoders, like utf16 and windoes-1253 and ofcourse with out using the bytes array. none worked. any ideas?
Use "ISO-8859-7" instead of "UTF-8". It is for latin and greek. See documentation
InputStream in = new BufferedInputStream(new FileInputStream(new File(myfile)));
result = new Scanner(in,"ISO-8859-7").useDelimiter("\\A").next();
A Byte Order Mask (BOM) should be entered at the start of the CSV file.
Can you try this code?
PrintWriter writer1 = new PrintWriter(saveDir+"/A"+fileName+".csv");
writer1.print('\ufeff');
....
I am having a set of pdf files that contain central european characters such as č, Ď, Š and so on. I want to convert them to text and I have tried pdftotext and PDFBox through Apache Tika but always some of them are not converted correctly.
The strange thing is that the same character in the same text is correctly converted at some places and incorrectly at some others! An example is this pdf.
In the case of pdftotext I am using these options:
pdftotext -nopgbrk -eol dos -enc UTF-8 070612.pdf
My Tika code looks like that:
String newname = f.getCanonicalPath().replace(".pdf", ".txt");
OutputStreamWriter print = new OutputStreamWriter (new FileOutputStream(newname), Charset.forName("UTF-16"));
String fileString = "path\to\myfiles\"
try{
is = new FileInputStream(f);
ContentHandler contenthandler = new BodyContentHandler(10*1024*1024);
Metadata metadata = new Metadata();
PDFParser pdfparser = new PDFParser();
pdfparser.parse(is, contenthandler, metadata, new ParseContext());
String outputString = contenthandler.toString();
outputString = outputString.replace("\n", "\r\n");
System.err.println("Writing now file "+newname);
print.write(outputString);
}catch (Exception e) {
e.printStackTrace();
}
finally {
if (is != null) is.close();
print.close();
}
Edit: Forgot to mention that I am facing the same issue when converting to text from Acrobat Reader XI, as well.
Well aside from anything else, this code will use the platform default encoding:
PrintWriter print = new PrintWriter(newname);
print.print(outputString);
print.close();
I suggest you use OutputStreamWriter instead wrapping a FileOutputStream, and specify UTF-8 as an encoding (as it can encode all of Unicode, and is generally well supported).
You should also close the writer in a finally block, and I'd probably separate the "reading" part from the "writing" part. (I'd avoid catching Exception too, but going into the details of exception handling is a bit beyond the point of this answer.)
I am using some arabic text in my app. on simulator Arabic Text is diplaying fine.
BUT on device it is not displaying Properly.
On Simulator it is like مَرْحَبًا that.
But on device it is like مرحبا.
My need is this one مَرْحَبًا.
Create text resources for a MIDP application, and how to load them at run-time. This technique is unicode safe, and so is suitable for all languages. The run-time code is small, fast, and uses relatively little memory.
Creating the Text Source
اَللّٰهُمَّ اِنِّىْ اَسْئَلُكَ رِزْقًاوَّاسِعًاطَيِّبًامِنْ رِزْقِكَ
مَرْحَبًا
The process starts with creating a text file. When the file is loaded, each line becomes a separate String object, so you can create a file like:
This needs to be in UTF-8 format. On Windows, you can create UTF-8 files in Notepad. Make sure you use Save As..., and select UTF-8 encoding.
Make the name arb.utf8
This needs to be converted to a format that can be read easily by the MIDP application. MIDP does not provide convenient ways to read text files, like J2SE's BufferedReader. Unicode support can also be a problem when converting between bytes and characters. The easiest way to read text is to use DataInput.readUTF(). But to use this, we need to have written the text using DataOutput.writeUTF().
Below is a simple J2SE, command-line program that will read the .uft8 file you saved from notepad, and create a .res file to go in the JAR.
import java.io.*;
import java.util.*;
public class TextConverter {
public static void main(String[] args) {
if (args.length == 1) {
String language = args[0];
List<String> text = new Vector<String>();
try {
// read text from Notepad UTF-8 file
InputStream in = new FileInputStream(language + ".utf8");
try {
BufferedReader bufin = new BufferedReader(new InputStreamReader(in, "UTF-8"));
String s;
while ( (s = bufin.readLine()) != null ) {
// remove formatting character added by Notepad
s = s.replaceAll("\ufffe", "");
text.add(s);
}
} finally {
in.close();
}
// write it for easy reading in J2ME
OutputStream out = new FileOutputStream(language + ".res");
DataOutputStream dout = new DataOutputStream(out);
try {
// first item is the number of strings
dout.writeShort(text.size());
// then the string themselves
for (String s: text) {
dout.writeUTF(s);
}
} finally {
dout.close();
}
} catch (Exception e) {
System.err.println("TextConverter: " + e);
}
} else {
System.err.println("syntax: TextConverter <language-code>");
}
}
}
To convert arb.utf8 to arb.res, run the converter as:
java TextConverter arb
Using the Text at Runtime
Place the .res file in the JAR.
In the MIDP application, the text can be read with this method:
public String[] loadText(String resName) throws IOException {
String[] text;
InputStream in = getClass().getResourceAsStream(resName);
try {
DataInputStream din = new DataInputStream(in);
int size = din.readShort();
text = new String[size];
for (int i = 0; i < size; i++) {
text[i] = din.readUTF();
}
} finally {
in.close();
}
return text;
}
Load and use text like this:
String[] text = loadText("arb.res");
System.out.println("my arabic word from arb.res file ::"+text[0]+" second from arb.res file ::"+text[1]);
Hope this will help you. Thanks
I try to load content of a text file that contains some text in multiple lines using java servlet.
When I test servlet in browser it works fine. Text is loaded with new line chars.
But when I load it to a string in my swing application and then use textpane.setText(text); new lines are gone. I tried many solutions I found int the net, but still can't get it right.
Servlet Code:
Reading text from file (simplified):
File file = new File(path);
StringBuilder data = new StringBuilder();
BufferedReader in = new BufferedReader(new FileReader(file));
String line;
while ((line = in.readLine()) != null) {
data.append(line);
data.append("\n");
}
in.close();
Sending text:
PrintWriter out = response.getWriter();
out.write(text));
Is it some platform issue? Servlet was writen and compiled on Linux, but I run it on Windows (on JBoss). Textfiles are also stored on my machine.
Instead of data.append("\n") use
data.append(System.getProperty("line.separator"));