Java BufferedReader arabic text file problem

Java BufferedReader arabic text file problem - java

Problem: Arabic words in my text files read by java show as series of question marks : ??????
Here is the code:
File[] fileList = mainFolder.listFiles();
BufferedReader bufferReader = null;
Reader reader = null;
try{
for(File f : fileList){
reader = new InputStreamReader(new FileInputStream(f.getPath()), "UTF8");
bufferReader = new BufferedReader(reader);
String line = null;
while((line = bufferReader.readLine())!= null){
System.out.println(new String(line.getBytes(), "UTF-8"));
}
}
}
catch(Exception exc){
exc.printStackTrace();
}
finally {
//Close the BufferedReader
try {
if (bufferReader != null)
bufferReader.close();
} catch (IOException ex) {
ex.printStackTrace();
}
As you can see I have specified the UTF-8 encoding in different places and still I get question marks, do you have any idea how can I fix this??
Thanks

Instead of trying to print out the line directly, print out the Unicode values of each character. For example:
char[] chars = line.toCharArray();
for (int i = 0; i < chars.length; i++)
{
System.out.println(i + ": " + chars[i] + " - " + (int) chars[i]);
}
Then look up the relevant characters in the Unicode code charts.
If you find it's printing 63, then those really are question marks... which would suggest that your text file isn't truly UTF-8 to start with.
If, on the other hand for some characters it's printing out "?" but then a value other than 63, then that would suggest it's a console display issue and you're reading the data correctly.

Replace
System.out.println(new String(line.getBytes(), "UTF-8"));
by
System.out.println(line);
The String#getBytes() without the charset argument namely uses platform default encoding to get the bytes from the string, which may not be UTF-8 per se. You're already reading the bytes as UTF-8 by InputStreamReader, so you don't need to massage it forth and back afterwards.
Further, ensure that your display console (where you're reading those lines) supports UTF-8. In for example Eclipse, you can do that by Window > Preferences > General > Workspace > Text File Encoding > Other > UTF-8.
See also:
Unicode - How to get the characters right?

Related

Correctly displaying text from a file in Java

I am currently trying to read multiple files (UTF-8) within a directory and store each element in that text file into an array.
I am able to get the text to print to console however it shows some funny characters I can't seem to rid myself off (See image - what is should look like is displayed on the right).
Currently, I have a method that builds an array with all file names in that directory then using a for loop I send each of these file names to a read method which puts it into a string.
The below method writes these file names to an array.
public static ArrayList<String> readModelFilesInModelDir() {
File folder = new File("Models/");
File[] listOfFiles = folder.listFiles();
String random = "";
assert listOfFiles != null;
ArrayList<String> listOfModelFiles = new ArrayList<>();
for (int i = 0; i < listOfFiles.length; i++) {
if (listOfFiles[i].isFile()) {
//System.out.println("File " + listOfFiles[i].getName());
listOfModelFiles.add(listOfFiles[i].getName());
} else if (listOfFiles[i].isDirectory()) {
System.out.println("Directory " + listOfFiles[i].getName());
}
}
System.out.println(listOfModelFiles);
return listOfModelFiles;
The below for loop then sends these file names to the read method.
ArrayList<String> modelFiles = readModelFilesInModelDir();
for (int i = 0; i < modelFiles.size(); i++) {
String thisString = readModelFileIntoArray(modelFiles.get(i));
System.out.println(thisString);
}
The below method then reads the string into an array, which is outputting what the images show.
public static String readModelFileIntoArray(String modelFilePath) {
StringBuilder fileHasBeenRead = new StringBuilder();
try {
Reader reader = new InputStreamReader(new FileInputStream(("Models/" + modelFilePath)), StandardCharsets.UTF_8);
String s;
BufferedReader bufferedReader = new BufferedReader(reader);
while ((s = bufferedReader.readLine()) != null) {
fileHasBeenRead.append(s + "\n");
}
reader.close();
} catch (Exception e) {
System.out.print(e);
}
return fileHasBeenRead.toString().trim();
}
Finally, how would I fix this output issue as well as store each of these files that have been read into a seperate array that I can use elsewhere? Thanks!

I agree with Johnny Mopp, your file is encoded in UTF-16, not in UTF-8. The two �� at the beginning of your output looks like a byte order mark (BOM). In UTF-16, each character is coded on two bytes. Since your text only contains characters in the ASCII range, it means that each first byte is always 0x00. This is why you're seeing all these ▯: they correspond to the non-printable character 0x00. I would even say that since the two characters following �� are ▯ and a in this order, your file is using big-endian UTF-16.
Instead of UTF-8, use StandardCharsets.UTF_16. It will also take the BOM into account and use the appropriate endianness.

It's much easier (and usually better) to use existing libraries for common stuff. There is FileUtils from apache commons-io, that provides this functionality out of the box, reducing your file reading code to a one liner
String thisString = FileUtils.readFileToString("Models/" + modelFilePath, StandardCharsets.UTF_8);
... or whatever charset your file is using...

Java readline() skipping second line

I have a questions file that I'd like to read, and when its reading, I want it to Identify the questions from the answers and print them, before each questions there is a line of "#" characters, code keeps skipping question one for some reason? what am I missing here?
Here is the code:
try {
// Open the file that is the first
// command line parameter
FileInputStream fstream = new FileInputStream(path);
BufferedReader br = new BufferedReader(new InputStreamReader(fstream));
String strLine;
strLine = br.readLine();
System.out.println(strLine);
// Read File Line By Line
while ((strLine ) != null) {
strLine = strLine.trim();
if ((strLine.length()!=0) && (strLine.charAt(0)=='#' && strLine.charAt(1)=='#')) {
strLine = br.readLine();
System.out.println(strLine);
//questions[q] = strLine;
}
strLine = br.readLine();
}
// Close the input stream
fstream.close();
// System.out.println(questions[0]);
} catch (Exception e) {// Catch exception if any
System.err.println("Error: " + e.getMessage());
}

I suspect, that the file you read is in UTF-8 with BOM.
The BOM is a code before the first character, that helps to identify the proper encoding of textfiles.
The issue with BOM is, that it is invisible and disturbs the reading. The textfile with BOM is arguable no longer a textfile. Especially, if you read the first line, the first character is no longer a #, but it is something different, because it is the character BOM+#.
Try to load the file with the explicit encoding specified. Java can handle BOM in newer releases, don't remember which exactly.
BufferedReader br = new BufferedReader(new InputStreamReader(fstream, "UTF-8"));
Otherwise, take a decent text editor, like notepad++ and change the encoding to UTF-8 without BOM or ANSI encoding (yuck).

Notice that when you either enter the if statement in the while or not, you first do strLine = br.readLine(); which overwrite the line you read when you initialized strline.

Binary file not being read properly in Java

I am trying to read a binary file in Java using the bufferedReader. I wrote that binary-file using "UTF-8" encoding. The code for writing into a binary file:
byte[] inMsgBin=null;
try {
inMsgBin = String.valueOf(cypherText).getBytes("UTF-8");
//System.out.println("CIPHER TEXT:FULL:BINARY WRITE: "+inMsgBin);
} catch (UnsupportedEncodingException ex) {
Logger.getLogger(EncDecApp.class.getName()).log(Level.SEVERE, null, ex);
}
try (FileOutputStream out = new FileOutputStream(fileName+ String.valueOf(new SimpleDateFormat("yyyyMMddhhmm").format(new Date()))+ ".encmsg")) {
out.write(inMsgBin);
out.close();
} catch (IOException ex) {
Logger.getLogger(EncDecApp.class.getName()).log(Level.SEVERE, null, ex);
}
System.out.println("cypherText charCount="+cypherText.length());
Here 'cypherText' is a String with some content. Total no of characters written in the file is given as 19. Also after writing, when I open the binary file in Notepad++, it shows some characters. Selecting all the content of the file counts to 19 characters in total.
Now when I read the same file using BufferedReader, using the following lines of code:
try
{
DecMessage obj2= new DecMessage();
StringBuilder cipherMsg=new StringBuilder();
try (BufferedReader in = new BufferedReader(new FileReader(filePath))) {
String tempLine="";
fileSelect=true;
while ((tempLine=in.readLine()) != null) {
cipherMsg.append(tempLine);
}
}
System.out.println("FROM FILE: charCount= "+cipherMsg.length());
Here the total no of characters read (stored in 'charCount') is 17 instead of 19.
How can I read all the characters of the file correctly?

Specify the same charset while reading file.
try (final BufferedReader br = Files.newBufferedReader(new File(filePath).toPath(),
StandardCharsets.UTF_8))
UPDATE
Now i got your problem. Thanks for the file.
Again : Your file still readable to any text reader like Notepad++ ( Since your characters includes extended and control characters you are seeing those non readable characters . but it is still in ASCII.)
Now back to your problem, You have two problem with your code.
While reading file you should specify the Correct Charset. Readers are character readers - Bytes would be convert into characters while reading. If you specify the Charset it would use that else it would use the default system charset. So you should create BufferedReader as follows
try (final BufferedReader br = Files.newBufferedReader(new File(filePath).toPath(),
StandardCharsets.UTF_8))
Second issue, you have characters which includes Control characters. while reading file line by line , by default bufferedReader uses System's default EOL characters and skip those characters. thats why you are getting 17 instead of 19 ( since you have 2 characters are CR). To avoid this issue you should read characters.
int ch;
while ((ch = br.read()) > -1) {
buffer.append((char)ch);
}
Overall the below method would return proper text.
static String readCyberText() {
StringBuilder buffer = new StringBuilder();
try (final BufferedReader br = Files.newBufferedReader(new File("C:\\projects\\test2201404221017.txt").toPath(),
StandardCharsets.UTF_8)){
int ch;
while ((ch = br.read()) > -1) {
buffer.append((char)ch);
}
return buffer.toString();
}
catch (IOException e) {
e.printStackTrace();
return null;
}
}
And you can test by
String s = readCyberText();
System.out.println(s.length());
System.out.println(s);
and output as
19
ia#
m©Ù6ë<«9K()il
Note: the length of String is 19, however when it display it just displayed 17 characters. because the console considered as eof and displayed in different line. but the String contain all 19 characters properly.

When parsing a text file with a Scanner why am I getting null characters between each expected character?

I am creating a registry snapshot with the command:
Runtime.getRuntime().exec("REG EXPORT HKLM " + pathVariable + "\HKLM.txt /y");
I am then parsing through this file trying to group the registry entries into a single String as they are often broken up over multiple lines. When I use this bit of code I am always getting the "NUL" character for every even character.
String line, concatLine;
Scanner scanner;
try {
scanner = new Scanner(myFile);
line = null;
concatLine = "";
while(scanner.hasNextLine()){
line = scanner.nextLine();
if(line !=null && !(line.isEmpty())){
concatLine += line;
}
else if(!(concatLine.equals(""))){
System.out.println(concatLine);
concatLine = "";
}
}
} catch (IOException e) {//Catch I/O Exceptions
System.err.println(e);
}
I am looking at the file before scanning it in NP++ and there are no "NUL" characters, but if I write these concatenated lines to a file the entire file has them between each expected character.

In my search to understand the problem I came across Java reading and writing paractices which is definitely worth looking over. Apart from that, it seems like the early comments were correct. If the file is opened as a UTF-16 stream, and written as such, then the output is without the null characters. By the way, you will also need to deal with escaped newlines in registry dump, because if you don't you will end up with things like: "00,00,\ 00," where you should have "00,00,00,".
Here is an example:
import java.io.*;
import java.util.*;
import static java.lang.System.out;
public class ReadReg {
public static void main(String[] argv){
String line=null; StringBuilder sb = new StringBuilder();
Scanner scanner;
FileOutputStream fos;
BufferedOutputStream bos; OutputStreamWriter fosw;
try {
scanner = new Scanner(new File("hklm-hw.txt"), "UTF-16");
fos = new FileOutputStream("hklm-hw.cat.txt");
bos = new BufferedOutputStream(fos);
fosw = new OutputStreamWriter(bos, "UTF-16");
while (scanner.hasNextLine()) {
sb.append( line = scanner.nextLine());
if (line.isEmpty()) {
sb.append("\n");
}
}
if (null != scanner.ioException()) {
out.format("scanner ioe:\n\t%s\n", scanner.ioException().getMessage());
//scanner.ioException().printStackTrace();
}
fosw.write( sb.toString(), 0, sb.length());
fosw.flush();
fosw.close();
scanner.close();
} catch (IOException io) {
io.printStackTrace();
}
}
}
Output:
$ javac ReadReg.java && java ReadReg ; file *
hklm-hw.cat.txt: Big-endian UTF-16 Unicode text, with very long lines
hklm-hw.txt: Little-endian UTF-16 Unicode text, with CRLF, CR line terminators
ReadReg.class: compiled Java class data, version 50.0 (Java 1.6)
ReadReg.java: C source, ASCII text

java file reading issue

In my java application, I have to read one file. The problem what I am facing, after reading the file, the results is coming as non readable format. that means some ascii characters are displayed. That means none of the letters are readable. How can I make it display that?
// Open the file that is the first
// command line parameter
FileInputStream fstream = new FileInputStream("c:\\hello.txt");
// Get the object of DataInputStream
DataInputStream in = new DataInputStream(fstream);
BufferedReader br = new BufferedReader(new InputStreamReader(in));
String strLine;
// Read File Line By Line
while ((strLine = br.readLine()) != null) {
// Print the content on the console
System.out.println(strLine);
}
// Close the input stream
in.close();
} catch (Exception e) {// Catch exception if any
System.err.println("Error: " + e.getMessage());
}

Perhaps you have an encoding error. The constructor you are using for an InputStreamReader uses the default character encoding; if your file contains UTF-8 text outside the ASCII range, you will get garbage. Also, you don't need a DataInputStream, since you aren't reading any data objects from the stream. Try this code:
FileInputStream fstream = null;
try {
fstream = new FileInputStream("c:\\hello.txt");
// Decode data using UTF-8
BufferedReader br = new BufferedReader(new InputStreamReader(in, "UTF-8"));
String strLine;
// Read File Line By Line
while ((strLine = br.readLine()) != null) {
// Print the content on the console
System.out.println(strLine);
}
} catch (Exception e) {// Catch exception if any
System.err.println("Error: " + e.getMessage());
} finally {
if (fstream != null) {
try { fstream.close(); }
catch (IOException e) {
// log failure to close file
}
}
}

The output you are getting is an ascii value ,so you need to type cast it into char or string before printing it.Hope this helps

You have to implement this way to handle:-
BufferedReader br = new BufferedReader(new InputStreamReader(in, encodingformat));
.
encodingformat - change it according to which type of encoding issue you are encounter.
Examples: UTF-8, UTF-16, ... soon
Refer this Supported Encodings by Java SE 6 for more info.

My problem got solved. I dont know how. I copied the hello.txt contents to another file and run the java program. I could read all letters. dont know whats the problem in that.

Since you doesn't know the encoding the file is in, use jchardet to detect the encoding used by the file and then use that encoding to read the file as others have already suggested. This is not 100 % fool proof but works for your scenario.
Also, use of DataInputStream is unnecessary.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Java BufferedReader arabic text file problem - java

Related

Correctly displaying text from a file in Java

Java readline() skipping second line

Binary file not being read properly in Java

When parsing a text file with a Scanner why am I getting null characters between each expected character?

java file reading issue

Categories

Resources