Binary file not being read properly in Java - java

I am trying to read a binary file in Java using the bufferedReader. I wrote that binary-file using "UTF-8" encoding. The code for writing into a binary file:
byte[] inMsgBin=null;
try {
inMsgBin = String.valueOf(cypherText).getBytes("UTF-8");
//System.out.println("CIPHER TEXT:FULL:BINARY WRITE: "+inMsgBin);
} catch (UnsupportedEncodingException ex) {
Logger.getLogger(EncDecApp.class.getName()).log(Level.SEVERE, null, ex);
}
try (FileOutputStream out = new FileOutputStream(fileName+ String.valueOf(new SimpleDateFormat("yyyyMMddhhmm").format(new Date()))+ ".encmsg")) {
out.write(inMsgBin);
out.close();
} catch (IOException ex) {
Logger.getLogger(EncDecApp.class.getName()).log(Level.SEVERE, null, ex);
}
System.out.println("cypherText charCount="+cypherText.length());
Here 'cypherText' is a String with some content. Total no of characters written in the file is given as 19. Also after writing, when I open the binary file in Notepad++, it shows some characters. Selecting all the content of the file counts to 19 characters in total.
Now when I read the same file using BufferedReader, using the following lines of code:
try
{
DecMessage obj2= new DecMessage();
StringBuilder cipherMsg=new StringBuilder();
try (BufferedReader in = new BufferedReader(new FileReader(filePath))) {
String tempLine="";
fileSelect=true;
while ((tempLine=in.readLine()) != null) {
cipherMsg.append(tempLine);
}
}
System.out.println("FROM FILE: charCount= "+cipherMsg.length());
Here the total no of characters read (stored in 'charCount') is 17 instead of 19.
How can I read all the characters of the file correctly?

Specify the same charset while reading file.
try (final BufferedReader br = Files.newBufferedReader(new File(filePath).toPath(),
StandardCharsets.UTF_8))
UPDATE
Now i got your problem. Thanks for the file.
Again : Your file still readable to any text reader like Notepad++ ( Since your characters includes extended and control characters you are seeing those non readable characters . but it is still in ASCII.)
Now back to your problem, You have two problem with your code.
While reading file you should specify the Correct Charset. Readers are character readers - Bytes would be convert into characters while reading. If you specify the Charset it would use that else it would use the default system charset. So you should create BufferedReader as follows
try (final BufferedReader br = Files.newBufferedReader(new File(filePath).toPath(),
StandardCharsets.UTF_8))
Second issue, you have characters which includes Control characters. while reading file line by line , by default bufferedReader uses System's default EOL characters and skip those characters. thats why you are getting 17 instead of 19 ( since you have 2 characters are CR). To avoid this issue you should read characters.
int ch;
while ((ch = br.read()) > -1) {
buffer.append((char)ch);
}
Overall the below method would return proper text.
static String readCyberText() {
StringBuilder buffer = new StringBuilder();
try (final BufferedReader br = Files.newBufferedReader(new File("C:\\projects\\test2201404221017.txt").toPath(),
StandardCharsets.UTF_8)){
int ch;
while ((ch = br.read()) > -1) {
buffer.append((char)ch);
}
return buffer.toString();
}
catch (IOException e) {
e.printStackTrace();
return null;
}
}
And you can test by
String s = readCyberText();
System.out.println(s.length());
System.out.println(s);
and output as
19
ia#
m©Ù6ë<«9K()il
Note: the length of String is 19, however when it display it just displayed 17 characters. because the console considered as eof and displayed in different line. but the String contain all 19 characters properly.

Related

java reading a file through scanner and its appending character ? for first line [duplicate]

This question already has answers here:
Java read file got a leading BOM [  ]
(7 answers)
Closed 9 years ago.
If I write this code, I get this as output --> This first: 
and then the other lines
try {
BufferedReader br = new BufferedReader(new FileReader(
"myFile.txt"));
String line;
while (line = br.readLine() != null) {
System.out.println(line);
}
br.close();
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
How can I avoid it?
You are getting the characters  on the first line because this sequence is the UTF-8 byte order mark (BOM). If a text file begins with a BOM, it's likely it was generated by a Windows program like Notepad.
To solve your problem, we choose to read the file explicitly as UTF-8, instead of whatever default system character encoding (US-ASCII, etc.):
BufferedReader in = new BufferedReader(
new InputStreamReader(
new FileInputStream("myFile.txt"),
"UTF-8"));
Then in UTF-8, the byte sequence  decodes to one character, which is U+FEFF. This character is optional - a legal UTF-8 file may or may not begin with it. So we will skip the first character only if it's U+FEFF:
in.mark(1);
if (in.read() != 0xFEFF)
in.reset();
And now you can continue with the rest of your code.
The problem could be in encoding used.
try this:
BufferedReader in = new BufferedReader(new InputStreamReader(
new FileInputStream("yourfile"), "UTF-8"));

First character of the reading from the text file :  [duplicate]

This question already has answers here:
Java read file got a leading BOM [  ]
(7 answers)
Closed 9 years ago.
If I write this code, I get this as output --> This first: 
and then the other lines
try {
BufferedReader br = new BufferedReader(new FileReader(
"myFile.txt"));
String line;
while (line = br.readLine() != null) {
System.out.println(line);
}
br.close();
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
How can I avoid it?
You are getting the characters  on the first line because this sequence is the UTF-8 byte order mark (BOM). If a text file begins with a BOM, it's likely it was generated by a Windows program like Notepad.
To solve your problem, we choose to read the file explicitly as UTF-8, instead of whatever default system character encoding (US-ASCII, etc.):
BufferedReader in = new BufferedReader(
new InputStreamReader(
new FileInputStream("myFile.txt"),
"UTF-8"));
Then in UTF-8, the byte sequence  decodes to one character, which is U+FEFF. This character is optional - a legal UTF-8 file may or may not begin with it. So we will skip the first character only if it's U+FEFF:
in.mark(1);
if (in.read() != 0xFEFF)
in.reset();
And now you can continue with the rest of your code.
The problem could be in encoding used.
try this:
BufferedReader in = new BufferedReader(new InputStreamReader(
new FileInputStream("yourfile"), "UTF-8"));

When parsing a text file with a Scanner why am I getting null characters between each expected character?

I am creating a registry snapshot with the command:
Runtime.getRuntime().exec("REG EXPORT HKLM " + pathVariable + "\HKLM.txt /y");
I am then parsing through this file trying to group the registry entries into a single String as they are often broken up over multiple lines. When I use this bit of code I am always getting the "NUL" character for every even character.
String line, concatLine;
Scanner scanner;
try {
scanner = new Scanner(myFile);
line = null;
concatLine = "";
while(scanner.hasNextLine()){
line = scanner.nextLine();
if(line !=null && !(line.isEmpty())){
concatLine += line;
}
else if(!(concatLine.equals(""))){
System.out.println(concatLine);
concatLine = "";
}
}
} catch (IOException e) {//Catch I/O Exceptions
System.err.println(e);
}
I am looking at the file before scanning it in NP++ and there are no "NUL" characters, but if I write these concatenated lines to a file the entire file has them between each expected character.
In my search to understand the problem I came across Java reading and writing paractices which is definitely worth looking over. Apart from that, it seems like the early comments were correct. If the file is opened as a UTF-16 stream, and written as such, then the output is without the null characters. By the way, you will also need to deal with escaped newlines in registry dump, because if you don't you will end up with things like: "00,00,\ 00," where you should have "00,00,00,".
Here is an example:
import java.io.*;
import java.util.*;
import static java.lang.System.out;
public class ReadReg {
public static void main(String[] argv){
String line=null; StringBuilder sb = new StringBuilder();
Scanner scanner;
FileOutputStream fos;
BufferedOutputStream bos; OutputStreamWriter fosw;
try {
scanner = new Scanner(new File("hklm-hw.txt"), "UTF-16");
fos = new FileOutputStream("hklm-hw.cat.txt");
bos = new BufferedOutputStream(fos);
fosw = new OutputStreamWriter(bos, "UTF-16");
while (scanner.hasNextLine()) {
sb.append( line = scanner.nextLine());
if (line.isEmpty()) {
sb.append("\n");
}
}
if (null != scanner.ioException()) {
out.format("scanner ioe:\n\t%s\n", scanner.ioException().getMessage());
//scanner.ioException().printStackTrace();
}
fosw.write( sb.toString(), 0, sb.length());
fosw.flush();
fosw.close();
scanner.close();
} catch (IOException io) {
io.printStackTrace();
}
}
}
Output:
$ javac ReadReg.java && java ReadReg ; file *
hklm-hw.cat.txt: Big-endian UTF-16 Unicode text, with very long lines
hklm-hw.txt: Little-endian UTF-16 Unicode text, with CRLF, CR line terminators
ReadReg.class: compiled Java class data, version 50.0 (Java 1.6)
ReadReg.java: C source, ASCII text

java file reading issue

In my java application, I have to read one file. The problem what I am facing, after reading the file, the results is coming as non readable format. that means some ascii characters are displayed. That means none of the letters are readable. How can I make it display that?
// Open the file that is the first
// command line parameter
FileInputStream fstream = new FileInputStream("c:\\hello.txt");
// Get the object of DataInputStream
DataInputStream in = new DataInputStream(fstream);
BufferedReader br = new BufferedReader(new InputStreamReader(in));
String strLine;
// Read File Line By Line
while ((strLine = br.readLine()) != null) {
// Print the content on the console
System.out.println(strLine);
}
// Close the input stream
in.close();
} catch (Exception e) {// Catch exception if any
System.err.println("Error: " + e.getMessage());
}
Perhaps you have an encoding error. The constructor you are using for an InputStreamReader uses the default character encoding; if your file contains UTF-8 text outside the ASCII range, you will get garbage. Also, you don't need a DataInputStream, since you aren't reading any data objects from the stream. Try this code:
FileInputStream fstream = null;
try {
fstream = new FileInputStream("c:\\hello.txt");
// Decode data using UTF-8
BufferedReader br = new BufferedReader(new InputStreamReader(in, "UTF-8"));
String strLine;
// Read File Line By Line
while ((strLine = br.readLine()) != null) {
// Print the content on the console
System.out.println(strLine);
}
} catch (Exception e) {// Catch exception if any
System.err.println("Error: " + e.getMessage());
} finally {
if (fstream != null) {
try { fstream.close(); }
catch (IOException e) {
// log failure to close file
}
}
}
The output you are getting is an ascii value ,so you need to type cast it into char or string before printing it.Hope this helps
You have to implement this way to handle:-
BufferedReader br = new BufferedReader(new InputStreamReader(in, encodingformat));
.
encodingformat - change it according to which type of encoding issue you are encounter.
Examples: UTF-8, UTF-16, ... soon
Refer this Supported Encodings by Java SE 6 for more info.
My problem got solved. I dont know how. I copied the hello.txt contents to another file and run the java program. I could read all letters. dont know whats the problem in that.
Since you doesn't know the encoding the file is in, use jchardet to detect the encoding used by the file and then use that encoding to read the file as others have already suggested. This is not 100 % fool proof but works for your scenario.
Also, use of DataInputStream is unnecessary.

Java BufferedReader arabic text file problem

Problem: Arabic words in my text files read by java show as series of question marks : ??????
Here is the code:
File[] fileList = mainFolder.listFiles();
BufferedReader bufferReader = null;
Reader reader = null;
try{
for(File f : fileList){
reader = new InputStreamReader(new FileInputStream(f.getPath()), "UTF8");
bufferReader = new BufferedReader(reader);
String line = null;
while((line = bufferReader.readLine())!= null){
System.out.println(new String(line.getBytes(), "UTF-8"));
}
}
}
catch(Exception exc){
exc.printStackTrace();
}
finally {
//Close the BufferedReader
try {
if (bufferReader != null)
bufferReader.close();
} catch (IOException ex) {
ex.printStackTrace();
}
As you can see I have specified the UTF-8 encoding in different places and still I get question marks, do you have any idea how can I fix this??
Thanks
Instead of trying to print out the line directly, print out the Unicode values of each character. For example:
char[] chars = line.toCharArray();
for (int i = 0; i < chars.length; i++)
{
System.out.println(i + ": " + chars[i] + " - " + (int) chars[i]);
}
Then look up the relevant characters in the Unicode code charts.
If you find it's printing 63, then those really are question marks... which would suggest that your text file isn't truly UTF-8 to start with.
If, on the other hand for some characters it's printing out "?" but then a value other than 63, then that would suggest it's a console display issue and you're reading the data correctly.
Replace
System.out.println(new String(line.getBytes(), "UTF-8"));
by
System.out.println(line);
The String#getBytes() without the charset argument namely uses platform default encoding to get the bytes from the string, which may not be UTF-8 per se. You're already reading the bytes as UTF-8 by InputStreamReader, so you don't need to massage it forth and back afterwards.
Further, ensure that your display console (where you're reading those lines) supports UTF-8. In for example Eclipse, you can do that by Window > Preferences > General > Workspace > Text File Encoding > Other > UTF-8.
See also:
Unicode - How to get the characters right?

Categories