IO bufferReader vs nio Files.newBufferedReader CharsetDecoder leniency on MalformedInput - java

I have a text file that contains invalid "UTF-8" charactor and this causing my app to throw MalformedInputException. I use Files.newBufferedReader to create BufferReader.
Path path = FileSystems.getDefault().getPath(inputDirectory, fileName);
BufferedReader br = Files.newBufferedReader(path, Charset.defaultCharset());
And this seems to be strict on the character encoding. I did some digging and found online that we can improve the leniency by overriding CharactorDecoder default action by .onMalformedInput(CodingErrorAction.REPLACE). This seems to be fixing the issue.
Then out of curiosity I used java IO BufferedReader to Read the same file.
fr = new FileReader(file);
br = new BufferedReader(fr);
This seems to have no issue on the invalid character and read the file without any issue.
So I looked at the code of both Files.newBufferedReader and new BufferedReader(fr). This is how they both implemented
Files.newBufferedReader:
public static BufferedReader newBufferedReader(Path path, Charset cs)
throws IOException
{
//onMalformedInput is not overridden. Thus strict decoding
CharsetDecoder decoder = cs.newDecoder();
//Look at how the InputStreadReader created. The decoder being passed
Reader reader = new InputStreamReader(newInputStream(path), decoder);
return new BufferedReader(reader);
}
IO BuffereReader
//Creating File Reader
FileReader fr = new FileReader(file);
--------------------------------------------------------------------
//File Reader constructor
public FileReader(File file) throws FileNotFoundException {
//Calls it's supper constructor InputStreamReader
super(new FileInputStream(file));
}
-----------------------------------------------------------
//InputStreamReader Constructor
public InputStreamReader(InputStream in) {
super(in);
try {
//This where I don't understand
sd = StreamDecoder.forInputStreamReader(in, this, (String)null); // ## check lock object
} catch (UnsupportedEncodingException e) {
throw new Error(e);
}
}
As you can see they both uses StreamDecoder.forInputStreamReader. I know why Files.newBufferedReader has strict decoder. But I am trying to understand where in IO BufferredReader, it's defined to do lenient decoding.
Would really appreciate if some can help me understand this.

The lenient decoding should actually be done by FileReader. I can't find any part of the documentation that specifies this, but digging into its code it uses onMalformedInput(CodingErrorAction.REPLACE) too. I'm not sure if it can be trusted to be the same way in all JDK implementations though.

Related

Manipulating BufferedReader before it is read by OpenCSV's CSVReaderBuilder results in CSVReaderBuilder = null

I am reading a CSV file OpenCSV's CSVReaderBuilder which doesn't work as the CSV file for some weird reason I cannot change has some lines with a missing column.
So I thought it would be a good idea to manipulate the BufferedReader I use as input for the CSVReaderBuilder and add an extra column before it is read by CSVReaderBuilder but unfortunately the CSVReaderBuilder will always return null.
This code results in an com.opencsv.exceptions.CsvRequiredFieldEmptyException as the lines have different number of columns, but works with a proper CSV file:
FileInputStream is;
try {
is = new FileInputStream(fileName);
InputStreamReader isr = new InputStreamReader(is, charSet);
BufferedReader buffReader = new BufferedReader(isr);
// use own CSVParser to set separator
final CSVParser parser = new CSVParserBuilder()
.withSeparator(separator)
.build();
// use own CSVReader make use of own CSVParser
reader = new CSVReaderBuilder(buffReader)
.withCSVParser(parser)
.build();
} catch (FileNotFoundException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
So I added the code to manipulate the BufferedReader to add an extra semicolon if the column count is 13 instead of 14, but this will result in reader being null.
FileInputStream is;
try {
is = new FileInputStream(fileName);
InputStreamReader isr = new InputStreamReader(is, charSet);
BufferedReader buffReader = new BufferedReader(isr);
buffReader.lines().forEach(t -> {
String a[] = t.split(";");
int occurence = a.length;
if(occurence == 13) {
t = t.concat(";");
}
});
// use own CSVParser to set separator
final CSVParser parser = new CSVParserBuilder()
.withSeparator(separator)
.build();
// use own CSVReader make use of own CSVParser
reader = new CSVReaderBuilder(buffReader)
.withCSVParser(parser)
.build();
} catch (FileNotFoundException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
Does anyone have an idea what I'm doing wrong here?
There are a couple of problems here:
First, by the time buffReader is used in new CSVReaderBuilder(buffReader), it has already been fully consumed by buffReader.lines().forEach. A BufferedReader can only be read once, in general. A solution could ordinarily be to create a new InputStreamReader and BufferedReader on the same file, except in this case, you'll run into the second problem.
The line t = t.concat(";"); does not work the way you expect. All this does is reassign the local variable t, which isn't used again. It does not change the contents of the file or the contents of the reader.
How to fix this is less straightforward. As far as I know, this exception will only be thrown when binding the CSV data to a bean, and only if fields are marked as required = true. Given that the source data does not always contain data for the last field, it seems like it should not be marked as required.
If manipulating the source data really is your only option, I can think of a few possible approaches:
Write the modified data back to a temporary file and then read that file with the CSV parser.
If the CSV file is small enough to fit into memory, you could write the modified data to a StringWriter, and then construct a StringReader with the result, and parse that.
Do the file content rewriting and CSV parsing in separate threads, using PipedOutputStream and PipedInputStream to connect them.
Write a custom implementation of FilterReader that transforms the file contents as they are read (not the most straightforward to implement).
Details of implementing these approaches would too long and broad for this answer, so I would suggest creating follow up questions if needed.
There might be additional options specific to the OpenCSV library that I'm not aware of.

Java printing stray character instead of actual

I have a simple java program that reads a line from file and writes it to another file.
My source file has words like: it's but the destination file is having words like it�s.
I am using BufferedReader br = new BufferedReader(new FileReader(inputFile)); to read the source file and PrintWriter writer = new PrintWriter(resultFile, "UTF-8"); to write the destination file.
How to get the actual character in my destination file too?
You need to specify a CharacterSet when creating the BufferedReader, otherwise the platform default encoding is used:
BufferedReader br = new BufferedReader(new FileReader(inputFile),"UTF-8");
I know this question is a bit old, but I thought I'd put down my answer anyway.
You can use java.nio.file.Files to read the file and java.io.RandomAccessFile to write to your destination file. For example:
public void copyContentsOfFile(File source, File destination){
Path p = Paths.get(source.toURI());
try {
byte[] bytes = Files.readAllBytes(p);
RandomAccessFile raf = new RandomAccessFile(destination, "rw");
raf.writeBytes(new String(bytes));
raf.close();
} catch (IOException e) {
e.printStackTrace();
}
}

How to modify text of a text-file which is read using FileInputStream

I have to use a method whose signature is like this
aMethod(FileInputStream);
I call that method like this
FileInputStream inputStream = new FileInputStream(someTextFile);
aMethod(inputStream);
I want to remove/edit some char which is being read from someTextFile before it being passed into aMethod(inputStream);
I cannot change aMethod's signature or overload it. And, it just take a InputStream.
If method taking a string as param, then I wouldn't be asking this question.
I am InputStream noob. Please advise.
you can convert a string into input stream
String str = "Converted stuff from reading the other inputfile and modifying it";
InputStream is = new ByteArrayInputStream(str.getBytes());
Here is something that might help. It will grab your .txt file. Then it will load it and go through line by line. You have to fill in the commented areas to do what you want.
public void parseFile() {
String inputLine;
String filename = "YOURFILE.txt";
Thread thisThread = Thread.currentThread();
ClassLoader loader = thisThread.getContextClassLoader();
InputStream is = loader.getResourceAsStream(filename);
try {
FileWriter fstream = new FileWriter("path/to/NEWFILE.txt");
BufferedWriter out = new BufferedWriter(fstream);
BufferedReader reader = new BufferedReader(
new InputStreamReader(is));
while((inputLine = reader.readLine()) != null) {
String[] str = inputLine.split("\t");
if(/* IF WHAT YOU WANT IS IN THE FILE ADD IT */) {
// DO SOMETHING OR ADD WHAT YOU WANT
out.append(str);
out.newLine();
}
}
reader.close();
out.close();
} catch (Exception e) {
e.getMessage();
}
}
Have you looked at another class FilterInputStream which also extends InputStream which may fit into your requirement?
From the documentation for the class
A FilterInputStream contains some other input stream, which it uses as its basic source of data, possibly transforming the data along the way or providing additional functionality.
Also have a look at this question which also seems to be similar to your question.

Character corruption going from BufferedReader to BufferedWriter in java

In Java, I am trying to parse an HTML file that contains complex text such as greek symbols.
I encounter a known problem when text contains a left facing quotation mark. Text such as
mutations to particular “hotspot” regions
becomes
mutations to particular “hotspot�? regions
I have isolated the problem by writting a simple text copy meathod:
public static int CopyFile()
{
try
{
StringBuffer sb = null;
String NullSpace = System.getProperty("line.separator");
Writer output = new BufferedWriter(new FileWriter(outputFile));
String line;
BufferedReader input = new BufferedReader(new FileReader(myFile));
while((line = input.readLine())!=null)
{
sb = new StringBuffer();
//Parsing would happen
sb.append(line);
output.write(sb.toString()+NullSpace);
}
return 0;
}
catch (Exception e)
{
return 1;
}
}
Can anybody offer some advice as how to correct this problem?
★My solution
InputStream in = new FileInputStream(myFile);
Reader reader = new InputStreamReader(in,"utf-8");
Reader buffer = new BufferedReader(reader);
Writer output = new BufferedWriter(new FileWriter(outputFile));
int r;
while ((r = reader.read()) != -1)
{
if (r<126)
{
output.write(r);
}
else
{
output.write("&#"+Integer.toString(r)+";");
}
}
output.flush();
The file read is not in the same encoding (probably UTF-8) as the file written (probably ISO-8859-1).
Try the following to generate a file with UTF-8 encoding:
BufferedWriter output = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(outputFile),"UTF8"));
Unfortunately, determining the encoding of a file is very difficult. See Java : How to determine the correct charset encoding of a stream
In addition to what Thierry-Dimitri Roy wrote, if you know the encoding you have to create your FileReader with a bit of extra work. From the docs:
Convenience class for reading
character files. The constructors of
this class assume that the default
character encoding and the default
byte-buffer size are appropriate. To
specify these values yourself,
construct an InputStreamReader on a
FileInputStream.
The Javadoc for FileReader says:
The constructors of this class assume that the default character encoding and the default byte-buffer size are appropriate. To specify these values yourself, construct an InputStreamReader on a FileInputStream.
In your case the default character encoding is probably not appropriate. Find what encoding the input file uses, and specify it. For example:
FileInputStream fis = new FileInputStream(myFile);
InputStreamReader isr = new InputStreamReader(fis, "charset name goes here");
BufferedReader input = new BufferedReader(isr);

GZIPInputStream reading line by line

I have a file in .gz format. The java class for reading this file is GZIPInputStream.
However, this class doesn't extend the BufferedReader class of java. As a result, I am not able to read the file line by line. I need something like this
reader = new MyGZInputStream( some constructor of GZInputStream)
reader.readLine()...
I though of creating my class which extends the Reader or BufferedReader class of java and use GZIPInputStream as one of its variable.
import java.io.BufferedReader;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.io.Reader;
import java.util.zip.GZIPInputStream;
public class MyGZFilReader extends Reader {
private GZIPInputStream gzipInputStream = null;
char[] buf = new char[1024];
#Override
public void close() throws IOException {
gzipInputStream.close();
}
public MyGZFilReader(String filename)
throws FileNotFoundException, IOException {
gzipInputStream = new GZIPInputStream(new FileInputStream(filename));
}
#Override
public int read(char[] cbuf, int off, int len) throws IOException {
// TODO Auto-generated method stub
return gzipInputStream.read((byte[])buf, off, len);
}
}
But, this doesn't work when I use
BufferedReader in = new BufferedReader(
new MyGZFilReader("F:/gawiki-20090614-stub-meta-history.xml.gz"));
System.out.println(in.readLine());
Can someone advice how to proceed ..
The basic setup of decorators is like this:
InputStream fileStream = new FileInputStream(filename);
InputStream gzipStream = new GZIPInputStream(fileStream);
Reader decoder = new InputStreamReader(gzipStream, encoding);
BufferedReader buffered = new BufferedReader(decoder);
The key issue in this snippet is the value of encoding. This is the character encoding of the text in the file. Is it "US-ASCII", "UTF-8", "SHIFT-JIS", "ISO-8859-9", …? there are hundreds of possibilities, and the correct choice usually cannot be determined from the file itself. It must be specified through some out-of-band channel.
For example, maybe it's the platform default. In a networked environment, however, this is extremely fragile. The machine that wrote the file might sit in the neighboring cubicle, but have a different default file encoding.
Most network protocols use a header or other metadata to explicitly note the character encoding.
In this case, it appears from the file extension that the content is XML. XML includes the "encoding" attribute in the XML declaration for this purpose. Furthermore, XML should really be processed with an XML parser, not as text. Reading XML line-by-line seems like a fragile, special case.
Failing to explicitly specify the encoding is against the second commandment. Use the default encoding at your peril!
GZIPInputStream gzip = new GZIPInputStream(new FileInputStream("F:/gawiki-20090614-stub-meta-history.xml.gz"));
BufferedReader br = new BufferedReader(new InputStreamReader(gzip));
br.readLine();
BufferedReader in = new BufferedReader(new InputStreamReader(
new GZIPInputStream(new FileInputStream("F:/gawiki-20090614-stub-meta-history.xml.gz"))));
String content;
while ((content = in.readLine()) != null)
System.out.println(content);
You can use the following method in a util class, and use it whenever necessary...
public static List<String> readLinesFromGZ(String filePath) {
List<String> lines = new ArrayList<>();
File file = new File(filePath);
try (GZIPInputStream gzip = new GZIPInputStream(new FileInputStream(file));
BufferedReader br = new BufferedReader(new InputStreamReader(gzip));) {
String line = null;
while ((line = br.readLine()) != null) {
lines.add(line);
}
} catch (FileNotFoundException e) {
e.printStackTrace(System.err);
} catch (IOException e) {
e.printStackTrace(System.err);
}
return lines;
}
here is with one line
try (BufferedReader br = new BufferedReader(
new InputStreamReader(
new GZIPInputStream(
new FileInputStream(
"F:/gawiki-20090614-stub-meta-history.xml.gz")))))
{br.readLine();}

Categories