How to read a string stream in Java discarding illegal characters?

How to read a string stream in Java discarding illegal characters? - java

I have to parse a stream of bytes coming from a TCP connection that's supposed to only give me printable characters, but in reality that's not always the case. I've seen some binary zeros in there, at the start and end of some fields. I have no control over the source of the data and I need to process the "dirty" lines. If I could just filter out the invalid characters, that'd be OK. The relevant code is as such:
srvr = new ServerSocket(myport);
skt = srvr.accept();
// Tried with no encoding argument too
in = new Scanner(skt.getInputStream(), "ISO-8859-1");
in.useDelimiter("[\r\n]");
for (;;) {
String myline = in.next();
if (!myline.equals(""))
ProcessRecord(myline);
}
I get an exception at every line that has "dirt." What's a good way to filter out invalid characters while still being able to obtain the rest of the string?

You have to wrap your InputStream in a CharsetDecoder, defining an empty error handler:
//let's create a decoder for ISO-8859-1 which will just ignore invalid data
CharsetDecoder decoder=Charset.forName("ISO-8859-1").newDecoder();
decoder.onMalformedInput(CodingErrorAction.IGNORE);
decoder.onUnmappableCharacter(CodingErrorAction.IGNORE);
//let's wrap the inputstream into the decoder
InputStream is=skt.getInputStream();
in = new Scanner(decoder.decode(is));
you can also use a custom CodingErrorAction and define your own action in case of coding error.

The purest solution is to filter the InputStream (binary bytes-level I/O).
in = new Scanner(new DirtFilterInputStream(skt.getInputStream()), "Windows-1252");
public class DirtFilterInputStream extends InputStream {
private InputStream in;
public DirtFilterInputStream(InputStream in) {
this.in = in;
}
#Override
public int read() throws IOException {
int ch = in.read();
if (ch != -1) {
if (ch == 0) {
ch = read();
}
}
return ch;
}
}
(You need to override all methods, and delegate to the original stream.)
Windows-1252 is Windows Latin-1, an extended Latin 1, ISO-8859-1, using 0x80 - 0xBF.

I was completely off base. I get the "dirty" strings no problem (and NO, I have NO option to clean up the data source, it's from a closed system and I have to just grin and deal with it) but trying to store them in PostgreSQL is what gets me the exception. That means I have total freedom to clean it up before processing.

Related

The Java.io.InputStream class is the superclass of all classes representing an input stream of bytes. How is it reading a file with characters?

I have a file named mark_method.json containing ABCDE in it and I am reading this file using the InputStream class.
By definition, the InputStream class reads an input stream of bytes. How does this work? I don't have bytes in the file, but characters?
I am trying to understand how a stream reading bytes is reading characters from the file?
public class MarkDemo {
public static void main(String args[]) throws Exception {
InputStream is = null;
try {
is = new FileInputStream("C:\\Users\\s\\Documents\\EB\\EB_02_09_2020_with_page_number_and_quote_number\\Old_images\\mark_method.json");
}
catch(Exception e) {
e.printStackTrace();
} finally {
if(is != null) {
is.close();
}
}
}
}

Every data on the computer is stored in bits and bytes. Here the content of the files is also stored in bytes.
We have programs which convert these bytes into human-readable forms thus we see the mark_method.json file containing characters and not bytes.

An character is a byte. (At least in ASCII).
Each byte from 0 to 127 has a character value. For example 0 is the Null-character, 0xa is \n, 0xd is \r, 0x41 is 'A' and so on.
The implementation only knows bytes. It doesn't know, that the char 0x2709 is ✉. It only sees it as two bytes: 0x27 and 0x09.
Only the texteditor interprets the bytes and show the matching symbol/letter

I think what you are actually asking here is how to convert the bytes you read from file using FileInputStream in to a Java String object you can print and manipulate.
FileInputStream does not have any read methods for directly producing a String object so if that is what you want, you need to further manipulate the input you get.
Option one is to use the Scanner class:
Scanner scanner = new Scanner(is);
String word = scanner.next();
Another option is to read the bytes and use the constructor of the String class that works with byte array:
byte [] bytes = new byte[10];
is.read(bytes);
String text = new String(bytes);
Note that for simplicity I just assumed you can read 10 valid bytes from your file.
In real code you would need some logic to make sure you are reading correct number of bytes.
Also, if your file is not stored using your system default character set, you will need to specify the character set as a parameter to the String constructor.
Finally, you can use another wrapper class, BufferedReader that has a readLine function which takes care of all the logic needed to read bytes representing a line of text from a file and return them in a String.
BufferedReader in = new BufferedReader(new FileReader("foo.in"));
String line = in.readLine();

java - modify and return a buffredInputStream

I have a BufferedInputStream that I got from a FileInputStream object like :
BufferedInputStream bufferedInputStream = new BufferedInputStream(fileInputStream)
now, I want to remove the chars { and } from the buffredInputStream (I know the file has those chars in it).
I thought that I can easily do it somehow like string replace but I saw that there is no simple way of doing it with BufferedInputStream.
any ideas how can I replace those specific chars from the BufferedInputStreamand return the new modified BufferedInputStream?
EDIT:
At the end I want to decide the charset of a file. though the chars {} are causing me some issues so I want to remove them before deciding the charset of a file. this i show I am trying to decide the charset:
static String detectCharset(File file) {
try (FileInputStream fileInputStream = new FileInputStream(file);
BufferedInputStream bufferedInputStream = new BufferedInputStream(fileInputStream)) {
CharsetDetector charsetDetector=new CharsetDetector();
charsetDetector.setText(bufferedInputStream);
charsetDetector.enableInputFilter(true);
CharsetMatch cm=charsetDetector.detect();
return cm.getName();
} catch (Exception e) {
return null;
}
}

NB: Adding a note to respond to the edit you have done to your question: You can't really filter } from a bag of bytes unless you know the encoding, so if you want to filter } out in order to guess at encoding you're in a chicken-and-egg situation. I do not understand how removing { and } would somehow help a charset encoding detector, though. That sounds like the detector is buggy or you're misinterpreting what it is doing. If you must, rewrite your brain to treat this as 'removing byte 123 and 125 from an inputstream' instead of 'remove chars { and } from an inputstream' and you're closer to a workable job definition. The same principle applies, except you'd write a FilterInputStream instead of a FilterReader with almost the same methods, except 123 and 125 instead of '{' and '}'.
-- original answer --
[1] InputStream refers to bytes, Reader is the same concept, except, for characters. It does not make sense to say: "filter all { from an inputstream". It would make sense to say "filter all occurrences of byte '123' from an inputstream". If it's UTF-8 or ASCII, these two are equivalent, but there's no guarantee, and it's not 'nice' code in any fashion. To read files as text, this is how:
import java.nio.file.*;
Path p = Paths.get("/path/to/file");
try (BufferedReader br = Files.newBufferedReader(p)) {
// operate on the reader here
}
note that unlike most java methods, the methods in Files assume UTF_8. You can specify the encoding explicitly (Files.newBufferedReader(p, [ENCODING HERE])) instead. You should never rely on the system default encoding being the right one; you cannot read a file as text unless you know in what text encoding it is written!
If you must use old API:
try (FileInputStream fis = new FileInputStream("/path/to/file");
InputStreamReader isr = new InputStreamReader(fis, StandardCharsets.UTF_8);
BufferedReader br = new BufferedReader(isr)) {
}
note that you MUST specify charset here or things break is subtle ways.
[2] to filter out certain characters, you can either do it 'inline' (in the code that reads chars from the reader), which is trivial, or you can create a wrapper stream that can do it. Something like:
class RemoveBracesReader extends java.io.FilterReader {
public RemoveBracesReader(Reader in) {
super(in);
}
public int read() throws java.io.IOException {
while (true) {
int c = in.read();
if (c != '{' && c != '}') return c;
}
}
}

How to use AsynchronousFileChannel to read to a StringBuffer efficiently

So you know you can use AsynchronousFileChannel to read an entire file to a String:
AsynchronousFileChannel fileChannel = AsynchronousFileChannel.open(filePath, StandardOpenOption.READ);
long len = fileChannel.size();
ReadAttachment readAttachment = new ReadAttachment();
readAttachment.byteBuffer = ByteBuffer.allocate((int) len);
readAttachment.asynchronousChannel = fileChannel;
CompletionHandler<Integer, ReadAttachment> completionHandler = new CompletionHandler<Integer, ReadAttachment>() {
#Override
public void completed(Integer result, ReadAttachment attachment) {
String content = new String(attachment.byteBuffer.array());
try {
attachment.asynchronousChannel.close();
} catch (IOException e) {
e.printStackTrace();
}
completeCallback.accept(content);
}
#Override
public void failed(Throwable exc, ReadAttachment attachment) {
exc.printStackTrace();
exceptionError(errorCallback, completeCallback, String.format("error while reading file [%s]: %s", path, exc.getMessage()));
}
};
fileChannel.read(
readAttachment.byteBuffer,
0,
readAttachment,
completionHandler);
Suppose that now, I don't want to allocate an entire ByteBuffer, but read line by line. I could use a ByteBuffer of fixed width and keep recalling read many times, always copying and appending to a StringBuffer until I don't get to a new line... My only concern is: because the encoding of the file that I am reading could be multi byte per character (UTF something), it may happen that the read bytes end with an uncomplete character. How can I make sure that I'm converting the right bytes into strings and not messing up the encoding?
UPDATE: answer is in the comment of the selected answer, but it basically points to CharsetDecoder.

If you have clear ASCII separator which you have in your case (\n), you'll not need to care about incomplete string as this character maps to singlebyte (and vice versa).
So just search for '\n' byte in your input and read and convert anything before into String. Loop until no more new lines are found. Then compact the buffer and reuse it for next read. If you don't find new line you'll have to allocate bigger buffer, copy the content of the old one and only then call the read again.
EDIT: As mentioned in the comment, you can pass the ByteBuffer to CharsetDecoder on the fly and translate it into CharBuffer (then append to StringBuilder or whatever is preffered solution).

Try Scanner:
Scanner sc = new Scanner(FileChannel.open(filePath, StandardOpenOption.READ));
String line = sc.readLine();
FileChannel is InterruptibleChannel

All inclusive Charset to avoid "java.nio.charset.MalformedInputException: Input length = 1"?

I'm creating a simple wordcount program in Java that reads through a directory's text-based files.
However, I keep on getting the error:
java.nio.charset.MalformedInputException: Input length = 1
from this line of code:
BufferedReader reader = Files.newBufferedReader(file,Charset.forName("UTF-8"));
I know I probably get this because I used a Charset that didn't include some of the characters in the text files, some of which included characters of other languages. But I want to include those characters.
I later learned at the JavaDocs that the Charset is optional and only used for a more efficient reading of the files, so I changed the code to:
BufferedReader reader = Files.newBufferedReader(file);
But some files still throw the MalformedInputException. I don't know why.
I was wondering if there is an all-inclusive Charset that will allow me to read text files with many different types of characters?
Thanks.

You probably want to have a list of supported encodings. For each file, try each encoding in turn, maybe starting with UTF-8. Every time you catch the MalformedInputException, try the next encoding.

Creating BufferedReader from Files.newBufferedReader
Files.newBufferedReader(Paths.get("a.txt"), StandardCharsets.UTF_8);
when running the application it may throw the following exception:
java.nio.charset.MalformedInputException: Input length = 1
But
new BufferedReader(new InputStreamReader(new FileInputStream("a.txt"),"utf-8"));
works well.
The different is that, the former uses CharsetDecoder default action.
The default action for malformed-input and unmappable-character errors is to report them.
while the latter uses the REPLACE action.
cs.newDecoder().onMalformedInput(CodingErrorAction.REPLACE).onUnmappableCharacter(CodingErrorAction.REPLACE)

ISO-8859-1 is an all-inclusive charset, in the sense that it's guaranteed not to throw MalformedInputException. So it's good for debugging, even if your input is not in this charset. So:-
req.setCharacterEncoding("ISO-8859-1");
I had some double-right-quote/double-left-quote characters in my input, and both US-ASCII and UTF-8 threw MalformedInputException on them, but ISO-8859-1 worked.

I also encountered this exception with error message,
java.nio.charset.MalformedInputException: Input length = 1
at java.nio.charset.CoderResult.throwException(Unknown Source)
at sun.nio.cs.StreamEncoder.implWrite(Unknown Source)
at sun.nio.cs.StreamEncoder.write(Unknown Source)
at java.io.OutputStreamWriter.write(Unknown Source)
at java.io.BufferedWriter.flushBuffer(Unknown Source)
at java.io.BufferedWriter.write(Unknown Source)
at java.io.Writer.write(Unknown Source)
and found that some strange bug occurs when trying to use
BufferedWriter writer = Files.newBufferedWriter(Paths.get(filePath));
to write a String "orazg 54" cast from a generic type in a class.
//key is of generic type <Key extends Comparable<Key>>
writer.write(item.getKey() + "\t" + item.getValue() + "\n");
This String is of length 9 containing chars with the following code points:
111
114
97
122
103
9
53
52
10
However, if the BufferedWriter in the class is replaced with:
FileOutputStream outputStream = new FileOutputStream(filePath);
BufferedWriter writer = new BufferedWriter(new OutputStreamWriter(outputStream));
it can successfully write this String without exceptions. In addition, if I write the same String create from the characters it still works OK.
String string = new String(new char[] {111, 114, 97, 122, 103, 9, 53, 52, 10});
BufferedWriter writer = Files.newBufferedWriter(Paths.get("a.txt"));
writer.write(string);
writer.close();
Previously I have never encountered any Exception when using the first BufferedWriter to write any Strings. It's a strange bug that occurs to BufferedWriter created from java.nio.file.Files.newBufferedWriter(path, options)

try this.. i had the same issue, below implementation worked for me
Reader reader = Files.newBufferedReader(Paths.get(<yourfilewithpath>), StandardCharsets.ISO_8859_1);
then use Reader where ever you want.
foreg:
CsvToBean<anyPojo> csvToBean = null;
try {
Reader reader = Files.newBufferedReader(Paths.get(csvFilePath),
StandardCharsets.ISO_8859_1);
csvToBean = new CsvToBeanBuilder(reader)
.withType(anyPojo.class)
.withIgnoreLeadingWhiteSpace(true)
.withSkipLines(1)
.build();
} catch (IOException e) {
e.printStackTrace();
}

ISO_8859_1 Worked for me! I was reading text file with comma separated values

I wrote the following to print a list of results to standard out based on available charsets. Note that it also tells you what line fails from a 0 based line number in case you are troubleshooting what character is causing issues.
public static void testCharset(String fileName) {
SortedMap<String, Charset> charsets = Charset.availableCharsets();
for (String k : charsets.keySet()) {
int line = 0;
boolean success = true;
try (BufferedReader b = Files.newBufferedReader(Paths.get(fileName),charsets.get(k))) {
while (b.ready()) {
b.readLine();
line++;
}
} catch (IOException e) {
success = false;
System.out.println(k+" failed on line "+line);
}
if (success)
System.out.println("************************* Successs "+k);
}
}

Well, the problem is that Files.newBufferedReader(Path path) is implemented like this :
public static BufferedReader newBufferedReader(Path path) throws IOException {
return newBufferedReader(path, StandardCharsets.UTF_8);
}
so basically there is no point in specifying UTF-8 unless you want to be descriptive in your code.
If you want to try a "broader" charset you could try with StandardCharsets.UTF_16, but you can't be 100% sure to get every possible character anyway.

UTF-8 works for me with Polish characters

Adding an additional answer for quarkus mailer and qute templates, as this is always the first result in google no matter what parts of the stack trace I searched for:
If you're using quarkus mailer and a qute template and get this MalformedInputException check if your templates folder contains other files than template files. In my case I had a .png file that I wanted to include in the mail and that was automatically read as template, therefore this encoding issue appeared.

you can try something like this, or just copy and past below piece.
boolean exception = true;
Charset charset = Charset.defaultCharset(); //Try the default one first.
int index = 0;
while(exception) {
try {
lines = Files.readAllLines(f.toPath(),charset);
for (String line: lines) {
line= line.trim();
if(line.contains(keyword))
values.add(line);
}
//No exception, just returns
exception = false;
} catch (IOException e) {
exception = true;
//Try the next charset
if(index<Charset.availableCharsets().values().size())
charset = (Charset) Charset.availableCharsets().values().toArray()[index];
index ++;
}
}

Reading UTF-8 characters using Scanner

public boolean isValid(String username, String password) {
boolean valid = false;
DataInputStream file = null;
try{
Scanner files = new Scanner(new BufferedReader(new FileReader("files/students.txt")));
while(files.hasNext()){
System.out.println(files.next());
}
}catch(Exception e){
e.printStackTrace();
}
return valid;
}
How come when I am reading a file that has been written by UTF-8(By another java program) it displays with weird symbols followed by its String name?
I wrote it using this
private static void addAccount(String username,String password){
File file = new File(file_name);
try{
DataOutputStream dos = new DataOutputStream(new FileOutputStream(file,true));
dos.writeUTF((username+"::"+password+"\n"));
}catch(Exception e){
}
}

Here is a simple way to do that:
File words = new File(path);
Scanner s = new Scanner(words,"utf-8");

From the FileReader Javadoc:
Convenience class for reading character files. The constructors of this class assume that the default character encoding and the default byte-buffer size are appropriate. To specify these values yourself, construct an InputStreamReader on a FileInputStream.
So perhaps something like new InputStreamReader(new FileInputStream(file), "UTF-8"))

When using DataOutput.writeUTF/DataInput.readUTF, the first 2 bytes form an unsigned 16-bit big-endian integer denoting the size of the string.
First, two bytes are read and used to construct an unsigned 16-bit integer in exactly the manner of the readUnsignedShort method . This integer value is called the UTF length and specifies the number of additional bytes to be read. These bytes are then converted to characters by considering them in groups. The length of each group is computed from the value of the first byte of the group. The byte following a group, if any, is the first byte of the next group.
These are likely the cause for your issues. You'd need to skip the first 2 bytes and then specify your Scanner use UTF-8 to read properly.
That being said, I do not see any reason to use DataOutput/DataInput here. You can merely use FileReader and FileWriter instead. These will use the default system encoding.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to read a string stream in Java discarding illegal characters? - java

Related

The Java.io.InputStream class is the superclass of all classes representing an input stream of bytes. How is it reading a file with characters?

java - modify and return a buffredInputStream

How to use AsynchronousFileChannel to read to a StringBuffer efficiently

All inclusive Charset to avoid "java.nio.charset.MalformedInputException: Input length = 1"?

Reading UTF-8 characters using Scanner

Categories

Resources