java - modify and return a buffredInputStream - java

I have a BufferedInputStream that I got from a FileInputStream object like :
BufferedInputStream bufferedInputStream = new BufferedInputStream(fileInputStream)
now, I want to remove the chars { and } from the buffredInputStream (I know the file has those chars in it).
I thought that I can easily do it somehow like string replace but I saw that there is no simple way of doing it with BufferedInputStream.
any ideas how can I replace those specific chars from the BufferedInputStreamand return the new modified BufferedInputStream?
EDIT:
At the end I want to decide the charset of a file. though the chars {} are causing me some issues so I want to remove them before deciding the charset of a file. this i show I am trying to decide the charset:
static String detectCharset(File file) {
try (FileInputStream fileInputStream = new FileInputStream(file);
BufferedInputStream bufferedInputStream = new BufferedInputStream(fileInputStream)) {
CharsetDetector charsetDetector=new CharsetDetector();
charsetDetector.setText(bufferedInputStream);
charsetDetector.enableInputFilter(true);
CharsetMatch cm=charsetDetector.detect();
return cm.getName();
} catch (Exception e) {
return null;
}
}

NB: Adding a note to respond to the edit you have done to your question: You can't really filter } from a bag of bytes unless you know the encoding, so if you want to filter } out in order to guess at encoding you're in a chicken-and-egg situation. I do not understand how removing { and } would somehow help a charset encoding detector, though. That sounds like the detector is buggy or you're misinterpreting what it is doing. If you must, rewrite your brain to treat this as 'removing byte 123 and 125 from an inputstream' instead of 'remove chars { and } from an inputstream' and you're closer to a workable job definition. The same principle applies, except you'd write a FilterInputStream instead of a FilterReader with almost the same methods, except 123 and 125 instead of '{' and '}'.
-- original answer --
[1] InputStream refers to bytes, Reader is the same concept, except, for characters. It does not make sense to say: "filter all { from an inputstream". It would make sense to say "filter all occurrences of byte '123' from an inputstream". If it's UTF-8 or ASCII, these two are equivalent, but there's no guarantee, and it's not 'nice' code in any fashion. To read files as text, this is how:
import java.nio.file.*;
Path p = Paths.get("/path/to/file");
try (BufferedReader br = Files.newBufferedReader(p)) {
// operate on the reader here
}
note that unlike most java methods, the methods in Files assume UTF_8. You can specify the encoding explicitly (Files.newBufferedReader(p, [ENCODING HERE])) instead. You should never rely on the system default encoding being the right one; you cannot read a file as text unless you know in what text encoding it is written!
If you must use old API:
try (FileInputStream fis = new FileInputStream("/path/to/file");
InputStreamReader isr = new InputStreamReader(fis, StandardCharsets.UTF_8);
BufferedReader br = new BufferedReader(isr)) {
}
note that you MUST specify charset here or things break is subtle ways.
[2] to filter out certain characters, you can either do it 'inline' (in the code that reads chars from the reader), which is trivial, or you can create a wrapper stream that can do it. Something like:
class RemoveBracesReader extends java.io.FilterReader {
public RemoveBracesReader(Reader in) {
super(in);
}
public int read() throws java.io.IOException {
while (true) {
int c = in.read();
if (c != '{' && c != '}') return c;
}
}
}

Related

How to split a byte array that contains multiple "lines" in Java?

Say we have a file like so:
one
two
three
(but this file got encrypted)
My crypto method returns the whole file in memory, as a byte[] type.
I know byte arrays don't have a concept of "lines", that's something a Scanner (for example) could have.
I would like to traverse each line, convert it to string and perform my operation on it but I don't know
how to:
Find lines in a byte array
Slice the original byte array to "lines" (I would convert those slices to String, to send to my other methods)
Correctly traverse a byte array, where each iteration is a new "line"
Also: do I need to consider the different OS the file might have been composed in? I know that there is some difference between new lines in Windows and Linux and I don't want my method to work only with one format.
Edit: Following some tips from answers here, I was able to write some code that gets the job done. I still wonder if this code is worthy of keeping or I am doing something that can fail in the future:
byte[] decryptedBytes = doMyCrypto(fileName, accessKey);
ByteArrayInputStream byteArrInStrm = new ByteArrayInputStream(decryptedBytes);
InputStreamReader inStrmReader = new InputStreamReader(byteArrInStrm);
BufferedReader buffReader = new BufferedReader(inStrmReader);
String delimRegex = ",";
String line;
String[] values = null;
while ((line = buffReader.readLine()) != null) {
values = line.split(delimRegex);
if (Objects.equals(values[0], tableKey)) {
return values;
}
}
System.out.println(String.format("No entry with key %s in %s", tableKey, fileName));
return values;
In particular, I was advised to explicitly set the encoding but I was unable to see exactly where?
If you want to stream this, I'd suggest:
Create a ByteArrayInputStream to wrap your array
Wrap that in an InputStreamReader to convert binary data to text - I suggest you explicitly specify the text encoding being used
Create a BufferedReader around that to read a line at a time
Then you can just use:
String line;
while ((line = bufferedReader.readLine()) != null)
{
// Do something with the line
}
BufferedReader handles line breaks from all operating systems.
So something like this:
byte[] data = ...;
ByteArrayInputStream stream = new ByteArrayInputStream(data);
InputStreamReader streamReader = new InputStreamReader(stream, StandardCharsets.UTF_8);
BufferedReader bufferedReader = new BufferedReader(streamReader);
String line;
while ((line = bufferedReader.readLine()) != null)
{
System.out.println(line);
}
Note that in general you'd want to use try-with-resources blocks for the streams and readers - but it doesn't matter in this case, because it's just in memory.
As Scott states i would like to see what you came up with so we can help you alter it to fit your needs.
Regarding your last comment about the OS; if you want to support multiple file types you should consider making several functions that support those different file extensions. As far as i know you do need to specify which file and what type of file you are reading with your code.

How to use AsynchronousFileChannel to read to a StringBuffer efficiently

So you know you can use AsynchronousFileChannel to read an entire file to a String:
AsynchronousFileChannel fileChannel = AsynchronousFileChannel.open(filePath, StandardOpenOption.READ);
long len = fileChannel.size();
ReadAttachment readAttachment = new ReadAttachment();
readAttachment.byteBuffer = ByteBuffer.allocate((int) len);
readAttachment.asynchronousChannel = fileChannel;
CompletionHandler<Integer, ReadAttachment> completionHandler = new CompletionHandler<Integer, ReadAttachment>() {
#Override
public void completed(Integer result, ReadAttachment attachment) {
String content = new String(attachment.byteBuffer.array());
try {
attachment.asynchronousChannel.close();
} catch (IOException e) {
e.printStackTrace();
}
completeCallback.accept(content);
}
#Override
public void failed(Throwable exc, ReadAttachment attachment) {
exc.printStackTrace();
exceptionError(errorCallback, completeCallback, String.format("error while reading file [%s]: %s", path, exc.getMessage()));
}
};
fileChannel.read(
readAttachment.byteBuffer,
0,
readAttachment,
completionHandler);
Suppose that now, I don't want to allocate an entire ByteBuffer, but read line by line. I could use a ByteBuffer of fixed width and keep recalling read many times, always copying and appending to a StringBuffer until I don't get to a new line... My only concern is: because the encoding of the file that I am reading could be multi byte per character (UTF something), it may happen that the read bytes end with an uncomplete character. How can I make sure that I'm converting the right bytes into strings and not messing up the encoding?
UPDATE: answer is in the comment of the selected answer, but it basically points to CharsetDecoder.
If you have clear ASCII separator which you have in your case (\n), you'll not need to care about incomplete string as this character maps to singlebyte (and vice versa).
So just search for '\n' byte in your input and read and convert anything before into String. Loop until no more new lines are found. Then compact the buffer and reuse it for next read. If you don't find new line you'll have to allocate bigger buffer, copy the content of the old one and only then call the read again.
EDIT: As mentioned in the comment, you can pass the ByteBuffer to CharsetDecoder on the fly and translate it into CharBuffer (then append to StringBuilder or whatever is preffered solution).
Try Scanner:
Scanner sc = new Scanner(FileChannel.open(filePath, StandardOpenOption.READ));
String line = sc.readLine();
FileChannel is InterruptibleChannel

All inclusive Charset to avoid "java.nio.charset.MalformedInputException: Input length = 1"?

I'm creating a simple wordcount program in Java that reads through a directory's text-based files.
However, I keep on getting the error:
java.nio.charset.MalformedInputException: Input length = 1
from this line of code:
BufferedReader reader = Files.newBufferedReader(file,Charset.forName("UTF-8"));
I know I probably get this because I used a Charset that didn't include some of the characters in the text files, some of which included characters of other languages. But I want to include those characters.
I later learned at the JavaDocs that the Charset is optional and only used for a more efficient reading of the files, so I changed the code to:
BufferedReader reader = Files.newBufferedReader(file);
But some files still throw the MalformedInputException. I don't know why.
I was wondering if there is an all-inclusive Charset that will allow me to read text files with many different types of characters?
Thanks.
You probably want to have a list of supported encodings. For each file, try each encoding in turn, maybe starting with UTF-8. Every time you catch the MalformedInputException, try the next encoding.
Creating BufferedReader from Files.newBufferedReader
Files.newBufferedReader(Paths.get("a.txt"), StandardCharsets.UTF_8);
when running the application it may throw the following exception:
java.nio.charset.MalformedInputException: Input length = 1
But
new BufferedReader(new InputStreamReader(new FileInputStream("a.txt"),"utf-8"));
works well.
The different is that, the former uses CharsetDecoder default action.
The default action for malformed-input and unmappable-character errors is to report them.
while the latter uses the REPLACE action.
cs.newDecoder().onMalformedInput(CodingErrorAction.REPLACE).onUnmappableCharacter(CodingErrorAction.REPLACE)
ISO-8859-1 is an all-inclusive charset, in the sense that it's guaranteed not to throw MalformedInputException. So it's good for debugging, even if your input is not in this charset. So:-
req.setCharacterEncoding("ISO-8859-1");
I had some double-right-quote/double-left-quote characters in my input, and both US-ASCII and UTF-8 threw MalformedInputException on them, but ISO-8859-1 worked.
I also encountered this exception with error message,
java.nio.charset.MalformedInputException: Input length = 1
at java.nio.charset.CoderResult.throwException(Unknown Source)
at sun.nio.cs.StreamEncoder.implWrite(Unknown Source)
at sun.nio.cs.StreamEncoder.write(Unknown Source)
at java.io.OutputStreamWriter.write(Unknown Source)
at java.io.BufferedWriter.flushBuffer(Unknown Source)
at java.io.BufferedWriter.write(Unknown Source)
at java.io.Writer.write(Unknown Source)
and found that some strange bug occurs when trying to use
BufferedWriter writer = Files.newBufferedWriter(Paths.get(filePath));
to write a String "orazg 54" cast from a generic type in a class.
//key is of generic type <Key extends Comparable<Key>>
writer.write(item.getKey() + "\t" + item.getValue() + "\n");
This String is of length 9 containing chars with the following code points:
111
114
97
122
103
9
53
52
10
However, if the BufferedWriter in the class is replaced with:
FileOutputStream outputStream = new FileOutputStream(filePath);
BufferedWriter writer = new BufferedWriter(new OutputStreamWriter(outputStream));
it can successfully write this String without exceptions. In addition, if I write the same String create from the characters it still works OK.
String string = new String(new char[] {111, 114, 97, 122, 103, 9, 53, 52, 10});
BufferedWriter writer = Files.newBufferedWriter(Paths.get("a.txt"));
writer.write(string);
writer.close();
Previously I have never encountered any Exception when using the first BufferedWriter to write any Strings. It's a strange bug that occurs to BufferedWriter created from java.nio.file.Files.newBufferedWriter(path, options)
try this.. i had the same issue, below implementation worked for me
Reader reader = Files.newBufferedReader(Paths.get(<yourfilewithpath>), StandardCharsets.ISO_8859_1);
then use Reader where ever you want.
foreg:
CsvToBean<anyPojo> csvToBean = null;
try {
Reader reader = Files.newBufferedReader(Paths.get(csvFilePath),
StandardCharsets.ISO_8859_1);
csvToBean = new CsvToBeanBuilder(reader)
.withType(anyPojo.class)
.withIgnoreLeadingWhiteSpace(true)
.withSkipLines(1)
.build();
} catch (IOException e) {
e.printStackTrace();
}
ISO_8859_1 Worked for me! I was reading text file with comma separated values
I wrote the following to print a list of results to standard out based on available charsets. Note that it also tells you what line fails from a 0 based line number in case you are troubleshooting what character is causing issues.
public static void testCharset(String fileName) {
SortedMap<String, Charset> charsets = Charset.availableCharsets();
for (String k : charsets.keySet()) {
int line = 0;
boolean success = true;
try (BufferedReader b = Files.newBufferedReader(Paths.get(fileName),charsets.get(k))) {
while (b.ready()) {
b.readLine();
line++;
}
} catch (IOException e) {
success = false;
System.out.println(k+" failed on line "+line);
}
if (success)
System.out.println("************************* Successs "+k);
}
}
Well, the problem is that Files.newBufferedReader(Path path) is implemented like this :
public static BufferedReader newBufferedReader(Path path) throws IOException {
return newBufferedReader(path, StandardCharsets.UTF_8);
}
so basically there is no point in specifying UTF-8 unless you want to be descriptive in your code.
If you want to try a "broader" charset you could try with StandardCharsets.UTF_16, but you can't be 100% sure to get every possible character anyway.
UTF-8 works for me with Polish characters
Adding an additional answer for quarkus mailer and qute templates, as this is always the first result in google no matter what parts of the stack trace I searched for:
If you're using quarkus mailer and a qute template and get this MalformedInputException check if your templates folder contains other files than template files. In my case I had a .png file that I wanted to include in the mail and that was automatically read as template, therefore this encoding issue appeared.
you can try something like this, or just copy and past below piece.
boolean exception = true;
Charset charset = Charset.defaultCharset(); //Try the default one first.
int index = 0;
while(exception) {
try {
lines = Files.readAllLines(f.toPath(),charset);
for (String line: lines) {
line= line.trim();
if(line.contains(keyword))
values.add(line);
}
//No exception, just returns
exception = false;
} catch (IOException e) {
exception = true;
//Try the next charset
if(index<Charset.availableCharsets().values().size())
charset = (Charset) Charset.availableCharsets().values().toArray()[index];
index ++;
}
}

How to read a string stream in Java discarding illegal characters?

I have to parse a stream of bytes coming from a TCP connection that's supposed to only give me printable characters, but in reality that's not always the case. I've seen some binary zeros in there, at the start and end of some fields. I have no control over the source of the data and I need to process the "dirty" lines. If I could just filter out the invalid characters, that'd be OK. The relevant code is as such:
srvr = new ServerSocket(myport);
skt = srvr.accept();
// Tried with no encoding argument too
in = new Scanner(skt.getInputStream(), "ISO-8859-1");
in.useDelimiter("[\r\n]");
for (;;) {
String myline = in.next();
if (!myline.equals(""))
ProcessRecord(myline);
}
I get an exception at every line that has "dirt." What's a good way to filter out invalid characters while still being able to obtain the rest of the string?
You have to wrap your InputStream in a CharsetDecoder, defining an empty error handler:
//let's create a decoder for ISO-8859-1 which will just ignore invalid data
CharsetDecoder decoder=Charset.forName("ISO-8859-1").newDecoder();
decoder.onMalformedInput(CodingErrorAction.IGNORE);
decoder.onUnmappableCharacter(CodingErrorAction.IGNORE);
//let's wrap the inputstream into the decoder
InputStream is=skt.getInputStream();
in = new Scanner(decoder.decode(is));
you can also use a custom CodingErrorAction and define your own action in case of coding error.
The purest solution is to filter the InputStream (binary bytes-level I/O).
in = new Scanner(new DirtFilterInputStream(skt.getInputStream()), "Windows-1252");
public class DirtFilterInputStream extends InputStream {
private InputStream in;
public DirtFilterInputStream(InputStream in) {
this.in = in;
}
#Override
public int read() throws IOException {
int ch = in.read();
if (ch != -1) {
if (ch == 0) {
ch = read();
}
}
return ch;
}
}
(You need to override all methods, and delegate to the original stream.)
Windows-1252 is Windows Latin-1, an extended Latin 1, ISO-8859-1, using 0x80 - 0xBF.
I was completely off base. I get the "dirty" strings no problem (and NO, I have NO option to clean up the data source, it's from a closed system and I have to just grin and deal with it) but trying to store them in PostgreSQL is what gets me the exception. That means I have total freedom to clean it up before processing.

Read special characters in java

I have a question, I'm trying to read from a file, a set of key and value pairs ( Like a Dictionary). For this I'm using the following code:
InputStream is = this.getClass().getResourceAsStream(PROPERTIES_BUNDLE);
properties=new Hashtable();
InputStreamReader isr=new InputStreamReader(is);
LineReader lineReader=new LineReader(isr);
try {
while (lineReader.hasLine()) {
String line=lineReader.readLine();
if(line.length()>1 && line.substring(0,1).equals("#")) continue;
if(line.indexOf("=")!=-1){
String key=line.substring(0,line.indexOf("="));
String value=line.substring(line.indexOf("=")+1,line.length());
properties.put(key, value);
}
}
} catch (IOException e) {
e.printStackTrace();
}
And the readLine function.
public String readLine() throws IOException{
int tmp;
StringBuffer out=new StringBuffer();
//Read in data
while(true){
//Check the bucket first. If empty read from the input stream
if(bucket!=-1){
tmp=bucket;
bucket=-1;
}else{
tmp=in.read();
if(tmp==-1)break;
}
//If new line, then discard it. If we get a \r, we need to look ahead so can use bucket
if(tmp=='\r'){
int nextChar=in.read();
if(tmp!='\n')bucket=nextChar;//Ignores \r\n, but not \r\r
break;
}else if(tmp=='\n'){
break;
}else{
//Otherwise just append the character
out.append((char) tmp);
}
}
return out.toString();
}
Everything is fine, however I want it to be able to parse special characters. For example: รณ that would be codified into \u00F3, however in this case it's not replacing it with the correct character... What would be the way to do it?
EDIT: Forgot to say that since I'm using JavaME the Properties class or anything similar does not exist, that's why it may seem that I'm reinventing the wheel...
If it's encoded with UTF-16, can you not just
InputStreamReader isr = new InputStreamReader(is, "UTF16")?
This would recognize your special characters right from the get-go and you wouldn't need to do any replacements.
You need to ensure that you character encoding is set in your InputStreamReader to be that of the file. If it doesn't match some characters can be incorrect.

Categories