So you know you can use AsynchronousFileChannel to read an entire file to a String:
AsynchronousFileChannel fileChannel = AsynchronousFileChannel.open(filePath, StandardOpenOption.READ);
long len = fileChannel.size();
ReadAttachment readAttachment = new ReadAttachment();
readAttachment.byteBuffer = ByteBuffer.allocate((int) len);
readAttachment.asynchronousChannel = fileChannel;
CompletionHandler<Integer, ReadAttachment> completionHandler = new CompletionHandler<Integer, ReadAttachment>() {
#Override
public void completed(Integer result, ReadAttachment attachment) {
String content = new String(attachment.byteBuffer.array());
try {
attachment.asynchronousChannel.close();
} catch (IOException e) {
e.printStackTrace();
}
completeCallback.accept(content);
}
#Override
public void failed(Throwable exc, ReadAttachment attachment) {
exc.printStackTrace();
exceptionError(errorCallback, completeCallback, String.format("error while reading file [%s]: %s", path, exc.getMessage()));
}
};
fileChannel.read(
readAttachment.byteBuffer,
0,
readAttachment,
completionHandler);
Suppose that now, I don't want to allocate an entire ByteBuffer, but read line by line. I could use a ByteBuffer of fixed width and keep recalling read many times, always copying and appending to a StringBuffer until I don't get to a new line... My only concern is: because the encoding of the file that I am reading could be multi byte per character (UTF something), it may happen that the read bytes end with an uncomplete character. How can I make sure that I'm converting the right bytes into strings and not messing up the encoding?
UPDATE: answer is in the comment of the selected answer, but it basically points to CharsetDecoder.
If you have clear ASCII separator which you have in your case (\n), you'll not need to care about incomplete string as this character maps to singlebyte (and vice versa).
So just search for '\n' byte in your input and read and convert anything before into String. Loop until no more new lines are found. Then compact the buffer and reuse it for next read. If you don't find new line you'll have to allocate bigger buffer, copy the content of the old one and only then call the read again.
EDIT: As mentioned in the comment, you can pass the ByteBuffer to CharsetDecoder on the fly and translate it into CharBuffer (then append to StringBuilder or whatever is preffered solution).
Try Scanner:
Scanner sc = new Scanner(FileChannel.open(filePath, StandardOpenOption.READ));
String line = sc.readLine();
FileChannel is InterruptibleChannel
Related
I have a file named mark_method.json containing ABCDE in it and I am reading this file using the InputStream class.
By definition, the InputStream class reads an input stream of bytes. How does this work? I don't have bytes in the file, but characters?
I am trying to understand how a stream reading bytes is reading characters from the file?
public class MarkDemo {
public static void main(String args[]) throws Exception {
InputStream is = null;
try {
is = new FileInputStream("C:\\Users\\s\\Documents\\EB\\EB_02_09_2020_with_page_number_and_quote_number\\Old_images\\mark_method.json");
}
catch(Exception e) {
e.printStackTrace();
} finally {
if(is != null) {
is.close();
}
}
}
}
Every data on the computer is stored in bits and bytes. Here the content of the files is also stored in bytes.
We have programs which convert these bytes into human-readable forms thus we see the mark_method.json file containing characters and not bytes.
An character is a byte. (At least in ASCII).
Each byte from 0 to 127 has a character value. For example 0 is the Null-character, 0xa is \n, 0xd is \r, 0x41 is 'A' and so on.
The implementation only knows bytes. It doesn't know, that the char 0x2709 is ✉. It only sees it as two bytes: 0x27 and 0x09.
Only the texteditor interprets the bytes and show the matching symbol/letter
I think what you are actually asking here is how to convert the bytes you read from file using FileInputStream in to a Java String object you can print and manipulate.
FileInputStream does not have any read methods for directly producing a String object so if that is what you want, you need to further manipulate the input you get.
Option one is to use the Scanner class:
Scanner scanner = new Scanner(is);
String word = scanner.next();
Another option is to read the bytes and use the constructor of the String class that works with byte array:
byte [] bytes = new byte[10];
is.read(bytes);
String text = new String(bytes);
Note that for simplicity I just assumed you can read 10 valid bytes from your file.
In real code you would need some logic to make sure you are reading correct number of bytes.
Also, if your file is not stored using your system default character set, you will need to specify the character set as a parameter to the String constructor.
Finally, you can use another wrapper class, BufferedReader that has a readLine function which takes care of all the logic needed to read bytes representing a line of text from a file and return them in a String.
BufferedReader in = new BufferedReader(new FileReader("foo.in"));
String line = in.readLine();
I need to read from a file that contain 9000 words, what is the best way to read from this file and what is the difference between bufferingreader aND regular scanner.. or is there other good class to use?
Thanks
If you are doing "efficient" reading, there is no benefit to buffering. If, on the other hand, you are doing "inefficient" reading, then having a buffer will improve performance.
What do I mean by "efficient" reading? Efficient reading means reading bytes off of the InputStream / Reader as fast as they appear. Imagine you wanted to load a whole text file to display in an IDE or other editor. "inefficient" reading is when you are reading information off of the stream piecemeal - ie Scanner.nextDouble() is inefficient reading, as it reads in a few bytes (until the double's digits end), then transforms the number from text to binary. In this case, having a buffer improves performance, as the next call to nextDouble() will read out of the buffer (memory) instead of disk
If you have any questions on this, please ask
Open the file using an input stream. Then read its content to a string using this code:
public static void main(String args[]) throws IOException {
FileInputStream in = null;
in = new FileInputStream("input.txt");
String text = inputStreamToString(is);
}
// Reads an InputStream and converts it to a String.
public String inputStreamToString(InputStream stream) throws IOException {
ByteArrayOutputStream byteArrayOutputStream = new ByteArrayOutputStream();
byte[] buffer = new byte[1024];
int length;
while((length = stream.read(buffer)) != -1)
byteArrayOutputStream.write(buffer,0,length);
return byteArrayOutputStream.toString("UTF-8");
}
Check this answer for comparisons between buffered readers:
Read/convert an InputStream to a String
I normally use Scanners when I want to read a file line by line, or based on a delimiter. For example:
try {
Scanner fileScanner = new Scanner(System.in);
File file = new File("file.txt");
fileScanner = new Scanner(file);
while (fileScanner.hasNextLine()) {
String line = fileScanner.nextLine();
System.out.println(line);
}
fileScanner.close();
} catch (Exception ex) {
ex.printStackTrace();
}
To scan based on a delimiter, you can use something similar to this:
fileScanner.userDelimiter("\\s"); // \s matches any whitespace
while(fileScanner.hasNext()){
//do something with scanned data
String word = fileScanner.next();
//Double num = fileScanner.nextDouble();
}
I have a servlet which receives via POST method a large JSON string (> 10,000 characters).
If i read the content of the request like this:
try(Reader reader = new InputStreamReader(new BufferedInputStream(request.getInputStream()), StandardCharsets.UTF_8))
{
char[] buffer = new char[request.getContentLength()];
reader.read(buffer);
System.out.println(new String(buffer));
}
i don´t get the entire content! The buffer size is correct. But the length of the created string is not.
But if i do it like this:
try(BufferedInputStream input = new BufferedInputStream(request.getInputStream()))
{
byte[] buffer = new byte[request.getContentLength()];
input.read(buffer);
System.out.println(new String(buffer, StandardCharsets.UTF_8));
}
it works perfectly.
So where am i wrong in the first case?
The way you are using InputStreamReader is not really the intended way. A call to read is not guaranteed to read any specific number of bytes (it depends on the stream you are reading from), which is why the return value of this method is the number of bytes that were read. You would need to keep reading from the stream and buffering until it indicates it has reached the end (it will return -1 as the number of bytes that were read). Some good examples of how to do this can be found here: Convert InputStream to byte array in Java
But since you want this as character data, you should probably use request.getReader() instead. A good example of how to do this can be found here: Retrieving JSON Object Literal from HttpServletRequest
I have to parse a stream of bytes coming from a TCP connection that's supposed to only give me printable characters, but in reality that's not always the case. I've seen some binary zeros in there, at the start and end of some fields. I have no control over the source of the data and I need to process the "dirty" lines. If I could just filter out the invalid characters, that'd be OK. The relevant code is as such:
srvr = new ServerSocket(myport);
skt = srvr.accept();
// Tried with no encoding argument too
in = new Scanner(skt.getInputStream(), "ISO-8859-1");
in.useDelimiter("[\r\n]");
for (;;) {
String myline = in.next();
if (!myline.equals(""))
ProcessRecord(myline);
}
I get an exception at every line that has "dirt." What's a good way to filter out invalid characters while still being able to obtain the rest of the string?
You have to wrap your InputStream in a CharsetDecoder, defining an empty error handler:
//let's create a decoder for ISO-8859-1 which will just ignore invalid data
CharsetDecoder decoder=Charset.forName("ISO-8859-1").newDecoder();
decoder.onMalformedInput(CodingErrorAction.IGNORE);
decoder.onUnmappableCharacter(CodingErrorAction.IGNORE);
//let's wrap the inputstream into the decoder
InputStream is=skt.getInputStream();
in = new Scanner(decoder.decode(is));
you can also use a custom CodingErrorAction and define your own action in case of coding error.
The purest solution is to filter the InputStream (binary bytes-level I/O).
in = new Scanner(new DirtFilterInputStream(skt.getInputStream()), "Windows-1252");
public class DirtFilterInputStream extends InputStream {
private InputStream in;
public DirtFilterInputStream(InputStream in) {
this.in = in;
}
#Override
public int read() throws IOException {
int ch = in.read();
if (ch != -1) {
if (ch == 0) {
ch = read();
}
}
return ch;
}
}
(You need to override all methods, and delegate to the original stream.)
Windows-1252 is Windows Latin-1, an extended Latin 1, ISO-8859-1, using 0x80 - 0xBF.
I was completely off base. I get the "dirty" strings no problem (and NO, I have NO option to clean up the data source, it's from a closed system and I have to just grin and deal with it) but trying to store them in PostgreSQL is what gets me the exception. That means I have total freedom to clean it up before processing.
I have a problem in code:
private static String compress(String str)
{
String str1 = null;
ByteArrayOutputStream bos = null;
try
{
bos = new ByteArrayOutputStream();
BufferedOutputStream dest = null;
byte b[] = str.getBytes();
GZIPOutputStream gz = new GZIPOutputStream(bos,b.length);
gz.write(b,0,b.length);
bos.close();
gz.close();
}
catch(Exception e) {
System.out.println(e);
e.printStackTrace();
}
byte b1[] = bos.toByteArray();
return new String(b1);
}
private static String deCompress(String str)
{
String s1 = null;
try
{
byte b[] = str.getBytes();
InputStream bais = new ByteArrayInputStream(b);
GZIPInputStream gs = new GZIPInputStream(bais);
ByteArrayOutputStream baos = new ByteArrayOutputStream();
int numBytesRead = 0;
byte [] tempBytes = new byte[6000];
try
{
while ((numBytesRead = gs.read(tempBytes, 0, tempBytes.length)) != -1)
{
baos.write(tempBytes, 0, numBytesRead);
}
s1 = new String(baos.toByteArray());
s1= baos.toString();
}
catch(ZipException e)
{
e.printStackTrace();
}
}
catch(Exception e) {
e.printStackTrace();
}
return s1;
}
public String test() throws Exception
{
String str = "teststring";
String cmpr = compress(str);
String dcmpr = deCompress(cmpr);
}
This code throw java.io.IOException: unknown format (magic number ef1f)
GZIPInputStream gs = new GZIPInputStream(bais);
It turns out that when converting byte new String (b1) and the byte b [] = str.getBytes () bytes are "spoiled." At the output of the line we have already more bytes. If you avoid the conversion to a string and work on the line with bytes - everything works. Sorry for my English.
public String unZip(String zipped) throws DataFormatException, IOException {
byte[] bytes = zipped.getBytes("WINDOWS-1251");
Inflater decompressed = new Inflater();
decompressed.setInput(bytes);
byte[] result = new byte[100];
ByteArrayOutputStream buffer = new ByteArrayOutputStream();
while (decompressed.inflate(result) != 0)
buffer.write(result);
decompressed.end();
return new String(buffer.toByteArray(), charset);
}
I'm use this function to decompress server responce. Thanks for help.
You have two problems:
You're using the default character encoding to convert the original string into bytes. That will vary by platform. It's better to specify an encoding - UTF-8 is usually a good idea.
You're trying to represent the opaque binary data of the result of the compression as a string by just calling the String(byte[]) constructor. That constructor is only meant for data which is encoded text... which this isn't. You should use base64 for this. There's a public domain base64 library which makes this easy. (Alternatively, don't convert the compressed data to text at all - just return a byte array.)
Fundamentally, you need to understand how different text and binary data are - when you want to convert between the two, you should do so carefully. If you want to represent "non text" binary data (i.e. bytes which aren't the direct result of encoding text) in a string you should use something like base64 or hex. When you want to encode a string as binary data (e.g. to write some text to disk) you should carefully consider which encoding to use. If another program is going to read your data, you need to work out what encoding it expects - if you have full control over it yourself, I'd usually go for UTF-8.
Additionally, the exception handling in your code is poor:
You should almost never catch Exception; catch more specific exceptions
You shouldn't just catch an exception and continue as if it had never happened. If you can't really handle the exception and still complete your method successfully, you should let the exception bubble up the stack (or possibly catch it and wrap it in a more appropriate exception type for your abstraction)
When you GZIP compress data, you always get binary data. This data cannot be converted into string as it is no valid character data (in any encoding).
So your compress method should return a byte array and your decompress method should take a byte array as its parameter.
Futhermore, I recommend you use an explicit encoding when you convert the string into a byte array before compression and when you turn the decompressed data into a string again.
When you GZIP compress data, you always get binary data. This data
cannot be converted into string as it is no valid character data (in
any encoding).
Codo is right, thanks a lot for enlightening me. I was trying to decompress a string (converted from the binary data). What I amended was using InflaterInputStream directly on the input stream returned by my http connection. (My app was retrieving a large JSON of strings)