Split byte array by certain byte (character)

Split byte array by certain byte (character) - java

I am building history parser, there's an application that already done the logging task (text based).
Now that my supervisor want me to create an application to read that log.
The log is is created at the end of the month, and is separated by [date]:
[19-11-2014]
- what goes here
- what goes here
[20-11-2014]
- what goes here
- what goes here
etc...
If the log file has small size, there's no problem processing the content by DataInputStream to get the byte[], and convert it to String and then do the filtering process (by doing substring and such).
But when the file has a large size (about 100mb), it throws JavaHeapSpace exception. I know that this is because the length of the content exceeds String maxlength, when I try not to convert the byte[] into string, no exception was thrown.
Now the question is, how do I split the byte[] into several byte[]?
Which is each new byte[] only contains single:
[date]
- what goes here
So if within a month we have 9 dates in log, it would be split into 9 byte[].
The splitting marker would be based on [\\d{2}-\\d{2}-\\d{4}] , if it is string I could just use Regex to find all the marker, get the indexOf and then substring it.
But how do I do this without converting to string first? As it would throws the JavaHeapSpace.

I think there are several concepts here that you're missing.
First, an InputStream is a Stream, which means it is a flow of bytes. What you do with that flow is up to you, but saving all of the stream to memory defies the point of the stream construct altogether.
Second, a DataInputStream is used to read objects from a binary file that were serialized there by a DataOutputStream. Reading just a string is overkill for this type of Stream, since a simple InputStream can do that.
As for your specific problem, I would use a BufferedFileReader, and read one line at a time, until reaching the next date. At that point you can do whatever processing you need on the last chunk of lines you read, and free the memory. Thus not running into the same problem.

Related

performance and size limitations on HttpServletResponse.getOutputStream.print(string) vs getWriter(String)

For a web project I'm writing large sections of text to a webpage(table) or even bigger (could be several MB) to CSV files for download.
The java method dealing with this receives a StringBuilder content string, which originally (by the creator of this module) was being sent char by char in a loop:
response.getOutputStream().write(content.charAt(i)).
Upon questioning about the loop, the reason given was that he thought the string might be too big for writing in one go. (using java 1.6).
I can't find any size restrictions anywhere, and then also the question came which method to use instead: print() or getWriter()?
The data in the string is all text.

He assumed wrong. If anything it's inefficient, or at least useless to do that one character at a time. If you have a String in memory, you can write it out at one go without worrying.
If you're only writing text, use a Writer. OutputStream is for binary data (although you can wrap it in an OutputStreamWriter to convert between the two). See Writer or OutputStream?

Efficient way to parse a datagram in Java

Right now I am using a socket and a datagram packet. This program is for a LAN network and sends at least 30 packets a second at 500 bytes maximum.
this is how I am receiving my data
payload = new String(incomingPacket.getData(), incomingPacket.getOffset(), incomingPacket.getLength(), "UTF-8");
Currently I am using no offset and I parse one by one through each character. I use the first 2 characters right now to determine what type of message it is but that is subject to change, then I break down variables and seperate the data with an exclamation mark to tell me when the next variable begins. At the end I parse it and apply it to my program. Is there a faster way to break down and interpret datagram packets? Will there be a performance difference if I put the length of the variables in the offset. Maybe an example would be useful. Also I think my variables are too small to use StringBuilder so I use normal concatenation.

What you are talking about here is setting up your own protocol for communication. While I have this as the fourth part of my socket tutorial (I'm currently working on part 3, non-blocking sockets) I can explain some things here already.
There are several ways of setting up such a protocol, depending on your needs.
One way of doing it is having a byte in front of each piece of data declaring the size, in bytes. That way, you know the length of the byte array containing the next variable value. This makes it easy to read out whole variables in one go via the System.arraycopy method. This is a method I've used before. If the object being sent is always the same, this is all you need to do. Write the variables in a standardized order, add the size of the variable value and you're done.
If you have to send multiple types of objects throught the stream you might want to add a bit of meta data. This meta data can then be used to tell what kind of object is being sent and the order of the variables. This meta data can then be put in a header which you add before the actual message. Once again, the values in the header are preceded by the size byte.
In my tutorial I'll write up a complete code example.

Don't use a String at all. Just process the underlying byte array directly: scan it for delimiters, counts, what have you. You can use a DataInputStream wrapped around a ByteArrayInputStream wrapped around the byte array if you want an API oriented to Java primitive datatypes.

Best data structure for storing dynamically sized blocks from file input Java

I'm working on a Java program where I'm reading from a file in dynamic, unknown blocks. That is, each block of data will not always be the same size and the size is determined as data is being read. For I/O I'm using a MappedByteBuffer (the file inputs are on the order of MB).
My goal:
Find an efficient way to store each complete block during the input phase so that I can process it.
My constraints:
I am reading one byte at a time from the buffer
My processing method takes a primitive byte array as input
Each block gets processed before the next block is read
What I've tried:
I played around with dynamic structures like Lists but they don't have backing arrays and the conversion time to a primitive array concerns me
I also thought about using a String to store each block and then getBytes() to get the byte[], but it's so slow
Reading the file multiple times in order to find the block size first, and then grab the relevant bytes
I am trying to find an approach that doesn't defeat the purpose of fast I/O. Any advice would be greatly appreciated.
Additional Info:
I'm using a rolling hash to decide where blocks should end
Here's a bit of pseudo-code:
circular_buffer[] = read first 128 bytes
rolling_hash = hash(buffer[])
block_storage = ??? // this is the data structure I'd like to use
while file has more text
b = next byte
add b to block_storage
add b to next index in circular_buffer (if reached end, start adding/overwriting front)
shift rolling_hash one byte to the right
if hash has a certain characteristic
process block_storage as a byte[] //should contain entire block of data
As you can see, I'm reading one byte at a time, and storing/overwriting that one byte repeatedly. However, once I get to the processing stage, I want to be able to access all of the info in the block. There is no predetermined max size of a block either, so I can't pre-allocate.

It seems to me, that you reqire a dynamically growing buffer. You can use the built in BytaArrayOutputStream to achieve that. It will automatically grow to store all data written to it. You can use write(int b) and toByteArray() to realize add b to block_storage and process block_storage as a byte[].
But take care - this stream will grow unbounded. You should implement some sanity checks around it to avoid using up all memory (e.g. count bytes written to it and break by throwing an exception, when it exceeds an reasonable amount). Also make sure to close and throw away the reference to a stream after consuming the block, to allow the GC to free up memory.
edit: As #marcman pointed out, the buffer can be reset().

Can BufferedReader read bytes?

Sorry if this question is a dulplicate but I didn't get an answer I was looking for.
Java docs says this
In general, each read request made of a Reader causes a corresponding read request to be made of the underlying character or byte stream. It is therefore advisable to wrap a BufferedReader around any Reader whose read() operations may be costly, such as FileReaders >and InputStreamReaders. For example,
BufferedReader in = new BufferedReader(new FileReader("foo.in"));
will buffer the input from the specified file. Without buffering, each invocation of read() or readLine() could cause bytes to be read from the file, converted into characters, and then returned, which can be very inefficient.
My first question is If bufferedReader can read bytes then why can't we work on images which are in bytes using bufferedreader.
My second question is Does Bufferedreader store characters in BUFFER and what is the meaning of this line
will buffer the input from the specified file.
My third question is what is the meaning of this line
In general, each read request made of a Reader causes a corresponding read request to be >made of the underlying character or byte stream.

There are two questions here.
1. Buffering
Imagine you lived a mile from your nearest water source, and you drink a cup of water every hour. Well, you wouldn't walk all the way to the water for every cup. Go once a day, and come home with a bucket full of enough water to fill the cup 24 times.
The bucket is a buffer.
Imagine your village is supplied water by a river. But sometimes the river runs dry for a month; other times the river brings so much water that the village floods. So you build a dam, and behind the dam there is a reservoir. The reservoir fills up in the rainy season and gradually empties in the dry season. The village gets a steady flow of water all year round.
The reservoir is a buffer.
Data streams in computing are similar to both those scenarios. For example, you can get several kilobytes of data from a filesystem in a single OS system call, but if you want to process one character at a time, you need something similar to a reservoir.
A BufferedReader contains within it another Reader (for example a FileReader), which is the river -- and an array of bytes, which is the reservoir. Every time you read from it, it does something like:
if there are not enough bytes in the "reservoir" to fulfil this request
top up the "reservoir" by reading from the underlying Reader
endif
return some bytes from the "reservoir".
However when you use a BufferedReader, you don't need to know how it works, only that it works.
2. Suitability for images
It's important to understand that BufferedReader and FileReader are examples of Readers. You might not have covered polymorphism in your programming education yet, so when you do, remember this. It means that if you have code which uses FileReader -- but only the aspects of it that conform to Reader -- then you can substitute a BufferedReader and it will work just the same.
It's a good habit to declare variables as the most general class that works:
Reader reader = new FileReader(file);
... because then this would be the only change you need to add buffering:
Reader reader = new BufferedReader(new FileReader(file));
I took that detour because it's all Readers that are less suitable for images.
Reader has two read methods:
int read(); // returns one character, cast to an int
int read(char[] block); // reads into block, returns how many chars it read
The second form is unsuitable for images because it definitely reads chars, not ints.
The first form looks as if it might be OK -- after all, it reads ints. And indeed, if you just use a FileReader, it might well work.
However, think about how a BufferedReader wrapped around a FileReader will work. The first time you call BufferedReader.read(), it will call FileReader.read(buffer) to fill its buffer. Then it will cast the first char of the buffer back to an int, and return that.
Especially when you bring multi-byte charsets into the picture, that can cause problems.
So if you want to read integers, use InputStream not Reader. InputStream has int read(byte[] buf, int offset, int length) -- bytes are much more reliably cast back and forth from int than chars.

Readers (and Writers) in java are specialized classes for dealing with text (character) streams - the concept of a line is meaningless in any other type of stream.
for the general IO equivalent, have a look at BufferedInputStream
so, to answer your questions:
while the reader does eventually read bytes, it converts them to characters. it is not intended to read anything else (like images) - use the InputStream family of classes for that
a buffered reader will read large blocks of data from the underlying stream (which may be a file, socket, or anything else) into a buffer in memory and will then serve read requests from this buffer until the buffer is emptied. this behaviour of reading large chunks instead of smaller chucks every time improves performance.
it means that if you dont wrap a reader in a buffered reader then every time you want to read a single character, it will access the disk.network to get just the single character you want. doing I/O in such small chunks is usually terrible for performance.

Default behaviour is it will convert to character, but when you have an image you cannot have a character data, instead you need pixel of bytes data. So you cannot use it.
It is buffereing, means , it is reading a certain chunk of data in an char array. You can see this behaviour in the code:
public BufferedReader(Reader in) {
this(in, defaultCharBufferSize);
}
and the defaultCharBufferSize is as mentioned below:
private static int defaultCharBufferSize = 8192;
3 Every time you do read operation, it will be reading only one character.
So in a nutshell, buffred means, it will read few chunk of character data first that will keep in a char array and that will be processed and again it will read same chunk of data until it reaches end of stream
You can refer the following to get to know more
BufferedReader

How to reposition a file by having a track on bytes which were decoded to corresponding characters?

Question may be quite vague, let me expound it here.
I'm developing an application in which I'll be reading data from a file. I've a FileReader class which opens the file in following fashion
currentFileStream = new FileInputStream(currentFile);
fileChannel = currentFileStream.getChannel();
data is read as following
bytesRead = fileChannel.read(buffer); // Data is buffered using a ByteBuffer
I'm processing the data in any one of the 2 forms, one is binary and other is character.
If its processed as character I do an additional step of decoding this ByteBuffer into CharBuffer
CoderResult result = decoder.decode(byteBuffer, charBuffer, false);
Now my problem is I need to read by repositioning the file from some offset during recovery mode in case of some failure or crash in application.
For this, I maintain a byteOffset which keeps track of no of bytes processed during binary mode and I persist this variable.
If something happens I reposition the file like this
fileChannel.position(byteOffset);
which is pretty straightforward.
But if processing mode is character, I maintain recordOffset which keeps track of character position/offset in the file. During recovery I make calls to read() internally till I get some character offset which is persisted recordOffset+1.
Is there anyway to get corresponding bytes which were needed to decode characters? For instance I want something similar like this if recordOffset is 400 and its corresponding byteOffset is 410 or 480 something( considering different charsets). So that while repositioning I can do this
fileChannel.position(recordOffset); //recordOffset equivalent value in number of bytes
instead of making repeated calls internally in my application.
Other approach I could think for this was using an InputStreamReader's skip method.
If there are any better approach for this or if possible to get byte - character mapping, please let me know.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Split byte array by certain byte (character) - java

Related

performance and size limitations on HttpServletResponse.getOutputStream.print(string) vs getWriter(String)

Efficient way to parse a datagram in Java

Best data structure for storing dynamically sized blocks from file input Java

Can BufferedReader read bytes?

How to reposition a file by having a track on bytes which were decoded to corresponding characters?

Categories

Resources