Java : How to find string patterns in a LARGE binary file? - java

I'm trying to write a program that will read a VERY LARGE binary file and try to find the occurrence of 2 different strings and then print the indexes that matches the patterns. For the example's sake let's assume the character sequences are [H,e,l,l,o] and [H,e,l,l,o, ,W,o,r,l,d].
I was able to code this for small binary files because I was reading each character as a byte and then saving it in an Arraylist. Then starting from the beginning of the Arraylist, I was comparing the byte arraylist(byte[] data) with the byte[] pattern.
I need to find a way to do the same but WITHOUT writing the entire binary file in memory for comparison. That means I should be able to compare while reading each character (I should not save the entire binary file in memory). Assume the binary file only contains characters.
Any suggestions on how this can be achieved ? Thank you all in advance.

Seems like you are really looking for Aho-Corasick string matching algorithm.
The algorithm builds an automaton from the given dictionary you have, and then allows you to find matches using a single scan of your input string.
The wikipedia article links to this java implementation

Google "finite state machine".
Or, read the file one byte at a time, if the byte just doesn't match the first character of the search term, go on to the next byte. If it does match, now you're looking for the next character in the sequence. I.e., your state has gone from 0, to 1. If your state equals (or passes) the length of the search string, you found it!
Implementation/debugging left to the reader.

There are specialised algorithms for this but let's try a simple one first.
You can start with making the comparison on the fly, always after reading the next byte. Once you do that, it's easy to spot that you don't need to keep any bytes that are from earlier than your longest pattern.
So you can just use a buffer that is as long as your longest pattern, put new bytes in at one end and drop them at the other.
As I said, there are algorithms more effective than this but it's a good start.

Use a FileInputStream wrapped in a BufferedInputStream and compare each byte. Keep a buffer the length of the sequence you're looking for so you backtrack if it doesn't match at some point. If the sequence you're looking for is too large, you could save the offset and re-open the file for reading.
Working with streams: http://docs.oracle.com/javase/tutorial/essential/io/
String matching algorithms: http://en.wikipedia.org/wiki/String_searching_algorithm
Or if you just want something to copy and paste you could look at this SO question.

Related

Is there any encoding that would let me safely write and read any 8 bit char code (the whole 256 not just the 128) to and from a file?

I am trying to implement Huffman Tree compression. Pretty much how it works is giving < 8-bit codes to the most common characters in text documents and larger codes to the less common characters. Then there is a binary tree encoded that lets you navigate down with 1's telling you to go left and 0's telling you to go right which leads you to the characters.
So obviously there are chunks that aren't 8 bytes long. I have been rounding them off as need be with 0's at the end and converting them to characters. However, I just found that java writes in 3 bytes per characters. Because this is about compression I obviously want one byte.
The problem is that I don't know what bytes are going to end up trying to be written. Three different < 8-bit codes might get mushed together. I need to be able to write any code to the text file. There are invalid byte sequences however and so my entire approach is all gummed up.
Is there any way that I can let any byte sequence be valid in a certain section of the file and just store it as it literally is and not worry about a character ending the file prematurely or causing another mischief? I am coding on a Mac so that is a problem unlike in windows where they just have the length of the file at the beginning so that they don't need an end of file character.
If there is not a direct solution here then perhaps I could make my own encoding that would not exit the file and nest that inside a more common one?
This looks like a good use case for Java's Bitset: https://docs.oracle.com/javase/8/docs/api/java/util/BitSet.html
When writing out the data to a file, you should output the number of values which were encoded and afterwards you only need the serialized stream of bits.

How parser's buffer works? Matching the regex

One of my students have a task to do which part is to check if there is a regex matching string inside of a file.
The trick is that his teacher has forbidden reading whole file at once then parse it. Instead he said that he supposed to use buffer. The problem is that you never know how much of input you suppose to read from the file: there might be a matching sequence if you read just one character more from the file.
So the teacher wrote(translated):
Use technique known from parsers:
rewrite second half of the buffer to the first part of buffer
read next part of file to the second half
check if whole buffer contains the matching sequence
So how it suppose to be done(idea)? In my opinion it does not solve the problem stated above and it is pretty stupid and wasteful.
A Matcher does use an internal buffer of some kind, certainly. But if you look at the prototype to build a Matcher, you see that the only thing it takes as an argument is a simple CharSequence, which has only three operations:
knowing its length,
getting one character at a given offset,
getting a subsequence (another CharSequence).
When reading from a file, one possibility is to map the whole file using FileChannel.map(), then use an appropriate CharsetDecoder to read into a CharBuffer (which implements CharSequence). Or do that in chunks...
... Or use yours truly's crazy idea: this! I have tested it on 800+ MiB files and it works...
What your teacher is saying:
The regex will never need to match anything longer than half the length of the buffer.
The match could lie on a buffer boundary, hence you need to shift:
That seems realistic.
A BufferedReader reading line wise seems not entirely fitting. Maybe you might consider a byte array, BufferedInputStream.

Regex search pattern in very large file

I'd like to search pattern in very large file (f.e above 1 GB) that consists of single line.
It is not possible to load it into memory. Currently, I use BufferedReaderto read into buffers (1024 chars).
The main steps:
Read data into two buffers
Search pattern in that buffers
Increment variable if pattern was found
Copy second buffer into first
Load data into second buffers
Search pattern in both buffers.
Increment variable if pattern was found
Repeat above steps (start from 4) until EOF
That algorithm (two buffers) lets me to avoid situation, where searched piece of text is split by chunks. It works like a chram unless pattern result is smaller that two buffers length. For example I can't manage with case, when result is longer - let's say long as 3 buffers (but I've only data in two buffers, so match will fail!). What's more, I can realize such a case:
Prepare 1 GB single line file, that consits of "baaaaaaa(....)aaaaab"
Search for pattern ba*b.
The whole file match pattern!
I don't have to print the result, I've only to be able to say: "Yea, I was able to find pattern" or "No, I wasn't able to find that".
It's possible with java? I mean:
Ability to determine, whether a pattern is present in file (without loading whole line into memory, see case above
Find the way handle the case, when match result is longer than chunk.
I hope my explanation is pretty clear.
I think the solution for you would be to implement CharSequence as a wrapper over very large text files.
Why? Because building a Matcher from a Pattern takes a CharSequence as an argument.
Of course, easier said than done... But then you only have three methods to implement, so that shouldn't be too hard...
EDIT I took the plunge and I ate my own dog's food. The "worst part" is that it actually works!
It seems like you may need to break that search-pattern down into pieces, since, given your restrictions, searching for it in its entirety is failing.
Can you determine that a buffer contains the beginning of a match? If so, save that state and then search the next portion for the next part of the match. Continue until the entire search-term is found.

find string in a very big single lined file

I have a file I need to read that's over 50gb large with all characters in one line.
Now comes the tricky part:
I have to split it on all double quote characters, find a substring (srsName) and get the element behind it which in a for loop over split substrings has the i+1 index ("value").
Question:
Are there some progressive search implementations or other methods that I could use instead of filling up my memory?
To simplify:
There are quite a lot of those srsName substrings inside the file but I need to read just one of those as all of them have the same value following them.
Something about the file:
It's a xml being prepared for a xsl transformation. I can't use a xslt that creates indentation because I need to do it with as little disk/memory usage as possible.
This is how the value presents itself inside the file.
<sometag:sometext srsName="value">
One way to speed up your search in a massive file is adapting a fast in-memory search algorithm to searching in a file.
One particularly fast algorithm is Knuth–Morris–Pratt: it looks at each character at most twice, and requires a small preprocessing step to construct the "jump table" that tells you to what position you should move to continue your search. That table is constructed in such a way as to not have you jump too far back, so you can do your search by keeping a small "search window" of your file in memory: since you are looking for a word of only seven characters, it is sufficient to keep only the last six characters in memory as your search progresses through the file.
You could try using a BufferedReader - http://download.oracle.com/javase/6/docs/api/java/io/BufferedReader.html
This would allow you to specify the number of characters to read in to memory at once using the read method.
I've done it like this:
String myBuff = "";
char charBuff;
while(myBuff.length()<30)myBuff+=(char)br.read();
charBuff=(char)br.read();
try{
while(true){
myBuff=myBuff.substring(1)+charBuff;
if(myBuff.startsWith("srsName"))break;
charBuff=(char)br.read();
}
}
catch(Exception e){}
value = myBuff.split("\"")[1];
where br is my BufferedReader

PushbackReader Equivalents

A somewhat vague question, I apologize in advance.
I'm building the tokenizing portion of a small parser with help of the book Building Parsers with Java. It uses PushbackReader and the String contained within as a way to first detect the first character of the given string then sends the PushbackReader to the appropriate state (the state then builds the token as a separate object containing a String).
PushbackReader seems to only be used if no other characters of use are found within the the stream. It then unreads the last character.
Is it possible to do the same thing with a CharBuffer's append? Preferably something that doesn't require the buffer to be predefined.
Based on what I see, he chose PushbackReader for two reasons:
He needed a reader that could handle individual characters.
He needed to backup in the stream because when tokenizing he needed to see one character or more ahead to decide if the current char was part of the token.
For example with the method WhitespaceState.nextToken he is skipping whitespace characters. He pulls off a character and looks at it. If it is a whitespace char he pulls the next char. When he finally pulls a character that is not whitespace, he puts it back into the stream so the next method that looks at the stream will be looking at the correct character.
While you could replace it with something more simple that has just two methods, read(), and unread(), you have to remember that by doing so you will probably be
Reading in the entire input, and then processing the input. So if you have a large file you will be eating up memory to store it.
Reading the input once as a stream, but storing the char(s) from unread() and passing them around in a separate structure.
With PushbackReader, he is reading and processing through the input once, he does not have to buffer the entire input, nor is he having to store the unread() characters and pass them around separately

Categories