Regex search pattern in very large file - java

I'd like to search pattern in very large file (f.e above 1 GB) that consists of single line.
It is not possible to load it into memory. Currently, I use BufferedReaderto read into buffers (1024 chars).
The main steps:
Read data into two buffers
Search pattern in that buffers
Increment variable if pattern was found
Copy second buffer into first
Load data into second buffers
Search pattern in both buffers.
Increment variable if pattern was found
Repeat above steps (start from 4) until EOF
That algorithm (two buffers) lets me to avoid situation, where searched piece of text is split by chunks. It works like a chram unless pattern result is smaller that two buffers length. For example I can't manage with case, when result is longer - let's say long as 3 buffers (but I've only data in two buffers, so match will fail!). What's more, I can realize such a case:
Prepare 1 GB single line file, that consits of "baaaaaaa(....)aaaaab"
Search for pattern ba*b.
The whole file match pattern!
I don't have to print the result, I've only to be able to say: "Yea, I was able to find pattern" or "No, I wasn't able to find that".
It's possible with java? I mean:
Ability to determine, whether a pattern is present in file (without loading whole line into memory, see case above
Find the way handle the case, when match result is longer than chunk.
I hope my explanation is pretty clear.

I think the solution for you would be to implement CharSequence as a wrapper over very large text files.
Why? Because building a Matcher from a Pattern takes a CharSequence as an argument.
Of course, easier said than done... But then you only have three methods to implement, so that shouldn't be too hard...
EDIT I took the plunge and I ate my own dog's food. The "worst part" is that it actually works!

It seems like you may need to break that search-pattern down into pieces, since, given your restrictions, searching for it in its entirety is failing.
Can you determine that a buffer contains the beginning of a match? If so, save that state and then search the next portion for the next part of the match. Continue until the entire search-term is found.

Related

Pattern matching in file or string

I have been working on a program which amongst other things will search for repeating patterns within a string.
Finding and counting the matches for each pattern type is the easy part and I can sort from the highest scoring to the lowest scoring based on number of matches found.
Choosing which of the overlapping matches to keep is a bit more difficult, should i remove the leftmost or rightmost?
Lets say I keep the first match found and remove the right most overlapping one and so on. The issue arises when I move on to the next pattern type and find that it would been better to instead remove the leftmost match from the pattern type per above. This would have allowed this this pattern to fit into the space, etc..
However again when I get to the next set of patterns, it could transpire that leaving things as they were the first time would benefit,etc...
This swinging to and fro could repeat for the entire file.
My question is: are there any algorithms or techniques to calculate best fit for every single pattern while maintaining the most repeated patterns at the top of the list?
Any advice would be much appreciated ;)
Ed
Try to show an example
The only thing that you should do would be (in my opinion):
-Instead of erasing the most-left or most-right, try to save them all, and after analyzing all the matches you should decide what to do. Deleting without certainty is not a good option.

Java : How to find string patterns in a LARGE binary file?

I'm trying to write a program that will read a VERY LARGE binary file and try to find the occurrence of 2 different strings and then print the indexes that matches the patterns. For the example's sake let's assume the character sequences are [H,e,l,l,o] and [H,e,l,l,o, ,W,o,r,l,d].
I was able to code this for small binary files because I was reading each character as a byte and then saving it in an Arraylist. Then starting from the beginning of the Arraylist, I was comparing the byte arraylist(byte[] data) with the byte[] pattern.
I need to find a way to do the same but WITHOUT writing the entire binary file in memory for comparison. That means I should be able to compare while reading each character (I should not save the entire binary file in memory). Assume the binary file only contains characters.
Any suggestions on how this can be achieved ? Thank you all in advance.
Seems like you are really looking for Aho-Corasick string matching algorithm.
The algorithm builds an automaton from the given dictionary you have, and then allows you to find matches using a single scan of your input string.
The wikipedia article links to this java implementation
Google "finite state machine".
Or, read the file one byte at a time, if the byte just doesn't match the first character of the search term, go on to the next byte. If it does match, now you're looking for the next character in the sequence. I.e., your state has gone from 0, to 1. If your state equals (or passes) the length of the search string, you found it!
Implementation/debugging left to the reader.
There are specialised algorithms for this but let's try a simple one first.
You can start with making the comparison on the fly, always after reading the next byte. Once you do that, it's easy to spot that you don't need to keep any bytes that are from earlier than your longest pattern.
So you can just use a buffer that is as long as your longest pattern, put new bytes in at one end and drop them at the other.
As I said, there are algorithms more effective than this but it's a good start.
Use a FileInputStream wrapped in a BufferedInputStream and compare each byte. Keep a buffer the length of the sequence you're looking for so you backtrack if it doesn't match at some point. If the sequence you're looking for is too large, you could save the offset and re-open the file for reading.
Working with streams: http://docs.oracle.com/javase/tutorial/essential/io/
String matching algorithms: http://en.wikipedia.org/wiki/String_searching_algorithm
Or if you just want something to copy and paste you could look at this SO question.

How parser's buffer works? Matching the regex

One of my students have a task to do which part is to check if there is a regex matching string inside of a file.
The trick is that his teacher has forbidden reading whole file at once then parse it. Instead he said that he supposed to use buffer. The problem is that you never know how much of input you suppose to read from the file: there might be a matching sequence if you read just one character more from the file.
So the teacher wrote(translated):
Use technique known from parsers:
rewrite second half of the buffer to the first part of buffer
read next part of file to the second half
check if whole buffer contains the matching sequence
So how it suppose to be done(idea)? In my opinion it does not solve the problem stated above and it is pretty stupid and wasteful.
A Matcher does use an internal buffer of some kind, certainly. But if you look at the prototype to build a Matcher, you see that the only thing it takes as an argument is a simple CharSequence, which has only three operations:
knowing its length,
getting one character at a given offset,
getting a subsequence (another CharSequence).
When reading from a file, one possibility is to map the whole file using FileChannel.map(), then use an appropriate CharsetDecoder to read into a CharBuffer (which implements CharSequence). Or do that in chunks...
... Or use yours truly's crazy idea: this! I have tested it on 800+ MiB files and it works...
What your teacher is saying:
The regex will never need to match anything longer than half the length of the buffer.
The match could lie on a buffer boundary, hence you need to shift:
That seems realistic.
A BufferedReader reading line wise seems not entirely fitting. Maybe you might consider a byte array, BufferedInputStream.

find string in a very big single lined file

I have a file I need to read that's over 50gb large with all characters in one line.
Now comes the tricky part:
I have to split it on all double quote characters, find a substring (srsName) and get the element behind it which in a for loop over split substrings has the i+1 index ("value").
Question:
Are there some progressive search implementations or other methods that I could use instead of filling up my memory?
To simplify:
There are quite a lot of those srsName substrings inside the file but I need to read just one of those as all of them have the same value following them.
Something about the file:
It's a xml being prepared for a xsl transformation. I can't use a xslt that creates indentation because I need to do it with as little disk/memory usage as possible.
This is how the value presents itself inside the file.
<sometag:sometext srsName="value">
One way to speed up your search in a massive file is adapting a fast in-memory search algorithm to searching in a file.
One particularly fast algorithm is Knuth–Morris–Pratt: it looks at each character at most twice, and requires a small preprocessing step to construct the "jump table" that tells you to what position you should move to continue your search. That table is constructed in such a way as to not have you jump too far back, so you can do your search by keeping a small "search window" of your file in memory: since you are looking for a word of only seven characters, it is sufficient to keep only the last six characters in memory as your search progresses through the file.
You could try using a BufferedReader - http://download.oracle.com/javase/6/docs/api/java/io/BufferedReader.html
This would allow you to specify the number of characters to read in to memory at once using the read method.
I've done it like this:
String myBuff = "";
char charBuff;
while(myBuff.length()<30)myBuff+=(char)br.read();
charBuff=(char)br.read();
try{
while(true){
myBuff=myBuff.substring(1)+charBuff;
if(myBuff.startsWith("srsName"))break;
charBuff=(char)br.read();
}
}
catch(Exception e){}
value = myBuff.split("\"")[1];
where br is my BufferedReader

Need some ideas on how to acomplish this in Java (parsing strings)

Sorry I couldn't think of a better title, but thanks for reading!
My ultimate goal is to read a .java file, parse it, and pull out every identifier. Then store them all in a list. Two preconditions are there are no comments in the file, and all identifiers are composed of letters only.
Right now I can read the file, parse it by spaces, and store everything in a list. If anything in the list is a java reserved word, it is removed. Also, I remove any loose symbols that are not attached to anything (brackets and arithmetic symbols).
Now I am left with a bunch of weird strings, but at least they have no spaces in them. I know I am going to have to re-parse everything with a . delimiter in order to pull out identifiers like System.out.print, but what about strings like this example:
Logger.getLogger(MyHash.class.getName()).log(Level.SEVERE,
After re-parsing by . I will be left with more crazy strings like:
getLogger(MyHash
getName())
log(Level
SEVERE,
How am I going to be able to pull out all the identifiers while leaving out all the trash? Just keep re-parsing by every symbol that could exist in java code? That seems rather lame and time consuming. I am not even sure if it would work completely. So, can you suggest a better way of doing this?
There are several solutions that you can use, other than hacking your-own parser:
Use an existing parser, such as this one.
Use BCEL to read bytecode, which includes all fields and variables.
Hack into the compiler or run-time, using annotation processing or mirrors - I'm not sure you can find all identifiers this way, but fields and parameters for sure.
I wouldn't separate the entire file at once according to whitespace. Instead, I would scan the file letter-by-letter, saving every character in a buffer until I'm sure an identifier has been reached.
In pseudo-code:
clean buffer
for each letter l in file:
if l is '
toggle "character mode"
if l is "
toggle "string mode"
if l is a letter AND "character mode" is off AND "string mode" is off
add l to end of buffer
else
if buffer is NOT a keyword or a literal
add buffer to list of identifiers
clean buffer
Notice some lines here hide further complexity - for example, to check if the buffer is a literal you need to check for both true, false, and null.
In addition, there are more bugs in the pseudo-code - it will find identify things like the e and L parts of literals (e in floating-point literals, L in long literals) as well. I suggest adding additional "modes" to take care of them, but it's a bit tricky.
Also there are a few more things if you want to make sure it's accurate - for example you have to make sure you work with unicode. I would strongly recommend investigating the lexical structure of the language, so you won't miss anything.
EDIT:
This solution can easily be extended to deal with identifiers with numbers, as well as with comments.
Small bug above - you need to handle \" differently than ", same with \' and '.
Wow, ok. Parsing is hard -- really hard -- to do right. Rolling your own java parser is going to be incredibly difficult to do right. You'll find there are a lot of edge cases you're just not prepared for. To really do it right, and handle all the edge cases, you'll need to write a real parser. A real parser is composed of a number of things:
A lexical analyzer to break the input up into logical chunks
A grammar to determine how to interpret the aforementioned chunks
The actual "parser" which is generated from the grammar using a tool like ANTLR
A symbol table to store identifiers in
An abstract syntax tree to represent the code you've parsed
Once you have all that, you can have a real parser. Of course you could skip the abstract syntax tree, but you need pretty much everything else. That leaves you with writing about 1/3 of a compiler. If you truly want to complete this project yourself, you should see if you can find an example for ANTLR which contains a preexisting java grammar definition. That'll get you most of the way there, and then you'll need to use ANTLR to fill in your symbol table.
Alternately, you could go with the clever solutions suggested by Little Bobby Tables (awesome name, btw Bobby).

Categories