A somewhat vague question, I apologize in advance.
I'm building the tokenizing portion of a small parser with help of the book Building Parsers with Java. It uses PushbackReader and the String contained within as a way to first detect the first character of the given string then sends the PushbackReader to the appropriate state (the state then builds the token as a separate object containing a String).
PushbackReader seems to only be used if no other characters of use are found within the the stream. It then unreads the last character.
Is it possible to do the same thing with a CharBuffer's append? Preferably something that doesn't require the buffer to be predefined.
Based on what I see, he chose PushbackReader for two reasons:
He needed a reader that could handle individual characters.
He needed to backup in the stream because when tokenizing he needed to see one character or more ahead to decide if the current char was part of the token.
For example with the method WhitespaceState.nextToken he is skipping whitespace characters. He pulls off a character and looks at it. If it is a whitespace char he pulls the next char. When he finally pulls a character that is not whitespace, he puts it back into the stream so the next method that looks at the stream will be looking at the correct character.
While you could replace it with something more simple that has just two methods, read(), and unread(), you have to remember that by doing so you will probably be
Reading in the entire input, and then processing the input. So if you have a large file you will be eating up memory to store it.
Reading the input once as a stream, but storing the char(s) from unread() and passing them around in a separate structure.
With PushbackReader, he is reading and processing through the input once, he does not have to buffer the entire input, nor is he having to store the unread() characters and pass them around separately
Related
I need to remove comments from code, but in this case I'll have to do it without using
System.out.println(sourceCode.replaceAll("//.*|/\\*((.|\\n)(?!=*/))+\\*/", ""));
The program needs to check the code character by character to look for "/" and then proceed to check if the next character is "/" or "*".
I'm looking for a good way to read through the code and check characters letter by letter
This is a classic problem given to new learners in Java. I would suggest to go for a simple approach as it is intended to help you practice your coding skills
Read the java source code as a file in your program char by char.
Search for comments beginning. In this case, there are 2, /* and //.
Open a string buffer and start writing the read contents into it.
If its /*, then don't write it in buffer. Keep on moving to next character till you find */.
Repeat till end of file is reached.
If single line comments need to be removed, then same algorithm can be followed till you get a new line character.
If you need help in reading from file char by char, refer to Java documentation.
When end of file is reached, then write the string buffer back to the file.
I'm trying to write a program that will read a VERY LARGE binary file and try to find the occurrence of 2 different strings and then print the indexes that matches the patterns. For the example's sake let's assume the character sequences are [H,e,l,l,o] and [H,e,l,l,o, ,W,o,r,l,d].
I was able to code this for small binary files because I was reading each character as a byte and then saving it in an Arraylist. Then starting from the beginning of the Arraylist, I was comparing the byte arraylist(byte[] data) with the byte[] pattern.
I need to find a way to do the same but WITHOUT writing the entire binary file in memory for comparison. That means I should be able to compare while reading each character (I should not save the entire binary file in memory). Assume the binary file only contains characters.
Any suggestions on how this can be achieved ? Thank you all in advance.
Seems like you are really looking for Aho-Corasick string matching algorithm.
The algorithm builds an automaton from the given dictionary you have, and then allows you to find matches using a single scan of your input string.
The wikipedia article links to this java implementation
Google "finite state machine".
Or, read the file one byte at a time, if the byte just doesn't match the first character of the search term, go on to the next byte. If it does match, now you're looking for the next character in the sequence. I.e., your state has gone from 0, to 1. If your state equals (or passes) the length of the search string, you found it!
Implementation/debugging left to the reader.
There are specialised algorithms for this but let's try a simple one first.
You can start with making the comparison on the fly, always after reading the next byte. Once you do that, it's easy to spot that you don't need to keep any bytes that are from earlier than your longest pattern.
So you can just use a buffer that is as long as your longest pattern, put new bytes in at one end and drop them at the other.
As I said, there are algorithms more effective than this but it's a good start.
Use a FileInputStream wrapped in a BufferedInputStream and compare each byte. Keep a buffer the length of the sequence you're looking for so you backtrack if it doesn't match at some point. If the sequence you're looking for is too large, you could save the offset and re-open the file for reading.
Working with streams: http://docs.oracle.com/javase/tutorial/essential/io/
String matching algorithms: http://en.wikipedia.org/wiki/String_searching_algorithm
Or if you just want something to copy and paste you could look at this SO question.
One of my students have a task to do which part is to check if there is a regex matching string inside of a file.
The trick is that his teacher has forbidden reading whole file at once then parse it. Instead he said that he supposed to use buffer. The problem is that you never know how much of input you suppose to read from the file: there might be a matching sequence if you read just one character more from the file.
So the teacher wrote(translated):
Use technique known from parsers:
rewrite second half of the buffer to the first part of buffer
read next part of file to the second half
check if whole buffer contains the matching sequence
So how it suppose to be done(idea)? In my opinion it does not solve the problem stated above and it is pretty stupid and wasteful.
A Matcher does use an internal buffer of some kind, certainly. But if you look at the prototype to build a Matcher, you see that the only thing it takes as an argument is a simple CharSequence, which has only three operations:
knowing its length,
getting one character at a given offset,
getting a subsequence (another CharSequence).
When reading from a file, one possibility is to map the whole file using FileChannel.map(), then use an appropriate CharsetDecoder to read into a CharBuffer (which implements CharSequence). Or do that in chunks...
... Or use yours truly's crazy idea: this! I have tested it on 800+ MiB files and it works...
What your teacher is saying:
The regex will never need to match anything longer than half the length of the buffer.
The match could lie on a buffer boundary, hence you need to shift:
That seems realistic.
A BufferedReader reading line wise seems not entirely fitting. Maybe you might consider a byte array, BufferedInputStream.
I'd like to search pattern in very large file (f.e above 1 GB) that consists of single line.
It is not possible to load it into memory. Currently, I use BufferedReaderto read into buffers (1024 chars).
The main steps:
Read data into two buffers
Search pattern in that buffers
Increment variable if pattern was found
Copy second buffer into first
Load data into second buffers
Search pattern in both buffers.
Increment variable if pattern was found
Repeat above steps (start from 4) until EOF
That algorithm (two buffers) lets me to avoid situation, where searched piece of text is split by chunks. It works like a chram unless pattern result is smaller that two buffers length. For example I can't manage with case, when result is longer - let's say long as 3 buffers (but I've only data in two buffers, so match will fail!). What's more, I can realize such a case:
Prepare 1 GB single line file, that consits of "baaaaaaa(....)aaaaab"
Search for pattern ba*b.
The whole file match pattern!
I don't have to print the result, I've only to be able to say: "Yea, I was able to find pattern" or "No, I wasn't able to find that".
It's possible with java? I mean:
Ability to determine, whether a pattern is present in file (without loading whole line into memory, see case above
Find the way handle the case, when match result is longer than chunk.
I hope my explanation is pretty clear.
I think the solution for you would be to implement CharSequence as a wrapper over very large text files.
Why? Because building a Matcher from a Pattern takes a CharSequence as an argument.
Of course, easier said than done... But then you only have three methods to implement, so that shouldn't be too hard...
EDIT I took the plunge and I ate my own dog's food. The "worst part" is that it actually works!
It seems like you may need to break that search-pattern down into pieces, since, given your restrictions, searching for it in its entirety is failing.
Can you determine that a buffer contains the beginning of a match? If so, save that state and then search the next portion for the next part of the match. Continue until the entire search-term is found.
I have a file I need to read that's over 50gb large with all characters in one line.
Now comes the tricky part:
I have to split it on all double quote characters, find a substring (srsName) and get the element behind it which in a for loop over split substrings has the i+1 index ("value").
Question:
Are there some progressive search implementations or other methods that I could use instead of filling up my memory?
To simplify:
There are quite a lot of those srsName substrings inside the file but I need to read just one of those as all of them have the same value following them.
Something about the file:
It's a xml being prepared for a xsl transformation. I can't use a xslt that creates indentation because I need to do it with as little disk/memory usage as possible.
This is how the value presents itself inside the file.
<sometag:sometext srsName="value">
One way to speed up your search in a massive file is adapting a fast in-memory search algorithm to searching in a file.
One particularly fast algorithm is Knuth–Morris–Pratt: it looks at each character at most twice, and requires a small preprocessing step to construct the "jump table" that tells you to what position you should move to continue your search. That table is constructed in such a way as to not have you jump too far back, so you can do your search by keeping a small "search window" of your file in memory: since you are looking for a word of only seven characters, it is sufficient to keep only the last six characters in memory as your search progresses through the file.
You could try using a BufferedReader - http://download.oracle.com/javase/6/docs/api/java/io/BufferedReader.html
This would allow you to specify the number of characters to read in to memory at once using the read method.
I've done it like this:
String myBuff = "";
char charBuff;
while(myBuff.length()<30)myBuff+=(char)br.read();
charBuff=(char)br.read();
try{
while(true){
myBuff=myBuff.substring(1)+charBuff;
if(myBuff.startsWith("srsName"))break;
charBuff=(char)br.read();
}
}
catch(Exception e){}
value = myBuff.split("\"")[1];
where br is my BufferedReader