Removing comments from code character by character [Java] - java

I need to remove comments from code, but in this case I'll have to do it without using
System.out.println(sourceCode.replaceAll("//.*|/\\*((.|\\n)(?!=*/))+\\*/", ""));
The program needs to check the code character by character to look for "/" and then proceed to check if the next character is "/" or "*".
I'm looking for a good way to read through the code and check characters letter by letter

This is a classic problem given to new learners in Java. I would suggest to go for a simple approach as it is intended to help you practice your coding skills
Read the java source code as a file in your program char by char.
Search for comments beginning. In this case, there are 2, /* and //.
Open a string buffer and start writing the read contents into it.
If its /*, then don't write it in buffer. Keep on moving to next character till you find */.
Repeat till end of file is reached.
If single line comments need to be removed, then same algorithm can be followed till you get a new line character.
If you need help in reading from file char by char, refer to Java documentation.
When end of file is reached, then write the string buffer back to the file.

Related

How parser's buffer works? Matching the regex

One of my students have a task to do which part is to check if there is a regex matching string inside of a file.
The trick is that his teacher has forbidden reading whole file at once then parse it. Instead he said that he supposed to use buffer. The problem is that you never know how much of input you suppose to read from the file: there might be a matching sequence if you read just one character more from the file.
So the teacher wrote(translated):
Use technique known from parsers:
rewrite second half of the buffer to the first part of buffer
read next part of file to the second half
check if whole buffer contains the matching sequence
So how it suppose to be done(idea)? In my opinion it does not solve the problem stated above and it is pretty stupid and wasteful.
A Matcher does use an internal buffer of some kind, certainly. But if you look at the prototype to build a Matcher, you see that the only thing it takes as an argument is a simple CharSequence, which has only three operations:
knowing its length,
getting one character at a given offset,
getting a subsequence (another CharSequence).
When reading from a file, one possibility is to map the whole file using FileChannel.map(), then use an appropriate CharsetDecoder to read into a CharBuffer (which implements CharSequence). Or do that in chunks...
... Or use yours truly's crazy idea: this! I have tested it on 800+ MiB files and it works...
What your teacher is saying:
The regex will never need to match anything longer than half the length of the buffer.
The match could lie on a buffer boundary, hence you need to shift:
That seems realistic.
A BufferedReader reading line wise seems not entirely fitting. Maybe you might consider a byte array, BufferedInputStream.

Java parsing text file

I need to write a parser for textfiles (at least 20 kb), and I need to determine if words out of a set of words appear in this textfile (about 400 words and numbers). So I am looking for the most efficient possibilitie to do this (if a match is found, i need to do some further processing of this and it's previous line).
What I currently do, is to exclude lines that do not contain any information for sure (kind of metadata lines) and then compare word by word - but i don't think that only comparing word by word is the most efficient possibility.
Can anyone please provide some tips/hints/ideas/...
Thank you very much
It depends on what you mean with "efficient".
If you want a very straightforward way to code it, keep in mind that the String object in java has method String.contains(CharSequence sequence).
Then, you could put the file content into a String and then iterate on your keywords you want to check to see if any of those appear in String, using the method contains().
How about the following:
Put all your keywords in a HashSet (Set<String> keywords;)
Read the file one line at once
For each line in file:
Tokenize to words
For each word in line:
If word is contained in keywords (keywords.containes(word))
Process actual line
If previous line is available
Process previous line
Keep track of previous line (prevLine = line;)

PushbackReader Equivalents

A somewhat vague question, I apologize in advance.
I'm building the tokenizing portion of a small parser with help of the book Building Parsers with Java. It uses PushbackReader and the String contained within as a way to first detect the first character of the given string then sends the PushbackReader to the appropriate state (the state then builds the token as a separate object containing a String).
PushbackReader seems to only be used if no other characters of use are found within the the stream. It then unreads the last character.
Is it possible to do the same thing with a CharBuffer's append? Preferably something that doesn't require the buffer to be predefined.
Based on what I see, he chose PushbackReader for two reasons:
He needed a reader that could handle individual characters.
He needed to backup in the stream because when tokenizing he needed to see one character or more ahead to decide if the current char was part of the token.
For example with the method WhitespaceState.nextToken he is skipping whitespace characters. He pulls off a character and looks at it. If it is a whitespace char he pulls the next char. When he finally pulls a character that is not whitespace, he puts it back into the stream so the next method that looks at the stream will be looking at the correct character.
While you could replace it with something more simple that has just two methods, read(), and unread(), you have to remember that by doing so you will probably be
Reading in the entire input, and then processing the input. So if you have a large file you will be eating up memory to store it.
Reading the input once as a stream, but storing the char(s) from unread() and passing them around in a separate structure.
With PushbackReader, he is reading and processing through the input once, he does not have to buffer the entire input, nor is he having to store the unread() characters and pass them around separately

How to remove particular string from .asp file, using Java?

I'm currently writing something which is validating our vbscript files. Right at the start I wish to remove all lines of code which are comments. I was expecting to be able to use the "'" (comment symbol in vbscript) and '\n'. However, when I write the content of the file to screen, the new lines are not formatting. Does this mean there are actually no new lines in the original vbscript file and if not, how could I remove comments?
first read whole file in string example
then use regex or simply substring for removing extra syntax
How are you parsing the file? Are you also taking the '\r' into consideration when removing the comments? Or maybe you are accidentally removing all newline characters.
I would create some state flags to tell the parser when I was in a comment or not.

Scanner cuts off my String after about 2400 characters

I've got some very basic code like
while (scan.hasNextLine())
{
String temp = scan.nextLine();
System.out.println(temp);
}
where scan is a Scanner over a file.
However, on one particular line, which is about 6k chars long, temp cuts out after something like 2470 characters. There's nothing special about when it cuts out; it's in the middle of the word "Australia." If I delete characters from the line, the place where it cuts out changes; e.g. if I delete characters 0-100 in the file then Scanner will get what was previously 100-2570.
I've used Scanner for larger strings before. Any idea what could be going wrong?
At a guess, you may have a rogue character at the cut-off point: look at the file in a hex editor instead of just a text editor. Perhaps there's an embedded null character, or possibly \r in the middle of the string? It seems unlikely to me that Scanner.nextLine() would just chop it arbitrarily.
As another thought, are you 100% sure that it's not all there? Perhaps System.out.println is chopping the string - again due to some "odd" character embedded in it? What happens if you print temp.length()?
EDIT: I'd misinterpreted the bit about what happens if you cut out some characters. Sorry about that. A few other things to check:
If you read the lines with BufferedReader.readLine() instead of Scanner, does it get everything?
Are you specifying the right encoding? I can't see why this would show up in this particular way, but it's something to think about...
If you replace all the characters in the line with "A" (in the file) does that change anything?
If you add an extra line before this line (or remove a line before it) does that change anything?
Failing all of this, I'd just debug into Scanner.nextLine() - one of the nice things about Java is that you can debug into the standard libraries.

Categories