Java Scanner with empty delimiter - java

I'd like to parse some text using an hand-written descending parser. I used Scanner with the following delimiter : "\\s*". Unfortunately, the fact that this pattern matches an empty String seems to make every hasNextFoo and nextFoo matching nothing any more.
The documention doesn't say anything about possibly empty delimiters.

You have some objection to the '+' character?
Are you sure you want to use a regular expression at all, and not just an if statement testing for space characters? You say 'runtime'. Is your data in a string, or coming on a stream, or what?

Yes, because i want to use the scanner as a runtime lexer. In short, I want to be able to ask scanner.next(pattern), that would either return the matched string, or return an exception while not consuming the stream. Spaces should be ignored. If there is a better class to do this than scanner, I would be glad to use it.
I cannot think of any off-the-shelf library class that will do this for you. The normal model of a scanner / lexer is that any invalid character sequence (i.e. one that results in an exception) will be consumed. So, I think you are going to have to implement your own scanner by hand, taking care to treat the read-ahead characters as unconsumed. You could do this with a "pushback" reader or (if that model is not convenient) by explicitly buffering the characters yourself with some kind of mark / reset model. If all you are doing is splitting into tokens separated by one or more spaces, then the pushback reader approach should be fine.

You might also consider StreamTokenizer. Here is an example of using it for one-symbol look-ahead in a recursive-descent parser.

It's possible to use lookbehinds/lookaheads to explicitly define which delimiters are omittable.
For instance this scanner uses whitespaces as a delimiter but doesn't need them between numbers and words:
new Scanner("1A.23 4 BC-5")
.useDelimiter("\\s+|(?<=\\d)(?=[A-Z])|(?<=[A-Z])(?=[-+.\\d])");
It produces:
1
A
.23
4
BC
-5
The regex consists of three alternations:
\s+ consecutive whitespaces are a delimiter.
(?<=\d)(?=[A-Z]) an empty string between a digit and a letter is a delimiter.
(?<=[A-Z])(?=[-+.\d]) an empty string between a letter and '-', '+', '.' or
a digit is a delimiter.
(Note: \w can't be used here as it matches digits.)

Related

When using a string tokenizer , how do I check the next character after a token if it is a delimiter?

Say for example I have a string which is the following "3*x+(b[3]+c)+x"
And my delimiters are " \t*+-/()[]"
When I use the string tokenizer to split the string and I get to token "b", how can I check to make sure that the next delimiter is "[".
I need to do this so that token "x" is different from token "b", as token x represents a simple variable and token b represents an array in the program I am trying to construct.
I am trying to write this code in Java.
When using a string tokenizer , how do I check the next character after a token if it is a delimiter?
If you use the 3 argument constructor for StringTokenizer, then 3rd argument says whether or not the delimiters are returned as tokens.
However1, this still leaves you with the awkward problem of dealing with whitespace. For example "1+1" gives you 3 tokens, but "1 + 1" gives you 5 tokens. You either need to:
pick a better way2 to tokenize the input,
filter out any white-space "tokens" before you start analyzing them, or
remove any whitespace from the input string before you tokenize it.
1 - I'm inclined to agree with #EJP that StringTokenizer is the wrong tool for this job. You could make it work, but IMO you shouldn't. (Unless you have been explicitly directed to do this.)
2 - Better ways might be StreamTokenizer, String.split (with a cunning regex involving lookaheads or lookbehinds), a tokenizer generated using ANTLR, javacc or some other parser generator, or ... a hand built tokenizer.

how to get rid of things not words, say period etc for a small word count method

I just want to use java to write a simple word count method for an essay. But how can I get rid of things not as a word, say period etc. Thanks!
Assuming that your definition of words includes only the letters of the alphabet, you can just use replaceAll with the appropriate regex. For example, the line below will remove all characters except spaces and letters.
String output = input.replaceAll("[^a-zA-Z ]", "")​
You can do this with the default Scanner provided there is no punctuation not attached to a word, otherwise you can set the characters the Scanner skips. Alternatively you could use regular expressions but that is a bit harder.

Java useDelimiter issue

I am working on an assignment that requires me to read in a text file of sentences. After this I am trying to use the delimiters as specified to restrict what is coming in and place that into an array.
scannerInput.useDelimiter("\\p{Punct}|\\p{Digit}|\\p{javaWhitespace}");
My problem is that when I read in the text file and place the words into an array there are large gaps of what appears to be whitespace between indexes in the array.
For example the output of the array would look like:
array[0] =
array[1] = tony
array[2] =
array[3] = sue
I am assuming there are some formatting characters or other I am missing in my delimiter list. I am wondering what I am missing to remove all additional whitespace so that I may be able to have only the words in the array. As of now my first 30 indexes are essentially blank.
Or if there is an easy way to find out what is really behind what appears to be whitespace. I assume it isn't just empty. Thanks for your help.
Your delimiter is a single character, and perhaps you need to specify multiple characters:
scannerInput.useDelimiter("\\p{Punct}+|\\p{Digit}+|\\p{javaWhitespace}+")
and, if there may be multiple types of delimiter between each (not just whitespace or just digits), then change it to the regex as suggested by #David Ehrmann.
Try:
scannerInput.useDelimiter("[\\p{Punct}\\p{Digit}\\p{javaWhitespace}]+")
It'll gobble consecutive delimiters. I also switched from alternation to a character class because you're only matching single characters \p{Punct} is, itself, a character class, and they match faster than a group with alternation.

Can newlines be replaces with spaces? (lexer)

I'm currently in the progress of developing a parser for a subset of Java, and I was wondering;
Is there any cases, in which newlines are more than token separators?
That is, where they couldn't just be replaced by a space.
Should I ignore newlines, in the same way that I ignore white-space?
That is, just use them to detect token separation.
Yes all newline characters in Java source code can be replaced by a space or be removed. However, do not remove \n (backslash n), because that are the newline characters inside a String literal.
And, yes newlines are for the parser the same as spaces, as long as you are outside String literals. If you are in a String literal, and you would remove a newline, then you would surpress a syntax error. Because it is not allowed in Java to have newline characters in a String literal. So, this is wrong:
String str = "first line
same line";
So, it depends on the fact if you want to detect syntax errors with your parser or not. Do you only parse valid code or not? That is the question you should ask yourself.
The only situation I can think of where it makes a difference is within String-literals.
If there is a linebreak between two "s it would cause a syntax error while a space would not.
you have to notice that it could come in string \n, and of course if you want to make this replace you have to increase the lines number +1 because you will need it in the next phases of your project.

use of delimiter function from scanner for "abc-def"

I'm currently trying to filter a text-file which contains words that are separated with a "-". I want to count the words.
scanner.useDelimiter(("[.,:;()?!\" \t\n\r]+"));
The problem which occurs simply is: words that contain a "-" will get separated and counted for being two words. So just escaping with \- isn't the solution of choice.
How can I change the delimiter-expression, so that words like "foo-bar" will stay, but the "-" alone will be filtered out and ignored?
Thanks ;)
OK, I'm guessing at your question here: you mean that you have a text file with some "real" prose, i.e. sentences that actually make sense, are separated by punctuation and the like, etc., right?
Example:
This situation is ameliorated - as far as we can tell - by the fact that our most trusted allies, the Vorgons, continue to hold their poetry slam contests; the enemy has little incentive to interfere with that, even with their Mute-O-Matic devices.
So, what you need as delimiter is something that is either any amount of whitespace and/or punctuation (which you already have covered with the regex you showed), or a hyphen that is surrounded by at least one whitespace on each side. The regex character for "or" is "|". There is a shortcut for the whitespace character class (spaces, tabs, and newlines) in many regex implementations: "\s".
"[.,:;()?!\"\s]+|\s+-\s+"
If possible try to use the pre-defined classes... makes the regex much easier to read. See java.util.regex.Pattern for options.
Maybe this is what you are looking for:
string.split("\\s+(\\W*\\s)?"
Reads: Match 1 or more whitespace chars optionally followed by zero or more non-word characters and a whitespace character.
This is not very simple. One thing to try would be {current-delimeter-chars}{zero-or-more-hyphens}{zero-or-more-current-delimeter-chars-or-hyphen}.
It might be easier to just ignore words returned by scanner consisting entirely of hyphens
Scanner scanner = new Scanner("one two2 - (three) four-five - ,....|");
scanner.useDelimiter("(\\B+-\\B+|[.,:;()?!\" \t|])+");
while (scanner.hasNext()) {
System.out.println(scanner.next("\\w+(-\\w+)*"));
}
NB
the next(String) method asserts that you get only words since the original useDelimiter() method misses "|"
NB
you have used the regular expression "\r\n|\n" as line terminator. The JavaDocs for java.util.regex.Pattern shows other possible line terminators, so a more complete check would use the expression "\r\n|[\r\n\u2028\u2029\u0085]"
This should be a simple enough: [^\\w-]\\W*|-\\W+
But of course if it's prose, and you want to exclude underscores:
[^\\p{Alnum}-]\\P{Alnum}*|-\\P{Alnum}+
or if you don't expect numerics:
[^\\p{Alpha}-]\\P{Alpha}*|-\\P{Alpha}+
EDIT: These are easier forms. Keep in mind the complete solution, that would handle dashes at the beginning and end of lines would follow this pattern. (?:^|[^\\w-])\\W*|-(?:\\W+|$)

Categories