I admit, not the best title.
I'm having the following problem. I need to use my scanner and parse every word (without the delimiters) to separate strings.
Example: Poker; Blackjack; LasVegas, NewYork to Poker Blackjack LasVegas NewYork
Now, for the first part, I would just use a delimiter like so: sc.useDelimiter("; ") which would work fine.
Second part is where I get trouble. If I switch to sc.useDelimiter(", ") after I'm done with Blackjack, I would still include that first ; and a whitespace so the string would output ; LasVegas.
I tried going over it by first resetting the delimiter and eating up the first token which is kind of a bad way of solving it, but then the string would still turn out to be "whitespace"LasVegas instead of LasVegas.
Would really appreciate some help.
There are a number of ways to deal with this, depending on your actual requirements1:
Don't change the delimiter. The token after "Blackjack" will be "LasVegas, NewYork to Poker Blackjack LasVegas NewYork". Create another scanner to parse that token. (Or use String::split.)
Use a delimiter regex that can will match either delimiter; e.g. "[;,]\\s*".
Parse like this:
String line = scanner.nextLine();
String[] parts = line.split(";\\s*");
String[] parts2 = parts[2].split(",\\s*");
This is assuming that ; is a primary delimiter and , is a secondary delimiter.
Change the input file syntax so that it uses only one delimiter character. (This assumes that you are free to do that, AND that an alternative syntax would "make more sense".)
1 - Obviously, we cannot infer the syntax of the file that you are trying to parse from a single line of input. Or, in general, from a single example input file.
Using a regular expression to match both types of punctuation, including any trailing whitespace, should do the trick.
sc.useDelimiter("[;,]\\s*");
^^^^ Followed by 0 or more whitespace chars
^^^^ Either of these
This will fail to capture the last token (NewYork in this case) if there is no semicolon or comma after it. If these 4-tuples of games & cities come in this format (where no delimiter comes after the last token) then you can additionally match a newline character:
sc.useDelimiter("\\n|[;,]\\s*");
^^^^^^^^ semi/comma delimiters
^ OR
^^^ New-line character
Related
I have a text file containing information that has numbers and characters that are broken into 3 columns and I can't figure out what regular expressions I'm needing. The columns are broken by ; and after the third column is written then it skips to the next line and goes on. I know majority of my code is working properly and I've narrowed down the problem to this section of code.
I've tried looking up java regular expressions and I can't seem to find what I'm trying to accomplish.
while ((line = br.readLine()) != null) {
// Searches the file that matches a specific value
if (!line.isEmpty() || line.matches("Need regular expression here that skips over the two columns and reads the last")) {
if (isValid(line)) {
System.out.println(line + "IS Valid");
} else {
System.out.println(line + "IS NOT VALID");
}
}
}
In the console after reading the file it should say
"12345";"12";"tacobell#yahoo.com"; IS valid
"123456";"31";"Taco . bell#yahoo.com"; IS NOT VALID
It must contain the whole line when writing out to the console not just the third column.
^[^;]*;[^;]*;([^ ]*);$
That will give you a match only if the third column contains no spaces (so it will match "12345";"12";"tacobell#yahoo.com";, but it will not match "123456";"31";"Taco . bell#yahoo.com";).
The parentheses are a capture group, so you can extract that column by grabbing group #1 (not group #0) from the capture results.
The ^ at the beginning means that this pattern has to start at the beginning of a line, and the $ at the end means that this pattern has to end at the end of a line. If that's not the case for your input, you will have to adjust it. For example, if you had trailing whitespace after the last column, you might do:
^[^;]*;[^;]*;([^ ]*);[ ]*$
If you had trailing whitespace and the last semicolon was optional, you'd do:
^[^;]*;[^;]*;([^ ]*);?[ ]*$
One last thing: I'm using [ ] to indicate whitespace, but that only includes the basic space character. It doesn't include tabs, newlines, or any other type of whitespace. It's better to use \s if you want to include all of those, but in Java string syntax you have to escape the backslash, so it would look like this:
Pattern.compile("^[^;]*;[^;]*;([^ ]*);?\\s*$")
This is the reason why well-designed programming languages have a specialized regular expression syntax. It gets even crazier if you want to match a literal backslash:
Pattern.compile("\\\\")
In Javascript, this would just be:
/\\/
Say for example I have a string which is the following "3*x+(b[3]+c)+x"
And my delimiters are " \t*+-/()[]"
When I use the string tokenizer to split the string and I get to token "b", how can I check to make sure that the next delimiter is "[".
I need to do this so that token "x" is different from token "b", as token x represents a simple variable and token b represents an array in the program I am trying to construct.
I am trying to write this code in Java.
When using a string tokenizer , how do I check the next character after a token if it is a delimiter?
If you use the 3 argument constructor for StringTokenizer, then 3rd argument says whether or not the delimiters are returned as tokens.
However1, this still leaves you with the awkward problem of dealing with whitespace. For example "1+1" gives you 3 tokens, but "1 + 1" gives you 5 tokens. You either need to:
pick a better way2 to tokenize the input,
filter out any white-space "tokens" before you start analyzing them, or
remove any whitespace from the input string before you tokenize it.
1 - I'm inclined to agree with #EJP that StringTokenizer is the wrong tool for this job. You could make it work, but IMO you shouldn't. (Unless you have been explicitly directed to do this.)
2 - Better ways might be StreamTokenizer, String.split (with a cunning regex involving lookaheads or lookbehinds), a tokenizer generated using ANTLR, javacc or some other parser generator, or ... a hand built tokenizer.
Is there an easy way to parse quoted text as a string to java? I have this lines like this to parse:
author="Tolkien, J.R.R." title="The Lord of the Rings"
publisher="George Allen & Unwin" year=1954
and all I want is Tolkien, J.R.R.,The Lord of the Rings,George Allen & Unwin, 1954 as strings.
You could either use a regex like
"(.+)"
It will match any character between quotes. In Java would be:
Pattern p = Pattern.compile("\\"(.+)\\"";
Matcher m = p.matcher("author=\"Tolkien, J.R.R.\"");
while(matcher.find()){
System.out.println(m.group(1));
}
Note that group(1) is used, this is the second match, the first one, group(0), is the full string with quotes
Offcourse you could also use a substring to select everything except the first and last char:
String quoted = "author=\"Tolkien, J.R.R.\"";
String unquoted;
if(quoted.indexOf("\"") == 0 && quoted.lastIndexOf("\"")==quoted.length()-1){
unquoted = quoted.substring(1, quoted.lenght()-1);
}else{
unquoted = quoted;
}
There are some fancy pattern regex nonsense things that fancy people and fancy programmers like to use.
I like to use String.split(). It's a simple function and does what you need it to do.
So if I have a String word: "hello" and I want to take out "hello", I can simply do this:
myStr = string.split("\"")[1];
This will cut the string into bits based on the quote marks.
If I want to be more specific, I can do
myStr = string.split("word: \"")[1].split("\"")[0];
That way I cut it with word: " and "
Of course, you run into problems if word: " is repeated twice, which is what patterns are for. I don't think you'll have to deal with that problem for your specific question.
Also, be cautious around characters like . and . Split uses regex, so those characters will trigger funny behavior. I think that "\\" = \ will escape those funny rules. Someone correct me if I'm wrong.
Best of luck!
Can you presume your document is well-formed and does not contain syntax errors? If so, you are simply interested in every other token after using String.split().
If you need something more robust, you may need to use the Scanner class (or a StringBuffer and a for loop ;-)) to pick out the valid tokens, taking into account additional criterion beyond "I saw a quotation mark somewhere".
For example, some reasons you might need a more robust solution than splitting the string blindly on quotation marks: perhaps its only a valid token if the quotation mark starting it comes immediately after an equals sign. Or perhaps you do need to handle values that are not quoted as well as quoted ones? Will \" need to be handled as an escaped quotation mark, or does that count as the end of the string. Can it have either single or double quotes (eg: html) or will it always be correctly formatted with double quotes?
One robust way would be to think like a compiler and use a Java based Lexer (such as JFlex), but that might be overkill for what you need.
If you prefer a low-level approach, you could iterate through your input stream character by character using a while loop, and when you see an =" start copying the characters into a StringBuffer until you find another non-escaped ", either concatenating to the various wanted parsed values or adding them to a List of some sort (depending on what you plan to do with your data). Then continue reading until you encounter your start token (eg: =") again, and repeat.
I'd like to parse some text using an hand-written descending parser. I used Scanner with the following delimiter : "\\s*". Unfortunately, the fact that this pattern matches an empty String seems to make every hasNextFoo and nextFoo matching nothing any more.
The documention doesn't say anything about possibly empty delimiters.
You have some objection to the '+' character?
Are you sure you want to use a regular expression at all, and not just an if statement testing for space characters? You say 'runtime'. Is your data in a string, or coming on a stream, or what?
Yes, because i want to use the scanner as a runtime lexer. In short, I want to be able to ask scanner.next(pattern), that would either return the matched string, or return an exception while not consuming the stream. Spaces should be ignored. If there is a better class to do this than scanner, I would be glad to use it.
I cannot think of any off-the-shelf library class that will do this for you. The normal model of a scanner / lexer is that any invalid character sequence (i.e. one that results in an exception) will be consumed. So, I think you are going to have to implement your own scanner by hand, taking care to treat the read-ahead characters as unconsumed. You could do this with a "pushback" reader or (if that model is not convenient) by explicitly buffering the characters yourself with some kind of mark / reset model. If all you are doing is splitting into tokens separated by one or more spaces, then the pushback reader approach should be fine.
You might also consider StreamTokenizer. Here is an example of using it for one-symbol look-ahead in a recursive-descent parser.
It's possible to use lookbehinds/lookaheads to explicitly define which delimiters are omittable.
For instance this scanner uses whitespaces as a delimiter but doesn't need them between numbers and words:
new Scanner("1A.23 4 BC-5")
.useDelimiter("\\s+|(?<=\\d)(?=[A-Z])|(?<=[A-Z])(?=[-+.\\d])");
It produces:
1
A
.23
4
BC
-5
The regex consists of three alternations:
\s+ consecutive whitespaces are a delimiter.
(?<=\d)(?=[A-Z]) an empty string between a digit and a letter is a delimiter.
(?<=[A-Z])(?=[-+.\d]) an empty string between a letter and '-', '+', '.' or
a digit is a delimiter.
(Note: \w can't be used here as it matches digits.)
I'm currently trying to filter a text-file which contains words that are separated with a "-". I want to count the words.
scanner.useDelimiter(("[.,:;()?!\" \t\n\r]+"));
The problem which occurs simply is: words that contain a "-" will get separated and counted for being two words. So just escaping with \- isn't the solution of choice.
How can I change the delimiter-expression, so that words like "foo-bar" will stay, but the "-" alone will be filtered out and ignored?
Thanks ;)
OK, I'm guessing at your question here: you mean that you have a text file with some "real" prose, i.e. sentences that actually make sense, are separated by punctuation and the like, etc., right?
Example:
This situation is ameliorated - as far as we can tell - by the fact that our most trusted allies, the Vorgons, continue to hold their poetry slam contests; the enemy has little incentive to interfere with that, even with their Mute-O-Matic devices.
So, what you need as delimiter is something that is either any amount of whitespace and/or punctuation (which you already have covered with the regex you showed), or a hyphen that is surrounded by at least one whitespace on each side. The regex character for "or" is "|". There is a shortcut for the whitespace character class (spaces, tabs, and newlines) in many regex implementations: "\s".
"[.,:;()?!\"\s]+|\s+-\s+"
If possible try to use the pre-defined classes... makes the regex much easier to read. See java.util.regex.Pattern for options.
Maybe this is what you are looking for:
string.split("\\s+(\\W*\\s)?"
Reads: Match 1 or more whitespace chars optionally followed by zero or more non-word characters and a whitespace character.
This is not very simple. One thing to try would be {current-delimeter-chars}{zero-or-more-hyphens}{zero-or-more-current-delimeter-chars-or-hyphen}.
It might be easier to just ignore words returned by scanner consisting entirely of hyphens
Scanner scanner = new Scanner("one two2 - (three) four-five - ,....|");
scanner.useDelimiter("(\\B+-\\B+|[.,:;()?!\" \t|])+");
while (scanner.hasNext()) {
System.out.println(scanner.next("\\w+(-\\w+)*"));
}
NB
the next(String) method asserts that you get only words since the original useDelimiter() method misses "|"
NB
you have used the regular expression "\r\n|\n" as line terminator. The JavaDocs for java.util.regex.Pattern shows other possible line terminators, so a more complete check would use the expression "\r\n|[\r\n\u2028\u2029\u0085]"
This should be a simple enough: [^\\w-]\\W*|-\\W+
But of course if it's prose, and you want to exclude underscores:
[^\\p{Alnum}-]\\P{Alnum}*|-\\P{Alnum}+
or if you don't expect numerics:
[^\\p{Alpha}-]\\P{Alpha}*|-\\P{Alpha}+
EDIT: These are easier forms. Keep in mind the complete solution, that would handle dashes at the beginning and end of lines would follow this pattern. (?:^|[^\\w-])\\W*|-(?:\\W+|$)