use of delimiter function from scanner for "abc-def" - java

I'm currently trying to filter a text-file which contains words that are separated with a "-". I want to count the words.
scanner.useDelimiter(("[.,:;()?!\" \t\n\r]+"));
The problem which occurs simply is: words that contain a "-" will get separated and counted for being two words. So just escaping with \- isn't the solution of choice.
How can I change the delimiter-expression, so that words like "foo-bar" will stay, but the "-" alone will be filtered out and ignored?
Thanks ;)

OK, I'm guessing at your question here: you mean that you have a text file with some "real" prose, i.e. sentences that actually make sense, are separated by punctuation and the like, etc., right?
Example:
This situation is ameliorated - as far as we can tell - by the fact that our most trusted allies, the Vorgons, continue to hold their poetry slam contests; the enemy has little incentive to interfere with that, even with their Mute-O-Matic devices.
So, what you need as delimiter is something that is either any amount of whitespace and/or punctuation (which you already have covered with the regex you showed), or a hyphen that is surrounded by at least one whitespace on each side. The regex character for "or" is "|". There is a shortcut for the whitespace character class (spaces, tabs, and newlines) in many regex implementations: "\s".
"[.,:;()?!\"\s]+|\s+-\s+"

If possible try to use the pre-defined classes... makes the regex much easier to read. See java.util.regex.Pattern for options.
Maybe this is what you are looking for:
string.split("\\s+(\\W*\\s)?"
Reads: Match 1 or more whitespace chars optionally followed by zero or more non-word characters and a whitespace character.

This is not very simple. One thing to try would be {current-delimeter-chars}{zero-or-more-hyphens}{zero-or-more-current-delimeter-chars-or-hyphen}.
It might be easier to just ignore words returned by scanner consisting entirely of hyphens

Scanner scanner = new Scanner("one two2 - (three) four-five - ,....|");
scanner.useDelimiter("(\\B+-\\B+|[.,:;()?!\" \t|])+");
while (scanner.hasNext()) {
System.out.println(scanner.next("\\w+(-\\w+)*"));
}
NB
the next(String) method asserts that you get only words since the original useDelimiter() method misses "|"
NB
you have used the regular expression "\r\n|\n" as line terminator. The JavaDocs for java.util.regex.Pattern shows other possible line terminators, so a more complete check would use the expression "\r\n|[\r\n\u2028\u2029\u0085]"

This should be a simple enough: [^\\w-]\\W*|-\\W+
But of course if it's prose, and you want to exclude underscores:
[^\\p{Alnum}-]\\P{Alnum}*|-\\P{Alnum}+
or if you don't expect numerics:
[^\\p{Alpha}-]\\P{Alpha}*|-\\P{Alpha}+
EDIT: These are easier forms. Keep in mind the complete solution, that would handle dashes at the beginning and end of lines would follow this pattern. (?:^|[^\\w-])\\W*|-(?:\\W+|$)

Related

Is it better to use regular expressions to match on content or on delimiter?

For a concrete example, I want to break a text document into sentences. I'm considering using the follow regular expressions (still might need tweaking):
[!?][\s]*|[.\n][!?\s]+[.!?\s]* which matches on punctuation/whitespace (stuff between sentences - delimiters)
(.|\n)*?([!?.]\s+|[\n]{2,}|$): which captures any string of characters followed by punctuation or newline (full sentences - the content I want itself)
Generally, which of these methods are preferred? In my specific context, I'd like to keep track of the begin and end indices of each sentence, so I can't do something as simple as String#split.
Thanks.
Assuming you intend to use vanilla Pattern and Matcher processing, the first regex will usually be capturing much, much fewer characters (ending punctuation and some whitespace characters at most) and as such should be the fastest. This can make a difference if you're parsing a very huge document.
However, it might be clumsier to extract start and end indexes for each sentence, because you'll need information from two consecutive matches to be able to delimit a whole sentence. The second regex directly maps sentences to individual matches and enables the simplest code.
So no free lunch here. Both will get the job done, although you can probably make them more robust depending on the type of document you're targeting. In particular, beware of unexpected punctuation characters in the middle of sentences, as in :
... the "?" character can be used to...

How to remove duplicate characters in a string using regex?

I need to replace the duplicate characters in a string. I tried using
outputString = str.replaceAll("(.)(?=.*\\1)", "");
This replaces the duplicate characters but the position of the characters changes as shown below.
input
haih
output
aih
But I need to get an output hai. That is the order of the characters that appear in the string should not change. Given below are the expected outputs for some inputs.
input
aaaassssddddd
output
asd
input
cdddddggggeeccc
output
cdge
How can this be achieved?
It seems like your code is leaving the last character, so how about this?
outputString = new StringBuilder(str).reverse().toString();
// outputString is now hiah
outputString = outputString.replaceAll("(.)(?=.*\\1)", "");
// outputString is now iah
outputString = new StringBuilder(outputString).reverse().toString();
// outputString is now hai
Overview
It's possible with Oracle's implementation, but I wouldn't recommend this answer for many reasons:
It relies on a bug in the implementation, which interprets *, + or {n,} as {0, 0x7FFFFFFF}, {1, 0x7FFFFFFF}, {n, 0x7FFFFFFF} respectively, which allows the look-behind to contains such quantifiers. Since it relies on a bug, there is no guarantee that it will work similarly in the future.
It is unmaintainable mess. Writing normal code and any people who have some basic Java knowledge can read it, but using the regex in this answer limits the number of people who can understand the code at a glance to people who understand the in and out of regex implementation.
Therefore, this answer is for educational purpose, rather than something to be used in production code.
Solution
Here is the one-liner replaceAll regex solution:
String output = input.replaceAll("(.)(?=(.*))(?<=(?=\\1.*?\\1\\2$).+)","")
Printing out the regex:
(.)(?=(.*))(?<=(?=\1.*?\1\2$).+)
What we want to do is to look-behind to see whether the same character has appeared before or not. The capturing group (.) at the beginning captures the current character, and the look-behind group is there to check whether the character has appeared before. So far, so good.
However, since backreferences \1 doesn't have obvious length, it can't appear in the look-behind directly.
This is where we make use of the bug to look-behind up to the beginning of the string, then use a look-ahead inside the look-behind to include the backreference, as you can see (?<=(?=...).+).
This is not the end of the problem, though. While the non-assertion pattern inside look-behind .+ can't advance past the position after the character in (.), the look-ahead inside can. As a simple test:
"haaaaaaaaa".replaceAll("h(?<=(?=(.*)).*)","$1")
> "aaaaaaaaaaaaaaaaaa"
To make sure that the search doesn't spill beyond the current character, I capture the rest of the string in a look-ahead (?=(.*)) and use it to "mark" the current position (?=\\1.*?\\1\\2$).
Can this be done in one replacement without using look-behind?
I think it is impossible. We need to differentiate the first appearance of a character with subsequent appearance of the same character. While we can do this for one fixed character (e.g. a), the problem requires us to do so for all characters in the string.
For your information, this is for removing all subsequent appearance of a fixed character (h is used here):
.replaceAll("^([^h]*h[^h]*)|(?!^)\\Gh+([^h]*)","$1$2")
To do this for multiple characters, we must keep track of whether the character has appeared before or not, across matches and for all characters. The regex above shows the across matches part, but the other condition kinda makes this impossible.
We obviously can't do this in a single match, since subsequent occurrences can be non-contiguous and arbitrary in number.

how to get rid of things not words, say period etc for a small word count method

I just want to use java to write a simple word count method for an essay. But how can I get rid of things not as a word, say period etc. Thanks!
Assuming that your definition of words includes only the letters of the alphabet, you can just use replaceAll with the appropriate regex. For example, the line below will remove all characters except spaces and letters.
String output = input.replaceAll("[^a-zA-Z ]", "")​
You can do this with the default Scanner provided there is no punctuation not attached to a word, otherwise you can set the characters the Scanner skips. Alternatively you could use regular expressions but that is a bit harder.

Regex to accept only alphabets and spaces and disallowing spaces at the beginning and the end of the string

I have the following requirements for validating an input field:
It should only contain alphabets and spaces between the alphabets.
It cannot contain spaces at the beginning or end of the string.
It cannot contain any other special character.
I am using following regex for this:
^(?!\s*$)[-a-zA-Z ]*$
But this is allowing spaces at the beginning. Any help is appreciated.
For me the only logical way to do this is:
^\p{L}+(?: \p{L}+)*$
At the start of the string there must be at least one letter. (I replaced your [a-zA-Z] by the Unicode code property for letters \p{L}). Then there can be a space followed by at least one letter, this part can be repeated.
\p{L}: any kind of letter from any language. See regular-expressions.info
The problem in your expression ^(?!\s*$) is, that lookahead will fail, if there is only whitespace till the end of the string. If you want to disallow leading whitespace, just remove the end of string anchor inside the lookahead ==> ^(?!\s)[-a-zA-Z ]*$. But this still allows the string to end with whitespace. To avoid this look back at the end of the string ^(?!\s)[-a-zA-Z ]*(?<!\s)$. But I think for this task a look around is not needed.
This should work if you use it with String.matches method. I assume you want English alphabet.
"[a-zA-Z]+(\\s+[a-zA-Z]+)*"
Note that \s will allow all kinds of whitespace characters. In Java, it would be equivalent to
[ \t\n\x0B\f\r]
Which includes horizontal tab (09), line feed (10), carriage return (13), form feed (12), backspace (08), space (32).
If you want to specifically allow only space (32):
"[a-zA-Z]+( +[a-zA-Z]+)*"
You can further optimize the regex above by making the capturing group ( +[a-zA-Z]+) non-capturing (with String.matches you are not going to be able to get the words individually anyway). It is also possible to change the quantifiers to make them possessive, since there is no point in backtracking here.
"[a-zA-Z]++(?: ++[a-zA-Z]++)*+"
Try this:
^(((?<!^)\s(?!$)|[-a-zA-Z])*)$
This expression uses negative lookahead and negative lookbehind to disallow spaces at the beginning or at the end of the string, and requiring the match of the entire string.
I think the problem is there's a ? before the negation of white spaces, which means it is optional
This should work:
[a-zA-Z]{1}([a-zA-Z\s]*[a-zA-Z]{1})?
at least one sequence of letters, then optional string with spaces but always ends with letters
I don't know if words in your accepted string can be seperated by more then one space. If they can:
^[a-zA-Z]+(( )+[a-zA-z]+)*$
If can't:
^[a-zA-Z]+( [a-zA-z]+)*$
String must start with letter (or few letters), not space.
String can contain few words, but every word beside first must have space before it.
Hope I helped.

Java Scanner with empty delimiter

I'd like to parse some text using an hand-written descending parser. I used Scanner with the following delimiter : "\\s*". Unfortunately, the fact that this pattern matches an empty String seems to make every hasNextFoo and nextFoo matching nothing any more.
The documention doesn't say anything about possibly empty delimiters.
You have some objection to the '+' character?
Are you sure you want to use a regular expression at all, and not just an if statement testing for space characters? You say 'runtime'. Is your data in a string, or coming on a stream, or what?
Yes, because i want to use the scanner as a runtime lexer. In short, I want to be able to ask scanner.next(pattern), that would either return the matched string, or return an exception while not consuming the stream. Spaces should be ignored. If there is a better class to do this than scanner, I would be glad to use it.
I cannot think of any off-the-shelf library class that will do this for you. The normal model of a scanner / lexer is that any invalid character sequence (i.e. one that results in an exception) will be consumed. So, I think you are going to have to implement your own scanner by hand, taking care to treat the read-ahead characters as unconsumed. You could do this with a "pushback" reader or (if that model is not convenient) by explicitly buffering the characters yourself with some kind of mark / reset model. If all you are doing is splitting into tokens separated by one or more spaces, then the pushback reader approach should be fine.
You might also consider StreamTokenizer. Here is an example of using it for one-symbol look-ahead in a recursive-descent parser.
It's possible to use lookbehinds/lookaheads to explicitly define which delimiters are omittable.
For instance this scanner uses whitespaces as a delimiter but doesn't need them between numbers and words:
new Scanner("1A.23 4 BC-5")
.useDelimiter("\\s+|(?<=\\d)(?=[A-Z])|(?<=[A-Z])(?=[-+.\\d])");
It produces:
1
A
.23
4
BC
-5
The regex consists of three alternations:
\s+ consecutive whitespaces are a delimiter.
(?<=\d)(?=[A-Z]) an empty string between a digit and a letter is a delimiter.
(?<=[A-Z])(?=[-+.\d]) an empty string between a letter and '-', '+', '.' or
a digit is a delimiter.
(Note: \w can't be used here as it matches digits.)

Categories