Complex RegEx patter for replacing commas - java

I have a problem with the REGEX pattern, which has to replace two commas with one comma with a space behind, but if there is only one comma and there is no space behind it, I want it to add that space there.
Currently, I am using this pattern - /([,]+)/g, but in case when I have one comma and space behing, it adds one more space behing.
Cases:
text,,text -> text, text
text, text -> text, text
text,text -> text, text
text,,text,,text -> text, text, text
(I am using Java)
Do you have any suggestions, how this REGEX pattern should look like? I am still bit confused.
Thanks.

What you want is to ensure a space behind every comma-chain, not create one in every case. You can either do this with lookahead, but I prefer the inclusive check, if the character chain already contains spaces, and if so, replace them as well.
str.replaceAll("\\,+ *", ", ");
The above answer will take all (real) spaces after the comma(s) and just replace them as well, this way, the single space that you insert is the only one. This will NOT work for line breaks. If you have line breaks and want to handle them explicitly, then you need to proceed differently. In other words, if the comma(s) is/are followed by a line break, you will have a white space between the (new) comma and line break.

Related

Java - How to split a string on "space" with a fraction char like "1 ½"?

So I want to be able do split this string by spaces:
"1 ½ cups fat-free half-and-half, divided "
I wrote my code like this:
String trimmed;
String[] words = trimmed.split(" ");
But it doesn't work! The 1 and the ½ end up in the same position of the array.
I also tried How to split a string with any whitespace chars as delimiters but it does not split string either. Looking in text editor there is clearly some sort of "space" but I don't get how to split on it. Is is because of "½"?
You've got a thin space there instead of a "regular" space character.
Regex capturing of this is not trivial, as there are other character classes you need to capture. You would at a minimum want to capture it as an additional grouping...
System.out.println(Arrays.toString(s.split("(\\s|\\u2009)")));
...but you would also need to include all the other non-standard white space characters in this search just to be sure you don't miss any. The above works for your case.
The reason for this is that the space between 1 and ½ is not a regular space (U+0020) but instead a "thin space" (U+2009).
Since String.split(String) accepts a regex pattern, you could for example use the pattern \h instead which represents a "horizontal whitespace character", see Pattern documentation, and matches U+2009.
Or you could use the pattern " |\u2009".

Using .split() for multiple characters in Java

I'm trying to split an input by ".,:;()[]"'\/!? " chars and add the words to a list. I've tried .split("\\W+?") and .split("\\W"), but both of them are returning empty elements in the list.
Additionally, I've tried .split("\\W+"), which returns only words without any special characters that should go along with them (for instance, if one of the input words is "C#", it writes "C" in the list). Lastly, I've also tried to put all of the special chars above into the .split() method: .split("\\.,:;\\(\\)\\[]\"'\\\\/!\\? "), but this isn't splitting the input at all. Could anyone advise please?
split() function accepts a regex.
This is not the regex you're looking for .split("\\.,:;\\(\\)\\[]\"'\\\\/!\\? ")
Try creating a character class like [.,:;()\[\]'\\\/!\?\s"] and add + to match one or more occurences.
I also suggest to change the character space with the generic \s who takes all the space variations like \t.
If you're sure about the list of characters you have selected as splitters, this should be your correct split with the correct Java string literal as #Andreas suggested:
.split("[.,:;()\\[\\]'\\\\\\/!\\?\\s\"]+")
BTW: I've found a particularly useful eclipse editor option which escapes the string when you're pasting them into the quotes. Go to Window/Preferences, under Java/Editor/Typing/, check the box next to Escape text when pasting into a string literal

How to match n number of lines with regex

I have some text like this
Notes:
He jumped the sea-horse
but it looked ropey
Then he left
but sometimes its like this
Notes:
Clay is only green
When it is seen
I need to capture 4 lines of text after "Notes" only so the output should be
Notes:
He jumped the sea-horse
but it looked ropey
Then he left
and for the second example
Notes:
I have tried matching the newlines but it only matches after the rest of the regex
Notes:.*\n{4}
How can I create a regex that allows me to repeat the match for the whole line and a newline four times? (is this a non-capturing group??)
You were close - in Notes:.*\n{4} the {4} is binding only to the newline, so it'll only capture "Notes:" followed by anything, followed by 4 blank lines.
You're looking for something like Notes:\n((?:.*\n){1,4})
If I understand the question correctly, and you want to capture the first 4 lines, not caring if they are blank or not, it may be better to not use regex at all and just split the text on the newline so you get an array of strings. Something like this:
string[] lines = text.split("\\r")
Then simply cherry pick the first four elements of the array.

Regex to accept only alphabets and spaces and disallowing spaces at the beginning and the end of the string

I have the following requirements for validating an input field:
It should only contain alphabets and spaces between the alphabets.
It cannot contain spaces at the beginning or end of the string.
It cannot contain any other special character.
I am using following regex for this:
^(?!\s*$)[-a-zA-Z ]*$
But this is allowing spaces at the beginning. Any help is appreciated.
For me the only logical way to do this is:
^\p{L}+(?: \p{L}+)*$
At the start of the string there must be at least one letter. (I replaced your [a-zA-Z] by the Unicode code property for letters \p{L}). Then there can be a space followed by at least one letter, this part can be repeated.
\p{L}: any kind of letter from any language. See regular-expressions.info
The problem in your expression ^(?!\s*$) is, that lookahead will fail, if there is only whitespace till the end of the string. If you want to disallow leading whitespace, just remove the end of string anchor inside the lookahead ==> ^(?!\s)[-a-zA-Z ]*$. But this still allows the string to end with whitespace. To avoid this look back at the end of the string ^(?!\s)[-a-zA-Z ]*(?<!\s)$. But I think for this task a look around is not needed.
This should work if you use it with String.matches method. I assume you want English alphabet.
"[a-zA-Z]+(\\s+[a-zA-Z]+)*"
Note that \s will allow all kinds of whitespace characters. In Java, it would be equivalent to
[ \t\n\x0B\f\r]
Which includes horizontal tab (09), line feed (10), carriage return (13), form feed (12), backspace (08), space (32).
If you want to specifically allow only space (32):
"[a-zA-Z]+( +[a-zA-Z]+)*"
You can further optimize the regex above by making the capturing group ( +[a-zA-Z]+) non-capturing (with String.matches you are not going to be able to get the words individually anyway). It is also possible to change the quantifiers to make them possessive, since there is no point in backtracking here.
"[a-zA-Z]++(?: ++[a-zA-Z]++)*+"
Try this:
^(((?<!^)\s(?!$)|[-a-zA-Z])*)$
This expression uses negative lookahead and negative lookbehind to disallow spaces at the beginning or at the end of the string, and requiring the match of the entire string.
I think the problem is there's a ? before the negation of white spaces, which means it is optional
This should work:
[a-zA-Z]{1}([a-zA-Z\s]*[a-zA-Z]{1})?
at least one sequence of letters, then optional string with spaces but always ends with letters
I don't know if words in your accepted string can be seperated by more then one space. If they can:
^[a-zA-Z]+(( )+[a-zA-z]+)*$
If can't:
^[a-zA-Z]+( [a-zA-z]+)*$
String must start with letter (or few letters), not space.
String can contain few words, but every word beside first must have space before it.
Hope I helped.

use of delimiter function from scanner for "abc-def"

I'm currently trying to filter a text-file which contains words that are separated with a "-". I want to count the words.
scanner.useDelimiter(("[.,:;()?!\" \t\n\r]+"));
The problem which occurs simply is: words that contain a "-" will get separated and counted for being two words. So just escaping with \- isn't the solution of choice.
How can I change the delimiter-expression, so that words like "foo-bar" will stay, but the "-" alone will be filtered out and ignored?
Thanks ;)
OK, I'm guessing at your question here: you mean that you have a text file with some "real" prose, i.e. sentences that actually make sense, are separated by punctuation and the like, etc., right?
Example:
This situation is ameliorated - as far as we can tell - by the fact that our most trusted allies, the Vorgons, continue to hold their poetry slam contests; the enemy has little incentive to interfere with that, even with their Mute-O-Matic devices.
So, what you need as delimiter is something that is either any amount of whitespace and/or punctuation (which you already have covered with the regex you showed), or a hyphen that is surrounded by at least one whitespace on each side. The regex character for "or" is "|". There is a shortcut for the whitespace character class (spaces, tabs, and newlines) in many regex implementations: "\s".
"[.,:;()?!\"\s]+|\s+-\s+"
If possible try to use the pre-defined classes... makes the regex much easier to read. See java.util.regex.Pattern for options.
Maybe this is what you are looking for:
string.split("\\s+(\\W*\\s)?"
Reads: Match 1 or more whitespace chars optionally followed by zero or more non-word characters and a whitespace character.
This is not very simple. One thing to try would be {current-delimeter-chars}{zero-or-more-hyphens}{zero-or-more-current-delimeter-chars-or-hyphen}.
It might be easier to just ignore words returned by scanner consisting entirely of hyphens
Scanner scanner = new Scanner("one two2 - (three) four-five - ,....|");
scanner.useDelimiter("(\\B+-\\B+|[.,:;()?!\" \t|])+");
while (scanner.hasNext()) {
System.out.println(scanner.next("\\w+(-\\w+)*"));
}
NB
the next(String) method asserts that you get only words since the original useDelimiter() method misses "|"
NB
you have used the regular expression "\r\n|\n" as line terminator. The JavaDocs for java.util.regex.Pattern shows other possible line terminators, so a more complete check would use the expression "\r\n|[\r\n\u2028\u2029\u0085]"
This should be a simple enough: [^\\w-]\\W*|-\\W+
But of course if it's prose, and you want to exclude underscores:
[^\\p{Alnum}-]\\P{Alnum}*|-\\P{Alnum}+
or if you don't expect numerics:
[^\\p{Alpha}-]\\P{Alpha}*|-\\P{Alpha}+
EDIT: These are easier forms. Keep in mind the complete solution, that would handle dashes at the beginning and end of lines would follow this pattern. (?:^|[^\\w-])\\W*|-(?:\\W+|$)

Categories