I'm trying to split an input by ".,:;()[]"'\/!? " chars and add the words to a list. I've tried .split("\\W+?") and .split("\\W"), but both of them are returning empty elements in the list.
Additionally, I've tried .split("\\W+"), which returns only words without any special characters that should go along with them (for instance, if one of the input words is "C#", it writes "C" in the list). Lastly, I've also tried to put all of the special chars above into the .split() method: .split("\\.,:;\\(\\)\\[]\"'\\\\/!\\? "), but this isn't splitting the input at all. Could anyone advise please?
split() function accepts a regex.
This is not the regex you're looking for .split("\\.,:;\\(\\)\\[]\"'\\\\/!\\? ")
Try creating a character class like [.,:;()\[\]'\\\/!\?\s"] and add + to match one or more occurences.
I also suggest to change the character space with the generic \s who takes all the space variations like \t.
If you're sure about the list of characters you have selected as splitters, this should be your correct split with the correct Java string literal as #Andreas suggested:
.split("[.,:;()\\[\\]'\\\\\\/!\\?\\s\"]+")
BTW: I've found a particularly useful eclipse editor option which escapes the string when you're pasting them into the quotes. Go to Window/Preferences, under Java/Editor/Typing/, check the box next to Escape text when pasting into a string literal
Related
So I want to be able do split this string by spaces:
"1 ½ cups fat-free half-and-half, divided "
I wrote my code like this:
String trimmed;
String[] words = trimmed.split(" ");
But it doesn't work! The 1 and the ½ end up in the same position of the array.
I also tried How to split a string with any whitespace chars as delimiters but it does not split string either. Looking in text editor there is clearly some sort of "space" but I don't get how to split on it. Is is because of "½"?
You've got a thin space there instead of a "regular" space character.
Regex capturing of this is not trivial, as there are other character classes you need to capture. You would at a minimum want to capture it as an additional grouping...
System.out.println(Arrays.toString(s.split("(\\s|\\u2009)")));
...but you would also need to include all the other non-standard white space characters in this search just to be sure you don't miss any. The above works for your case.
The reason for this is that the space between 1 and ½ is not a regular space (U+0020) but instead a "thin space" (U+2009).
Since String.split(String) accepts a regex pattern, you could for example use the pattern \h instead which represents a "horizontal whitespace character", see Pattern documentation, and matches U+2009.
Or you could use the pattern " |\u2009".
To remove quotation marks in Java,I understand I can use
replaceAll("\"", "");
Ex: "Hello World" becomes Hello World.
However, it only removes this type of quotation marks "". Is there a way to remove quotes like this “Hello World” ?
If you simply want to remove those 3 kinds of double-quotes, irrespective of the context:
replaceAll("[\"“”]", "");
If there are other kinds of quote characters that you want to remove, just add them before the ].
These pages list some of the other quote characters that you might encounter:
https://unicode-table.com/en/sets/quotation-marks/
https://en.wikipedia.org/wiki/Quotation_mark
And also see:
Is there a regex to grab all quotation marks?
which talks about the difficulty in creating a regex to match all of them in a future-proof fashion.
Note that since we are including some "funky" characters (non-ASCII) in the source code (above), it is important that the Java compiler is aware of the character encoding that the source code uses. We could avoid that by using Unicode escapes instead. For example:
replaceAll("[\"\u201c\u201d]", "");
You may try a regex replacement here, e.g.
String input = "“Hello World”";
System.out.println(input.replaceAll("“(.*?)”", "$1")); // prints Hello World
I have a problem that I can't seem to find an answer here for, so I'm asking it.
The thing is that I have a string and I have delimiters. I want to create an array of strings from the things which are between those delimiters (might be words, numbers, etc). However, if I have two delimiters next to one another, the split method will return an empty string for one of the instances.
I tested this against even more delimiters that are in succession. I found out that if I have n delimiters, I will have n-1 empty strings in the result array. In other words, if I have both "," and " " as delimiters, and the sentence "This is a very nice day, isn't it", then the array with results would be like:
{... , "day", "", "isn't" ...}
I want to get those extra empty strings out and I can't figure out how to do that. A sample regex for the delimiters that I have is:
"[\\s,.-\\'\\[\\]\\(\\)]"
Also can you explain why there are extra empty strings in the result array?
P.S. I read some of the similar posts which included information about the second parameter of the regex. I tried both negative, zero, and positive numbers, and I didn't get the result that I'm looking for. (one of the questions had an answer saying that -1 as a parameter might solve the problem, but it didn't.
You can use this regex for splitting:
[\\s,.'\\[\\]()-]+
Keep unescaped hyphen at first or last position in character class otherwise it is treated as range like A-Z or 0-9
You must use quantifier + for matching 1 more delimiters
I think your problem is just the regex itself. You should use a greedy quantifier:
"[\\s,.-\\'\\[\\]\\(\\)]+"
See http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html#sum
X+ ... X, one or more times
Your regular expression describes just one single character. If you want it to match multiple separators at once, use a quantifier:
String s = "This is a very nice day, isn't it";
String[] tokens = s.split("[\\s,.\\-\\[\\]()']+");
(Note the '+' at the end of the expression)
If you want to get rid of empty strings, you can use the Guava project Splitter class.
on method:
Returns a splitter that uses the given fixed string as a separator.
Example (ignoring empty strings):
System.out.println(
Splitter.on(',')
.trimResults()
.omitEmptyStrings()
.split("foo,bar,, qux")
);
Output:
[foo, bar, qux]
onPattern method:
Returns a splitter that considers any subsequence matching a given
pattern (regular expression) to be a separator.
Example (ignoring empty strings):
System.out.println(
Splitter
.onPattern("([,.|])")
.trimResults()
.omitEmptyStrings()
.split("foo|bar,, qux.hi")
);
Output:
[foo, bar, qux, hi]
For more details, consult Splitter documentation.
I have a list of words wordsList and a string text.
I need to remove from the text words that are in wordsList.
Example: "You have knowlegde in Java, PHP, Oracle, etc."
In this case, you, have, in and etc are some of the words in wordsList.
So, I need to remove them and replace by a whitespace.
Please, how do I do that?
I think I should visit every item from the list, then check if the text contains it.
What regular expression can I use to replace the words (to remove)?
They can be followed by whitespace or ponctuation.
Expected output for this example: "knowlegde Java, PHP, Oracle, ."
PS: I can not remove punctuation!
I am using Java.
Try this one :
String text="bla bla ....";
for(String word : wordlist){
text=text.replaceAll("\\s"+word+"\\s","\\s"); //for detect exactly word in your text.
}
text=text.replaceAll("\\s+","\\s");
I am working on an assignment that requires me to read in a text file of sentences. After this I am trying to use the delimiters as specified to restrict what is coming in and place that into an array.
scannerInput.useDelimiter("\\p{Punct}|\\p{Digit}|\\p{javaWhitespace}");
My problem is that when I read in the text file and place the words into an array there are large gaps of what appears to be whitespace between indexes in the array.
For example the output of the array would look like:
array[0] =
array[1] = tony
array[2] =
array[3] = sue
I am assuming there are some formatting characters or other I am missing in my delimiter list. I am wondering what I am missing to remove all additional whitespace so that I may be able to have only the words in the array. As of now my first 30 indexes are essentially blank.
Or if there is an easy way to find out what is really behind what appears to be whitespace. I assume it isn't just empty. Thanks for your help.
Your delimiter is a single character, and perhaps you need to specify multiple characters:
scannerInput.useDelimiter("\\p{Punct}+|\\p{Digit}+|\\p{javaWhitespace}+")
and, if there may be multiple types of delimiter between each (not just whitespace or just digits), then change it to the regex as suggested by #David Ehrmann.
Try:
scannerInput.useDelimiter("[\\p{Punct}\\p{Digit}\\p{javaWhitespace}]+")
It'll gobble consecutive delimiters. I also switched from alternation to a character class because you're only matching single characters \p{Punct} is, itself, a character class, and they match faster than a group with alternation.