How to parse a string by words

How to parse a string by words - java

I need to parse a string by highlighting all wards. At now I figured out how to split words with any symbols. But how to rewrite the code to discard words with numbers or any other characters? Here is my code:
String s = "AaA bbd cDef d1s s/4 +xx_x asdgag 34545rtrtr.";
Pattern p = Pattern.compile("\\b[A-Za-z]+\\b");
System.out.println(Arrays.asList(s.split(p.pattern())));
Not valid words:
*“d1s”, “s/4”, “+xx_x”, “34545rtrtr.”*
Appropriate words:
“AaA”, “bbd”, “cDef”, “asdgag”

Try something like:
"\\b[A-Za-z]+\\b"
Where,
\b marks a word boundary.
[A-Za-z] means every letter, upper or lower case
+ means "one or more".

Related

Split String end with special characters - Java

I have a string which I want to first split by space, and then separate the words from the special characters.
For Example, let's say the input is:
Hi, How are you???
I already wrote the logic to split by space here:
String input = "Hi, How are you???";
String[] words = input.split("\\\\s+");
Now, I want to seperate each word from the special character.
For example: "Hi," to {"Hi", ","} and "you???" to {"you", "???"}
If the string does not end with any special characters, just ignore it.
Can you please help me with the regular expression and code for this in Java?

Following regex should help you out:
(\s+|[^A-Za-z0-9]+)
This is not a java regex, so you need to add a backspace.
It matches on whitespaces \s+ and on strings of characters consisting not of A-Za-z0-9. This is a workaround, since there isn't (or at least I do not know of) a regex for special characters.
You can test this regex here.
If you use this regex with the split function, it will return the words. Not the special characters and whitespaces it machted on.
UPDATE
According to this answer here on SO, java has\P{Alpha}+, which matches any non-alphabetic character. So you could try:
(\s|\P{Alpha})+

I want to separate each word from the special character.
For example: "Hi," to {"Hi", ","} and "you???" to {"you", "???"}
regex to achieve above behavior
String stringToSearch ="Hi, you???";
Pattern p1 = Pattern.compile("[a-z]{0}\\b");
String[] str = p1.split(stringToSearch);
System.out.println(Arrays.asList(str));
output:
[Hi, , , you, ???]

#mike is right...we need to split the sentence on special characters, leaving out the words. Here is the code:
`public static void main(String[] args) {
String match = "Hi, How are you???";
String[] words = match.split("\\P{Alpha}+");
for(String word: words) {
System.out.print(word + " ");
}
}`

Regex add space between all punctuation

I need to add spaces between all punctuation in a string.
\\ "Hello: World." -> "Hello : World ."
\\ "It's 9:00?" -> "It ' s 9 : 00 ?"
\\ "1.B,3.D!" -> "1 . B , 3 . D !"
I think a regex is the way to go, matching all non-punctuation [a-ZA-Z\\d]+, adding a space before and/or after, then extracting the remainder matching all punctuation [^a-ZA-Z\\d]+.
But I don't know how to (recursively?) call this regex. Looking at the first example, the regex will only match the "Hello". I was thinking of just building a new string by continuously removing and appending the first instance of the matched regex, while the original string is not empty.
private String addSpacesBeforePunctuation(String s) {
StringBuilder builder = new StringBuilder();
final String nonpunctuation = "[a-zA-Z\\d]+";
final String punctuation = "[^a-zA-Z\\d]+";
String found;
while (!s.isEmpty()) {
// regex stuff goes here
found = ???; // found group from respective regex goes here
builder.append(found);
builder.append(" ");
s = s.replaceFirst(found, "");
}
return builder.toString().trim();
}
However this doesn't feel like the right way to go... I think I'm over complicating things...

You can use lookarounds based regex using punctuation property \p{Punct} in Java:
str = str.replaceAll("(?<=\\S)(?:(?<=\\p{Punct})|(?=\\p{Punct}))(?=\\S)", " ");
(?<=\\S) Asserts if prev char is not a white-space
(?<=\\p{Punct}) asserts a position if previous char is a punctuation char
(?=\\p{Punct}) asserts a position if next char is a punctuation char
(?=\\S) Asserts if next char is not a white-space
IdeOne Demo

When you see a punctuation mark, you have four possibilities:
Punctuation is surrounded by spaces
Punctuation is preceded by a space
Punctuation is followed by a space
Punctuation is neither preceded nor followed by a space.
Here is code that does the replacement properly:
String ss = s
.replaceAll("(?<=\\S)\\p{Punct}", " $0")
.replaceAll("\\p{Punct}(?=\\S)", "$0 ");
It uses two expressions - one matching the number 2, and one matching the number 3. Since the expressions are applied on top of each other, they take care of the number 4 as well. The number 1 requires no change.
Demo.

Specific Regex Pattern

I wish to take a string input from the user and extract words or numbers like so:
String problem = "I'm lo#o#king t%o ext!r$act a^ll 6 su*bs(tr]i{ngs.";
String[] solve = {"I'm", "looking", "to", "extract", "all", "6", "substrings"};
Basically, I want to extract numbers and words with complete disregard to punctuation except apostrophes. I know how to get words and strings but I can't seem to figure out this tricky part.

You could do like the below.
String s = "I'm lo#o#king t%o ext!r$act a^ll 6 su*bs(tr]i{ngs.";
String parts[] = s.replaceAll("[^\\s\\w']|(?<!\\b)'|'(?!\\b)", "").split("\\s+");
System.out.println(Arrays.toString(parts));
Output:
[I'm, looking, to, extract, all, 6, substrings]
Explanation:
[^\\s\\w'] matches any character but not of space or single quote or word character.
(?<!\\b)'(?!\\b) matches the ' symbol only if it's not preceded and not followed by a word character.
replaceAll function replaces all the matched characters with an empty string.
Finally we do splitting on the resultant string according to one or more space characters.

How to replace last letter to another letter in java using regular expression

i have seen to replace "," to "." by using ".$"|",$", but this logic is not working with alphabets.
i need to replace last letter of a word to another letter for all word in string containing EXAMPLE_TEST using java
this is my code
Pattern replace = Pattern.compile("n$");//here got the real problem
matcher2 = replace.matcher(EXAMPLE_TEST);
EXAMPLE_TEST=matcher2.replaceAll("k");
i also tried "//n$" ,"\n$" etc
Please help me to get the solution
input text=>njan ayman
output text=> njak aymak

Instead of the end of string $ anchor, use a word boundary \b
String s = "njan ayman";
s = s.replaceAll("n\\b", "k");
System.out.println(s); //=> "njak aymak"

You can use lookahead and group matching:
String EXAMPLE_TEST = "njan ayman";
s = EXAMPLE_TEST.replaceAll("(n)(?=\\s|$)", "k");
System.out.println("s = " + s); // prints: s = njak aymak
Explanation:
(n) - the matched word character
(?=\\s|$) - which is followed by a space or at the end of the line (lookahead)
The above is only an example! if you want to switch every comma with a period the middle line should be changed to:
s = s.replaceAll("(,)(?=\\s|$)", "\\.");

Here's how I would set it up:
(?=.\b)\w
Which in Java would need to be escaped as following:
(?=.\\b)\\w
It translates to something like "a character (\w) after (?=) any single character (.) at the end of a word (\b)".
String s = "njan ayman aowkdwo wdonwan. wadawd,.. wadwdawd;";
s = s.replaceAll("(?=.\\b)\\w", "");
System.out.println(s); //nja ayma aowkdw wdonwa. wadaw,.. wadwdaw;
This removes the last character of all words, but leaves following non-alphanumeric characters. You can specify only specific characters to remove/replace by changing the . to something else.
However, the other answers are perfectly good and might achieve exactly what you are looking for.

if (word.endsWith("char oldletter")) {
name = name.substring(0, name.length() - 1 "char newletter");
}

regex delete heading and tailing punctuation

I am trying to write a regex in Java to get rid of all heading and tailing punctuation characters except for "-" in a String, however keeping the punctuation within words intact.
I tried to replace the punctuations with "", String regex = "[\\p{Punct}+&&[^-]]"; right now, but it will delete the punctuation within word too.
I also tried to match pattern: String regex = "[(\\w+\\p{Punct}+\\w+)]"; and Matcher.maches() to match a group, but it gives me null for input String word = "#(*&wor(&d#)("
I am wondering what is the right way to deal with Regex group matching in this case
Examples:
Input: #)($&word#)($& Output: word
Input: #)($)word#google.com#)(*$&$ Output: word#google.com

Pattern p = Pattern.compile("^\\p{Punct}*(.*?)\\p{Punct}*$");
Matcher m = p.matcher("#)($)word#google.com#)(*$&$");
if (m.matches()) {
System.out.println(m.group(1));
}
To give some more info, the key is to have marks for the beginning and end of the string in the regex (^ and $) and to have the middle part match non-greedily (using *? instead of just *).

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to parse a string by words - java

Try something like: "\\b[A-Za-z]+\\b" Where, \b marks a word boundary. [A-Za-z] means every letter, upper or lower case + means "one or more".

Related

Split String end with special characters - Java

Regex add space between all punctuation

Specific Regex Pattern

How to replace last letter to another letter in java using regular expression

regex delete heading and tailing punctuation

Categories

Resources