Java regex to preserve ngrams in square brackets - java

I am a bit of a newbie with Java regex so I wonder if anyone can help where I need a regex to split text based on ngrams. So if I have text like this:
dyson [salisbury matheson beaumont] clarke [carstairs morden] vaughan
To return the following ngrams:
Unigram: dyson
Trigram: salisbury matheson beaumont
Unigram: clarke
Bigram: carstairs morden
Unigram: vaughan
The contents of the square brackets are preserved as bigrams or trigrams?
The split would be based upon spaces outside the brackets.

That's pretty easy:
\w+|\[([\w\s]+)\]
Demo
Explanation:
\w+ matches a word (a series of alphanumeric characters or an underscore)
or: \[([\w\s]+)\]
\[ matches a [
[\w\s]+ matches a series of words and spaces, this is captured
\] matches a ]
If you have a capture it means you have something in brackets, else it means you have a single word. You can then apply the simple \w+ regex to the contents of the brackets to extract the words.
To use it in Java you have to escape the backslashes to pass them as-is to the regex engine:
String pattern = "\\w+|\\[([\\w\\s]+)\\]";

Related

How to extract and replace a String with specific format?

I have input String like;
(rm01ADS21212, 'adfffddd', rmAdssssss, '1231232131', rm2321312322)
What I want to do is find all words starting with "rm" and replace them with remove function.
(remove(01ADS21212), 'adfffddd', remove(Adssssss), '1231232131', remove(2321312322))
I am trying to use replaceAll function but I don't know how to extract parts after "rm" literal.
statement.replaceAll("\\(rm*.,", "remove($1)");
Is there any way to get these parts?
You have not captured any substring with a capturing group, thus $1 is null.
You may use
.replaceAll("\\brm(\\w*)", "remove($1)")
See the regex demo
Details
\b - a word boundary (to start matching only at the start of a word)
rm - a literal part
(\w*) - Group 1: 0+ word chars (letters, digits or underscores)
The $1 in the replacement pattern stands for Group 1 value.
If you mean to match any chars other than a comma and whitespace after rm, use "\\brm([^\\s,]*)", see this regex demo.
Use "Replace" with empty string .
Eg;
string str = "(rm01ADS21212, 'adfffddd', rmAdssssss, '1231232131', rm2321312322)";
Console.WriteLine(str.Replace("rm", ""));
Output : (01ADS21212, 'adfffddd', Adssssss, '1231232131', 2321312322)

RegEx for combining multiple sequences

As many people ,i am struggling with what it seems a "trivial" regex issue.
in a given text, whenever I encounter a word within {} brackets i need to extract it.At first i used
"\\{-?(\\w{3,})\\}"
and it worked ok:
as long as the word didnt have any white space or special character like ' .
For example {Project} returns Project.But {Project Test} or {Project D'arce} don't return anything.
i know that for white characters i need to use \s.But it is absolutely not clear for me how to add to the above , i tried :
"%\\{-?(\\w(\\s{3,})\\)\\}"))
but not working.Also what if i want to add words containing a special characters like ' ??? Its really frustrating
How about matching any character inside {..} which is not }?
To do so you can use negated character class [^..] like [^}]. So your regex can look like
"\\{[^}]{3,}\\}"
But if you want to limit your regex only to some specific alphabet you can also use character class to combine many characters and even predefined shorthand character classes like \w \s \d and so on.
So if you want to accept any word character \w or whitespace \s or ' your regex can look like
"\\{[\\w\\s']{3,}\\}"
You could use a character class [\w\s']and add to it what you could allow to match:
\{-?([\w\s']{3,})}
In Java
String regex = "\\{-?([\\w\\s']{3,})}";
Regex demo
If you want to prevent matching only 3 whitespace chars, you could use a repeating group:
\{-?\h*([\w']{3,}(?:\h+[\w']+)*)\h*}
About the pattern
\{ Match { char
-? Optional hyphen
\h* Match 0+ times a horizontal whitespace char
([\w\s']{3,}) Capture in a group matching 3 or more times either a word char, whitespace char or '
(?:\h[\w']+)* Repeat 0+ times matching 1+ horizontal whitespace chars followed by what is listed in the character class
\h* Match 0+ times a horizontal whitespace char
} Match }
In Java
String regex = "\\{-?\\h*([\\w']{3,}(?:\\h+[\\w']+)*)\\h*}";
Regex demo

Replace substring of text matching regexp

I have text that looks like something like this:
1. Must have experience in Java 2. Team leader...
I want to render this in HTML as an ordered list. Now adding the </li> tag to the end is simple enough:
s = replace(s, ". ", "</li>");
But how do I go about replacing the 1., 2. etc with <li>?
I have the regular expression \d*\.$ which matches a number with a period, but the problem is is that is a substring so matching 1. Must have experience in Java 2. Team leader with \d*\.$ returns false.
Code
See regex in use here
\d+\.\s+(.*?)\s*(?=\d+\.\s+|$)
Replace
<li>$1</li>\n
Results
Input
Must have experience in Java 2. Team leader...
Output
<li>Must have experience in Java</li>
<li>Team leader...</li>
Explanation
\d+ Match one or more digits
\. Match the dot character . literally
\s+ Match one or more whitespace characters
(.*?) Capture any character any number of times, but as few as possible, into capture group 1
\s* Match any number of whitespace characters
(?=\d+\.\s+|$) Positive lookahead ensuring either of the following doesn't match
\d+\.\s+
\d+ Match one or more digits
\. Match the dot character . literally
\s+ Match one or more whitespace characters
$ Assert position at the end of the line
But how do I go about replacing the 1., 2. etc with <li>?
You can use String#replaceAll which can allow regex instead of replace :
s = s.replaceAll("\\d+\\.\\s", "</li>");
Note
You don't need to use $ in the end of your regex.
You have to escape dot . because it's mean any character in regex
You can use \s for one space or \s* for zero or more spaces or \s+ for one or more space
We want
<ol>
<li>one</li>
<li>two<li>
</ol>
This can be done as:
s = s.replaceAll("(?s)(\\d+\\.)\\s+(.*\\.)\\s*", "<li>$2</li></ol>");
s = s.replaceFirst("<li>", "<ol><li>");
s = s.replaceAll("(?s)</li></ol><li>", "</li>\n<li>");
The trick is to first add </li></ol> with a spurious </ol> that should only remain after the last list item.
(?s) is the DOTALL notation, causing . to also match line breaks.
In case of more than one numbered list this will not do. Also it assumes one single sentence per list item.

java regex matching &[text(text - text text) !text]

I am currently working on creating a regex to split out all occurrences of Strings that match the following format: &[text(text - text text) !text]. Here text can be any char really. and the spacing is important. The text will be listed as shown.
I have tried the following regex but I cannot seem to get it to work:
&\\[([^\\]]*)\\]
Any help would be greatly appreciated.
You replace text with \w+ to capture 1 or more word characters.
Assuming everything else was a literal, the following regular expression should work:
&\[\w+\(\w+ - \w+ \w+\) !\w+\]
You could also use [a-zA-Z] in place of \w if you would like. It is sometimes easier to understand since it explicitly describes the characters to match, a-z and A-Z inclusive.
&\[[a-zA-Z]+\([a-zA-Z]+ - [a-zA-Z]+ [a-zA-Z]+\) ![a-zA-Z]+\]
And for one character only, remove the +
&\[\w\(\w - \w \w\) !\w\]
&\[[a-zA-Z]\([a-zA-Z] - [a-zA-Z] [a-zA-Z]\) ![a-zA-Z]\]
P.S - I cant remember if -, &, or ! are coutned as regex symbols and if they are you can make them literals by using \-, \&, or \!.
P.P.S - In java you have to escape \ so \w becomes \\w in a string.
If you want to extract text as groups to work with them after:
&\\[(\\w+)\\((\w+)\\s\\-\\s(\\w+)\\s(\\w+)\\)\\s!(\\w+)]
example

Regex for First word and last word of a string separates with

I'm trying to get a regex for the following expression but can't make it:
String have 4 words separated with dots(.).
First word matches a given one (HELLO for example).
Second and third words could have any character but dot itself (.).
Last word matches a given one again(csv for example).
So:
HELLO.something.Somethi#gElse.csv should match.
something.HELLO.?.csv shouldn't match.
HELLO.something...csv shouldn't match.
HELLO.something.somethingelse.notcsv shouldn't match
I can do it with split(.) and then check for individual words, but I'm trying to get it working with Regex and Pattern class.
Any help would be really appreciated.
This is relatively straightforward, as long as you understand character classes. A regex with square brackets [xyz] matches any character from the list {x, y, z}; a regex [^xyz] matches any character except {x, y, z}.
Now you can construct your expression:
^HELLO\.[^.]+\.[^.]+\.csv$
+ means "one or more of the preceding expression"; \. means "dot itself". ^ means "the beginning of the string"; $ means "the end of the string". These anchors prevent regex from matching
blahblahHELLO.world.world.csvblahblah
Demo.
A common goal for writing regular expressions like that is to capture some content, for example, the string between the first and the second dot, and the string between the second and the third dot. Use capturing groups to bring the content of these strings into your Java program:
^HELLO\.([^.]+)\.([^.]+)\.csv$
Each pair of parentheses defines a capturing group, indexed from 1 (group at index zero represents the capture of the entire expression). Once you obtain a match object from the pattern, you can query it for the groups, and extract the corresponding strings.
Note that backslashes in Java regex need to be doubled.
(^HELLO\.[^.]+\.[^.]+\.csv$)
Here is the same regex with token explanation on regex101.

Categories