Regex to split by special characters with exceptions JAVA

Regex to split by special characters with exceptions JAVA - java

I am very new to regular expressions and Im having difficulties with this one:
I want to split a String when found this patern but also this one "text here" and this one "text here"^^ (this should be considered as one in the output).
Note these symbols: ^^
The three cases can be repeated each many times or can be one after the other and are always separated by spaces.
Example:
<\herewouldbeurl> "HEY THERE" "Asioc-project.org/."^^<\anotherurl/>
would produce:
1.<\herewouldbeurl>
2."HEY THERE"
3."Asioc-project.org/."^^<\anotherurl/>
Ive found this: "\s+(?=(?:(?<=[a-zA-Z])\"(?=[A-Za-z])|\"[^\"]\"|[^\"])$)" but does not work for the third case.
Any ideas?

Don't use split(). Use a find() loop.
String input = "<\\herewouldbeurl> \"HEY THERE\" \"Asioc-project.org/.\"^^<\\anotherurl/>";
Pattern p = Pattern.compile("\".*?\"(?:\\^\\^)?");
Matcher m = p.matcher(input);
int start = 0;
while (m.find()) {
String s = input.substring(start, m.start()).trim();
if (! s.isEmpty())
System.out.println(s);
System.out.println(m.group());
start = m.end();
}
String s = input.substring(start).trim();
if (! s.isEmpty())
System.out.println(s);
Output
<\herewouldbeurl>
"HEY THERE"
"Asioc-project.org/."^^
<\anotherurl/>

Related

Regex to get value between two colon excluding the colons

I have a string like this:
something:POST:/some/path
Now I want to take the POST alone from the string. I did this by using this regex
:([a-zA-Z]+):
But this gives me a value along with colons. ie I get this:
:POST:
but I need this
POST
My code to match the same and replace it is as follows:
String ss = "something:POST:/some/path/";
Pattern pattern = Pattern.compile(":([a-zA-Z]+):");
Matcher matcher = pattern.matcher(ss);
if (matcher.find()) {
System.out.println(matcher.group());
ss = ss.replaceFirst(":([a-zA-Z]+):", "*");
}
System.out.println(ss);
EDIT:
I've decided to use the lookahead/lookbehind regex since I did not want to use replace with colons such as :*:. This is my final solution.
String s = "something:POST:/some/path/";
String regex = "(?<=:)[a-zA-Z]+(?=:)";
Matcher matcher = Pattern.compile(regex).matcher(s);
if (matcher.find()) {
s = s.replaceFirst(matcher.group(), "*");
System.out.println("replaced: " + s);
}
else {
System.out.println("not replaced: " + s);
}

There are two approaches:
Keep your Java code, and use lookahead/lookbehind (?<=:)[a-zA-Z]+(?=:), or
Change your Java code to replace the result with ":*:"
Note: You may want to define a String constant for your regex, since you use it in different calls.

As pointed out, the reqex captured group can be used to replace.
The following code did it:
String ss = "something:POST:/some/path/";
Pattern pattern = Pattern.compile(":([a-zA-Z]+):");
Matcher matcher = pattern.matcher(ss);
if (matcher.find()) {
ss = ss.replaceFirst(matcher.group(1), "*");
}
System.out.println(ss);

UPDATE
Looking at your update, you just need ReplaceFirst only:
String result = s.replaceFirst(":[a-zA-Z]+:", ":*:");
See the Java demo
When you use (?<=:)[a-zA-Z]+(?=:), the regex engine checks each location inside the string for a * before it, and once found, tries to match 1+ ASCII letters and then assert that there is a : after them. With :[A-Za-z]+:, the checking only starts after a regex engine found : character. Then, after matching :POST:, the replacement pattern replaces the whole match. It is totlally OK to hardcode colons in the replacement pattern since they are hardcoded in the regex pattern.
Original answer
You just need to access Group 1:
if (matcher.find()) {
System.out.println(matcher.group(1));
}
See Java demo
Your :([a-zA-Z]+): regex contains a capturing group (see (....) subpattern). These groups are numbered automatically: the first one has an index of 1, the second has the index of 2, etc.
To replace it, use Matcher#appendReplacement():
String s = "something:POST:/some/path/";
StringBuffer result = new StringBuffer();
Matcher m = Pattern.compile(":([a-zA-Z]+):").matcher(s);
while (m.find()) {
m.appendReplacement(result, ":*:");
}
m.appendTail(result);
System.out.println(result.toString());
See another demo

This is your solution:
regex = (:)([a-zA-Z]+)(:)
And code is:
String ss = "something:POST:/some/path/";
ss = ss.replaceFirst("(:)([a-zA-Z]+)(:)", "$1*$3");
ss now contains:
something:*:/some/path/
Which I believe is what you are looking for...

Regular expression to remove everything but words. java

This code doesn't seem doing the right job. It removes the spaces between the words!
input = scan.nextLine().replaceAll("[^A-Za-z0-9]", "");
I want to remove all extra spaces and all numbers or abbreviations from a string, except words and this character: '.
For Example:
input: 34 4fF$##D one 233 r # o'clock 329riewio23
returns: one o'clock

public static String filter(String input) {
return input.replaceAll("[^A-Za-z0-9' ]", "").replaceAll(" +", " ");
}
The first replace replaces all characters except alphabetic characters, the single-quote, and spaces. The second replace replaces all instances of one or more spaces, with a single space.

Your solution doesn't work because you don't replace numbers and you also replace the ' character.
Check out this solution:
Pattern pattern = Pattern.compile("[^| ][A-Za-z']{2,} ");
String input = scan.nextLine();
Matcher matcher = pattern.matcher(input);
StringBuilder result = new StringBuilder();
while (matcher.find()) {
result.append(matcher.group());
}
System.out.println(result.toString());
It looks for the beginning of the string or a space ([^| ]) and then takes all the following characters ([A-Za-z']). However, it only takes the word if there are 2 or more charactes ({2,}) and there has to be a trailing space.

If you want to just extract that time information use this regex group match:
input = scan.nextLine();
Pattern p = Pattern.compile("([a-zA-Z]{3,})\\s.*?(o'clock)");
Matcher m = p.matcher(input);
if (m.find()) {
input = m.group(1) + " " + m.group(2);
}
The regex is quite naive though, and will only work if the input is always of a similar format.

split string to parts by regex

I need to split string to parts by regex.
String is: AA2 DE3 or AA2 and I need this 2.
String code = "AA2 DE3";
String[] parts = code.split("^(AA(\\d)+){1}( )?(\\w*)?$");
and here length of parts is 0.
I tried
String[] parts = code.split("^((AA){1}(\\d)+){1}( )?(\\w*)?$");
but also 0.
It looks like wrong regex. Although it works fine in PHP.
edit
In fact I need to get the number after "AA" but there may be additional word after it.

Assuming that you only want to extract the number and don't care to validate the rest:
Pattern pattern = Pattern.compile("^AA(\\d+)");
Matcher matcher = pattern.matcher(code);
String id = null;
if (matcher.find()) {
id = matcher.group(1);
}
Note that I rewrite (\d)+ to (\d+) to capture all the digits. When there are more than one digit, your regex captures only the last digit.
If you want to keep your validation:
Pattern pattern = Pattern.compile("^AA(\\d+) ?\\w*$");

With String.split, the regex specifies what goes in-between the parts. In your case, your regex matches the entire string, so there is nothing else, hence it returns nothing.
If you want to match this regex, use:
Pattern pattern = Pattern.compile("^(AA(\\d)+){1}( )?(\\w*)?$");
Matcher matcher = pattern.matcher(code);
if(!matcher.matches()) {
// the string doesn't match your regex; handle this
} else {
String part1 = matcher.group(1);
String part2 = matcher.group(2);
// repeat the above line similarly for the third and forth groups
// do something with part1/part2/...
}

It is indeed better to use Pattern and Matcher APIs for this.
This is purely from academic purpose in case you must use String#split only. You can use this lookbehind based regex for split:
(?<=AA\\d{1,999}) *
Code:
String[] toks = "AA2 DE3".split( "(?<=AA\\d{1,999}) *" ); // [AA2, DE3]
OR
String[] toks = "AA2".split( "(?<=AA\\d{1,999}) *" ); // [AA2]

If you'd like String#split() to handle the Pattern/Matcher for you, you can use:
String[] inputs = { "AA2 DE3", "AA3", "BB45 FG6", "XYZ321" };
try {
for (String input : inputs) {
System.out.println(
input.split(" ")[0].split("(?=\\d+$)", 2)[1]
);
}
} catch (ArrayIndexOutOfBoundsException e) {
System.err.println("Input format is incorrect.");
}
}
Output :
2
3
45
321
If the input is guaranteed to start with AA, you can also use
System.out.println(
input.split(" ")[0].split("(?<=^AA)")[1]
);

Get what was removed by String.replaceAll()

So, let's say I got my regular expression
String regex = "\d*";
for finding any digits.
Now I also got a inputted string, for example
String input = "We got 34 apples and too much to do";
Now I want to replace all digits with "", doing it like that:
input = input.replaceAll(regex, "");
When now printing input I got "We got apples and too much to do". It works, it replaced the 3 and the 4 with "".
Now my question: Is there any way - maybe an existing lib? - to get what actually was replaced?
The example here is very simple, just to understand how it works. Want to use it for complexer inputs and regex.
Thanks for your help.

You can use a Matcher with the append-and-replace procedure:
String regex = "\\d*";
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(input);
StringBuffer sb = new StringBuffer();
StringBuffer replaced = new StringBuffer();
while(matcher.find()) {
replaced.append(matcher.group());
matcher.appendReplacement(sb, "");
}
matcher.appendTail(sb);
System.out.println(sb.toString()); // prints the replacement result
System.out.println(replaced.toString()); // prints what was replaced

Particular Regular expression in Java

I have a text string that looks as follows:
word word word {{t:word word|word}} word word {{t:word|word}} word word...
I'm interested to extract all strings that start with "{{t" and end with "}}". I don't care about the rest. I don't know in advance the number of words in "{{..|..}}". If it wasn't a space separating the words inside then splitting the text on space would work. I'm not sure how to write a regular expression to get this done. I thought about running over the text, char by char, and then store everything between "{{t:" and "}}", but would like to know a cleaner way to do the same.
Thank you!
EDIT
Expected output from above:
An array of strings String[] a where a[0] is {{t:word word|word}} and a[1] is {{t:word|word}}.

How about (using non-greedy matching, so that it doesn't find ":word word|word}} word word {{t:word|word"
String s = "word word word {{t:word word|word}} word word {{t:word|word}} word word";
Pattern p = Pattern.compile("\\{\\{t:(.*?)\\}\\}");
Matcher m = p.matcher(s);
while (m.find()) {
//System.out.println(m.group(1));
System.out.println(m.group());
}
Edit:
changed to m.group() so that results contain delimiters.

using the java.util.regex.* package works miracles here
Pattern p = Pattern.compile("\\{\\{t(.*?)\\}\\}");//escaping + capturing group
Matcher m = p.matcher(str);
Set<String> result = new HashSet<String>();//can also be a list or whatever
while(m.find()){
result.add(m.group(1));
}
the capturing group can also be the entire regex to include the {{ and }} like so "(\\{\\{t.*?\\}\\})"

This worked for me:
import java.util.regex.*;
class WordTest {
public static void main( String ... args ) {
String input = "word word word {{t:word word|word}} word word {{t:word|word}} word word...";
Pattern p = Pattern.compile("(\\{\\{.*?\\}\\})");
Matcher m = p.matcher( input );
while( m.find() ) {
System.out.println( m.group(1) );
}
}
}

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Regex to split by special characters with exceptions JAVA - java

Related

Regex to get value between two colon excluding the colons

Regular expression to remove everything but words. java

split string to parts by regex

Get what was removed by String.replaceAll()

Particular Regular expression in Java

Categories

Resources