Regex look ahead to seperate string into tokens

Regex look ahead to seperate string into tokens - java

I currently have the following code which allows me to find matches from a String.
I need to be able to find all words similar to 64xand split them up into tokens, so I'll get 64 and x as the output.
I have looked at regexs lookahead and this does not solve the issue, is there a way to do this without creating a new arraylist to store matches similar to 64x then splitting them up?
String input = "Hello world 65x";
ArrayList<String> userInput = new ArrayList<>();
Matcher isMatch = Pattern.compile("[0-9]*+[a-zA-Z]")
.matcher(input);
while (isMatch.find()) {
userInput.add(isMatch.group());
}

You can try the following regular expression:
\b(\p{Digit}+)(\p{Alpha})\b
Additionally, if you plan to use the regular expression very often, it is recommended to use a constant in order to avoid recompile it each time, e.g.:
private static final Pattern REGEX_PATTERN =
Pattern.compile("\\b(\\p{Digit}+)(\\p{Alpha})\\b");
public static void main(String[] args) {
String input = "Hello world 65x";
Matcher matcher = REGEX_PATTERN.matcher(input);
while (matcher.find()) {
System.out.println(matcher.group(1));
System.out.println(matcher.group(2));
}
}
Output:
65
x

No need of lookaheads, you can use nested captured groups:
Matcher isMatch = Pattern.compile("\\b([0-9]+)([a-zA-Z])\\b");
Group #1 will contain 65 and group #2 will contain x.
Better to add \\b (word boundary) on either side to avoid matching abc56xyz

You just need to use Matcher.group(int). This lets you extract pieces of the matched text. Read about caputring groups here. A regex that contains capturing groups is \\b([0-9]+)([a-zA-Z])\\b (as given by anubhava).

Related

Regex to replace a repeating string pattern

I need to replace a repeated pattern within a word with each basic construct unit. For example
I have the string "TATATATA" and I want to replace it with "TA". Also I would probably replace more than 2 repetitions to avoid replacing normal words.
I am trying to do it in Java with replaceAll method.

I think you want this (works for any length of the repeated string):
String result = source.replaceAll("(.+)\\1+", "$1")
Or alternatively, to prioritize shorter matches:
String result = source.replaceAll("(.+?)\\1+", "$1")
It matches first a group of letters, and then it again (using back-reference within the match pattern itself). I tried it and it seems to do the trick.
Example
String source = "HEY HEY duuuuuuude what'''s up? Trololololo yeye .0.0.0";
System.out.println(source.replaceAll("(.+?)\\1+", "$1"));
// HEY dude what's up? Trolo ye .0

You had better use a Pattern here than .replaceAll(). For instance:
private static final Pattern PATTERN
= Pattern.compile("\\b([A-Z]{2,}?)\\1+\\b");
//...
final Matcher m = PATTERN.matcher(input);
ret = m.replaceAll("$1");
edit: example:
public static void main(final String... args)
{
System.out.println("TATATA GHRGHRGHRGHR"
.replaceAll("\\b([A-Za-z]{2,}?)\\1+\\b", "$1"));
}
This prints:
TA GHR

Since you asked for a regex solution:
(\\w)(\\w)(\\1\\2){2,};
(\w)(\w): matches every pair of consecutive word characters ((.)(.) will catch every consecutive pair of characters of any type), storing them in capturing groups 1 and 2. (\\1\\2) matches anytime the characters in those groups are repeated again immediately afterward, and {2,} matches when it repeats two or more times ({2,10} would match when it repeats more than one but less than ten times).
String s = "hello TATATATA world";
Pattern p = Pattern.compile("(\\w)(\\w)(\\1\\2){2,}");
Matcher m = p.matcher(s);
while (m.find()) System.out.println(m.group());
//prints "TATATATA"

Regular expression match a-alphanumeric&b-digits&c-digits

I have query about java regular expressions. Actually, I am new to regular expressions.
So I need help to form a regex for the statement below:
Statement: a-alphanumeric&b-digits&c-digits
Possible matching Examples: 1) a-90485jlkerj&b-34534534&c-643546
2) A-RT7456ffgt&B-86763454&C-684241
Use case: First of all I have to validate input string against the regular expression. If the input string matches then I have to extract a value, b value and c value like
90485jlkerj, 34534534 and 643546 respectively.
Could someone please share how I can achieve this in the best possible way?
I really appreciate your help on this.

you can use this pattern :
^(?i)a-([0-9a-z]++)&b-([0-9]++)&c-([0-9]++)$
In the case what you try to match is not the whole string, just remove the anchors:
(?i)a-([0-9a-z]++)&b-([0-9]++)&c-([0-9]++)
explanations:
(?i) make the pattern case-insensitive
[0-9]++ digit one or more times (possessive)
[0-9a-z]++ the same with letters
^ anchor for the string start
$ anchor for the string end
Parenthesis in the two patterns are capture groups (to catch what you want)

Given a string with the format a-XXX&b-XXX&c-XXX, you can extract all XXX parts in one simple line:
String[] parts = str.replaceAll("[abc]-", "").split("&");
parts will be an array with 3 elements, being the target strings you want.
The simplest regex that matches your string is:
^(?i)a-([\\da-z]+)&b-(\\d+)&c-(\\d+)
With your target strings in groups 1, 2 and 3, but you need lot of code around that to get you the strings, which as shown above is not necessary.

Following code will help you:
String[] texts = new String[]{"a-90485jlkerj&b-34534534&c-643546", "A-RT7456ffgt&B-86763454&C-684241"};
Pattern full = Pattern.compile("^(?i)a-([\\da-z]+)&b-(\\d+)&c-(\\d+)");
Pattern patternA = Pattern.compile("(?i)([\\da-z]+)&[bc]");
Pattern patternB = Pattern.compile("(\\d+)");
for (String text : texts) {
if (full.matcher(text).matches()) {
for (String part : text.split("-")) {
Matcher m = patternA.matcher(part);
if (m.matches()) {
System.out.println(part.substring(m.start(), m.end()).split("&")[0]);
}
m = patternB.matcher(part);
if (m.matches()) {
System.out.println(part.substring(m.start(), m.end()));
}
}
}
}

Java Regexp capturing group includes space, why?

I am trying to parse this string,
"斬釘截鐵 斩钉截铁 [zhan3 ding1 jie2 tie3] /to chop the nail and slice the iron (idiom)/resolute and decisive/unhesitating/definitely/without any doubt/";
With this code
private static final Pattern TRADITIONAL = Pattern.compile("(.*?) ");
private String extractSinglePattern(String row, Pattern pattern) {
Matcher matcher = pattern.matcher(row);
if (matcher.find()) {
return matcher.group();
}
return null;
}
However, for some reason the string returned contains a space at the end
org.junit.ComparisonFailure: expected:<斬釘截鐵[]> but was:<斬釘截鐵[ ]>
Is there something wrong with my pattern?
I have also tried
private static final Pattern TRADITIONAL = Pattern.compile("(.*?)\\s");
but to no avail
I have also tried matching with two spaces at the end of the pattern, but it doesn't match (there is only one space).

You're using Matcher.group() which is documented as:
Returns the input subsequence matched by the previous match.
The match includes the space. The capturing group within the match doesn't, but you haven't asked for that.
If you change your return statement to:
return matcher.group(1);
then I believe it'll do what you want.

use this regular expression (.+?)(?=\s+)

regex - matching between a literal string and a quotation mark

I'm terrible at Regex and would greatly appreciate any help with this issue, which I think will be newb stuff for anyone familiar.
I'm getting a response like this from a REST call
{"responseData":{"translatedText":"Ciao mondo"},"responseDetails":"","responseStatus":200,"matches":[{"id":"424913311","segment":"Hello World","translation":"Ciao mondo","quality":"74","reference":"","usage-count":50,"subject":"All","created-by":"","last-updated-by":null,"create-date":"2011-12-29 19:14:22","last-update-date":"2011-12-29 19:14:22","match":1},{"id":"0","segment":"Hello World","translation":"Ciao a tutti","quality":"70","reference":"Machine Translation provided by Google, Microsoft, Worldlingo or the MyMemory customized engine.","usage-count":1,"subject":"All","created-by":"MT!","last-updated-by":null,"create-date":"2012-05-14","last-update-date":"2012-05-14","match":0.85}]}
All I need is the 'Ciao mondo' in between those quotations. I was hoping with Java's Split feature I could do this but unfortunately it doesn't allow two separate delimiters as then I could have specified the text before the translation.
To simplify, what I'm stuck with is the regex to gather whatever is inbetween translatedText":" and the next "
I'd be very grateful for any help

You can use \"translatedText\":\"([^\"]*)\" expression to capture the match.
The expression meaning is as follows: find quoted translatedText followed by a colon and an opening quote. Then match every character before the following quote, and capture the result in a capturing group.
String s = " {\"responseData\":{\"translatedText\":\"Ciao mondo\"},\"responseDetails\":\"\",\"responseStatus\":200,\"matches\":[{\"id\":\"424913311\",\"segment\":\"Hello World\",\"translation\":\"Ciao mondo\",\"quality\":\"74\",\"reference\":\"\",\"usage-count\":50,\"subject\":\"All\",\"created-by\":\"\",\"last-updated-by\":null,\"create-date\":\"2011-12-29 19:14:22\",\"last-update-date\":\"2011-12-29 19:14:22\",\"match\":1},{\"id\":\"0\",\"segment\":\"Hello World\",\"translation\":\"Ciao a tutti\",\"quality\":\"70\",\"reference\":\"Machine Translation provided by Google, Microsoft, Worldlingo or the MyMemory customized engine.\",\"usage-count\":1,\"subject\":\"All\",\"created-by\":\"MT!\",\"last-updated-by\":null,\"create-date\":\"2012-05-14\",\"last-update-date\":\"2012-05-14\",\"match\":0.85}]}";
System.out.println(s);
Pattern p = Pattern.compile("\"translatedText\":\"([^\"]*)\"");
Matcher m = p.matcher(s);
if (!m.find()) return;
System.out.println(m.group(1));
This fragment prints Ciao mondo.

use look-ahead and look-behind to gather strings inside quotations:
(?<=[,.{}:]\").*?(?=\")
class Test
{
public static void main(String[] args)
{
Scanner scanner = new Scanner(System.in);
String in = scanner.nextLine();
Matcher matcher = Pattern.compile("(?<=[,.{}:]\\\").*?(?=\\\")").matcher(in);
while(matcher.find())
System.out.println(matcher.group());
}
}

Try this regular expression -
^.*translatedText":"([^"]*)"},"responseDetails".*$
The matching group will contain the text Ciao mondo.
This assumes that translatedText and responseDetails will always occur in the positions specified in your sample.

java regular expression

Can anyone please help me do the following in a java regular expression?
I need to read 3 characters from the 5th position from a given String ignoring whatever is found before and after.
Example : testXXXtest
Expected result : XXX

You don't need regex at all.
Just use substring: yourString.substring(4,7)
Since you do need to use regex, you can do it like this:
Pattern pattern = Pattern.compile(".{4}(.{3}).*");
Matcher matcher = pattern.matcher("testXXXtest");
matcher.matches();
String whatYouNeed = matcher.group(1);
What does it mean, step by step:
.{4} - any four characters
( - start capturing group, i.e. what you need
.{3} - any three characters
) - end capturing group, you got it now
.* followed by 0 or more arbitrary characters.
matcher.group(1) - get the 1st (only) capturing group.

You should be able to use the substring() method to accomplish this:
string example = "testXXXtest";
string result = example.substring(4,7);

This might help: Groups and capturing in java.util.regex.Pattern.
Here is an example:
import java.util.regex.Pattern;
import java.util.regex.Matcher;
public class Example {
public static void main(String[] args) {
String text = "This is a testWithSomeDataInBetweentest.";
Pattern p = Pattern.compile("test([A-Za-z0-9]*)test");
Matcher m = p.matcher(text);
if (m.find()) {
System.out.println("Matched: " + m.group(1));
} else {
System.out.println("No match.");
}
}
}
This prints:
Matched: WithSomeDataInBetween
If you don't want to match the entire pattern rather to the input string (rather than to seek a substring that would match), you can use matches() instead of find(). You can continue searching for more matching substrings with subsequent calls with find().
Also, your question did not specify what are admissible characters and length of the string between two "test" strings. I assumed any length is OK including zero and that we seek a substring composed of small and capital letters as well as digits.

You can use substring for this, you don't need a regex.
yourString.substring(4,7);
I'm sure you could use a regex too, but why if you don't need it. Of course you should protect this code against null and strings that are too short.

Use the String.replaceAll() Class Method
If you don't need to be performance optimized, you can try the String.replaceAll() class method for a cleaner option:
String sDataLine = "testXXXtest";
String sWhatYouNeed = sDataLine.replaceAll( ".{4}(.{3}).*", "$1" );
References
https://docs.oracle.com/javase/1.5.0/docs/api/java/lang/String.html
http://www.vogella.com/tutorials/JavaRegularExpressions/article.html#using-regular-expressions-with-string-methods

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Regex look ahead to seperate string into tokens - java

No need of lookaheads, you can use nested captured groups: Matcher isMatch = Pattern.compile("\\b([0-9]+)([a-zA-Z])\\b"); Group #1 will contain 65 and group #2 will contain x. Better to add \\b (word boundary) on either side to avoid matching abc56xyz

You just need to use Matcher.group(int). This lets you extract pieces of the matched text. Read about caputring groups here. A regex that contains capturing groups is \\b([0-9]+)([a-zA-Z])\\b (as given by anubhava).

Related

Regex to replace a repeating string pattern

Regular expression match a-alphanumeric&b-digits&c-digits

Java Regexp capturing group includes space, why?

regex - matching between a literal string and a quotation mark

java regular expression

Categories

Resources