Regex - extract indefinite number of hits - java

The method getPolygonPoints() (see below) becomes a String name as parameter, which looks something like this:
points={{-100,100},{-120,60},{-80,60},{-100,100},{-100,100}}
The first number stands for the x-coordinate, the second for the y coordinate. For example,the first point is
x=-100
y=100
The second point is
x=-120
y=60
and so on.
Now I want to extract the points of the String and put them in a ArrayList, which has to look like this at the end:
[-100, 100, -120, 60, -80, 60, -100, 100, -100, 100]
The special feature here is, that the number of points in the given String changes and is not always the same.
I have written the following code:
private ArrayList<Integer> getPolygonPoints(String name) {
// the regular expression
String regGroup = "[-]?[\\d]{1,3}";
// compile the regular expression into a pattern
Pattern regex = Pattern.compile("\\{(" + regGroup + ")");
// the mather
Matcher matcher;
ArrayList<Integer> points = new ArrayList<Integer>();
// matcher that will match the given input against the pattern
matcher = regex.matcher(name);
int i = 1;
while(matcher.find()) {
System.out.println(Integer.parseInt(matcher.group(i)));
i++;
}
return points;
}
The first x coordinate is extracted correctly, but then a IndexOutOfBoundsException is thrown. I think that happens, because group 2 is not defined.
I think at first I have to count the points and then iterate over this number. Inside of the iteration I would put the int values in the ArrayList with a simple add(). But I don't know how to do this. Maybe I don't understand the regex part at this point. Especially how the groups work.
Please help!

String points = "{{-100,100},{-120,60},{-80,60},{-100,100},{-100,100}}";
String[] strs = points.replaceAll("(\\{|\\})", "").split(",");
ArrayList<Integer> list = new ArrayList<Integer>(strs.length);
for (String s : strs)
{
list.add(Integer.valueOf(s));
}

The part you don't seem to understand about the regex API is that the capture group number "reset" with every call to find(). Or, to put it another way: the number of the capture group is its position in the pattern, not in the input string.
You're also going about this the wrong way. You should match the whole construct you're looking for, in this case the {x,y} pairs. I'm assuming you don't want to validate the format of the whole string, so we can ignore the outside brackets and comma:
Pattern p = Pattern.compile("\\{(-?\\d+),(-?\\d+)\\}");
Matcher m = p.matcher(name);
while (m.find()) {
String x = m.group(1);
String y = m.group(2);
// parse and add to list
}
Alternately, since you don't care about which coordinate is X and which is Y, you can even do:
Matcher m = Pattern.compile("-?\\d+").matcher(name);
while (m.find()) {
String xOrY = m.group();
// parse etc.
}
Now, if you want to validate the input as well, I'd say that's a separate concern, I wouldn't necessarily try to do it in the same step as the parsing to keep the regex readable. (It might be possible in this case but if you don't need it why bother in the first place.)

You can also try this regex:
((-?\d+)\s*,\s*(-?\d+))
It will give you three groups:
Group 1 : x
Group 2 : y
Group 3 : x,y
You can use which one is required to you.

How about doing it in just one line:
List<String> list = Arrays.asList(name.replaceAll("(^\\w+=\\{+)|(\\}+$)", "").split("\\{?,\\}?"));
Your whole method would then be:
private ArrayList<Integer> getPolygonPoints(String name) {
return new ArrayList<String>(Arrays.asList(name.replaceAll("(^\\w+=\\{+)|(\\}+$)", "").split("\\{?,\\}?")));
}
This works by first stripping off the leading and trailing text, then splits on commas optionally surrounded by braces.
BTW You really should return the abstract type List, not the concrete implementation ArrayList.

Related

Delete regex matches placed inside other regex matches

I have two regexes. I want to delete all matches of second one if they are placed inside matches of first one. Basically, nothing can be matched in what was already matched. Example:
First regex (bold) - c\w+ finds words beginning with c
Second regex (underlined) - me finds me
Result: cam̲e̲l crim̲e̲ care cool m̲e̲dium m̲e̲lt hom̲e̲
The me in c-words are matched too. Want I want is: camel crime care cool m̲e̲dium m̲e̲lt hom̲e̲
Two results of second regex are in results of first regex, I want to delete them, or just don't match them at all. Here's what I tried:
String text = "camel crime care cool medium melt home";
static final Pattern PATTERN_FIRST = Pattern.compile("c\w+");
static final Pattern PATTERN_SECOND = Pattern.compile("me");
// Save all matches
List<int[]> firstRegexMatches = new ArrayList<>();
for (Matcher m = PATTERN_FIRST.matcher(text); m.find();) {
firstRegexMatches.add(new int[]{m.start(), m.end()});
}
List<int[]> secondRegexMatches = new ArrayList<>();
for (Matcher m = PATTERN_SECOND.matcher(text); m.find();) {
secondRegexMatches.add(new int[]{m.start(), m.end()});
}
// Remove matches of second inside matches of first
for (int[] pos : firstRegexMatches) {
Iterables.removeIf(secondRegexMatches, p -> p[0] > pos[0] && p[1] < pos[1]);
}
In this code I store all matches of both into list then try to remove from the second list matches placed inside first list matches.
Not only does this not work, but I'm not sure it's very efficient. Note that this a simplified version of my situation, which contains more regexes and a large text. Iterables is from Guava.
First of all you can achieve something like this merging both expressions into one.
(^c\w+)|\s(c\w+)|(\w*me\w*)
If you match against this regex every match will be either a word starting with "c" followed by some word-characters or a word containing "me". For every match you then either get the group:
(1) or (2) indicating a word starting with "c" or
(3) indicating a word containing "me"
However note that this only works in case you know the delimiter of the words, in this case a \s character.
Example code:
String text = "camel crime care cool medium melt home";
final Pattern PATTERN = Pattern.compile("(^c\\w+)|\\s(c\\w+)|(\\w*me\\w*)");
// Save all matches
List<String> wordsStartingWithC = new ArrayList<>();
List<String> wordsIncludingMe = new ArrayList<>();
for (Matcher m = PATTERN.matcher(text); m.find();) {
if(m.group(1) != null) {
wordsStartingWithC.add(m.group(1));
} else if(m.group(2) != null) {
wordsStartingWithC.add(m.group(2));
} else if(m.group(3) != null) {
wordsIncludingMe.add(m.group(3));
}
}
System.out.println(wordsStartingWithC);
System.out.println(wordsIncludingMe);
I'd recommend to simplify this by taking a somewhat different approach.
As you seem to know the word limiter, namely the whitespace character, you can get a collection of all words simply by splitting the original string.
String[] words = "camel crime care cool medium melt home".split(" ");
You then simply iterate over all of these.
for(String word: words) {
if(word.startsWith("c")) {
// put in your list for words starting with "c"
} else if (word.contains("me")) {
// put in your list for words containing "me"
}
}
This will result in two lists without duplicate entries, as the second if statement will only be executed in case the first one fails.
Isn't it possible to combine the two Regexes? For example, the me after c can be found using one Regex with this code:
((?<=c)|(?<=c\w)|(?<=c\w{2})|(?<=c\w{3})|(?<=c\w{4})|(?<=c\w{5}))me
Check it out here: https://regex101.com/r/bfNkvF/2

Split a string based on pattern and merge it back

I need to split a string based on a pattern and again i need to merge it back on a portion of string.
for ex: Below is the actual and expected strings.
String actualstr="abc.def.ghi.jkl.mno";
String expectedstr="abc.mno";
When i use below, i can store in a Array and iterate over to get it back. Is there anyway it can be done simple and efficient than below.
String[] splited = actualstr.split("[\\.\\.\\.\\.\\.\\s]+");
Though i can acess the string based on index, is there any other way to do this easily. Please advise.
You do not understand how regexes work.
Here is your regex without the escapes: [\.\.\.\.\.\s]+
You have a character class ([]). Which means there is no reason to have more than one . in it. You also don't need to escape .s in a char class.
Here is an equivalent regex to your regex: [.\s]+. As a Java String that's: "[.\\s]+".
You can do .split("regex") on your string to get an array. It's very simple to get a solution from that point.
I would use a replaceAll in this case
String actualstr="abc.def.ghi.jkl.mno";
String str = actualstr.replaceAll("\\..*\\.", ".");
This will replace everything with the first and last . with a .
You could also use split
String[] parts = actualString.split("\\.");
string str = parts[0]+"."+parts[parts.length-1]; // first and last word
public static String merge(String string, String delimiter, int... partnumbers)
{
String[] parts = string.split(delimiter);
String result = "";
for ( int x = 0 ; x < partnumbers.length ; x ++ )
{
result += result.length() > 0 ? delimiter.replaceAll("\\\\","") : "";
result += parts[partnumbers[x]];
}
return result;
}
and then use it like:
merge("abc.def.ghi.jkl.mno", "\\.", 0, 4);
I would do it this way
Pattern pattern = Pattern.compile("(\\w*\\.).*\\.(\\w*)");
Matcher matcher = pattern.matcher("abc.def.ghi.jkl.mno");
if (matcher.matches()) {
System.out.println(matcher.group(1) + matcher.group(2));
}
If you can cache the result of
Pattern.compile("(\\w*\\.).*\\.(\\w*)")
and reuse "pattern" all over again this code will be very efficient as pattern compilation is the most expensive. java.lang.String.split() method that other answers suggest uses same Pattern.compile() internally if the pattern length is greater then 1. Meaning that it will do this expensive operation of Pattern compilation on each invocation of the method. See java.util.regex - importance of Pattern.compile()?. So it is much better to have the Pattern compiled and cached and reused.
matcher.group(1) refers to the first group of () which is "(\w*\.)"
matcher.group(2) refers to the second one which is "(\w*)"
even though we don't use it here but just to note that group(0) is the match for the whole regex.

Scanning 2 Different Data Types Java

I have a data file that is a list of names followed by "*****" and then continues with integers. How do I scan the names and then break with the asterisks, followed by scanning the integers?
This question might help : Splitting up data file in Java Scanner
Use the Scanner.useDelimiter() method, put "*****" as the delimiter, like this for example :
sc.useDelimiter("*****");
OR
Alternative :
Read the whole string
Split the string using String.split()
Resulting String array will have index 0 contain the names and index 1 contain the integers.
Below code should work for you
Scanner scanner = new Scanner(<INPUT_STR>).useDelimiter("[*****]");
while (scanner.hasNext()) {
if (scanner.hasNextInt()) {
// For Integer
} else {
// For String
}
}
Although this seems a tedious thing, I think this would solve the issue without worrying if the split returns anything, and the out of bounds.
final String x = "abc****12354";
final Pattern p = Pattern.compile("[A-Z]*[a-z]*\\*{4}");
final Matcher m = p.matcher(x);
while (m.find()) {
System.out.println(m.group());
}
final Pattern p1 = Pattern.compile("\\*{4}[0-9]*");
final Matcher m1 = p1.matcher(x);
while (m1.find()) {
System.out.println(m1.group());
}
The first pattern match minus the last 4 stars (can be substring-ed out) and the second pattern match minus the leading 4 stars (also can be removed) would give the request fields.

Get a substring of a string made of xCharsxInts

I have a list of constants:
public static final String INSTANCE_PREFIX = "in";
public static final String INDICATOR_PREFIX = "i";
public static final String MODEL_PREFIX = "m";
...
They have variable lengths, which are put in front of a number and the result is a variable's id. For example, it could be in30 or i2 or m4353. I am trying to make the method as abstract as possible to account for x letters x numbers. The letters are always going to be some prefix that is inside of my Constants.java so I know that much, but the method won't know with which combination it's working with.
I just want the number attached to the end. For example, I want to pass in the m4353 from above and just get back the 4353. Whether it uses the constants file or not is not relevant, but I include them as they may be useful for some approach.
It seems to me like you don't care about the prefixes at all, so I have ignored them in this answer. If you do care about the prefixes, please scroll down to the second half of this answer:
This code uses regular expressions to extract the trailing numbers at the end of a string.
() represents a capturing group (used by m.group(1));
[0-9]+ represents a String of digits of at least 1 in length
$ represents the end of the string, guaranteeing the numbers are only the ones at the end.
Here is the code:
private static final Pattern p = Pattern.compile("([0-9]+)$");
public static int extractNumber(String value) {
Matcher m = p.matcher(value);
if(m.find()) {
return Integer.parseInt(m.group(1));
} else {
return Integer.MIN_VALUE; // error code
}
}
Demo.
If you want to capture the prefix, you could use Pattern.compile("^([a-z]+)([0-9]+)$ instead.
Note that the numbers are now the second group, so they would be captured in m.group(2), and the prefix would be captured in m.group(1).
Try the String replaceAll method
For example:
String x = "prefix1111111";
x = x.replaceAll("\\D", "");
int justNum = Integer.parseInt(x);
where "\\D" is any non-digit character. So it deletes all non-digits in your string.
Note, you might want to use Long.parseLong or Double.parseDouble and the associated primitive types instead if your numbers will be longer than 9 digits as Java ints can only handle values up to 2147483647

Java String- How to get a part of package name in android?

Its basically about getting string value between two characters. SO has many questions related to this. Like:
How to get a part of a string in java?
How to get a string between two characters?
Extract string between two strings in java
and more.
But I felt it quiet confusing while dealing with multiple dots in the string and getting the value between certain two dots.
I have got the package name as :
au.com.newline.myact
I need to get the value between "com." and the next "dot(.)". In this case "newline". I tried
Pattern pattern = Pattern.compile("com.(.*).");
Matcher matcher = pattern.matcher(beforeTask);
while (matcher.find()) {
int ct = matcher.group();
I tried using substrings and IndexOf also. But couldn't get the intended answer. Because the package name in android varies by different number of dots and characters, I cannot use fixed index. Please suggest any idea.
As you probably know (based on .* part in your regex) dot . is special character in regular expressions representing any character (except line separators). So to actually make dot represent only dot you need to escape it. To do so you can place \ before it, or place it inside character class [.].
Also to get only part from parenthesis (.*) you need to select it with proper group index which in your case is 1.
So try with
String beforeTask = "au.com.newline.myact";
Pattern pattern = Pattern.compile("com[.](.*)[.]");
Matcher matcher = pattern.matcher(beforeTask);
while (matcher.find()) {
String ct = matcher.group(1);//remember that regex finds Strings, not int
System.out.println(ct);
}
Output: newline
If you want to get only one element before next . then you need to change greedy behaviour of * quantifier in .* to reluctant by adding ? after it like
Pattern pattern = Pattern.compile("com[.](.*?)[.]");
// ^
Another approach is instead of .* accepting only non-dot characters. They can be represented by negated character class: [^.]*
Pattern pattern = Pattern.compile("com[.]([^.]*)[.]");
If you don't want to use regex you can simply use indexOf method to locate positions of com. and next . after it. Then you can simply substring what you want.
String beforeTask = "au.com.newline.myact.modelact";
int start = beforeTask.indexOf("com.") + 4; // +4 since we also want to skip 'com.' part
int end = beforeTask.indexOf(".", start); //find next `.` after start index
String resutl = beforeTask.substring(start, end);
System.out.println(resutl);
You can use reflections to get the name of any class. For example:
If I have a class Runner in com.some.package and I can run
Runner.class.toString() // string is "com.some.package.Runner"
to get the full name of the class which happens to have a package name inside.
TO get something after 'com' you can use Runner.class.toString().split(".") and then iterate over the returned array with boolean flag
All you have to do is split the strings by "." and then iterate through them until you find one that equals "com". The next string in the array will be what you want.
So your code would look something like:
String[] parts = packageName.split("\\.");
int i = 0;
for(String part : parts) {
if(part.equals("com")
break;
}
++i;
}
String result = parts[i+1];
private String getStringAfterComDot(String packageName) {
String strArr[] = packageName.split("\\.");
for(int i=0; i<strArr.length; i++){
if(strArr[i].equals("com"))
return strArr[i+1];
}
return "";
}
I have done heaps of projects before dealing with websites scraping and I
just have to create my own function/utils to get the job done. Regex might
be an overkill sometimes if you just want to extract a substring from
a given string like the one you have. Below is the function I normally
use to do this kind of task.
private String GetValueFromText(String sText, String sBefore, String sAfter)
{
String sRetValue = "";
int nPos = sText.indexOf(sBefore);
if ( nPos > -1 )
{
int nLast = sText.indexOf(sAfter,nPos+sBefore.length()+1);
if ( nLast > -1)
{
sRetValue = sText.substring(nPos+sBefore.length(),nLast);
}
}
return sRetValue;
}
To use it just do the following:
String sValue = GetValueFromText("au.com.newline.myact", ".com.", ".");

Categories