Java create matching string with regex using known parameters - java

I'm trying to make a string reconstruction with parameters based on a regex pattern.
I'm looking for something like this:
String regex = "(?<string>[a-z\-]+)?(/number/(?<number>\\d+))?"
case 1:
Map<String,String> parameters = new HashMap<>();
parameters.put("string","test");
parameters.put("number","1");
generator.generateString(regex, parameters) // returns "test/number/1"
case 2:
Map<String,String> parameters = new HashMap<>();
parameters.put("string","test");
generator.generateString(regex, parameters) // returns "test"
because this group (/number/(?<number>\\d+))? is optional, it would still match
Maybe there's a library for it, just note that I need something that is working with regex named groups (introduced in Java7)
I could write it myself, but at least a library that could detect groups in the pattern would be a big help.
How a library like that could work:
Pattern2 r = Pattern2.compile("(?<string>[a-z\-]+)?(/number/(?<number>\\d+))?");
To find start and end character of the "string" named group
r.start("string"); //0
r.end("string"); //19
or to find start and end of a group by a number of a group
r.start(2); //20
r.end(2); //44

Related

Regex replacing everything before a predefined range of chars - Java

I have string values where I want to remove or replace everything that comes before "TV|TH". My problem is that despite using the correct syntax, the string seems to stay the same.
String test = "10TH";
String replaceBeforeSide = test.replaceAll("^\\(TH|TV)+", "");
System.out.println(replaceBeforeSide);
//Desired result = "TH";
Converting my comment to answer so that solution is easy to find for future visitors.
You could use a simple regex with a capture group:
replaceBeforeSide = test.replaceAll(".+?(TH|TV)", "$1");
or even shorter:
replaceBeforeSide = test.replaceAll(".+?(T[HV])", "$1");
Using .+?, we are matching 1+ of any character (non-greedy) before matching (TH|TV) that we capture in group #1.
In replacement we just put $1 back so that only string before (TH|TV) is removed.
We could also use a lookahead and avoid capture group:
replaceBeforeSide = test.replaceAll(".+?(?=T[HV])", "");
If you want to match ignore case then use inline modifier (?i):
replaceBeforeSide = test.replaceAll("(?i).+?(?=T[HV])", "");

Regex pattern error on API 21(android 5) and below

Android 5 and below getting error from my regex pattern on runtime:
java.util.regex.PatternSyntaxException: Syntax error in regexp pattern near index 4:
(?<g1>(http|ftp)(s)?://)?(?<g2>[\w-:#])+(?<TLD>\.[\w\-]+)+(:\d+)?((|\?)([\w\-._~:/?#\[\]#!$&'()*+,;=.%])*)*
Here is code sample:
val urlRegex = "(?<g1>(http|ftp)(s)?://)?(?<g2>[\\w-:#])+(?<TLD>\\.[\\w\\-]+)+(:\\d+)?((|\\?)([\\w\\-._~:/?#\\[\\]#!$&'()*+,;=.%])*)*"
val sampleUrl = "https://www.google.com"
val urlMatchers = Pattern.compile(urlRegex).matcher(sampleUrl)
assert(urlMatchers.find())
This pattern works really fine on all APIs above 21.
It seems the earlier versions do not support named groups. As per this source, the named groups were introduced in Kotlin 1.2. Remove them if you do not need those submatches and only use the regex for validation.
Your regex is very inefficient as it contains a lot of nested quantified groups. See a "cleaner" version of it below.
Also, it seems you want to check if there is a regex match inside your input string. Use Regex#containsMatchIn():
val urlRegex = "(?:(?:http|ftp)s?://)?[\\w:#.-]+\\.[\\w-]+(?::\\d+)?\\??[\\w.~:/?#\\[\\]#!$&'()*+,;=.%-]*"
val sampleUrl = "https://www.google.com"
val urlMatchers = Regex(urlRegex).containsMatchIn(sampleUrl)
println(urlMatchers) // => true
See the Kotlin demo and the regex demo.
If you need to check the whole string match use matches:
Regex(urlRegex).matches(sampleUrl)
See another Kotlin demo.
Note that to define a regex, you need to use the Regex class constructor.

Extract attributes of an string

I got to deal here with a problem, caused by a dirty design. I get a list of string and want to parse attributes out of it. Unfortunately, I can't change the source, where these String were created.
Example:
String s = "type=INFO, languageCode=EN-GB, url=http://www.stackoverflow.com, ref=1, info=Text, that may contain all kind of chars., deactivated=false"
Now I want to extract the attributes type, languageCode, url, ref, info and deactivated.
The problem here is the field info, whose text is not limited by quote mark. Also commas may occur in this field, so I can't use the comma at the end of the string, to find out where is ends.
Additional, those strings not always contain all attributes. type, info and deactivated are always present, the rest is optional.
Any suggestions how I can solve this problem?
One possible solution is to search for = characters in the input and then take the single word immediately before it as the field name - it seems that all your field names are single words (no whitespace). If that's the case, you can then take everything after the = until the next field name (accounting for separating ,) as the value.
This assumes that the value cannot contain =.
Edit:
As a possible way to handle embedded =, you can see if the word in front of it is one your known field names - if not, you can possibly treat the = as an embedded character rather than an operator. This, however, assumes that you have a fixed set of known fields (some of which may not always appear). This assumption may be eased if you know that the field names are case-sensitive.
Assuming that order of elements is fixed you could write solution using regex like this one
String s = "type=INFO, languageCode=EN-GB, url=http://www.stackoverflow.com, ref=1, info=Text, that may contain all kind of chars., deactivated=false";
String regex = //type, info and deactivated are always present
"type=(?<type>.*?)"
+ "(?:, languageCode=(?<languageCode>.*?))?"//optional group
+ "(?:, url=(?<url>.*?))?"//optional group
+ "(?:, ref=(?<rel>.*?))?"//optional group
+ ", info=(?<info>.*?)"
+ ", deactivated=(?<deactivated>.*?)";
Pattern p = Pattern.compile(regex);
Matcher m = p.matcher(s);
if(m.matches()){
System.out.println("type -> "+m.group("type"));
System.out.println("languageCode -> "+m.group("languageCode"));
System.out.println("url -> "+m.group("url"));
System.out.println("rel -> "+m.group("rel"));
System.out.println("info -> "+m.group("info"));
System.out.println("deactivated -> "+m.group("deactivated"));
}
Output:
type -> INFO
languageCode -> EN-GB
url -> http://www.stackoverflow.com
rel -> 1
info -> Text, that may contain all kind of chars.
deactivated -> false
EDIT: Version2 regex searching for oneOfPossibleKeys=value where value ends with:
, oneOfPossibleKeys=
or has end of string after it (represented by $).
Code:
String s = "type=INFO, languageCode=EN-GB, url=http://www.stackoverflow.com, ref=1, info=Text, that may contain all kind of chars., deactivated=false";
String[] possibleKeys = {"type","languageCode","url","ref","info","deactivated"};
String keysStrRegex = String.join("|", possibleKeys);
//above will contain type|languageCode|url|ref|info|deactivated
String regex = "(?<key>\\b(?:"+keysStrRegex+")\\b)=(?<value>.*?(?=, (?:"+keysStrRegex+")=|$))";
// (?<key>\b(?:type|languageCode|url|ref|info|deactivated)\b)
// =
// (?<value>.*?(?=, (?:type|languageCode|url|ref|info|deactivated)=|$))System.out.println(regex);
Pattern p = Pattern.compile(regex);
Matcher m = p.matcher(s);
while(m.find()){
System.out.println(m.group("key")+" -> "+m.group("value"));
}
Output:
type -> INFO
languageCode -> EN-GB
url -> http://www.stackoverflow.com
ref -> 1
info -> Text, that may contain all kind of chars.
deactivated -> false
You could use a regular expression, capturing all the "fixed" groups and using whatever remains for info. This should even work if the info part contains , or = characters. Here's some quick example (using Python, but that should not be a problem...).
>>> p = r"(type=[A-Z]+), (languageCode=[-A-Z]+), (url=[^,]+), (ref=\d), (info=.+?), (deactivated=(?:true|false))"
>>> s = "type=INFO, languageCode=EN-GB, url=http://www.stackoverflow.com, ref=1, info=Text, that may contain all kind of chars, even deactivated=true., deactivated=false"
>>> re.search(p, s).groups()
('type=INFO',
'languageCode=EN-GB',
'url=http://www.stackoverflow.com',
'ref=1',
'info=Text, that may contain all kind of chars, even deactivated=true.',
'deactivated=false')
If any of those elements are optional, you can put a ? after those groups, and make the comma optional. If the order can be different, then it's more complicated. In this case, instead of using one RegEx to capture everything at once, use several RegExes to capture the individual attributes and then remove (replace with '') those in the string before matching the next attribute. Finally, match info.
On further consideration, given that those attributes could have any order, it may be more promising to capture just everything spanning from one keyword to the next, regardless of its actual content, very similar to Pshemo's solution:
keys = "type|languageCode|url|ref|info|deactivated"
p = r"({0})=(.+?)(?=\, (?:{0})=|$)".format(keys)
matches = re.findall(p, s)
But this, too, might fail in some very obscure cases, e.g. if the info attribute contains something like ', ref=foo', including the comma. However, there seems to be no way around those ambiguities. If you had a string like info=in this string, ref=1, and in another, ref=2, ref=1, does it contain one ref attribute, or three, or none at all?

Using Elasticsearch Java API to match any of several words, case insensitive

I'm using Elasticsearch for the first time, I'm sure this must be easy but so far a solution has eluded me using the Java API:
I have an array of search terms, and I'd like to return hits matching any of these terms in a case insensitive way.
This code works except it's case sensitive, but I'd like it to be case insensitive:
String[] terms = {"orange", "peach"};
SearchResponse response = client.prepareSearch("orders")
.setTypes("fruit")
.setQuery(
QueryBuilders.termsQuery("description", terms).minimumMatch(1)
)
.setFrom(0).setSize(10)
.setExplain(true)
.execute()
.actionGet();
for ( SearchHit hit : response.getHits()) {
String source = hit.sourceAsString();
//only case sensitive matches found...
}
You're using Terms query - this doesn't get analysed.
Try replacing it with a Match query or a Query String query. These both get analysed and assuming you're using the Standard Analyzer will convert all terms to lowercase.

extracting a particular field from url

I want to extract particular fields from a url of a facebookpage. Iam not able to extract since link format is not static.eg:if I gave the below examples as input it should give the o/p as what we desire
1)https://www.facebook.com/pages/Ice-cream/109301862430120?rf=102173023157556
o/p -109301862430120
What about this type of link
can anyone help me
So in short, you want to get name after last / and (if there is any) before ? mark.
You can do it with using URI and File classes like
String data = "https://www.facebook.com/pages/Anti-Christian-sentiment/149675731889496?ref=br_tf";
System.out.println(new File(new URI(data).getRawPath()).getName());
Output: 149675731889496
If you need to use regex then you can use
([^/?]+)(\\?|$)
and just read content of group 1 (the one in first pair of parenthesis).
If you don't want to use groups, and make regex match only digit part (without including ? in match) then you can use look around mechanisms like look-ahead (?=...). Regex you would have to use would look like
[^/?]+(?=\\?|$)
Code example:
String data = "https://www.facebook.com/pages/Anti-Christian-sentiment/149675731889496?ref=br_tf";
Pattern p = Pattern.compile("([^/?]+)(\\?|$)");
Matcher m = p.matcher(data);
if (m.find()){
System.out.println(m.group(1));
}
Output:
149675731889496

Categories