how to get "something" from <em>something</em> use java Regular expressions - java

in the following, i need to get:
String regex = "Item#: <em>.*</em>";
String content = "xxx Item#: <em>something</em> yyy";
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(content);
if( matcher.find() ) {
System.out.println(matcher.group());
}
it will print:
Item#: <em>something</em>
but i just need the value "something".
i know i can use .substring(begin,end) to get the value,
but is there another way which would be more elegant?

It prints the whole string because you have printed it. matcher.group() prints the complete match. To get specific part of your matched string, you need to change your Regex to capture the content between the tag in a group: -
String regex = "Item#: <em>(.*?)</em>";
Also, use Reluctant quantifier(.*?) to match the least number of characters before an </em> is encountered.
And then in if, print group(1) instead of group()
if( matcher.find() ) {
System.out.println(matcher.group(1));
}
Anyways, you should not use Regex to parse HTML. Regex is not strong enough to achieve this task. You should probably use some HTML parser like - HTML Cleaner. Also see the link that is provided in one of the comments in the OP. That post is very nice explanation of the problems you can face.

Related

How can I push regex matches to array in java?

I've currently got a string, of which I want to use certain parts. With these parts I want to do various things, like pushing them to an array or showing them in a text area.
Fist I try to split method. It delete my regex matches and prints other part of string. I want to delete other part and print the regex match.
How can I do this?
For example:
There are lot of youtube links like this
https://www.youtube.com/watch?v=qJuoXM7G322&list=PLRfAW_jVDn06M7qxHIwlowgLY3Io1pG6z&index=7
I want to take only simple video link with this expression
"https:\\/\\/www.youtube.com\\/watch\\?v=.{11}"
when I use this code :
String ytLink = linkArea.getText();
String regexp = "https:\\/\\/www.youtube.com\\/watch\\?v=.{11}";
String[] tokenVal;
tokenVal = ytLink.split(regexp);
System.out.println("Count of Links : "+tokenVal.length);
for (String t : tokenVal) {
System.out.println(t);
}
It prints
"&list=PLRfAW_jVDn06M7qxHIwlowgLY3Io1pG6z&index=7"
I want to output be like this:
"https://www.youtube.com/watch?v=SATL2mTfZO0"
"when I Right this code :"
You are splitting the string with that regular expression, which is not the correct tool for the job.
It is dividing your example string into:
"" // The bit before the separator.
"https://www.youtube.com/watch?v=qJuoXM7G322" // The separator
"&list=PLRfAW_jVDn06M7qxHIwlowgLY3Io1pG6z&index=7" // The bit after the separator
but then discarding the separator, so you'd get back a 2-element array containing:
"" // The bit before the separator.
"&list=PLRfAW_jVDn06M7qxHIwlowgLY3Io1pG6z&index=7" // The bit after the separator
If you want to get the thing that matches the regex, you'd need to use Pattern and Matcher:
Pattern pattern = Pattern.compile("https:\\/\\/www.youtube.com\\/watch\\?v=.{11}");
Matcher matcher = pattern.matcher(ytLink);
if (matcher.find()) {
System.out.println(matcher.group());
}
(I don't entirely trust your escaped backslashes in your regular expression; however the pattern is not really important to the principle)
You can negate your regex using the negative lookaround: (?!pattern)
See also : How to negate the whole regex?

Java Match string with optional hyphen

I am trying to match a series of string thats looks like this:
item1 = "some value"
item2 = "some value"
I have some strings, though, that look like this:
item-one = "some new value"
item-two = "some new value"
I am trying to parse it using regular expressions, but I can't get it to match the optional hyphen.
Here is my regex string:
Pattern p = Pattern.compile("^(\\w+[-]?)\\w+?\\s+=\\s+\"(.*)\"");
Matcher m = p.matcher(line);
m.find();
String option = m.group(1);
String value = m.group(2);
May someone please tell me what I could be doing wrong.
Thank you
I suspect that main reason of your problem is that you are expecting w+? to make w+ optional, where in reality it will make + quantifier reluctant so regex will still try to find at least one or more \\w here, consuming last character from ^(\\w+.
Maybe try this way
Pattern.compile("^(\\w+(?:-\\w+)?)\\s+=\\s+\"(.*?)\"");
in (\\w+(?:-\\w+)?) -> (?:-\\w+) part will create non-capturing group (regex wont count it as group so (.*?) will be group(2) even if this part will exist) and ? after it will make this part optional.
in \"(.*?)\" *? is reluctant quantifier which will make regex to look for minimal match that exist between quotation marks.
Demo
Your problem is that you have the ? in the wrong place:
Try this regex:
^((\\w+-)?\\w+)\\s*=\\s*\"([^\"]+)\"
But use groups 1 and 3.
I've cleaned up the regex a bit too
This regex should work for you:
^\w[\w-]*(?<=\w)\s*=\s*\"([^"]*)\"
In Java:
Pattern p = Pattern.compile("^\\w[\\w-]*(?<=\\w)\\s*=\\s*\"([^\"]*)\"");
Live Demo: http://www.rubular.com/r/0CvByDnj5H
You want something like this:
([\w\-]+)\s*=\s*"([^"]*)"
With extra backslashes for Java:
([\\w\\-]+)\\s*=\\s*\"([^\"]*)\"
If you expect other symbols to start appearing in the variable name, you could make it a character class like [^=\s] to accept any characters not = or whitespace, for example.

Extract substring after a certain pattern

I have the following string:
http://xxx/Content/SiteFiles/30/32531a5d-b0b1-4a8b-9029-b48f0eb40a34/05%20%20LEISURE.mp3?&mydownloads=true
How can I extract the part after 30/? In this case, it's 32531a5d-b0b1-4a8b-9029-b48f0eb40a34.I have another strings having same part upto 30/ and after that every string having different id upto next / which I want.
You can do like this:
String s = "http://xxx/Content/SiteFiles/30/32531a5d-b0b1-4a8b-9029-b48f0eb40a34/05%20%20LEISURE.mp3?&mydownloads=true";
System.out.println(s.substring(s.indexOf("30/")+3, s.length()));
split function of String class won't help you in this case, because it discards the delimiter and that's not what we want here. you need to make a pattern that looks behind. The look behind synatax is:
(?<=X)Y
Which identifies any Y that is preceded by a X.
So in you case you need this pattern:
(?<=30/).*
compile the pattern, match it with your input, find the match, and catch it:
String input = "http://xxx/Content/SiteFiles/30/32531a5d-b0b1-4a8b-9029-b48f0eb40a34/05%20%20LEISURE.mp3?&mydownloads=true";
Matcher matcher = Pattern.compile("(?<=30/).*").matcher(input);
matcher.find();
System.out.println(matcher.group());
Just for this one, or do you want a generic way to do it ?
String[] out = mystring.split("/")
return out[out.length - 2]
I think the / is definitely the delimiter you are searching for.
I can't see the problem you are talking about Alex
EDIT : Ok, Python got me with indexes.
Regular expression is the answer I think. However, how the expression is written depends on the data (url) format you want to process. Like this one:
Pattern pat = Pattern.compile("/Content/SiteFiles/30/([a-z0-9\\-]+)/.*");
Matcher m = pat.matcher("http://xxx/Content/SiteFiles/30/32531a5d-b0b1-4a8b-9029-b48f0eb40a34/05%20%20LEISURE.mp3?&mydownloads=true");
if (m.find()) {
System.out.println(m.group(1));
}

How to find the exact word using a regex in Java?

Consider the following code snippet:
String input = "Print this";
System.out.println(input.matches("\\bthis\\b"));
Output
false
What could be possibly wrong with this approach? If it is wrong, then what is the right solution to find the exact word match?
PS: I have found a variety of similar questions here but none of them provide the solution I am looking for.
Thanks in advance.
When you use the matches() method, it is trying to match the entire input. In your example, the input "Print this" doesn't match the pattern because the word "Print" isn't matched.
So you need to add something to the regex to match the initial part of the string, e.g.
.*\\bthis\\b
And if you want to allow extra text at the end of the line too:
.*\\bthis\\b.*
Alternatively, use a Matcher object and use Matcher.find() to find matches within the input string:
Pattern p = Pattern.compile("\\bthis\\b");
Matcher m = p.matcher("Print this");
m.find();
System.out.println(m.group());
Output:
this
If you want to find multiple matches in a line, you can call find() and group() repeatedly to extract them all.
Full example method for matcher:
public static String REGEX_FIND_WORD="(?i).*?\\b%s\\b.*?";
public static boolean containsWord(String text, String word) {
String regex=String.format(REGEX_FIND_WORD, Pattern.quote(word));
return text.matches(regex);
}
Explain:
(?i) - ignorecase
.*? - allow (optionally) any characters before
\b - word boundary
%s - variable to be changed by String.format (quoted to avoid regex
errors)
\b - word boundary
.*? - allow (optionally) any characters after
For a good explanation, see: http://www.regular-expressions.info/java.html
myString.matches("regex") returns true or false depending whether the
string can be matched entirely by the regular expression. It is
important to remember that String.matches() only returns true if the
entire string can be matched. In other words: "regex" is applied as if
you had written "^regex$" with start and end of string anchors. This
is different from most other regex libraries, where the "quick match
test" method returns true if the regex can be matched anywhere in the
string. If myString is abc then myString.matches("bc") returns false.
bc matches abc, but ^bc$ (which is really being used here) does not.
This writes "true":
String input = "Print this";
System.out.println(input.matches(".*\\bthis\\b"));
You may use groups to find the exact word. Regex API specifies groups by parentheses. For example:
A(B(C))D
This statement consists of three groups, which are indexed from 0.
0th group - ABCD
1st group - BC
2nd group - C
So if you need to find some specific word, you may use two methods in Matcher class such as: find() to find statement specified by regex, and then get a String object specified by its group number:
String statement = "Hello, my beautiful world";
Pattern pattern = Pattern.compile("Hello, my (\\w+).*");
Matcher m = pattern.matcher(statement);
m.find();
System.out.println(m.group(1));
The above code result will be "beautiful"
Is your searchString going to be regular expression? if not simply use String.contains(CharSequence s)
System.out.println(input.matches(".*\\bthis$"));
Also works. Here the .* matches anything before the space and then this is matched to be word in the end.

Strip all reluctant curly braces using regex

Note: This is a Java-only question (i.e. no Javascript, sed, Perl, etc.)
I need to filter out all the "reluctant" curly braces ({}) in a long string of text.
(by "reluctant" I mean as in reluctant quantifier).
I have been able to come up with the following regex which correctly finds and lists all such occurrences:
Pattern pattern = Pattern.compile("(\\{)(.*?)(\\})", Pattern.DOTALL);
Matcher matcher = pattern.matcher(originalString);
while (matcher.find()) {
Log.d("WITHIN_BRACES", matcher.group(2));
}
My problem now is how to replace every found matcher.group(0) with the corresponding matcher.group(2).
Intuitively I tried:
while (matcher.find()) {
String noBraces = matcher.replaceAll(matcher.group(2));
}
But that replaced all found matcher.group(0) with only the first matcher.group(2), which is of course not what I want.
Is there an expression or a method in Java's regex to perform this "corresponding replaceAll" that I need?
ANSWER: Thanks to the tip below, I have been able to come up with 2 fixes that did the trick:
if (matcher.find()) {
String noBraces = matcher.replaceAll("$2");
}
Fix #1: Use "$2" instead of matcher.group(2)
Fix #2: Use if instead of while.
Works now like a charm.
You can use the special backreference syntax:
String noBraces = matcher.replaceAll("$2");

Categories