Java: Obtain matched string from an input - java

I am trying to obtain the string that my matcher is able to find using my provided expression. Something like this..
if(matcher.find())
System.out.println("Matched string is: " + ?);
What would be the appropriate code for this? According to Oracle the
matcher.group();
method returns only the provided input same as
matcher.group(0);
Thanks in advance..
Edit:
Example follows below:
private static String fileExtensionPattern = ".*<input type=\"hidden\" name=\".*\" value=\".*\" />.*";
private static Matcher fileXtensionMatcher;
private static String input = text "<html><body><table width="96"><tr><td><img src="file:/test" /><input type="hidden" name="docExt" value=".doc" />Employee Trv Log 2011 Training Trip.doc</td></tr></table></body></html>"
private static void findFileExtension() {
System.out.println("** Searching for file extension **");
System.out.println("Looking for pattern: " + fileExtensionPattern);
fileXtensionMatcher = fileXtensionExp.matcher(input);
if(fileXtensionMatcher.find()) {
//the extension expression is contained in the string
System.out.println("Extension expression found.");
System.out.println(fileXtensionMatcher.group());
}
}
The obtained result is:
text "<html><body><table width="96"><tr><td><img src="file:/test" /><input type="hidden" name="docExt" value=".doc" />Employee Trv Log 2011 Training Trip.doc</td></tr></table></body></html>"

Why do you think that group() returns the input?
According to the JavaDoc:
Returns the input subsequence matched by the previous match.
In other words: it returns that part of the input that was matched.

After you added the source code, I can assure you the group() returns the whole input string because it matches your regular expression. If you want just the <input> element use:
private static String fileExtensionPattern = "<input type=\"hidden\" name=\".*\" value=\".*\" />";
Or use:
private static String fileExtensionPattern = ".*(<input type=\"hidden\" name=\".*\" value=\".*\" />).*";
. . .
System.out.println(fileXtensionMatcher.group(1));

After seeing your update it seems like you need matcher groups. Also you need to make your matches non-greedy (.*? instead of .*). Try this:
private static String fileExtensionPattern =
".*<input type=\"hidden\" name=\".*?\" value=\"(.*?)\" />([^<]*)";
// etc.
private static void findFileExtension() {
// etc.
if(fileXtensionMatcher.find()) {
// etc.
System.out.println(fileXtensionMatcher.group(1));
System.out.println(fileXtensionMatcher.group(2));
}
}

Related

How to remove text between <script></script> tags

I want to remove the content between <script></script>tags. I'm manually checking for the pattern and iterating using while loop. But, I'm getting StringOutOfBoundException at this line:
String script = source.substring(startIndex,endIndex-startIndex);
Below is the complete method:
public static String getHtmlWithoutScript(String source) {
String START_PATTERN = "<script>";
String END_PATTERN = " </script>";
while (source.contains(START_PATTERN)) {
int startIndex=source.lastIndexOf(START_PATTERN);
int endIndex=source.indexOf(END_PATTERN,startIndex);
String script=source.substring(startIndex,endIndex);
source.replace(script,"");
}
return source;
}
Am I doing anything wrong here? And I'm getting endIndex=-1. Can anyone help me to identify, why my code is breaking.
String text = "<script>This is dummy text to remove </script> dont remove this";
StringBuilder sb = new StringBuilder(text);
String startTag = "<script>";
String endTag = "</script>";
//removing the text between script
sb.replace(text.indexOf(startTag) + startTag.length(), text.indexOf(endTag), "");
System.out.println(sb.toString());
If you want to remove the script tags too add the following line :
sb.toString().replace(startTag, "").replace(endTag, "")
UPDATE :
If you dont want to use StringBuilder you can do this:
String text = "<script>This is dummy text to remove </script> dont remove this";
String startTag = "<script>";
String endTag = "</script>";
//removing the text between script
String textToRemove = text.substring(text.indexOf(startTag) + startTag.length(), text.indexOf(endTag));
text = text.replace(textToRemove, "");
System.out.println(text);
You can use a regex to remove the script tag content:
public String removeScriptContent(String html) {
if(html != null) {
String re = "<script>(.*)</script>";
Pattern pattern = Pattern.compile(re);
Matcher matcher = pattern.matcher(html);
if (matcher.find()) {
return html.replace(matcher.group(1), "");
}
}
return null;
}
You have to add this two imports:
import java.util.regex.Matcher;
import java.util.regex.Pattern;
I know I'm probably late to the party. But I would like to give you a regex (really tested solution).
What you have to note here is that when it comes to regular expressions, their engines are greedy by default. So a search string such as <script>(.*)</script> will match the entire string starting from <script> up until the end of the line, or end of the file depending on the regexp options used. This is due to the fact that the search engine uses greedy matching by default.
Now in order to perform the match that you want to in an accurate manner... you could use "lazy" searching.
Search with Lazy loading
<script>(.*?)<\/script>
Now with that, you will get accurate results.
You can read more about about Regexp Lazy & Greedy in this answer.
This worked for me:
private static String removeScriptTags(String message) {
String scriptRegex = "<(/)?[ ]*script[^>]*>";
Pattern pattern2 = Pattern.compile(scriptRegex);
if(message != null) {
Matcher matcher2 = pattern2.matcher(message);
StringBuffer str = new StringBuffer(message.length());
while(matcher2.find()) {
matcher2.appendReplacement(str, Matcher.quoteReplacement(" "));
}
matcher2.appendTail(str);
message = str.toString();
}
return message;
}
Credit goes to nealvs: https://nealvs.wordpress.com/2010/06/01/removing-tags-from-a-string-in-java/

complex regular expression in Java

I have a rather complex (to me it seems rather complex) problem that I'm using regular expressions in Java for:
I can get any text string that must be of the format:
M:<some text>:D:<either a url or string>:C:<some more text>:Q:<a number>
I started with a regular expression for extracting the text between the M:/:D:/:C:/:Q: as:
String pattern2 = "(M:|:D:|:C:|:Q:.*?)([a-zA-Z_\\.0-9]+)";
And that works fine if the <either a url or string> is just an alphanumeric string. But it all falls apart when the embedded string is a url of the format:
tcp://someurl.something:port
Can anyone help me adjust the above reg exp to extract the text after :D: to be either a url or a alpha-numeric string?
Here's an example:
public static void main(String[] args) {
String name = "M:myString1:D:tcp://someurl.com:8989:C:myString2:Q:1";
boolean matchFound = false;
ArrayList<String> values = new ArrayList<>();
String pattern2 = "(M:|:D:|:C:|:Q:.*?)([a-zA-Z_\\.0-9]+)";
Matcher m3 = Pattern.compile(pattern2).matcher(name);
while (m3.find()) {
matchFound = true;
String m = m3.group(2);
System.out.println("regex found match: " + m);
values.add(m);
}
}
In the above example, my results would be:
myString1
tcp://someurl.com:8989
myString2
1
And note that the Strings can be of variable length, alphanumeric, but allowing some characters (such as the url format with :// and/or . - characters
You mention that the format is constant:
M:<some text>:D:<either a url or string>:C:<some more text>:Q:<a number>
Capture groups can do this for you with the pattern:
"M:(.*):D:(.*):C:(.*):Q:(.*)"
Or you can do a String.split() with a pattern of "M:|:D:|:C:|:Q:". However, the split will return an empty element at the first index. Everything else will follow.
public static void main(String[] args) throws Exception {
System.out.println("Regex: ");
String data = "M:<some text>:D:tcp://someurl.something:port:C:<some more text>:Q:<a number>";
Matcher matcher = Pattern.compile("M:(.*):D:(.*):C:(.*):Q:(.*)").matcher(data);
if (matcher.matches()) {
for (int i = 1; i <= matcher.groupCount(); i++) {
System.out.println(matcher.group(i));
}
}
System.out.println();
System.out.println("String.split(): ");
String[] pieces = data.split("M:|:D:|:C:|:Q:");
for (String piece : pieces) {
System.out.println(piece);
}
}
Results:
Regex:
<some text>
tcp://someurl.something:port
<some more text>
<a number>
String.split():
<some text>
tcp://someurl.something:port
<some more text>
<a number>
To extract the URL/text part you don't need the regular expression. Use
int startPos = input.indexOf(":D:")+":D:".length();
int endPos = input.indexOf(":C:", startPos);
String urlOrText = input.substring(startPos, endPos);
Assuming you need to do some validation along with the parsing:
break the regex into different parts like this:
String m_regex = "[\\w.]+"; //in jsva a . in [] is just a plain dot
String url_regex = "."; //theres a bunch online, pick your favorite.
String d_regex = "(?:" + url_regex + "|\\p{Alnum}+)"; // url or a sequence of alphanumeric characters
String c_regex = "[\\w.]+"; //but i'm assuming you want this to be a bit more strictive. not sure.
String q_regex = "\\d+"; //what sort of number exactly? assuming any string of digits here
String regex = "M:(?<M>" + m_regex + "):"
+ "D:(?<D>" + d_regex + "):"
+ "C:(?<D>" + c_regex + "):"
+ "Q:(?<D>" + q_regex + ")";
Pattern p = Pattern.compile(regex);
Might be a good idea to keep the pattern as a static field somewhere and compile it in a static block so that the temporary regex strings don't overcrowd some class with basically useless fields.
Then you can retrieve each part by its name:
Matcher m = p.matcher( input );
if (m.matches()) {
String m_part = m.group( "M" );
...
String q_part = m.group( "Q" );
}
You can go even a step further by making a RegexGroup interface/objects where each implementing object represents a part of the regex which has a name and the actual regex. Though you definitely lose the simplicity makes it harder to understand it with a quick glance. (I wouldn't do this, just pointing out its possible and has its own benefits)

get substring from string Java(Android)

Could you help me with this problem? I have String like this:
<p class="youtube_sc mobile "><img src="http://i.ytimg.com/vi/bMHJODdp7-U/hqdefault.jpg" /><span class="play-button-outer" title="Click to play video"><span class="play-button"></span></span></p>
And I need to get only what will be after "youtube:" in this example this part of line:
bMHJODdp7-U
I use this code of my example:
String path = strin.replaceAll(".*(youtube:\\S+)", "$ title");
Where strin is the String line. But this code only replace what I need to " title"
Any ideas?
Explanation - youtube:(.*?)\" matches for text in between youtube: and "
public static void main(String[] args) {
String s = "<p class=\"youtube_sc mobile \"><img src=\"http://i.ytimg.com/vi/bMHJODdp7-U/hqdefault.jpg\" /><span class=\"play-button-outer\" title=\"Click to play video\"><span class=\"play-button\"></span></span></p>";
Pattern pattern = Pattern.compile("youtube:(.*?)\"");
Matcher matcher = pattern.matcher(s);
if (matcher.find()) {
System.out.println(matcher.group(1));
}
}
Output
bMHJODdp7-U
You can use this regex for search:
^.*?youtube:([^"]+).*$
and replace it by:
$1
Code:
String path = strin.replaceFirst("^.*?youtube:([^\"]+).*$", "$1");
//=> bMHJODdp7-U
RegEx Demo
Change your regular expression to
(?:youtube:)(\S+)(?:")

Replace String in Java with regex and replaceAll

Is there a simple solution to parse a String by using regex in Java?
I have to adapt a HTML page. Therefore I have to parse several strings, e.g.:
href="/browse/PJBUGS-911"
=>
href="PJBUGS-911.html"
The pattern of the strings is only different corresponding to the ID (e.g. 911). My first idea looks like this:
String input = "";
String output = input.replaceAll("href=\"/browse/PJBUGS\\-[0-9]*\"", "href=\"PJBUGS-???.html\"");
I want to replace everything except the ID. How can I do this?
Would be nice if someone can help me :)
You can capture substrings that were matched by your pattern, using parentheses. And then you can use the captured things in the replacement with $n where n is the number of the set of parentheses (counting opening parentheses from left to right). For your example:
String output = input.replaceAll("href=\"/browse/PJBUGS-([0-9]*)\"", "href=\"PJBUGS-$1.html\"");
Or if you want:
String output = input.replaceAll("href=\"/browse/(PJBUGS-[0-9]*)\"", "href=\"$1.html\"");
This does not use regexp. But maybe it still solves your problem.
output = "href=\"" + input.substring(input.lastIndexOf("/")) + ".html\"";
This is how I would do it:
public static void main(String[] args)
{
String text = "href=\"/browse/PJBUGS-911\" blahblah href=\"/browse/PJBUGS-111\" " +
"blahblah href=\"/browse/PJBUGS-34234\"";
Pattern ptrn = Pattern.compile("href=\"/browse/(PJBUGS-[0-9]+?)\"");
Matcher mtchr = ptrn.matcher(text);
while(mtchr.find())
{
String match = mtchr.group(0);
String insMatch = mtchr.group(1);
String repl = match.replaceFirst(match, "href=\"" + insMatch + ".html\"");
System.out.println("orig = <" + match + "> repl = <" + repl + ">");
}
}
This just shows the regex and replacements, not the final formatted text, which you can get by using Matcher.replaceAll:
String allRepl = mtchr.replaceAll("href=\"$1.html\"");
If just interested in replacing all, you don't need the loop -- I used it just for debugging/showing how regex does business.

How can I extract all substring by matching a regular expression?

I want extract values of all src attribute in this string, how can i do that:
<p>Test
<img alt="70" width="70" height="50" src="/adminpanel/userfiles/image/1.jpg" />
Test
<img alt="70" width="70" height="50" src="/adminpanel/userfiles/image/2.jpg" />
</p>
Here you go:
String data = "<p>Test \n" +
"<img alt=\"70\" width=\"70\" height=\"50\" src=\"/adminpanel/userfiles/image/1.jpg\" />\n" +
"Test \n" +
"<img alt=\"70\" width=\"70\" height=\"50\" src=\"/adminpanel/userfiles/image/2.jpg\" />\n" +
"</p>";
Pattern p0 = Pattern.compile("src=\"([^\"]+)\"");
Matcher m = p0.matcher(data);
while (m.find())
{
System.out.printf("found: %s%n", m.group(1));
}
Most regex flavors have a shorthand for grabbing all matches, like Ruby's scan method or .NET's Matches(), but in Java you always have to spell it out.
Idea - split around the '"' char, look at each part if it contains the attribute name src and - if yes - store the next value, which is a src attribute.
String[] parts = thisString.split("\""); // splits at " char
List<String> srcAttributes = new ArrayList<String>();
boolean nextIsSrcAttrib = false;
for (String part:parts) {
if (part.trim().endsWith("src=") {
nextIsSrcAttrib = true; {
else if (nextIsSrcAttrib) {
srcAttributes.add(part);
nextIsSrcAttrib = false;
}
}
Better idea - feed it into a usual html parser and extract the values of all src attributes from all img elements. But the above should work as an easy solution, especially in non-production code.
sorry for not coding it (short of time)
how about:
1. (assuming that the file size is reasonable)read the entire file to a String.
2. Split the String arround "src=\"" (assume that the resulting array is called strArr)
3. loop over resulting array of Strings and store strArr[i].substring(0,strArr[i].indexOf("\" />")) to some collection of image sources.
Aviad
since you've requested a regex implementation ...
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Test {
private static String input = "....your html.....";
public static void main(String[] args) {
Pattern pattern = Pattern.compile("src=\".*\"");
Matcher matcher = pattern.matcher(input);
while (matcher.find()) {
System.out.println(matcher.group());
}
}
}
You may have to tweak the regex if your src attributes are not double quoted

Categories