How can I extract all substring by matching a regular expression?

How can I extract all substring by matching a regular expression? - java

I want extract values of all src attribute in this string, how can i do that:
<p>Test
<img alt="70" width="70" height="50" src="/adminpanel/userfiles/image/1.jpg" />
Test
<img alt="70" width="70" height="50" src="/adminpanel/userfiles/image/2.jpg" />
</p>

Here you go:
String data = "<p>Test \n" +
"<img alt=\"70\" width=\"70\" height=\"50\" src=\"/adminpanel/userfiles/image/1.jpg\" />\n" +
"Test \n" +
"<img alt=\"70\" width=\"70\" height=\"50\" src=\"/adminpanel/userfiles/image/2.jpg\" />\n" +
"</p>";
Pattern p0 = Pattern.compile("src=\"([^\"]+)\"");
Matcher m = p0.matcher(data);
while (m.find())
{
System.out.printf("found: %s%n", m.group(1));
}
Most regex flavors have a shorthand for grabbing all matches, like Ruby's scan method or .NET's Matches(), but in Java you always have to spell it out.

Idea - split around the '"' char, look at each part if it contains the attribute name src and - if yes - store the next value, which is a src attribute.
String[] parts = thisString.split("\""); // splits at " char
List<String> srcAttributes = new ArrayList<String>();
boolean nextIsSrcAttrib = false;
for (String part:parts) {
if (part.trim().endsWith("src=") {
nextIsSrcAttrib = true; {
else if (nextIsSrcAttrib) {
srcAttributes.add(part);
nextIsSrcAttrib = false;
}
}
Better idea - feed it into a usual html parser and extract the values of all src attributes from all img elements. But the above should work as an easy solution, especially in non-production code.

sorry for not coding it (short of time)
how about:
1. (assuming that the file size is reasonable)read the entire file to a String.
2. Split the String arround "src=\"" (assume that the resulting array is called strArr)
3. loop over resulting array of Strings and store strArr[i].substring(0,strArr[i].indexOf("\" />")) to some collection of image sources.
Aviad

since you've requested a regex implementation ...
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Test {
private static String input = "....your html.....";
public static void main(String[] args) {
Pattern pattern = Pattern.compile("src=\".*\"");
Matcher matcher = pattern.matcher(input);
while (matcher.find()) {
System.out.println(matcher.group());
}
}
}
You may have to tweak the regex if your src attributes are not double quoted

Related

Two separate patterns and matchers (java)

I'm working on a simple bot for discord and the first pattern reading works fine and I get the results I'm looking for, but the second one doesn't seem to work and I can't figure out why.
Any help would be appreciated
public void onMessageReceived(MessageReceivedEvent event) {
if (event.getMessage().getContent().startsWith("!")) {
String output, newUrl;
String word, strippedWord;
String url = "http://jisho.org/api/v1/search/words?keyword=";
Pattern reading;
Matcher matcher;
word = event.getMessage().getContent();
strippedWord = word.replace("!", "");
newUrl = url + strippedWord;
//Output contains the raw text from jisho
output = getUrlContents(newUrl);
//Searching through the raw text to pull out the first "reading: "
reading = Pattern.compile("\"reading\":\"(.*?)\"");
matcher = reading.matcher(output);
//Searching through the raw text to pull out the first "english_definitions: "
Pattern def = Pattern.compile("\"english_definitions\":[\"(.*?)]");
Matcher matcher2 = def.matcher(output);
event.getTextChannel().sendMessage(matcher2.toString());
if (matcher.find() && matcher2.find()) {
event.getTextChannel().sendMessage("Reading: "+matcher.group(1)).queue();
event.getTextChannel().sendMessage("Definition: "+matcher2.group(1)).queue();
}
else {
event.getTextChannel().sendMessage("Word not found").queue();
}
}
}

You had to escape the [ character to \\[ (once for the Java String and once for the Regex). You also did forget the closing \".
the correct pattern looks like this:
Pattern def = Pattern.compile("\"english_definitions\":\\[\"(.*?)\"]");
At the output, you might want to readd \" and start/end.
event.getTextChannel().sendMessage("Definition: \""+matcher2.group(1) + "\"").queue();

Parsing Inner <p> tags

I need to parse a xml content and need to find a inner tags inside the
<p><span>test</span></p> <p><span>test12</span></p> <p>Some text<p><span>test</span></p></p>
In my above test the last p tag has inner p tag inside. I need to find inner p tags of p tag. i tried as below
public static void main(String[] args) {
String text= "<p><span>test</span></p> <p><span>test12</span></p> <p>Some text<p><span>test</span></p></p>";
Pattern pattern = Pattern.compile("<p>.*?</p>");
Matcher matcher = pattern.matcher(text);
while (matcher.find()) {
String match = matcher.group();
//System.out.println("matcher group:"+match);
if (match.lastIndexOf("<p>") > 0) {
//System.out.println("Substring:"+match.substring(match.indexOf("<p>") + "<p>".length(), match.indexOf("</p>")));
text = text.replace(match, "<p>" +match.substring(match.indexOf("<p>") + "<p>".length(), match.indexOf("</p>")).replaceAll("<p>", ""));
}
}
System.out.println("text:"+text);
}
Let me know if any easy way to do this.

Have a look at JAXB.
As suggested by others, don't do this manually and instead use an existing library like JAXB.
An easy to understand JAXB hello world example can be found here.

complex regular expression in Java

I have a rather complex (to me it seems rather complex) problem that I'm using regular expressions in Java for:
I can get any text string that must be of the format:
M:<some text>:D:<either a url or string>:C:<some more text>:Q:<a number>
I started with a regular expression for extracting the text between the M:/:D:/:C:/:Q: as:
String pattern2 = "(M:|:D:|:C:|:Q:.*?)([a-zA-Z_\\.0-9]+)";
And that works fine if the <either a url or string> is just an alphanumeric string. But it all falls apart when the embedded string is a url of the format:
tcp://someurl.something:port
Can anyone help me adjust the above reg exp to extract the text after :D: to be either a url or a alpha-numeric string?
Here's an example:
public static void main(String[] args) {
String name = "M:myString1:D:tcp://someurl.com:8989:C:myString2:Q:1";
boolean matchFound = false;
ArrayList<String> values = new ArrayList<>();
String pattern2 = "(M:|:D:|:C:|:Q:.*?)([a-zA-Z_\\.0-9]+)";
Matcher m3 = Pattern.compile(pattern2).matcher(name);
while (m3.find()) {
matchFound = true;
String m = m3.group(2);
System.out.println("regex found match: " + m);
values.add(m);
}
}
In the above example, my results would be:
myString1
tcp://someurl.com:8989
myString2
1
And note that the Strings can be of variable length, alphanumeric, but allowing some characters (such as the url format with :// and/or . - characters

You mention that the format is constant:
M:<some text>:D:<either a url or string>:C:<some more text>:Q:<a number>
Capture groups can do this for you with the pattern:
"M:(.*):D:(.*):C:(.*):Q:(.*)"
Or you can do a String.split() with a pattern of "M:|:D:|:C:|:Q:". However, the split will return an empty element at the first index. Everything else will follow.
public static void main(String[] args) throws Exception {
System.out.println("Regex: ");
String data = "M:<some text>:D:tcp://someurl.something:port:C:<some more text>:Q:<a number>";
Matcher matcher = Pattern.compile("M:(.*):D:(.*):C:(.*):Q:(.*)").matcher(data);
if (matcher.matches()) {
for (int i = 1; i <= matcher.groupCount(); i++) {
System.out.println(matcher.group(i));
}
}
System.out.println();
System.out.println("String.split(): ");
String[] pieces = data.split("M:|:D:|:C:|:Q:");
for (String piece : pieces) {
System.out.println(piece);
}
}
Results:
Regex:
<some text>
tcp://someurl.something:port
<some more text>
<a number>
String.split():
<some text>
tcp://someurl.something:port
<some more text>
<a number>

To extract the URL/text part you don't need the regular expression. Use
int startPos = input.indexOf(":D:")+":D:".length();
int endPos = input.indexOf(":C:", startPos);
String urlOrText = input.substring(startPos, endPos);

Assuming you need to do some validation along with the parsing:
break the regex into different parts like this:
String m_regex = "[\\w.]+"; //in jsva a . in [] is just a plain dot
String url_regex = "."; //theres a bunch online, pick your favorite.
String d_regex = "(?:" + url_regex + "|\\p{Alnum}+)"; // url or a sequence of alphanumeric characters
String c_regex = "[\\w.]+"; //but i'm assuming you want this to be a bit more strictive. not sure.
String q_regex = "\\d+"; //what sort of number exactly? assuming any string of digits here
String regex = "M:(?<M>" + m_regex + "):"
+ "D:(?<D>" + d_regex + "):"
+ "C:(?<D>" + c_regex + "):"
+ "Q:(?<D>" + q_regex + ")";
Pattern p = Pattern.compile(regex);
Might be a good idea to keep the pattern as a static field somewhere and compile it in a static block so that the temporary regex strings don't overcrowd some class with basically useless fields.
Then you can retrieve each part by its name:
Matcher m = p.matcher( input );
if (m.matches()) {
String m_part = m.group( "M" );
...
String q_part = m.group( "Q" );
}
You can go even a step further by making a RegexGroup interface/objects where each implementing object represents a part of the regex which has a name and the actual regex. Though you definitely lose the simplicity makes it harder to understand it with a quick glance. (I wouldn't do this, just pointing out its possible and has its own benefits)

Replace String in Java with regex and replaceAll

Is there a simple solution to parse a String by using regex in Java?
I have to adapt a HTML page. Therefore I have to parse several strings, e.g.:
href="/browse/PJBUGS-911"
=>
href="PJBUGS-911.html"
The pattern of the strings is only different corresponding to the ID (e.g. 911). My first idea looks like this:
String input = "";
String output = input.replaceAll("href=\"/browse/PJBUGS\\-[0-9]*\"", "href=\"PJBUGS-???.html\"");
I want to replace everything except the ID. How can I do this?
Would be nice if someone can help me :)

You can capture substrings that were matched by your pattern, using parentheses. And then you can use the captured things in the replacement with $n where n is the number of the set of parentheses (counting opening parentheses from left to right). For your example:
String output = input.replaceAll("href=\"/browse/PJBUGS-([0-9]*)\"", "href=\"PJBUGS-$1.html\"");
Or if you want:
String output = input.replaceAll("href=\"/browse/(PJBUGS-[0-9]*)\"", "href=\"$1.html\"");

This does not use regexp. But maybe it still solves your problem.
output = "href=\"" + input.substring(input.lastIndexOf("/")) + ".html\"";

This is how I would do it:
public static void main(String[] args)
{
String text = "href=\"/browse/PJBUGS-911\" blahblah href=\"/browse/PJBUGS-111\" " +
"blahblah href=\"/browse/PJBUGS-34234\"";
Pattern ptrn = Pattern.compile("href=\"/browse/(PJBUGS-[0-9]+?)\"");
Matcher mtchr = ptrn.matcher(text);
while(mtchr.find())
{
String match = mtchr.group(0);
String insMatch = mtchr.group(1);
String repl = match.replaceFirst(match, "href=\"" + insMatch + ".html\"");
System.out.println("orig = <" + match + "> repl = <" + repl + ">");
}
}
This just shows the regex and replacements, not the final formatted text, which you can get by using Matcher.replaceAll:
String allRepl = mtchr.replaceAll("href=\"$1.html\"");
If just interested in replacing all, you don't need the loop -- I used it just for debugging/showing how regex does business.

Java: Obtain matched string from an input

I am trying to obtain the string that my matcher is able to find using my provided expression. Something like this..
if(matcher.find())
System.out.println("Matched string is: " + ?);
What would be the appropriate code for this? According to Oracle the
matcher.group();
method returns only the provided input same as
matcher.group(0);
Thanks in advance..
Edit:
Example follows below:
private static String fileExtensionPattern = ".*<input type=\"hidden\" name=\".*\" value=\".*\" />.*";
private static Matcher fileXtensionMatcher;
private static String input = text "<html><body><table width="96"><tr><td><img src="file:/test" /><input type="hidden" name="docExt" value=".doc" />Employee Trv Log 2011 Training Trip.doc</td></tr></table></body></html>"
private static void findFileExtension() {
System.out.println("** Searching for file extension **");
System.out.println("Looking for pattern: " + fileExtensionPattern);
fileXtensionMatcher = fileXtensionExp.matcher(input);
if(fileXtensionMatcher.find()) {
//the extension expression is contained in the string
System.out.println("Extension expression found.");
System.out.println(fileXtensionMatcher.group());
}
}
The obtained result is:
text "<html><body><table width="96"><tr><td><img src="file:/test" /><input type="hidden" name="docExt" value=".doc" />Employee Trv Log 2011 Training Trip.doc</td></tr></table></body></html>"

Why do you think that group() returns the input?
According to the JavaDoc:
Returns the input subsequence matched by the previous match.
In other words: it returns that part of the input that was matched.

After you added the source code, I can assure you the group() returns the whole input string because it matches your regular expression. If you want just the <input> element use:
private static String fileExtensionPattern = "<input type=\"hidden\" name=\".*\" value=\".*\" />";
Or use:
private static String fileExtensionPattern = ".*(<input type=\"hidden\" name=\".*\" value=\".*\" />).*";
. . .
System.out.println(fileXtensionMatcher.group(1));

After seeing your update it seems like you need matcher groups. Also you need to make your matches non-greedy (.*? instead of .*). Try this:
private static String fileExtensionPattern =
".*<input type=\"hidden\" name=\".*?\" value=\"(.*?)\" />([^<]*)";
// etc.
private static void findFileExtension() {
// etc.
if(fileXtensionMatcher.find()) {
// etc.
System.out.println(fileXtensionMatcher.group(1));
System.out.println(fileXtensionMatcher.group(2));
}
}

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How can I extract all substring by matching a regular expression? - java

I want extract values of all src attribute in this string, how can i do that: <p>Test <img alt="70" width="70" height="50" src="/adminpanel/userfiles/image/1.jpg" /> Test <img alt="70" width="70" height="50" src="/adminpanel/userfiles/image/2.jpg" /> </p>

Related

Two separate patterns and matchers (java)

Parsing Inner <p> tags

complex regular expression in Java

Replace String in Java with regex and replaceAll

Java: Obtain matched string from an input

Categories

Resources