Search for a substring between certain characters at unknown index - java

I have a String such as:
<div class="photo-box biz-photo-box pb-60s">
<a href="/biz/the-kerry-piper-willowbrook">
<img class="photo-img" alt="" height="60" src="http://s3-media3.ak.yelpcdn.com/bphoto/rCz-uF_qwqyb5Nnq74JeVQ/60s.jpg" width="60">
</a>
How can I retrieve the url
http://s3-media3.ak.yelpcdn.com/bphoto/rCz-uF_qwqyb5Nnq74JeVQ/60s.jpg
from this String?
I thought about string.indexOf() but the number of characters before and after url can vary therefore I don't know at which index this substring starts and this could be messy. Any best approach?

Use Jsoup to scrape/parse HTML from a URL, file, or string and use its jQuery like selector syntax.
String htmlStr="<div class=\"photo-box biz-photo-box pb-60s\">"
+ "<a href=\"/biz/the-kerry-piper-willowbrook\">"
+ "<img class=\"photo-img\" alt=\"\" height=\"60\" src=\"http://s3-media3.ak.yelpcdn.com/bphoto/rCz-uF_qwqyb5Nnq74JeVQ/60s.jpg\" width=\"60\">"
+ "</a>";
org.jsoup.nodes.Document doc=org.jsoup.Jsoup.parse(htmlStr);
String src=doc.select("img").attr("src");
System.out.println(src);

If you don't want to use an HTML parser, you could construct a regular expression and used the regex package to match the only the data you need.
Something like,
Pattern pattern = Pattern.compile("<img.*?src=\"([^\"]+)\"",Pattern.CASE_INSENSITIVE);
Matcher m = pattern.matcher(data);
while(m.find()) {
srcUrl = m.group(1));
}

Related

Regex get text between tags

I try to get the text between a tag in JAVA.
`
<td colspan="2" style="font-weight:bold;">HELLO TOTO</td>
<td>Function :</td>
`
I would like to use a regex to extract "HELLO TOTO" but not "Function :"
I already tried something like this
`
String btwTags = "<td colspan=\"2\" style=\"font-weight:bold;\">HELLO TOTO</td>\n" + "<td>Function :</td>";
Pattern pattern = Pattern.compile("<td(.*?)>(.*?)</td>");
Matcher matcher = pattern.matcher(btwTags);
while (matcher.find()) {
String group = matcher.group();
System.out.println(group);
}
`
but the result is the same as the input.
Any ideas ?
I tried this regex (?<=<td>)(.*?)(?=</td>) too but it only catch "Function:"
I don't know of to set that he could be something after the open <td ...>
Already thanks in advance
Don't use RegEx to parse HTML, its a very bad idea...
to know why check this link:
RegEx match open tags except XHTML self-contained tags
you can use Jsoup to achieve this :
String html; // your html code
Document doc = Jsoup.parse(html);
System.out.println(doc.select("td[colspan=2]").text());
You can use a Regex for very basic HTML parsing. Here's the easiest Java regex I could find :
"(?i)<td[^>]+>([^<]+)<\\/td>"
It matches the first td tag with attributes and a value. "HELLO TOTO" is in group 1.
Here's an example.
For anything more complex, a parser like Jsoup would be better.
But even a parser could fail if the HTML isn't valid or if the structure for which you wrote the code has been changed.
I had provided solution without using REGEX Hope that would be helpful..
public class Solution{
public static void main(String ...args){
String str = "<td colspan=\"2\" style=\"font-weight:bold;\">HELLO TOTO</td><td>Function :</td>";
String [] garray = str.split(">|</td>");
for(int i = 1;i < garray.length;i+=2){
System.out.println(garray[i]);
}
}
}
Output :: HELLO TOTO
Function :
I am just using split function to delimit at given substrings .Regex is slow and often confuse.
cheers happy coding...

RSS Feed - Parse/Extract src image tag inside Description tag in JAVA

Extending this question
How to extract an image src from RSS feed
for JAVA, answer is already made for ios, but to make it work in JAVA there is not enough solutions made for it.
RSS Feeds parsing the direct tag is known for me, but parsing tag inside another tag is quite complicated like this below
<description>
<![CDATA[
<img width="745" height="410" src="http://example.com/image.png" class="attachment-large wp-post-image" alt="alt tag" style="margin-bottom: 15px;" />description text
]]>
</description>
How to split up the src tag alone?
Take a look at jsoup. I think it's what you need.
EDIT:
private String extractImageUrl(String description) {
Document document = Jsoup.parse(description);
Elements imgs = document.select("img");
for (Element img : imgs) {
if (img.hasAttr("src")) {
return img.attr("src");
}
}
// no image URL
return "";
}
You could try to use a regular expression to get the value,
give a look to this little example, I hope it's help you.
For more info about regular expression you can find more info here.
http://www.tutorialspoint.com/java/java_regular_expressions.htm
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Test{
public static void main(String []args){
String regularExpression = "src=\"(.*)\" class";
String html = "<description> <![CDATA[ <img width=\"745\" height=\"410\" src=\"http://example.com/image.png\" class=\"attachment-large wp-post-image\" alt=\"alt tag\" style=\"margin-bottom: 15px;\" />description text ]]> </description>";
// Create a Pattern object
Pattern pattern = Pattern.compile(regularExpression);
// Now create matcher object.
Matcher matcher = pattern.matcher(html);
if (matcher.find( )) {
System.out.println("Found value: " + matcher.group(1) );
//It's prints Found value: http://example.com/image.png
}
}
}

extract element from xml string

<media:thumbnail url="http:// mysite.com/wp-content/uploads/2013/11/mes1-300x186.png" width="320" length="125399" type="image/jpg"/>
How to extract the URL from this xml ? If I have the above as string ?
Supposing you have the XML string stored in String str, one lazy and simple way to retrieve the URL (if you always expect the same input text format) could be:
String url = str.split("\"")[1];
Some times, as a quick "one time solution" you can use regular expressions.
String xml="<media:thumbnail url=\"http:// mysite.com/wp-content/uploads/2013/11/ mes1-300x186.png\" width=\"320\" length=\"125399\" type=\"image/jpg\"/>";
Pattern pattern=Pattern.compile("url\\s*=\\s*\\\"(.*?)\\\"");
Matcher m=pattern.matcher(xml);
if(m.find()){
String urlValue=m.group(1);
System.out.println(urlValue);
}

How can I adjust this regex to filter out "

I got the following regex working to search for video links in a page
(http(s?):/)(/[^/]+)\\S+.\\.(?:avi|flv|mp4)
Unfortunately it does not stop at the end of the link if there is another match right behind it, for example this video link
somevideoname.avi
would, after regex return this:
http://somevideo.flv">somevideoname.avi
How can I adjust the regex to avoid this? I would like to learn more about regex, its fascinating but so complex!
Here is how you can do something similar with JSoup parser.
Scanner scanner = new Scanner(new File("input.txt"));
scanner.useDelimiter("\\Z");
String htmlString = scanner.next();
scanner.close();
Document doc = Jsoup.parse(htmlString);
// or to get connect of some page use
// Document doc = Jsoup.connect("http://example.com/").get();
Elements elements = doc.select("a[href]");//find all anchors with href attribute
for (Element el : elements) {
URL url = new URL(el.attr("href"));
if (url.getPath().matches(".*\\.(?:avi|flv|mp4)")) {
System.out.println("url: " + url);
//System.out.println("file: " + url.getPath());
System.out.println("file name: "
+ new File(url.getPath()).getName());
System.out.println("------");
}
}
I'm not sure I understand the groupings in your regexp. At any rate, this one should work:
\\bhttps?://[^\"]+?\\.(?:avi|flv|mp4)\\b
If you only want to extract href attribute values then you're better off matching against the following pattern:
href=("|')(.*?)\.(avi|flv|mp4)\1
This should match "href" followed by either a double-quote or single-quote character, then capture everything up to (and including) the next character which matches the starting quote character. Then your href attribute can be extracted by
matcher.group(2) + "." + matcher.group(3)
to concatenate the file path and name with a period and then the file extension.
Your regex is greedy:
Limit its greediness read this:
(http(s?):/)(/[^/]+?)\\S+.\\.(?:avi|flv|mp4)

Java Matcher Class

I need a pattern matcher to get the page id value in the below text which is coming from a http response body.
<meta name="ajs-page-id" content="262250">
What i'm after is to get the content value from this line that will always be generated in responsebody.
Pattern pat = Pattern.compile("<meta\\sname=\"ajs-page-id\"\\scontent=\"(\\d+)\">");
That is obviously a very literal pattern... but group(1) should return the number as a string.
Haven't tested.
Use an HTML parser like jsoup to parse and search for the part. You should not be using regular expressions for this.
e.g.,
String htmlStr = "<meta name=\"ajs-page-id\" content=\"262250\">";
Document doc = Jsoup.parse(htmlStr);
Element meta = doc.select("meta[name=ajs-page-id]").first();
if (meta != null)
{
System.out.println(meta.attr("content"));
}

Categories