Match img src name in HTML - java

I've a list of images and some of these images are used on web.
I need to get statistic about what images are used on website and on what pages etc.
How can I "match" my images.
Rules are:
I've only filename i.e. "mypic.png"
Here is a regex I want to build <img[anything]src=("or')[anything]mypic.png[anything]("or')[anything]>
here is a dumb of HTML I have
<figure class="gr_col gr_2of3">
<div class="mll mrm mbs md_pic_wrap1">
<a href="http://mydomain/nice-page" title="title test">
<img alt="alt text" class="mbm" src="http://mydomain/file-pic2/mypic.png" width="95" height="95">
</a>
</div>
</figure>
Thanks!

HTML and regex are terrible together in almost all cases. Use a tool that was meant to perform the job you need done e.g. JSoup.
Document document = Jsoup.parse(htmlStringOrFile);
for(Element img : document.select("img")) {
if(img.attr("src").contains("mypic.png")) {
System.out.println(img.attr("alt"));
}
}
This will print the value of the alt attribute of all img elements containing mypic.png in their src. Replace alt with name or id or whatever is the most appropriate for your case.
[As noted by Pshemo]
The selector can be any CSS selector, so you can cut the condition checking and even the loop itself by replacing it with img[src*=mypic.png] which essentially has the same semantics.

To match an image use:
(?i)<img.*?src=["'].*?(mypic\.png).*?["'].*?>
In capturing group 1 there is the name of the image that matches.
public String buildRegex(String... nameList) {
StringBuilder regex = new StringBuilder();
regex.append("(?i)<img.*?src=[\"'].*?(");
for (int i = 0; i < nameList.length - 1; i++) {
regex.append(nameList[i].replaceAll("\\.", "\\\\.")).append("|");
}
regex.append(nameList[nameList.length - 1].replaceAll("\\.", "\\\\."));
regex.append(").*?[\"'].*?>");
return regex.toString();
}

Related

Convert String to arraylist using split

Is it possible to convert below String content to an arraylist using split, so that you get something like in point A?
<a class="postlink" href="http://test.site/i7xt1.htm">http://test.site/i7xt1.htm<br/>
</a>
<br/>Mirror:<br/>
<a class="postlink" href="http://information.com/qokp076wulpw">http://information.com/qokp076wulpw<br/>
</a>
<br/>Additional:<br/>
<a class="postlink" href="http://additional.com/qokdsfsdwulpw">http://additional.com/qokdsfsdwulpw<br/>
</a>
Point A (desired arraylist content):
http://test.site/i7xt1.htm
Mirror:
http://information.com/qokp076wulpw
Additional:
http://additional.com/qokdsfsdwulpw
I am now using below code but it doesn`t bring the desired output. (mirror for instance is being added multiple times etc).
Document doc = Jsoup.parse(string);
Elements links = doc.select("a[href]");
for (Element link : links) {
Node previousSibling = link.previousSibling();
while (!(previousSibling.nodeName().equals("u") || previousSibling.nodeName().equals("#text"))) {
previousSibling = previousSibling.previousSibling();
}
String identifier = previousSibling.toString();
if (identifier.contains("Mirror")) {
totalUrls.add("MIRROR(s):");
}
totalUrls.add(link.attr("href"));
}
Fix your links first. As cricket_007 mentioned, having proper HTML would make this a lot easier.
String html = yourHtml.replaceAll("<br/></a>", "</a>"); // get rid of bad HTML
String[] lines = html.split("<br/>");
for (String str : Arrays.asList(lines)) {
Jsoup.parse(str).text();
... // you can go further here, check if it has a link or not to display your semi-colon;
}
Now that the errant <br> tags are out of the links, you can split the string on the <br> tags that remain and print out your html result. It's not pretty, but it should work.

Extracting html tags between header tags using jsoup or regex

Hi i have a scenario in html file parsing.I am parsing the html file using jsoup, After parsing i want to extract header tags(h1,h3,h4).I used doc.select() but it will return only header tag value but my requirement is i should extract tags between h1 to h3 or h4 and vice-versa.
<h4>SECTION 2</h4>
<p>some thing h4.....</p>
<p>some thing h4.....</p>
<p>some thing h4.....</p>
<h3>lawsuit</h3>
<p>some thing h3.....</p>
<p>some thing h3.....</p>
<p>some thing h3.....</p>
<h1>header one </h1>
So here first search if the html string contains any H1,H3,H4.
Here we have h4 so including h4 it should search for next h1 or h3,till h3 we extract the string and put it in a separate html file.
First html file contains
<h4>SECTION 2</h4>
<p>some thing h4.....</p>
<p>some thing h4.....</p>
<p>some thing h4.....</p>
Second html file contains
<h3>lawsuit</h3>
<p>some thing h3.....</p>
<p>some thing h3.....</p>
<p>some thing h3.....</p>
Third html file contains
<h1>header one </h1>
....
....
....
Here the html string is dynamic so i want to write a regular expression which should achieve this context as i am new to java i don't know how to achieve this.
Rightnow i used substring,but i need a generic approach either regular expression or jsoup itself.
The code i tried is.
try {
File sourceFile = new File("E://data1.html");
org.jsoup.nodes.Document doc = Jsoup.parse(sourceFile, "UTF-8");
org.jsoup.nodes.Element elements = doc.body();
String elementString = StringUtils.substringBetween(elements.toString(),"<h4>", "<h3>");
System.out.println("elementString::"+elementString);
File destinationFile = new File("E://sample.html");
BufferedWriter htmlWriter = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(destinationFile), "UTF-8"));
htmlWriter.write(elementString);
htmlWriter.close();
System.out.println("Completed!!!");
} catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
Please help me to achieve this.
Please don't use regex to extract elements from a Xml or HTML document. Regex has limitation on large documents.
Use XPath instead to query a document. For example try to have a look to this stackoverflow question. You can use the pipe operator | to have more than one condition in OR.
Something similar to this should work:
//h1/following-sibling::p |
//h2/following-sibling::p |
//h3/following-sibling::p |
...
You are probably looking for this. You can use this function after you select the desired element(s).
If you are using Jsoup, you don't have to (in fact don't need to) use regex in the case of dom operations.
Elements heads = body.select('h1');
// iterate and get inner html of that elements by
String html = head.html();
-- edit --
Misunderstood the question;
You can determine the h tag's index and use getElementsByIndexGreaterThan. The rest will be the same.
-- edit 2 --
For your particular case; you can iterate after finding first h element:
Elements elements = doc.select("h1,h2,h3,h4,h5");
for (Element element : elements) {
StringBuilder sb = new StringBuilder(element.toString());
Element next = element.nextElementSibling();
while (next != null && !next.tagName().startsWith("h")) {
sb.append(next.toString()).append("\n");
next = next.nextElementSibling();
}
System.out.println(sb);
}
Should work for you.

what is wrong with the printing, it stop at someplace not loading to print

Here is my CODE to deleted everything inside the <>.
public static void main (String [] args) throws FileNotFoundException{
Scanner console = new Scanner(System.in);
Scanner Theinput = GetUserInput (console);
while (Theinput.hasNextLine()){
String Input = Theinput.nextLine();
Scanner text = new Scanner(Input);
if (text.hasNext()){
String MyNewText = Input;
while(MyNewText.contains("<") || MyNewText.contains(">") ){
int Max = MyNewText.indexOf ( ">" );
int Min = MyNewText.indexOf ( "<" );
String Replacement = "";
String ToReplacement = MyNewText.substring (Min,Max+1);
MyNewText = MyNewText.replaceAll(ToReplacement,Replacement);
}
System.out.println (MyNewText);
}
else {
System.out.println();
}
}
}
i basically is try to converge a this text
<HEAD>
<TITLE>Basic HTML Sample Page</TITLE>
</HEAD>
<BODY BGCOLOR="WHITE">
<CENTER>
<H1>A Simple Sample Web Page</H1>
<IMG SRC="http://sheldonbrown.com/images/scb_eagle_contact.jpeg">
<H4>By Sheldon Brown</H4>
<H2>Demonstrating a few HTML features</H2>
</CENTER>
HTML is really a very simple language. It consists of ordinary text, with
commands that are enclosed by "<" and ">" characters, or bewteen an "&" and a ";". <P>
You don't really need to know much HTML to create a page, because you can copy bits
of HTML from other pages that do what you want, then change the text!<P>
This page shows on the left as it appears in your browser, and the corresponding HTML
code appears on the right. The HTML commands are linked to explanations of what they do.
<H3>Line Breaks</H3>
HTML doesn't normally use line breaks for ordinary text. A white space of any size is
treated as a single space. This is because the author of the page has no way of knowing
the size of the reader's screen, or what size type they will have their browser set for.<P>
If you want to put a line break at a particular place, you can use the "<BR>" command,
or, for a paragraph break, the "<P>" command, which will insert a blank line.
The heading command ("<4></4>") puts a blank line above and below the heading text.
<H4>Starting and Stopping Commands</H4>
Most HTML commands come in pairs: for example, "<H4>" marks the beginning of a size 4
heading, and "</H4>" marks the end of it. The closing command is always the same as the
opening command, except for the addition of the "/".<P>
Modifiers are sometimes included along with the basic command, inside the opening
command's < >. The modifier does not need to be repeated in the closing command.
<H1>This is a size "1" heading</H1>
<H2>This is a size "2" heading</H2>
<H3>This is a size "3" heading</H3>
<H4>This is a size "4" heading</H4>
<H5>This is a size "5" heading</H5>
<H6>This is a size "6" heading</H6>
<center>
<H4>Copyright ?1997, by
Sheldon Brown
</H4>
If you would like to make a link or bookmark to this page, the URL is:
<BR> http://sheldonbrown.com/web_sample1.html</body>
after all i my output stop at and everything work but just that.
I JUST HAD NO IDEA THAT WHAT HAPPEN TO THE PRINTLN()
This is a size "5" heading
This i
String out = "<TITLE>Basic HTML Sample Page</TITLE>".replaceAll("</?[a-zA-Z0-9]+?>", "");
System.out.println(out);
you can try regular expression, but it can't handle something like
<a
href="http://google.com"
target="_blank"
>google</a>
maybe you should consider using a parser, for example Jsoup

Extract text from html: looking for a good sax-like parser or advices with a dom parser

I have an html document formatted this way:
<p>
some plain text <em>some emphatized text</em>, <strong> some strong text</strong>
</p>
<p>
just some plain text
</p>
<p>
<strong>strong text </p> followed by plain, <a>with a link at the end!</a>
</p>
I'd like to extract the text. With dom like parsers I could extract each paragraph , but the problem is inside: I'd have to extract text from inner tags too and have a resulting string with the same order, in the example above, first paragraph, I want to extract:
some plain text some emphatized text, some strong text
and for this purpose I guess a sax like parser would be better than a dom, given that I can't know inner tags number o sequence: a paragraph can have zero or more inner tags, of different type.
You can use dom parsers, get the text inside of the p tags (including child html elements) in to a string variable and use some other functionality to strip all the html tags out of the resulting string. This should leave you with all of the content between the p tags without any of the child element tags.
Example
<p>
some plain text <em>some emphatized text</em>, <strong> some strong text</strong>
</p>
<p>
just some plain text
</p>
<p>
<strong>strong text </p> followed by plain, <a>with a link at the end!</a>
</p>
Use some dom parser to extract the p tags to strings, you would then have a string like so:
String content = "some plain text <em>some emphatized text</em>, <strong> some strong text</strong>";
content = stripHtmlTags( content );
println( content ); // some plain text some emphatized text, some strong text
String extractedText=Html.fromHtml(Your HTML String).toString()
This gives u extracted text..
Hope this help you.
Add code to read CDATA by DOM pase
**childNode.getNodeType() == Node.CDATA_SECTION_NODE**
if Using XMLUtils modify like
public static String getNodeValue(Node node) {
node.normalize();
String response = node.getNodeValue();
if (response != null) {
return response;
} else {
NodeList list = node.getChildNodes();
int size = list == null ? 0 : list.getLength();
for (int j = 0; j < size; j++) {
Node childNode = list.item(j);
if (childNode.getNodeType() == Node.TEXT_NODE
|| childNode.getNodeType() == Node.CDATA_SECTION_NODE) {
response = childNode.getNodeValue();
return response;
}
}
}
return "";
}

How to parse and return a list of links to seperate strings[] or strings?

I have html div class formated accordingly....
<div class="latest-media-images">
<div class="hdr-article">LATEST IMAGES</div>
<a class="lnk-thumb" href="http://media.pc.ign.com/media/093/093395/imgs_1.html"><img id="thumbImg1" src="http://media.ignimgs.com/media/thumb/351/3513804/the-elder-scrolls-v-skyrim-20110824023151748_thumb_ign.jpg" class="latestMediaThumb" alt="" height="109" width="145"></a>
<a class="lnk-thumb" href="http://media.pc.ign.com/media/093/093395/imgs_1.html"><img id="thumbImg2" src="http://media.ignimgs.com/media/thumb/351/3513803/the-elder-scrolls-v-skyrim-20110824023149685_thumb_ign.jpg" class="latestMediaThumb" alt="" height="109" width="145"></a>
<a class="lnk-thumb" href="http://media.pc.ign.com/media/093/093395/imgs_1.html"><img id="thumbImg3" src="http://media.ignimgs.com/media/thumb/351/3513802/the-elder-scrolls-v-skyrim-20110824023147685_thumb_ign.jpg" class="latestMediaThumb" alt="" height="109" width="145"></a>
</div>
Now.... Ive been trying to think of different ways to do this.
I want to parse each URL to sepereate strings for each one...
Now i was thinking of some how parsing them into a list and then selecting each one by passing a position?
(If anyone wants to answer this please feel free too)
Or i could do something such as navigating to the div class...
Element latest_images = doc.select("div.latest-media-images");
Elements links = latest_images.getElementsByTag("img");
for (Element link : links) {
String linkHref = link.attr("href");
String linkText = link.text();
}
I was thinking of this,havent tried it out yet. I will when i get the chance.
But how will i parse each to a seperate string or a whole list using the code?(if its correct)
Feel free to leave suggestions or answers =) or let me know if the code i have above will do the trick.
Thanks,
coder-For-Life22
Here goes code sample to extract all img urls from your html using RegEx:
//I used your html with some obfuscations to test some fringe cases.
final String HTML
= "<div class=\"latest-media-images\">\n"
+ "<div class=\"hdr-article\">LATEST IMAGES</div>\n"
+ "<a class=\"lnk-thumb\" href=\"http://media.pc.ign.com/media/093/093395/imgs_1.html\"><img id=\"thumbImg1\" \n "
+ "src=\"http://media.ignimgs.com/media/thumb/351/3513804/the-elder-scrolls-v-skyrim-20110824023151748_thumb_ign.jpg\" class=\"latestMediaThumb\" alt=\"\" height=\"109\" width=\"145\"></a>\n"
+ "<a class=\"lnk-thumb\" href=\"http://media.pc.ign.com/media/093/093395/imgs_1.html\"><img id=\"thumbImg2\" src= \n"
+ "\"http://media.ignimgs.com/media/thumb/351/3513803/the-elder-scrolls-v-skyrim-20110824023149685_thumb_ign.jpg\" class=\"latestMediaThumb\" alt=\"\" height=\"109\" width=\"145\"></a>\n"
+ "<a class=\"lnk-thumb\" href=\"http://media.pc.ign.com/media/093/093395/imgs_1.html\"><img id=\"thumbImg3\" src "
+ "= \t \n "
+ "\"http://media.ignimgs.com/media/thumb/351/3513802/the-elder-scrolls-v-skyrim-20110824023147685_thumb_ign.jpg\" class=\"latestMediaThumb\" alt=\"\" height=\"109\" width=\"145\"></a>\n"
+ "</div>";
Pattern pattern = Pattern.compile ("<img[^>]*?src\\s*?=\\s*?\\\"([^\\\"]*?)\\\"");
Matcher matcher = pattern.matcher (HTML);
List<String> imgUrls = new ArrayList<String> ();
while (matcher.find ())
{
imgUrls.add (matcher.group (1));
}
for (String imgUrl : imgUrls) System.out.println (imgUrl);
The output is the same as Sahil Muthoo posted:
http://media.ignimgs.com/media/thumb/351/3513804/the-elder-scrolls-v-skyrim-20110824023151748_thumb_ign.jpg
http://media.ignimgs.com/media/thumb/351/3513803/the-elder-scrolls-v-skyrim-20110824023149685_thumb_ign.jpg
http://media.ignimgs.com/media/thumb/351/3513802/the-elder-scrolls-v-skyrim-20110824023147685_thumb_ign.jpg
If by using a link to get the html first you mean that you have an url than the only change will be that instead of using a hard-coded String you'll need to load the html first. For example, you can use Java OOB class URL:
new URL ("http://some_address").openConnection ().getInputStream ();
Elements thumbs = doc.select("div.latest-media-images img.latestMediaThumb");
List<String> thumbLinks = new ArrayList<String>();
for(Element thumb : thumbs) {
thumbLinks.add(thumb.attr("src"));
}
for(String thumb : thumbLinks) {
System.out.println(thumb);
}
Output
http://media.ignimgs.com/media/thumb/351/3513804/the-elder-scrolls-v-skyrim-20110824023151748_thumb_ign.jpg
http://media.ignimgs.com/media/thumb/351/3513803/the-elder-scrolls-v-skyrim-20110824023149685_thumb_ign.jpg
http://media.ignimgs.com/media/thumb/351/3513802/the-elder-scrolls-v-skyrim-20110824023147685_thumb_ign.jpg
Obviously you can parse the html into a DOM tree and extract all "img" nodes using XPath or CSS selector. And then iterating through them fill an array of links.
Though your code doesn't exactly do the trick.
The cycle is written to work with "a" nodes while the code before it extracts img nodes.
There's also another way: you can extract required data using RegEx which should have better performance and less memory cost.

Categories