Is it possible to convert below String content to an arraylist using split, so that you get something like in point A?
<a class="postlink" href="http://test.site/i7xt1.htm">http://test.site/i7xt1.htm<br/>
</a>
<br/>Mirror:<br/>
<a class="postlink" href="http://information.com/qokp076wulpw">http://information.com/qokp076wulpw<br/>
</a>
<br/>Additional:<br/>
<a class="postlink" href="http://additional.com/qokdsfsdwulpw">http://additional.com/qokdsfsdwulpw<br/>
</a>
Point A (desired arraylist content):
http://test.site/i7xt1.htm
Mirror:
http://information.com/qokp076wulpw
Additional:
http://additional.com/qokdsfsdwulpw
I am now using below code but it doesn`t bring the desired output. (mirror for instance is being added multiple times etc).
Document doc = Jsoup.parse(string);
Elements links = doc.select("a[href]");
for (Element link : links) {
Node previousSibling = link.previousSibling();
while (!(previousSibling.nodeName().equals("u") || previousSibling.nodeName().equals("#text"))) {
previousSibling = previousSibling.previousSibling();
}
String identifier = previousSibling.toString();
if (identifier.contains("Mirror")) {
totalUrls.add("MIRROR(s):");
}
totalUrls.add(link.attr("href"));
}
Fix your links first. As cricket_007 mentioned, having proper HTML would make this a lot easier.
String html = yourHtml.replaceAll("<br/></a>", "</a>"); // get rid of bad HTML
String[] lines = html.split("<br/>");
for (String str : Arrays.asList(lines)) {
Jsoup.parse(str).text();
... // you can go further here, check if it has a link or not to display your semi-colon;
}
Now that the errant <br> tags are out of the links, you can split the string on the <br> tags that remain and print out your html result. It's not pretty, but it should work.
Related
I've a list of images and some of these images are used on web.
I need to get statistic about what images are used on website and on what pages etc.
How can I "match" my images.
Rules are:
I've only filename i.e. "mypic.png"
Here is a regex I want to build <img[anything]src=("or')[anything]mypic.png[anything]("or')[anything]>
here is a dumb of HTML I have
<figure class="gr_col gr_2of3">
<div class="mll mrm mbs md_pic_wrap1">
<a href="http://mydomain/nice-page" title="title test">
<img alt="alt text" class="mbm" src="http://mydomain/file-pic2/mypic.png" width="95" height="95">
</a>
</div>
</figure>
Thanks!
HTML and regex are terrible together in almost all cases. Use a tool that was meant to perform the job you need done e.g. JSoup.
Document document = Jsoup.parse(htmlStringOrFile);
for(Element img : document.select("img")) {
if(img.attr("src").contains("mypic.png")) {
System.out.println(img.attr("alt"));
}
}
This will print the value of the alt attribute of all img elements containing mypic.png in their src. Replace alt with name or id or whatever is the most appropriate for your case.
[As noted by Pshemo]
The selector can be any CSS selector, so you can cut the condition checking and even the loop itself by replacing it with img[src*=mypic.png] which essentially has the same semantics.
To match an image use:
(?i)<img.*?src=["'].*?(mypic\.png).*?["'].*?>
In capturing group 1 there is the name of the image that matches.
public String buildRegex(String... nameList) {
StringBuilder regex = new StringBuilder();
regex.append("(?i)<img.*?src=[\"'].*?(");
for (int i = 0; i < nameList.length - 1; i++) {
regex.append(nameList[i].replaceAll("\\.", "\\\\.")).append("|");
}
regex.append(nameList[nameList.length - 1].replaceAll("\\.", "\\\\."));
regex.append(").*?[\"'].*?>");
return regex.toString();
}
I am about to parse this url : http://online.wsj.com/public/page/news-wall-street-heard.html?dsk=y
Document jDoc = Jsoup.connect(url1).get();
System.out.println(jDoc1.text());
But the output of the second line(above) is all TAGS inside textarea + text of other tags. Output is like :
..
..
<ul class="">
<li><a data-time="1dy" data-frequency="1mi" class="mdm_time">1 Day</a></li>
<li><a data-time="5dy" data-frequency="15mi" class="mdm_time">5 Days</a></li>
..
..
All the html is getting printed (of what is inside ) and text of other tags. I either want to remove this tag from Doc or want to get this as element so that I can remove it by my hand.
Hope, I am able to explain everything clearly. Please help me solve this.
EDIT :
As per suggestion, I did this :
System.out.println(jDoc1.select("textarea"));
And output comes is :
textarea id="wsj_autocomplete_template" style="display:none">
<div>
<div class="acHeadline hidden" >
</div>
<div class="dropdownContainerClass">
<div class="suggestionblock hidden" templateType="C1">
....
...
..
Certainly it is selecting the textarea, but is not able to parse inner elements. possibly due to < instead of < tag. Is there any workaround for this ?
If you want to remove the entire text area tag use doc.select("textarea").remove();. Or if you want to get the content of textarea use doc.select("textarea").text(). Note here i'm using the text() method instead of toString() or html() methods. This gives the exact text rather than html escape codes.
Again if you want to manipulate this html you can parse it again like Document textareaDoc = Jsoup.parseBodyFragment(doc.select("textarea").text());
Example
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
public class WSJParser {
public static void main(String[] args) {
String url = "http://online.wsj.com/public/page/news-wall-street-heard.html?dsk=y";
try {
Document doc = Jsoup.connect(url).get();
//doc.select("textarea").remove(); // Removes the entire text area tag
Document textareaDoc = Jsoup.parseBodyFragment(doc.select("textarea").text());
System.out.println(textareaDoc);
} catch (IOException e) {
e.printStackTrace();
}
}
}
If I understand correctly, what you want is this
Elements textareas = Jsoup.connect(url1).get().select("textarea");
for (Element textarea : textareas) {
Elements elements = textarea.select("*");
for (Element element : elements) {
System.out.println(element.ownText());
}
}
Here is the html I'm trying to parse:
<div class="entry">
<img src="http://www.example.com/image.jpg" alt="Image Title">
<p>Here is some text</p>
<p>Here is some more text</p>
</div>
I want to get the text within the <p>'s into one ArrayList. I've tried using Jsoup for this.
Document doc = Jsoup.parse(line);
Elements descs = doc.getElementsByClass("entry");
for (Element desc : descs) {
String text = desc.getElementsByTag("p").first().text();
myArrayList.add(text);
}
But this doesn't work at all. I'm quite new to Jsoup but it seems it has its limitations. If I can get the text within <p> into one ArrayList using Jsoup, how can I accomplish that? If I must use some other means to parse the html, let me know.
I'm using a BufferedReader to read the html file one line at a time.
You could change your approach to the following:
Document doc = Jsoup.parse(line);
Elements pElems = doc.select("div.entry > p");
for (Element pElem : pElems) {
myArrayList.add(pElem.data());
}
Not sure why you are reading the html line by line. However if you want to read the whole html use the code below:
String line = "<div class=\"entry\">" +
"<img src=\"http://www.example.com/image.jpg\" alt=\"Image Title\">" +
"<p>Here is some text</p>" +
"<p>Here is some more text</p>" +
"</div>";
Document doc = Jsoup.parse(line);
Elements descs = doc.getElementsByClass("entry");
List<String> myArrayList = new ArrayList<String>();
for (Element desc : descs) {
Elements paragraphs = desc.getElementsByTag("p");
for (Element paragraph : paragraphs) {
myArrayList.add(paragraph.text());
}
}
In your for-loop:
Elements ps = desc.select("p");
(http://jsoup.org/apidocs/org/jsoup/nodes/Element.html#select(java.lang.String))
Try this:
Document doc = Jsoup.parse(line);
String text = doc.select("p").first().text();
I have html div class formated accordingly....
<div class="latest-media-images">
<div class="hdr-article">LATEST IMAGES</div>
<a class="lnk-thumb" href="http://media.pc.ign.com/media/093/093395/imgs_1.html"><img id="thumbImg1" src="http://media.ignimgs.com/media/thumb/351/3513804/the-elder-scrolls-v-skyrim-20110824023151748_thumb_ign.jpg" class="latestMediaThumb" alt="" height="109" width="145"></a>
<a class="lnk-thumb" href="http://media.pc.ign.com/media/093/093395/imgs_1.html"><img id="thumbImg2" src="http://media.ignimgs.com/media/thumb/351/3513803/the-elder-scrolls-v-skyrim-20110824023149685_thumb_ign.jpg" class="latestMediaThumb" alt="" height="109" width="145"></a>
<a class="lnk-thumb" href="http://media.pc.ign.com/media/093/093395/imgs_1.html"><img id="thumbImg3" src="http://media.ignimgs.com/media/thumb/351/3513802/the-elder-scrolls-v-skyrim-20110824023147685_thumb_ign.jpg" class="latestMediaThumb" alt="" height="109" width="145"></a>
</div>
Now.... Ive been trying to think of different ways to do this.
I want to parse each URL to sepereate strings for each one...
Now i was thinking of some how parsing them into a list and then selecting each one by passing a position?
(If anyone wants to answer this please feel free too)
Or i could do something such as navigating to the div class...
Element latest_images = doc.select("div.latest-media-images");
Elements links = latest_images.getElementsByTag("img");
for (Element link : links) {
String linkHref = link.attr("href");
String linkText = link.text();
}
I was thinking of this,havent tried it out yet. I will when i get the chance.
But how will i parse each to a seperate string or a whole list using the code?(if its correct)
Feel free to leave suggestions or answers =) or let me know if the code i have above will do the trick.
Thanks,
coder-For-Life22
Here goes code sample to extract all img urls from your html using RegEx:
//I used your html with some obfuscations to test some fringe cases.
final String HTML
= "<div class=\"latest-media-images\">\n"
+ "<div class=\"hdr-article\">LATEST IMAGES</div>\n"
+ "<a class=\"lnk-thumb\" href=\"http://media.pc.ign.com/media/093/093395/imgs_1.html\"><img id=\"thumbImg1\" \n "
+ "src=\"http://media.ignimgs.com/media/thumb/351/3513804/the-elder-scrolls-v-skyrim-20110824023151748_thumb_ign.jpg\" class=\"latestMediaThumb\" alt=\"\" height=\"109\" width=\"145\"></a>\n"
+ "<a class=\"lnk-thumb\" href=\"http://media.pc.ign.com/media/093/093395/imgs_1.html\"><img id=\"thumbImg2\" src= \n"
+ "\"http://media.ignimgs.com/media/thumb/351/3513803/the-elder-scrolls-v-skyrim-20110824023149685_thumb_ign.jpg\" class=\"latestMediaThumb\" alt=\"\" height=\"109\" width=\"145\"></a>\n"
+ "<a class=\"lnk-thumb\" href=\"http://media.pc.ign.com/media/093/093395/imgs_1.html\"><img id=\"thumbImg3\" src "
+ "= \t \n "
+ "\"http://media.ignimgs.com/media/thumb/351/3513802/the-elder-scrolls-v-skyrim-20110824023147685_thumb_ign.jpg\" class=\"latestMediaThumb\" alt=\"\" height=\"109\" width=\"145\"></a>\n"
+ "</div>";
Pattern pattern = Pattern.compile ("<img[^>]*?src\\s*?=\\s*?\\\"([^\\\"]*?)\\\"");
Matcher matcher = pattern.matcher (HTML);
List<String> imgUrls = new ArrayList<String> ();
while (matcher.find ())
{
imgUrls.add (matcher.group (1));
}
for (String imgUrl : imgUrls) System.out.println (imgUrl);
The output is the same as Sahil Muthoo posted:
http://media.ignimgs.com/media/thumb/351/3513804/the-elder-scrolls-v-skyrim-20110824023151748_thumb_ign.jpg
http://media.ignimgs.com/media/thumb/351/3513803/the-elder-scrolls-v-skyrim-20110824023149685_thumb_ign.jpg
http://media.ignimgs.com/media/thumb/351/3513802/the-elder-scrolls-v-skyrim-20110824023147685_thumb_ign.jpg
If by using a link to get the html first you mean that you have an url than the only change will be that instead of using a hard-coded String you'll need to load the html first. For example, you can use Java OOB class URL:
new URL ("http://some_address").openConnection ().getInputStream ();
Elements thumbs = doc.select("div.latest-media-images img.latestMediaThumb");
List<String> thumbLinks = new ArrayList<String>();
for(Element thumb : thumbs) {
thumbLinks.add(thumb.attr("src"));
}
for(String thumb : thumbLinks) {
System.out.println(thumb);
}
Output
http://media.ignimgs.com/media/thumb/351/3513804/the-elder-scrolls-v-skyrim-20110824023151748_thumb_ign.jpg
http://media.ignimgs.com/media/thumb/351/3513803/the-elder-scrolls-v-skyrim-20110824023149685_thumb_ign.jpg
http://media.ignimgs.com/media/thumb/351/3513802/the-elder-scrolls-v-skyrim-20110824023147685_thumb_ign.jpg
Obviously you can parse the html into a DOM tree and extract all "img" nodes using XPath or CSS selector. And then iterating through them fill an array of links.
Though your code doesn't exactly do the trick.
The cycle is written to work with "a" nodes while the code before it extracts img nodes.
There's also another way: you can extract required data using RegEx which should have better performance and less memory cost.
test: example test1:example1
Elements size = doc.select("div:contains(test:)");
how can i extract the value example and example1 from this html tag....using jsoup..
Since this HTML is not semantic enough for the final purpose you have (a <br> cannot have children and : is not HTML), you can't do much with a HTML parser like Jsoup. A HTML parser isn't intented to do the job of specific text extraction/tokenizing.
Best what you can do is to get the HTML content of the <div> using Jsoup and then extract that further using the usual java.lang.String or maybe java.util.Scanner methods.
Here's a kickoff example:
String html = "<div style=\"height:240px;\"><br>test: example<br>test1:example1</div>";
Document document = Jsoup.parse(html);
Element div = document.select("div[style=height:240px;]").first();
String[] parts = div.html().split("<br />"); // Jsoup transforms <br> to <br />.
for (String part : parts) {
int colon = part.indexOf(':');
if (colon > -1) {
System.out.println(part.substring(colon + 1).trim());
}
}
This results in
example
example1
If I was the HTML author, I would have used a definition list for this. E.g.
<dl id="mydl">
<dt>test:</dt><dd>example</dd>
<dt>test1:</dt><dd>example1</dd>
</dl>
This is more semantic and thus more easy parseable:
String html = "<dl id=\"mydl\"><dt>test:</dt><dd>example</dd><dt>test1:</dt><dd>example1</dd></dl>";
Document document = Jsoup.parse(html);
Elements dts = document.select("#mydl dd");
for (Element dt : dts) {
System.out.println(dt.text());
}