Here is the html I'm trying to parse:
<div class="entry">
<img src="http://www.example.com/image.jpg" alt="Image Title">
<p>Here is some text</p>
<p>Here is some more text</p>
</div>
I want to get the text within the <p>'s into one ArrayList. I've tried using Jsoup for this.
Document doc = Jsoup.parse(line);
Elements descs = doc.getElementsByClass("entry");
for (Element desc : descs) {
String text = desc.getElementsByTag("p").first().text();
myArrayList.add(text);
}
But this doesn't work at all. I'm quite new to Jsoup but it seems it has its limitations. If I can get the text within <p> into one ArrayList using Jsoup, how can I accomplish that? If I must use some other means to parse the html, let me know.
I'm using a BufferedReader to read the html file one line at a time.
You could change your approach to the following:
Document doc = Jsoup.parse(line);
Elements pElems = doc.select("div.entry > p");
for (Element pElem : pElems) {
myArrayList.add(pElem.data());
}
Not sure why you are reading the html line by line. However if you want to read the whole html use the code below:
String line = "<div class=\"entry\">" +
"<img src=\"http://www.example.com/image.jpg\" alt=\"Image Title\">" +
"<p>Here is some text</p>" +
"<p>Here is some more text</p>" +
"</div>";
Document doc = Jsoup.parse(line);
Elements descs = doc.getElementsByClass("entry");
List<String> myArrayList = new ArrayList<String>();
for (Element desc : descs) {
Elements paragraphs = desc.getElementsByTag("p");
for (Element paragraph : paragraphs) {
myArrayList.add(paragraph.text());
}
}
In your for-loop:
Elements ps = desc.select("p");
(http://jsoup.org/apidocs/org/jsoup/nodes/Element.html#select(java.lang.String))
Try this:
Document doc = Jsoup.parse(line);
String text = doc.select("p").first().text();
Related
I am trying to parse certain information through jsoup in Java from last 3 days -_-, this is my code:
Document document = Jsoup.connect(urlofpage).get();
Elements links = document.select(".contentBox");
for (Element link : links) {
// String name = link.text();
String title = link.select("h2").text();
String content = link.select("p").text();
System.out.println(title);
System.out.println(content);
}
It is fetching the data as it is directed, fetching the data of h2 and p separated, but the problem is, I want to parse the data inside of <p> tag which is just after every <h2> tag.
For example (HTML content):
<h2>main content</h2>
<div class="acx"><div>
<p>content</p>
<p>content 2</p>
<h2>content 2</h2>
<div class="acx"><div>
<p>new content od 2</p>
<p>new 2</p>
Now it should fetch like (in array):
array[0] = "content content 2",
array[1] = "new content od 2 new 2",
Any solutions?
You can play with "~" next element selector. For example
link.select("h2 ~ p").get(0).text(); // returns "content"
link.select("h2 ~ p").get(1).text(); // returns "new content od 2"
Just use your initial approach to iterate all necessary tags within selected .contentBox class:
Document document = Jsoup.connect(urlofpage).get();
Elements links = document.select(".contentBox");
for (Element link : links) {
for (Element h2Tag : link.select("h2"))
{
System.out.println(h2Tag.text());
}
for (Element pTag : link.select("p"))
{
System.out.println(pTag.text());
}
}
I want to find the important links in a site using Jsoup library. So for this suppose we have following code:
<h1>This is important </h1>
Now while parsing how can we find that the tag a is inside the h1 tag?
You can do it this way:
File input = new File("/tmp/input.html");
Document doc = Jsoup.parse(input, "UTF-8", "http://example.com/");
Elements headlinesCat1 = doc.getElementsByTag("h1");
for (Element headline : headlinesCat1) {
Elements importantLinks = headline.getElementsByTag("a");
for (Element link : importantLinks) {
String linkHref = link.attr("href");
String linkText = link.text();
System.out.println(linkHref);
}
}
Taken from the JSoup Cookbook.
Use selector:
Elements elements = doc.select("h1 > a");
I have html div class formated accordingly....
<div class="latest-media-images">
<div class="hdr-article">LATEST IMAGES</div>
<a class="lnk-thumb" href="http://media.pc.ign.com/media/093/093395/imgs_1.html"><img id="thumbImg1" src="http://media.ignimgs.com/media/thumb/351/3513804/the-elder-scrolls-v-skyrim-20110824023151748_thumb_ign.jpg" class="latestMediaThumb" alt="" height="109" width="145"></a>
<a class="lnk-thumb" href="http://media.pc.ign.com/media/093/093395/imgs_1.html"><img id="thumbImg2" src="http://media.ignimgs.com/media/thumb/351/3513803/the-elder-scrolls-v-skyrim-20110824023149685_thumb_ign.jpg" class="latestMediaThumb" alt="" height="109" width="145"></a>
<a class="lnk-thumb" href="http://media.pc.ign.com/media/093/093395/imgs_1.html"><img id="thumbImg3" src="http://media.ignimgs.com/media/thumb/351/3513802/the-elder-scrolls-v-skyrim-20110824023147685_thumb_ign.jpg" class="latestMediaThumb" alt="" height="109" width="145"></a>
</div>
Now.... Ive been trying to think of different ways to do this.
I want to parse each URL to sepereate strings for each one...
Now i was thinking of some how parsing them into a list and then selecting each one by passing a position?
(If anyone wants to answer this please feel free too)
Or i could do something such as navigating to the div class...
Element latest_images = doc.select("div.latest-media-images");
Elements links = latest_images.getElementsByTag("img");
for (Element link : links) {
String linkHref = link.attr("href");
String linkText = link.text();
}
I was thinking of this,havent tried it out yet. I will when i get the chance.
But how will i parse each to a seperate string or a whole list using the code?(if its correct)
Feel free to leave suggestions or answers =) or let me know if the code i have above will do the trick.
Thanks,
coder-For-Life22
Here goes code sample to extract all img urls from your html using RegEx:
//I used your html with some obfuscations to test some fringe cases.
final String HTML
= "<div class=\"latest-media-images\">\n"
+ "<div class=\"hdr-article\">LATEST IMAGES</div>\n"
+ "<a class=\"lnk-thumb\" href=\"http://media.pc.ign.com/media/093/093395/imgs_1.html\"><img id=\"thumbImg1\" \n "
+ "src=\"http://media.ignimgs.com/media/thumb/351/3513804/the-elder-scrolls-v-skyrim-20110824023151748_thumb_ign.jpg\" class=\"latestMediaThumb\" alt=\"\" height=\"109\" width=\"145\"></a>\n"
+ "<a class=\"lnk-thumb\" href=\"http://media.pc.ign.com/media/093/093395/imgs_1.html\"><img id=\"thumbImg2\" src= \n"
+ "\"http://media.ignimgs.com/media/thumb/351/3513803/the-elder-scrolls-v-skyrim-20110824023149685_thumb_ign.jpg\" class=\"latestMediaThumb\" alt=\"\" height=\"109\" width=\"145\"></a>\n"
+ "<a class=\"lnk-thumb\" href=\"http://media.pc.ign.com/media/093/093395/imgs_1.html\"><img id=\"thumbImg3\" src "
+ "= \t \n "
+ "\"http://media.ignimgs.com/media/thumb/351/3513802/the-elder-scrolls-v-skyrim-20110824023147685_thumb_ign.jpg\" class=\"latestMediaThumb\" alt=\"\" height=\"109\" width=\"145\"></a>\n"
+ "</div>";
Pattern pattern = Pattern.compile ("<img[^>]*?src\\s*?=\\s*?\\\"([^\\\"]*?)\\\"");
Matcher matcher = pattern.matcher (HTML);
List<String> imgUrls = new ArrayList<String> ();
while (matcher.find ())
{
imgUrls.add (matcher.group (1));
}
for (String imgUrl : imgUrls) System.out.println (imgUrl);
The output is the same as Sahil Muthoo posted:
http://media.ignimgs.com/media/thumb/351/3513804/the-elder-scrolls-v-skyrim-20110824023151748_thumb_ign.jpg
http://media.ignimgs.com/media/thumb/351/3513803/the-elder-scrolls-v-skyrim-20110824023149685_thumb_ign.jpg
http://media.ignimgs.com/media/thumb/351/3513802/the-elder-scrolls-v-skyrim-20110824023147685_thumb_ign.jpg
If by using a link to get the html first you mean that you have an url than the only change will be that instead of using a hard-coded String you'll need to load the html first. For example, you can use Java OOB class URL:
new URL ("http://some_address").openConnection ().getInputStream ();
Elements thumbs = doc.select("div.latest-media-images img.latestMediaThumb");
List<String> thumbLinks = new ArrayList<String>();
for(Element thumb : thumbs) {
thumbLinks.add(thumb.attr("src"));
}
for(String thumb : thumbLinks) {
System.out.println(thumb);
}
Output
http://media.ignimgs.com/media/thumb/351/3513804/the-elder-scrolls-v-skyrim-20110824023151748_thumb_ign.jpg
http://media.ignimgs.com/media/thumb/351/3513803/the-elder-scrolls-v-skyrim-20110824023149685_thumb_ign.jpg
http://media.ignimgs.com/media/thumb/351/3513802/the-elder-scrolls-v-skyrim-20110824023147685_thumb_ign.jpg
Obviously you can parse the html into a DOM tree and extract all "img" nodes using XPath or CSS selector. And then iterating through them fill an array of links.
Though your code doesn't exactly do the trick.
The cycle is written to work with "a" nodes while the code before it extracts img nodes.
There's also another way: you can extract required data using RegEx which should have better performance and less memory cost.
I have a HTML file which contains a specific tag, e.g. <TABLE cellspacing=0> and the end tag is </TABLE>. Now I want to get everything between those tags. I am using Jericho HTML parser in Java to parse the HTML. Is it possible to get the text & other tags between specific tags in Jericho parser?
For example:
<TABLE cellspacing=0>
<tr><td>HELLO</td>
<td>How are you</td></tr>
</TABLE>
Answer:
<tr><td>HELLO</td>
<td>How are you</td></tr>
Once you have found the Element of your table, all you have to do is call getContent().toString(). Here's a quick example using your sample HTML:
Source source = new Source("<TABLE cellspacing=0>\n" +
" <tr><td>HELLO</td> \n" +
" <td>How are you</td></tr>\n" +
"</TABLE>");
Element table = source.getFirstElement();
String tableContent = table.getContent().toString();
System.out.println(tableContent);
Output:
<tr><td>HELLO</td>
<td>How are you</td></tr>
Aby, I walk down the code for all elements and show on screen. Maybe help you.
List<Element> elementListTd = source.getAllElements(HTMLElementName.TD);
//Scroll through the list of elements "td" page
for (Element element : elementListTd) {
if (element.getAttributes() != null) {
String td = element.getAllElements().toString();
String tag = "td";
System.out.println("TD: " + td);
System.out.println(element.getContent());
String conteudoAtributo = element.getTextExtractor().toString();
System.out.println(conteudoAtributo);
if (td.contains(palavraCompara)) {
tabela.add(conteudoAtributo);
}
}
test: example test1:example1
Elements size = doc.select("div:contains(test:)");
how can i extract the value example and example1 from this html tag....using jsoup..
Since this HTML is not semantic enough for the final purpose you have (a <br> cannot have children and : is not HTML), you can't do much with a HTML parser like Jsoup. A HTML parser isn't intented to do the job of specific text extraction/tokenizing.
Best what you can do is to get the HTML content of the <div> using Jsoup and then extract that further using the usual java.lang.String or maybe java.util.Scanner methods.
Here's a kickoff example:
String html = "<div style=\"height:240px;\"><br>test: example<br>test1:example1</div>";
Document document = Jsoup.parse(html);
Element div = document.select("div[style=height:240px;]").first();
String[] parts = div.html().split("<br />"); // Jsoup transforms <br> to <br />.
for (String part : parts) {
int colon = part.indexOf(':');
if (colon > -1) {
System.out.println(part.substring(colon + 1).trim());
}
}
This results in
example
example1
If I was the HTML author, I would have used a definition list for this. E.g.
<dl id="mydl">
<dt>test:</dt><dd>example</dd>
<dt>test1:</dt><dd>example1</dd>
</dl>
This is more semantic and thus more easy parseable:
String html = "<dl id=\"mydl\"><dt>test:</dt><dd>example</dd><dt>test1:</dt><dd>example1</dd></dl>";
Document document = Jsoup.parse(html);
Elements dts = document.select("#mydl dd");
for (Element dt : dts) {
System.out.println(dt.text());
}