Iterate over HTML tags - java

I am developing some kind of RSS application: the app downloads the content provided by a RSS feed and shows it to the user.
The post's content has tags like p, img and h2, I want to iterate (in order) over them and create TextView's and ImageView's depending of the tag.
For example, I want to show this HTML code:
<body>
<h2>Some text</h2>
<img src="image1.jpg">
<p>A lot of text</p>
</body>
as
<TextView />
<ImageView />
<TextView />
I think Jsoup is an option, but I am not sure how to use it or if Android includes a native solution.
I also want to incorporate lacy download for images, and I've found the Ion library, but maybe for my use there are more simple solutions
EDIT:
As #Vogabe suggested, I am iterating over the tags using Jsoup. This is the code, maybe someone can find it useful
Document document = Jsoup.parse(htmlContent);
Elements elements = document.getAllElements();
for (Element element:elements) {
Tag tag = element.tag();
if (tag.getName().equalsIgnoreCase("p")) {
// ...
}
}

JSoup is a good solution for parsing HTML pages and retrieving data from it. The Select() method just accepts a css selector and will return the html elements that comply with that selector.
These 2 links should get you started:
http://jsoup.org/cookbook/extracting-data/selector-syntax
http://jsoup.org/cookbook/extracting-data/dom-navigation
There are other parsers out there, but I do not have experience with them.
JSoup is widely adopted and very easy to use.

Related

Parsing modern web pages (javascript/html5/json) using java

I used to have a tool that parse yahoo finance webpage, using jsup.
Recently yahoo changed the layout of their pages, and now the page is full of javascript and what looks like json data.
Please see example here:
http://finance.yahoo.com/quote/AAPL/financials?ltr=1
Inspecting the page in chrome shows a different view (after javascript had executed and the dom was created) than what the java document looks like in jsup:
Document d = Jsoup.connect(link).get();// link same as above
Element body = d.body();
returns an Element (body) that contains huge data document that looks like:
<div class="footer Py(10px) Ta(c) Bgc(#fff) Py(0) BdT Bdc($lightGray)" data-reactid=".1vh5ojua4n4.1.$0.0.0.3.1.$main-0-Quote-Proxy.$main-0-Quote.0.2.1.3.0.$footer">
<div class="Fz(s) Py(20px) " data-reactid=".1vh5ojua4n4.1.$0.0.0.3.1.$main-0-Quote-Proxy.$main-0-Quote.0.2.1.3.0.$footer.0">
<div class="Pb(10px) D(b)" data-reactid=".1vh5ojua4n4.1.$0.0.0.3.1.$main-0-Quote-Proxy.$main-0-Quote.0.2.1.3.0.$footer.0.0">
<a class="Mend(10px)" href="http://help.yahoo.com/kb/index?page=content&y=PROD_FIN&locale=en-US&id=SLN2310&pir=Zm7qO7BibUkC.4dK5GxJ95B3DCru2iA5odBNM0pj" data-reactid=".1vh5ojua4n4.1.$0.0.0.3.1.$main-0-Quote-Proxy.$main-0-Quote.0.2.1.3.0.$footer.0.0.0">
Any idea how I can parse this type of document in java? I suspect I need to run it in using a java script engine first and then parse the outcome, or maybe there is another way.

Extract image id with Jsoup

I am trying to extract a specific captcha image id using api Jsoup, the html image tag is like :
<img id="wlspispHIPBimg03256465465dsd5456" style="display: inline; width: 200px; height: 100px;" aria-hidden="true" src="https://users/hip/data/rnd=435cb60d0a6b63ef4">
This is my code to obtain the attribute id="wlspispHIPBimg03256465465dsd5456":
doc = Jsoup.connect("http://go.microsoft.com/fwlink/?LinkID=614866&clcid")
.timeout(0).get();
Elements images = doc.select("img[src~=(?i)]");
for (Element image : images) {
System.out.println(image.attr("id"));
}
The problem is that i can't get the id of captcha image
You need to find something in the html that discriminates the img tag of any other tag in the document. From your posted code that is can't be deduced, so i use my imagination here:
Element imageEl = doc.select("img[scr*=rnd]").first();
This exploits that the source of the image contains "rnd" in it path. To get the best solution you must look yourself. Also it helps a lot if you learn the CSS selectors of Jsoup.
I think you simply can't accomplish this using only Jsoup, the DOM is modified at runtime with javascript and jsoup simply does not execute it.
View also this other question.

Using Jsoup to find elements troubles

I don't really know much about HTML parsers(using Jsoup currently) and have tried many times and can not get it to work due to my poor understanding of it, so please bare that in mind.
Anyway I am trying to grab certain parts of an HTML document. This is what I want to be extracted:
<div class ="detNane" >
<a class="detLink" title="Details for Hock part3">Hock part3</a></div>
Obviously the HTML document has multiple [div class="detName"] and I want to extract all text in each detName div class. I would greatly appreciate it.
Thank you for your time.
You can use a selector for this:
Document doc = // parse your document here or connect to a website
for( Element element : doc.select("div.detNane") )
{
System.out.println(element.text()); // Print the text of that element
}

How do I get this text using Jsoup?

How do i get "this text" from the following html code using Jsoup?
<h2 class="link title"><a href="myhref.html">this text<img width=10
height=10 src="img.jpg" /><span class="blah">
<span>Other texts</span><span class="sometime">00:00</span></span>
</a></h2>
When I try
String s = document.select("h2.title").select("a[href]").first().text();
it returns
this textOther texts00:00
I tried to read the api for Selector in Jsoup but could not figure out much.
Also how do i get an element of class class="link title blah" (multiple classes?). Forgive me I only know both Jsoup and CSS a little.
Use Element#ownText() instead of Element#text().
String s = document.select("h2.link.title a[href]").first().ownText();
Note that you can select elements with multiple classes by just concatenating the classname selectors together like as h2.link.title which will select <h2> elements which have at least both the link and title class.

How do I use ColdFusion to replace text in HTML without replacing HTML tags?

I have a html source as a String variable.
And a word as another variable that will be highlighted in that html source.
I need a Regular Expression which does not highlights tags, but obly text within the tags.
For example I have a html source like
<cfset html = "<span>Text goes here, forr example it container also **span** </span>" />
<cfset wordToReplace = "span" />
<cfset html = ReReplace(html ,"[^(<#wordToReplace#\b[^>]*>)]","replaced","ALL")>
and what I want to get is
<span>Text goes here, forr example it container also **replaced** </span>
But I have an error. Any tip!
I need a Regular Expression which does
not highlights tags, but obly text
within the tags.
You wont find one. Not one that is fully reliable against all legal/wild HTML.
The simple reason is that Regular Expressions match Regular languages, and HTML is not even remotely a Regular language.
Even if you're very careful, you run the risk of replacing stuff you didn't want to, and not replacing stuff you did want to, simply due to how complicated HTML syntax can be.
The correct way to parse HTML is using a purpose-built HTML DOM parser.
Annoyingly CF doesn't have one built in, though if your HTML is XHTML, then you can use XmlParse and XmlSearch to allow you to do an xpath search for only text (not tags) that match your text... something like //*[contains(text(), 'span')] should do (more details here).
If you've not got XHTML then you'll need to look at using a HTML DOM parser for Java - Google turns up plenty, (I've not tried any yet so can't give any specific recommendations).
what you have to do is use a lookahead to make sure that your text isn't contained within a tag. granted this could probably be written better, but it will get you the results you want. it will even handle when the tag has attributes.
<cfset html = "<span class='me'>Text goes here, forr example it container also **span** </span>" />
<cfset wordToReplace = "span" />
<cfset html = ReReplace(html ,"(?!/?<)(#wordToReplace#)(?![^.*>]*>)","replaced","ALL")>

Categories