Screen scraping, extract text data from DOM tree - java

I'm trying to scrape specific text from a website using jsoup. The text doesn't appear to be in the html that I'm downloading, but rather in the DOM tree. Im not exactly sure how this works because I'm new at it. When I view the source of the website I'm unable to see or find the text that I'm looking for, but I can find it in the DOM tree. How can I extract the data?

Related

Get XPath from specific content

I am trying to parse websites and get the XPath for specific pieces of content.
For example:
On www.stackoverflow.com I want to get the XPath to the 'Questions' button. Using a Chrome extension I find that the following XPath can grab 'Questions':
/html/body[#class='home-page new-topbar']/div[#class='container']/div[#id='header']/div[#id='hmenus']/div[#class='nav mainnavs']/ul/li[1]/a[#id='nav-questions']
Now I want to know is a way to programmatically get the XPath for a given piece of content on a webpage?

Html search validation android

I have an android app with a search functionality. The search functionality loops through locally stored html files and appends a span with a background color to words that equal the imputed word, the same as if you press ctrl -f on your desktop. The problem i am having is that if the user searches for head, body, div, span etc it adds a span to the html tags. My question. Is there an android validation library that deals with this issue or do i need to make my own blacklist? I am aware of Android form validator's libraries but but i am not sure that they are built for what i am looking for.
I've use jsoup before to strip out unwanted html tags. You could do this in order to make the html data more "searchable". Also look at Android's Html.escapeHtml(CharSequence) that converts html into a String.

How to catch one specific text from html source code using Jsoup?

I tried the solution from:
How to extract text of paragraph from html using Jsoup?
jsoup how to extract this text
but both examples are working with texts from tags.
I have this unique piece of code on my html web search:
and what I need is to take the link that comes with the d.href variable.
I tried codes like:
Elements link = jSoupConnection.select(":contains(d.href)");
Elements link = jSoupConnection.select("#d.href");
Elements link = jSoupConnection.getElementsByAttributeValueContaining("d.href","google");
but until now none of them worked.
I tried also to make one research at http://jsoup.org/cookbook/ and also nothing sucessfull. Could anyone more experienced with Jsoup help me please??
Thanks in advance
In case of your text doesn't come with any tag that you could specific catch with Jsoup select elements, you should download the hole page (which you can do with Elements link = jSoupConnection.select("*");) and then open it on your application as one text file to retrieve whatever you want. If the downloaded file is too big, and that was my problem, try to limit the file size download; more details you can find on those links:
Limiting file size creation with java
How to limit the file size in Java

how to identify corresponding html object using label using java

I have html file and I have the label name now I need to identify the html object.
Can you please help me to identify the object.
I am using jsoup to parase,
I could not attaching the screen shot,
The page has top row with label below are html object
program, study, study status, study manager (all are labels and below html obj)
text box dropdown, drop down, text box (html objec)
When working with HTML, it pays to be aware of the structure of the HTML elements, and not how they are rendered on-screen. In your case, you will need to find a way to identify the elements you seek in the HTML code, and and then use one of the Document#getElement(s|)By(.+) methods to find it.

creating java help using single HTML file

I have one HTML file which contains 200 definitions and i don't want to create 200 HTML files. I want to create java help using that file such that if user click on TOC(table of content) list and user can reached at the particular definition without scrolling that html documentation.
First study how to make tree like structure using JTree and then study how to show html page in Jframe using JEditorPane.

Categories