Page scrape for a particular div - java

I am wondering if there is a way to read the html output of a given webpage using Java?
I know in php you can do something like:
$handle = #fopen("'http://www.google.com", "r");
$source_code = fread($handle,9000);
I am looking for the Java equivalent.
Additionally, once I have the rendered html are there any Java utilities that would allow me to strip out a single div by its id?
Thanks for any help with this.

Use jsoup.
You have the choice between a tree model and a powerful query syntax similar to CSS or jQuery selectors, plus utility methods to quickly get the source of a webpage.
To quote from their website:
Fetch the Wikipedia homepage, parse it to a DOM, and select the
headlines from the In the news section into a list of Elements:
Document doc = Jsoup.connect("http://en.wikipedia.org/").get();
Elements newsHeadlines = doc.select("#mp-itn b a");
Once you found the Element representing the div you want to remove, just call remove() on it.

Related

How to find specific information in HTTP Java response? [duplicate]

I do a lot of HTML parsing in my line of work. Up until now, I was using the HtmlUnit headless browser for parsing and browser automation.
Now, I want to separate both the tasks.
I want to use a light HTML parser because it takes much time in HtmlUnit to first load a page, then get the source and then parse it.
I want to know which HTML parser can parse HTML efficiently. I need
Speed
Ease to locate any HtmlElement by its "id" or "name" or "tag type".
It would be ok for me if it doesn't clean the dirty HTML code. I don't need to clean any HTML source. I just need an easiest way to move across HtmlElements and harvest data from them.
Self plug: I have just released a new Java HTML parser: jsoup. I mention it here because I think it will do what you are after.
Its party trick is a CSS selector syntax to find elements, e.g.:
String html = "<html><head><title>First parse</title></head>"
+ "<body><p>Parsed HTML into a doc.</p></body></html>";
Document doc = Jsoup.parse(html);
Elements links = doc.select("a");
Element head = doc.select("head").first();
See the Selector javadoc for more info.
This is a new project, so any ideas for improvement are very welcome!
The best I've seen so far is HtmlCleaner:
HtmlCleaner is open-source HTML parser written in Java. HTML found on Web is usually dirty, ill-formed and unsuitable for further processing. For any serious consumption of such documents, it is necessary to first clean up the mess and bring the order to tags, attributes and ordinary text. For the given HTML document, HtmlCleaner reorders individual elements and produces well-formed XML. By default, it follows similar rules that the most of web browsers use in order to create Document Object Model. However, user may provide custom tag and rule set for tag filtering and balancing.
With HtmlCleaner you can locate any element using XPath.
For other html parsers see this SO question.
I suggest Validator.nu's parser, based on the HTML5 parsing algorithm. It is the parser used in Mozilla from 2010-05-03

Screen Scraping Using Jsoup to Extract Sentences

I want to do some screen scraping and after doing a little research, it appears that JSoup is the best tool for this task. I want to be able to extract all the sentences on a web page; so for example, given this wikipedia page, http://en.wikipedia.org/wiki/Data_scraping#Screen_scraping, I want to be able to get all the sentences on that page and print it out to the console. I'm still not familiar with how JSoup works though, so if somebody could help me out that would be greatly appreciated. Thanks!
First download Jsoup and include it in your project. Then the best place to start is the Jsoup cookbook (http://jsoup.org/cookbook/) as it provides examples for the most common methods you will use with Jsoup. I recommend that you spend some time working through those examples to familiarize yourself with the API. Another good resource is the javadocs.
Here is a quick example to pull some text from the Wikipedia link you provided:
String url = "http://en.wikipedia.org/wiki/Data_scraping#Screen_scraping";
// Download the HTML and store in a Document
Document doc = Jsoup.connect(url).get();
// Select the <p> Elements from the document
Elements paragraphs = doc.select("p");
// For each selected <p> element, print out its text
for (Element e : paragraphs) {
System.out.println(e.text());
}

Java library using css selectors to parse XML

Is there a jQuery like JAVA/Android library that uses CSS Selectors to parse the XML ?
Like :
String desc = myXML.find("bloc[type=pro]").get(0).attr("description");
Chainability is also what I'm looking for, in the same way of jQuery...
I hope this exists !
While initially designed as a HTML parser with CSS selector support, Jsoup works fine for XML documents as well if your sole intent is to extract data, not to manipulate data.
Document document = Jsoup.parse(xmlString);
String desc = document.select("bloc[type=pro]").get(0).attr("description");
// ...
You see, the syntax is almost identical to what you have had in the question.
Apache Jericho is what you are looking for.
You example would look like
String desc = source.getFirstElement( "type", "pro" ).getAttributeValue( "description" );
It's a charm to parse HTML with jericho, so I guess it's even easier for well structured XML.
Since there are some bugs in other Libraries like Jsoup and Jericho is different from what I was expecting,
I wrote a Class Extending the org.xml.sax.helpers.DefaultHandler which parse the XML. I then wrote two other Classes that look like Element and Elements from Jsoup containing two functions called find that handle the CSS3 Selector and attr that returns the attribute value.
I'm now cleaning and commenting that code... I'll post the library later for who is interested in.
xmlDoc.find("bloc[type=Pro]>act").attr("label");
is now possible like in jQuery !
Edit !
Here is the link to access the code for who is interested : Google Code Project
Moving to GitHub : https://github.com/ChristopheCVB/JavaXMLQuery
I use XPath to solve that issue. XML parsing like JDOM is ok to to the XPath. Maybe jQuery see how XPath works :p
//bloc[#type="pro"][1]/#description
Xpath index start from 1, not 0
https://www.w3schools.com/xml/xpath_syntax.asp
The droidQuery library can do many of the things you are looking for. Although the syntax is a little different, you can:
Get view attributes using chained, jQuery-style commands, such as:
CharSequence text = $.with(this, R.id.myTextView).attr("text");
Parse XML:
Document dom = $.parseXML(myXMLString);
If you are a fan of jQuery, you will be pleased to see that nearly all of the features it provides are included in droidQuery, and although the syntax may differ at times, its major goal is to be as syntactically close to jQuery as possible.

What library to use for building HTML documents?

Could please anybody recommend libraries that are able to do the opposite thing than these libraries ?
HtmlCleaner, TagSoup, HtmlParser, HtmlUnit, jSoup, jTidy, nekoHtml, WebHarvest or Jericho.
I need to build html pages, build the DOM model from String content.
EDIT: I need it for testing purposes. I have various types of input/strings that might be in the html page on various places... So I need to dynamically build it up... I then process the html page based on various criterions that must be fulfilled or not.
I will show you why I asked this question, consider htmlCleaner for this job :
List<String> paragraphs = getParagraphs(entity.getFile());
List<TagNode> pNodes = new ArrayList<TagNode>();
TagNode html = cleaner.clean("<html/>");
for(String paragraph : paragraphs) {
TagNode p = new TagNode("p");
pNodes.add(p);
// CANNOT setText() ?
}
html.addChildren(pNodes);
The problem is that TagNode has getText() method, but no setText() method ....
Please add more comments about how vague this question is ... The best thing you can do
Jsoup, Jsoup, Jsoup! I've used all of those, and it's my favorite by a long shot. You can use it to build documents, plus it brings a lot of the magic of Jquery-style traversing alongside the best HTML document parsing I've seen to date in a Java library. I'm so happy with it that I don't mind shamelessly promoting it. ;)
There are lot of template libraries for Java, from JSP to FreeMarker, from specific implementations in various frameworks (Spring?) to generic libraries like StringTemplate.
The most difficult task is... to make a choice.
In general, these libraries offer to make a skeleton of Web page, with "holes" to fill with variables. It is the simplest approach, often working well with tools.
If you really want to build from Dom, you can just use an XML library and generate XHTML.
If you are interested in HtmlCleaner particularly, it is actually a very convenient choice for building html documents.
But you must know that if you want to set content to a TagNode, you append a child ContentNode element :-)
List<String> paragraphs = getParagraphs(entity.getFile());
List<TagNode> pNodes = new ArrayList<TagNode>();
TagNode html = new TagNode("html");
for(String paragraph : paragraphs) {
TagNode p = new TagNode("p");
p.addChild(new ContentNode(paragraph));
pNodes.add(p);
}
html.addChildren(pNodes);
jwebutils -- A library for creating HTML 5 markup using Java. It also contains support for creating JSON and CSS 3 markup.
Jakarta Element Construction Set (ECS) - A Java API for generating elements for various markup languages it directly supports HTML 4.0 and XML. Now retired, but some folks really like it.

How to find URLs in HTML using Java

I have the following... I wouldn't say problem, but situation.
I have some HTML with tags and everything. I want to search the HTML for every URL. I'm doing it now by checking where it says 'h' then 't' then 't' then 'p', but I don't think is a great solution
Any good ideas?
Added: I'm looking for some kind of pseudocode but, just in case, I'm using Java for this project in particular
Try using a HTML parsing library then search for <a> tags in the HTML document.
Document doc = Jsoup.parse(input, "UTF-8", "http://example.com/");
Elements links = doc.select("a[href]"); // a with href
not all url are in tags, some are text
and some are in links or other tags
You shouldn't scan the HTML source to achieve this.
You will end up with link elements that are not necessarily in the 'text' of the page, i.e you could end up with 'links' of JS scripts in the page for example.
Best way is still that you use a tool made for the job.
You should grab HTML tags and cover the most likely ones to have 'links' inside them (say: <h1>, <p>, <div> etc) . HTML parsers provide regex-like functionalities to filter through the content of the tags, something similar to your logic of "starts with HTTP".
[attr^=value], [attr$=value],
[attr*=value]: elements with
attributes that start with, end with,
or contain the value, e.g.
select("[href*=/path/]")
See: jSoup.
You may want to have a look at XPath or Regular Expressions.
Use a DOM parser to extract all <a href> tags, and, if desired, additionally scan the source for http:// outside of those tags.
The best way should be to google for regexes. One example is this one:
/^(https?):\/\/((?:[a-z0-9.\-]|%[0-9A-F]{2}){3,})(?::(\d+))?((?:\/(?:[a-z0-9\-._~!$&'()+,;=:#]|%[0-9A-F]{2})))(?:\?((?:[a-z0-9\-._~!$&'()+,;=:\/?#]|%[0-9A-F]{2})))?(?:#((?:[a-z0-9\-._~!$&'()+,;=:\/?#]|%[0-9A-F]{2})*))?$/i
found in a hacker news article. As far as I can follow it, it looks good. But there is, as far as I know, no formal regex for this problem. So the best solution is to google for some and try which one matches most of what you want.

Categories