Java library for HTML to Java (POJO) conversion

Java library for HTML to Java (POJO) conversion - java

Using Apache Velocity Api we can combine Java objects (Lists, POJOs etc.) with a (HTML) template and create the (HTML) output.
Is there any Java API that can help reverse engineer this ? The input of this API could be HTML output and the template used, the output should be the data (in Java/XML format) that was used to generate the output.
I am aware of HTTP Unit API, but this just lets me extract HTML elements (like Tables). I am looking for something that extracts the data based on some template.

You can use google protobuf in order to convert messages for different types. And it is very easy to define templates as well. I create JavaScript Objects using JSON.parse(), and in Java you can use protobuf to convert JSON to Java objects.
http://code.google.com/p/protobuf/
http://code.google.com/p/protobuf-java-format/

My answer won't probably be useful to the writer of this question (I have 5 years late so not the right timing I guess) but as this is the first result I found on Google when typing HTML to POJO, I think it will probably be useful for many other developers that might come across this answer.
Today, I just released (in the name of my company) an HTML to POJO complete framework that you can use to map HTML to any POJO class with simply some annotations. The library itself is quite handy and features many other things all the while being very pluggable. You can have a look to it right here : https://github.com/whimtrip/jwht-htmltopojo
How to use : Basics
Imagine we need to parse the following html page :
<html>
<head>
<title>A Simple HTML Document</title>
</head>
<body>
<div class="restaurant">
<h1>A la bonne Franquette</h1>
<p>French cuisine restaurant for gourmet of fellow french people</p>
<div class="location">
<p>in <span>London</span></p>
</div>
<p>Restaurant n*18,190. Ranked 113 out of 1,550 restaurants</p>
<div class="meals">
<div class="meal">
<p>Veal Cutlet</p>
<p rating-color="green">4.5/5 stars</p>
<p>Chef Mr. Frenchie</p>
</div>
<div class="meal">
<p>Ratatouille</p>
<p rating-color="orange">3.6/5 stars</p>
<p>Chef Mr. Frenchie and Mme. French-Cuisine</p>
</div>
</div>
</div>
</body>
</html>
Let's create the POJOs we want to map it to :
public class Restaurant {
#Selector( value = "div.restaurant > h1")
private String name;
#Selector( value = "div.restaurant > p:nth-child(2)")
private String description;
#Selector( value = "div.restaurant > div:nth-child(3) > p > span")
private String location;
#Selector(
value = "div.restaurant > p:nth-child(4)"
format = "^Restaurant n\*([0-9,]+). Ranked ([0-9,]+) out of ([0-9,]+) restaurants$",
indexForRegexPattern = 1,
useDeserializer = true,
deserializer = ReplacerDeserializer.class,
preConvert = true,
postConvert = false
)
// so that the number becomes a valid number as they are shown in this format : 18,190
#ReplaceWith(value = ",", with = "")
private Long id;
#Selector(
value = "div.restaurant > p:nth-child(4)"
format = "^Restaurant n\*([0-9,]+). Ranked ([0-9,]+) out of ([0-9,]+) restaurants$",
// This time, we want the second regex group and not the first one anymore
indexForRegexPattern = 2,
useDeserializer = true,
deserializer = ReplacerDeserializer.class,
preConvert = true,
postConvert = false
)
// so that the number becomes a valid number as they are shown in this format : 18,190
#ReplaceWith(value = ",", with = "")
private Integer rank;
#Selector(value = ".meal")
private List<Meal> meals;
// getters and setters
}
And now the Meal class as well :
public class Meal {
#Selector(value = "p:nth-child(1)")
private String name;
#Selector(
value = "p:nth-child(2)",
format = "^([0-9.]+)\/5 stars$",
indexForRegexPattern = 1
)
private Float stars;
#Selector(
value = "p:nth-child(2)",
// rating-color custom attribute can be used as well
attr = "rating-color"
)
private String ratingColor;
#Selector(
value = "p:nth-child(3)"
)
private String chefs;
// getters and setters.
}
We provided some more explanations on the above code on our github page.
For the moment, let's see how to scrap this.
private static final String MY_HTML_FILE = "my-html-file.html";
public static void main(String[] args) {
HtmlToPojoEngine htmlToPojoEngine = HtmlToPojoEngine.create();
HtmlAdapter<Restaurant> adapter = htmlToPojoEngine.adapter(Restaurant.class);
// If they were several restaurants in the same page,
// you would need to create a parent POJO containing
// a list of Restaurants as shown with the meals here
Restaurant restaurant = adapter.fromHtml(getHtmlBody());
// That's it, do some magic now!
}
private static String getHtmlBody() throws IOException {
byte[] encoded = Files.readAllBytes(Paths.get(MY_HTML_FILE));
return new String(encoded, Charset.forName("UTF-8"));
}
Another short example can be found here
Hope this will help someone out there!

Related

Scrape information from Web Pages with Java?

I'm trying to extract data from a webpage, for example, lets say I wish to fetch information from chess.org.
I know the player's ID is 25022, which means I can request
http://www.chess.org.il/Players/Player.aspx?Id=25022
In that page I can see that this player's fide ID = 2821109.
From that, I can request this page:
http://ratings.fide.com/card.phtml?event=2821109
And from that I can see that stdRating=1602.
How can I get the "stdRating" output from a given "localID" input in Java?
(localID, fideID and stdRating are aid parameters that I use to clarify the question)

You could try the univocity-html-parser, which is very easy to use and avoids a lot of spaghetti code.
To get the standard rating for example you can use this code:
public static void main(String... args) {
UrlReaderProvider url = new UrlReaderProvider("http://ratings.fide.com/card.phtml?event={EVENT}");
url.getRequest().setUrlParameter("EVENT", 2821109);
HtmlElement doc = HtmlParser.parseTree(url);
String rating = doc.query()
.match("small").withText("std.")
.match("br").getFollowingText()
.getValue();
System.out.println(rating);
}
Which produces the value 1602.
But getting data by querying individual nodes and trying to stitch all pieces together is not exactly easy.
I expanded the code to illustrate how you can use the parser to get more information into records. Here I created records for the player and her rank details which are available in the table of the second page. It took me less than 1h to get this done:
public static void main(String... args) {
UrlReaderProvider url = new UrlReaderProvider("http://www.chess.org.il/Players/Player.aspx?Id={PLAYER_ID}");
url.getRequest().setUrlParameter("PLAYER_ID", 25022);
HtmlEntityList entities = new HtmlEntityList();
HtmlEntitySettings player = entities.configureEntity("player");
player.addField("id").match("b").withExactText("מספר שחקן").getFollowingText().transform(s -> s.replaceAll(": ", ""));
player.addField("name").match("h1").followedImmediatelyBy("b").withExactText("מספר שחקן").getText();
player.addField("date_of_birth").match("b").withExactText("תאריך לידה:").getFollowingText();
player.addField("fide_id").matchFirst("a").attribute("href", "http://ratings.fide.com/card.phtml?event=*").getText();
HtmlLinkFollower playerCard = player.addField("fide_card_url").matchFirst("a").attribute("href", "http://ratings.fide.com/card.phtml?event=*").getAttribute("href").followLink();
playerCard.addField("rating_std").match("small").withText("std.").match("br").getFollowingText();
playerCard.addField("rating_rapid").match("small").withExactText("rapid").match("br").getFollowingText();
playerCard.addField("rating_blitz").match("small").withExactText("blitz").match("br").getFollowingText();
playerCard.setNesting(Nesting.REPLACE_JOIN);
HtmlEntitySettings ratings = playerCard.addEntity("ratings");
configureRatingsBetween(ratings, "World Rank", "National Rank ISR", "world");
configureRatingsBetween(ratings, "National Rank ISR", "Continent Rank Europe", "country");
configureRatingsBetween(ratings, "Continent Rank Europe", "Rating Chart", "continent");
Results<HtmlParserResult> results = new HtmlParser(entities).parse(url);
HtmlParserResult playerData = results.get("player");
String[] playerFields = playerData.getHeaders();
for(HtmlRecord playerRecord : playerData.iterateRecords()){
for(int i = 0; i < playerFields.length; i++){
System.out.print(playerFields[i] + ": " + playerRecord.getString(playerFields[i]) +"; ");
}
System.out.println();
HtmlParserResult ratingData = playerRecord.getLinkedEntityData().get("ratings");
for(HtmlRecord ratingRecord : ratingData.iterateRecords()){
System.out.print(" * " + ratingRecord.getString("rank_type") + ": ");
System.out.println(ratingRecord.fillFieldMap(new LinkedHashMap<>(), "all_players", "active_players", "female", "u16", "female_u16"));
}
}
}
private static void configureRatingsBetween(HtmlEntitySettings ratings, String startingHeader, String endingHeader, String rankType) {
Group group = ratings.newGroup()
.startAt("table").match("b").withExactText(startingHeader)
.endAt("b").withExactText(endingHeader);
group.addField("rank_type", rankType);
group.addField("all_players").match("tr").withText("World (all", "National (all", "Rank (all").match("td", 2).getText();
group.addField("active_players").match("tr").followedImmediatelyBy("tr").withText("Female (active players):").match("td", 2).getText();
group.addField("female").match("tr").withText("Female (active players):").match("td", 2).getText();
group.addField("u16").match("tr").withText("U-16 Rank (active players):").match("td", 2).getText();
group.addField("female_u16").match("tr").withText("Female U-16 Rank (active players):").match("td", 2).getText();
}
The output will be:
id: 25022; name: יעל כהן; date_of_birth: 02/02/2003; fide_id: 2821109; rating_std: 1602; rating_rapid: 1422; rating_blitz: 1526;
* world: {all_players=195907, active_players=94013, female=5490, u16=3824, female_u16=586}
* country: {all_players=1595, active_players=1024, female=44, u16=51, female_u16=3}
* continent: {all_players=139963, active_players=71160, female=3757, u16=2582, female_u16=372}
Hope it helps
Disclosure: I'm the author of this library. It's commercial closed source but it can save you a lot of development time.

As #Alex R pointed out, you'll need a Web Scraping library for this.
The one he recommended, JSoup, is quite robust and is pretty commonly used for this task in Java, at least in my experience.
You'd first need to construct a document that fetches your page, eg:
int localID = 25022; //your player's ID.
Document doc = Jsoup.connect("http://www.chess.org.il/Players/Player.aspx?Id=" + localID).get();
From this Document Object, you can fetch a lot of information, for example the FIDE ID you requested, unfortunately the web page you linked inst very simple to scrape, and you'll need to basically go through every link on the page to find the relevant link, for example:
Elements fidelinks = doc.select("a[href*=fide.com]");
This Elements object should give you a list of all links that link to anything containing the text fide.com, but you probably only want the first one, eg:
Element fideurl = doc.selectFirst("a[href=*=fide.com]");
From that point on, I don't want to write all the code for you, but hopefully this answer serves as a good starting point!
You can get the ID alone by calling the text() method on your Element object, but You can also get the link itself by just calling Element.attr('href')
The css selector you can use to get the other value is
div#main-col table.contentpaneopen tbody tr td table tbody tr td table tbody tr:nth-of-type(4) td table tbody tr td:first-of-type, which will get you the std score specifically, at least with standard css, so this should work with jsoup as well.

Html Slurping in Groovy

I am trying to parse HTML that comes to me as a giant String. When I get to Line 13, NodeChild page = it.parent()
I am able to find the key that I am looking for, but the data comes to me like This Is Value One In My KeyThis is Value Two in my KeyThis is Value Three In My Key and so on. I see a recurring trend where the seperator between the two is always UppercaseUppercase (withoutSpaces).
I would like to put it into an ArrayList one way or another. Is there a method that I am missing from the docs that is able to automatically do this? Is there a better way to parse this together?
class htmlParsingStuff{
private def slurper = new XmlSlurper(new Parser())
private void slurpItUp(String rawHTMLString){
ArrayList urlList = []
def htmlParser = slurper.parseText(rawHTMLString)
htmlParser.depthFirst().findAll() {
//Loop through all of the HTML Tags to get to the key that I am looking for
//EDIT: I see that I am able to iterate through the parent object, I just need a way to figure out how to get into that object
boolean trigger = it.text() == 'someKey'
if (trigger){
//I found the key that I am looking for
NodeChild page = it.parent()
page = page.replace('someKey', '')
LazyMap row = ["page": page, "type": "Some Type"]
urlList.add(row)
}
}
}
}

I can't provide you with working code since I don't know your specific html.
But: don't use XmlSlurper for parsing HTML, HTML is not well formed and therefor XmlSlurper is not the right tool for the job.
For HTML use a library like JSoup. You will find it much easier to use especially if you have some JQuery knowledge. Since you didn't post your HTML snippet I made up my own example:
#Grab(group='org.jsoup', module='jsoup', version='1.10.1')
import org.jsoup.Jsoup
def html = """
<html>
<body>
<table>
<tr><td>Key 1</td></tr>
<tr><td>Key 2</td></tr>
<tr><td>Key 3</td></tr>
<tr><td>Key 4</td></tr>
<tr><td>Key 5</td></tr>
</table>
</body>
</html>"""
def doc = Jsoup.parse(html)
def elements = doc.select('td')
def result = elements.collect {it.text()}
// contains ['Key 1', 'Key 2', 'Key 3', 'Key 4', 'Key 5']
To manipulate the document you would use
def doc = Jsoup.parse(html)
def elements = doc.select('td')
elements.each { oldElement ->
def newElement = new Element(Tag.valueOf('td'), '')
newElement.text('Another key')
oldElement.replaceWith(newElement)
}
println doc.outerHtml()

Setting id attribute on input field using Wicket MultiFileUploadField

In my panel class I have the following code:
private Fragment fileUploadField(String id, UploadFeedbackPanel feedbackPanel, ComponentFeedbackPanel componentFeedbackPanel) {
String uploadType = isJSEnabled ? "multiple" : "single";
Fragment uploadFragment = new Fragment( "uploadContainer", uploadType, this );
if (isJSEnabled) {
multipleUpload = new MultiFileUploadField( id, new PropertyModel<Collection<FileUpload>>( this, "multiUploads" ), MAX_FILES );
uploadFragment.add( multipleUpload = multipleUpload);
multipleUpload.add( newOnChangeAjaxBehavior( feedbackPanel, componentFeedbackPanel ) );
} else {
uploadFragment.add( singleUpload = new FileUploadField( id ) );
singleUpload.add( newOnChangeAjaxBehavior( feedbackPanel, componentFeedbackPanel ) );
}
return uploadFragment;
}
I want to add a label for this field but I'm unable to get the actual input fields ID. You can see this working for the single upload field because the input field itself is render without any surrounding elements. This however doesn't seem to be exposed when using MultiFileUploadField.
An alternative acceptable answer would be using FileUploadField and a collection of files with the multiple=true attribute. However I am unsure how to limit the number of files to be MAX_FILES only.
<label wicket:for="file"><wicket:msg key="file">File:</wicket:msg></label>
<div wicket:id="uploadContainer" class="col-right">[upload fragment shows here]</div>
<wicket:fragment wicket:id="single">
<input wicket:id="file" type="file"/>
</wicket:fragment>
<wicket:fragment wicket:id="multiple">
<div wicket:id="file" class="mfuex"></div>
</wicket:fragment>
Wicket version 6.15.0.

MultiFileUploadField uses JavaScript to generate the input fields: https://github.com/apache/wicket/blob/master/wicket-core/src/main/java/org/apache/wicket/markup/html/form/upload/MultiFileUploadField.js#L91
See whether you can plug there somehow. If you find an elegant way we would be glad to include it in the next version of Wicket!
If you use 'multiple' attribute then check:
How do I limit the number of file upload in html?

Difficulty Parsing JSON with JQuery

I have developed an application to write twitter search results as JSON objects to a results page as such:
for (Status tweet : tweets) {
Map<String, String> tweetResult = new LinkedHashMap<String, String>();
tweetResult.put("username", tweet.getUser().getScreenName());
tweetResult.put("status", tweet.getText());
tweetResult.put("date", tweet.getCreatedAt().toString());
tweetResult.put("retweets", String.valueOf(tweet.getRetweetCount()));
String resultJson = new Gson().toJson(tweetResult);
response.getWriter().write(resultJson);
}
This is called with AJAX/JQuery in the following:
$(document).ready(function() {
$.getJSON('SearchServlet', function(list) {
var table = $('#resultsTable');
$.each(list, function(index, tweet) {
$('<tr>').appendTo(table)
.append($('<td>').text(tweet.username))
.append($('<td>').text(tweet.status))
.append($('<td>').text(tweet.date))
.append($('<td>').text(tweet.retweets));
});
});
});
With the intention of populating a table with the results:
<body>
<div id="wrapper">
<div id="contentArea">
<div id="content">
<h2>Results:</h2>
<table id="resultsTable"></table>
</div>
</div>
</div>
</body>
The GET call is working perfectly and the results show up in the firebug console without a problem, however they're not appearing on the actual document itself as intended. I've tried a number of different approaches to this (including the answers here and here ).
Example of the JSON output:
{"username":"Dineen_","status":"RT #TwitterAds: Learn how to put Twitter to work for your small business! Download our small biz guide now: https://t.co/gdnMMYLI","date":"Tue Feb 26 08:37:11 GMT 2013","retweets":"22"}
Thanks in advance.

It seems your serialization is wrong. Since you are generating a sequence of concatenated JSON objects not enclosed properly in an array.
Current invalid JSON response:
{ ... } { ... } { ... } { ... }
Whereas the expected JSON response should be:
[ { ... }, { ... }, { ... }, { ... } ]
No need to do this by hand. Gson may do it automatically for you if you construct the proper object. For example, using something as follows (untested):
List<Map<String, String>> tweetList = new LinkedList<Map<String, String>>();
for (Status tweet : tweets) {
Map<String, String> tweetResult = new LinkedHashMap<String, String>();
tweetResult.put("username", tweet.getUser().getScreenName());
tweetResult.put("status", tweet.getText());
tweetResult.put("date", tweet.getCreatedAt().toString());
tweetResult.put("retweets", String.valueOf(tweet.getRetweetCount()));
tweetList.add(tweetResult);
}
String resultJson = new Gson().toJson(tweetList);
response.getWriter().write(resultJson);
After this fix you should be able to use your original code.
Based on your example JSON output the returned output is an Object, not an Array. You don't need to use $.each here.
$(document).ready(function () {
$.getJSON('SearchServlet', function(tweet) {
var table = $('#resultsTable');
$('<tr>').appendTo(table)
.append($('<td>').text(tweet.username))
.append($('<td>').text(tweet.status))
.append($('<td>').text(tweet.date))
.append($('<td>').text(tweet.retweets));
});
});

I think the issue is with your use of $.each. Since you are passing in an object, each is iterating over the key-value pairs of the object. (see http://api.jquery.com/jQuery.each/)
You might want to return a JSON object that is wrapped in square brackets, just so it iterates over an array.
[{"username":"Dineen_","status":"RT #TwitterAds: Learn how to put Twitter to work for your small business! Download our small biz guide now: https://t.co/gdnMMYLI","date":"Tue Feb 26 08:37:11 GMT 2013","retweets":"22"}]
EDIT: As Alexander points out, you can just return the same object as you already do, but NOT use the $.each at all. My answer assumes you want to be able to pass back several objects and insert every one in a table row.

Dynamically add SWFObject using Wicket

I am trying to add a Flash (*.swf) file to my Wicket application. I found some information here, but unfortunately it is not working, and I don't know why. On a web page, the elements and tag
<object wicket:id="swf" data="resources/test.swf" width="700" height="70" style="float: right; margin: 15px 0 0 0;"></object>
render as
<object height="70" style="float: right; margin: 15px 0 0 0;" width="140" data="../../resources/wicketapp.ViewPanel/resources/test.swf" type="application/x-shockwave-flash"><param name="movie" value="../../resources/wicketapp.ViewPanel/resources/test.swf">
</object>
Clearly, this is not the path of my Flash file. Also, I want to load the file dynamically, but the method of embedding Flash discussed in the above link is static. How can I load swf files dynamically?

Looking at the linked implementation, if you want an absolute path you should precede it with a slash:
// if it's an absolute path, return it:
if( src.startsWith( "/" ) || src.startsWith( "http://" ) || src.startsWith( "https://" ) )
return(src);
Otherwise a wicket resource path is generated.
I'd actually recommend using swfobject for embedding flash - there is some nice wicket integration code at the start of this page, along with a flash-based component that uses it.

As I have understood your question, your want change swf file in runtime. I have solve this problem as shown below (this is Scala code, but I suppose that you understand it):
class SWFObject(id: String) extends WebComponent(id)
with LoggerSupport {
def script: String = """
var swfVersionStr = "10.0.0";
var xiSwfUrlStr = "flash/playerProductInstall.swf";
var flashvars = {};
var params = {};
params.quality = "high";
params.bgcolor = "#ebf4ff";
params.allowscriptaccess = "sameDomain";
params.allowfullscreen = "true";
var attributes = {};
attributes.align = "middle";
swfobject.embedSWF(
"${name}", "flashContent",
"100%", "100%",
swfVersionStr, xiSwfUrlStr,
flashvars, params, attributes);
swfobject.createCSS("#flashContent", "display:block;text-align:left;");
"""
/**
* Path to SWF file.
*/
var swfFile: String = _;
override def onComponentTag(tag: ComponentTag) = {
checkComponentTag(tag, "script")
}
override def onComponentTagBody(markupStream: MarkupStream, openTag: ComponentTag) = {
val relativeName = getRequestCycle()
.getProcessor()
.getRequestCodingStrategy()
.rewriteStaticRelativeUrl(swfFile)
val body = body.replace("${name}", relativeName)
replaceComponentTagBody(markupStream, openTag, body)
}
}
Here are example of using:
private val gameObject = new SWFObject("game");
gameObject.swfFile = "flash/" + swfFile;
HTML is used swfobject script and based on standard FlashBuilder export.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Java library for HTML to Java (POJO) conversion - java

Related

Scrape information from Web Pages with Java?

Html Slurping in Groovy

Setting id attribute on input field using Wicket MultiFileUploadField

Difficulty Parsing JSON with JQuery

Dynamically add SWFObject using Wicket

Categories

Resources