Html Slurping in Groovy

Html Slurping in Groovy - java

I am trying to parse HTML that comes to me as a giant String. When I get to Line 13, NodeChild page = it.parent()
I am able to find the key that I am looking for, but the data comes to me like This Is Value One In My KeyThis is Value Two in my KeyThis is Value Three In My Key and so on. I see a recurring trend where the seperator between the two is always UppercaseUppercase (withoutSpaces).
I would like to put it into an ArrayList one way or another. Is there a method that I am missing from the docs that is able to automatically do this? Is there a better way to parse this together?
class htmlParsingStuff{
private def slurper = new XmlSlurper(new Parser())
private void slurpItUp(String rawHTMLString){
ArrayList urlList = []
def htmlParser = slurper.parseText(rawHTMLString)
htmlParser.depthFirst().findAll() {
//Loop through all of the HTML Tags to get to the key that I am looking for
//EDIT: I see that I am able to iterate through the parent object, I just need a way to figure out how to get into that object
boolean trigger = it.text() == 'someKey'
if (trigger){
//I found the key that I am looking for
NodeChild page = it.parent()
page = page.replace('someKey', '')
LazyMap row = ["page": page, "type": "Some Type"]
urlList.add(row)
}
}
}
}

I can't provide you with working code since I don't know your specific html.
But: don't use XmlSlurper for parsing HTML, HTML is not well formed and therefor XmlSlurper is not the right tool for the job.
For HTML use a library like JSoup. You will find it much easier to use especially if you have some JQuery knowledge. Since you didn't post your HTML snippet I made up my own example:
#Grab(group='org.jsoup', module='jsoup', version='1.10.1')
import org.jsoup.Jsoup
def html = """
<html>
<body>
<table>
<tr><td>Key 1</td></tr>
<tr><td>Key 2</td></tr>
<tr><td>Key 3</td></tr>
<tr><td>Key 4</td></tr>
<tr><td>Key 5</td></tr>
</table>
</body>
</html>"""
def doc = Jsoup.parse(html)
def elements = doc.select('td')
def result = elements.collect {it.text()}
// contains ['Key 1', 'Key 2', 'Key 3', 'Key 4', 'Key 5']
To manipulate the document you would use
def doc = Jsoup.parse(html)
def elements = doc.select('td')
elements.each { oldElement ->
def newElement = new Element(Tag.valueOf('td'), '')
newElement.text('Another key')
oldElement.replaceWith(newElement)
}
println doc.outerHtml()

Related

Scrape information from Web Pages with Java?

I'm trying to extract data from a webpage, for example, lets say I wish to fetch information from chess.org.
I know the player's ID is 25022, which means I can request
http://www.chess.org.il/Players/Player.aspx?Id=25022
In that page I can see that this player's fide ID = 2821109.
From that, I can request this page:
http://ratings.fide.com/card.phtml?event=2821109
And from that I can see that stdRating=1602.
How can I get the "stdRating" output from a given "localID" input in Java?
(localID, fideID and stdRating are aid parameters that I use to clarify the question)

You could try the univocity-html-parser, which is very easy to use and avoids a lot of spaghetti code.
To get the standard rating for example you can use this code:
public static void main(String... args) {
UrlReaderProvider url = new UrlReaderProvider("http://ratings.fide.com/card.phtml?event={EVENT}");
url.getRequest().setUrlParameter("EVENT", 2821109);
HtmlElement doc = HtmlParser.parseTree(url);
String rating = doc.query()
.match("small").withText("std.")
.match("br").getFollowingText()
.getValue();
System.out.println(rating);
}
Which produces the value 1602.
But getting data by querying individual nodes and trying to stitch all pieces together is not exactly easy.
I expanded the code to illustrate how you can use the parser to get more information into records. Here I created records for the player and her rank details which are available in the table of the second page. It took me less than 1h to get this done:
public static void main(String... args) {
UrlReaderProvider url = new UrlReaderProvider("http://www.chess.org.il/Players/Player.aspx?Id={PLAYER_ID}");
url.getRequest().setUrlParameter("PLAYER_ID", 25022);
HtmlEntityList entities = new HtmlEntityList();
HtmlEntitySettings player = entities.configureEntity("player");
player.addField("id").match("b").withExactText("מספר שחקן").getFollowingText().transform(s -> s.replaceAll(": ", ""));
player.addField("name").match("h1").followedImmediatelyBy("b").withExactText("מספר שחקן").getText();
player.addField("date_of_birth").match("b").withExactText("תאריך לידה:").getFollowingText();
player.addField("fide_id").matchFirst("a").attribute("href", "http://ratings.fide.com/card.phtml?event=*").getText();
HtmlLinkFollower playerCard = player.addField("fide_card_url").matchFirst("a").attribute("href", "http://ratings.fide.com/card.phtml?event=*").getAttribute("href").followLink();
playerCard.addField("rating_std").match("small").withText("std.").match("br").getFollowingText();
playerCard.addField("rating_rapid").match("small").withExactText("rapid").match("br").getFollowingText();
playerCard.addField("rating_blitz").match("small").withExactText("blitz").match("br").getFollowingText();
playerCard.setNesting(Nesting.REPLACE_JOIN);
HtmlEntitySettings ratings = playerCard.addEntity("ratings");
configureRatingsBetween(ratings, "World Rank", "National Rank ISR", "world");
configureRatingsBetween(ratings, "National Rank ISR", "Continent Rank Europe", "country");
configureRatingsBetween(ratings, "Continent Rank Europe", "Rating Chart", "continent");
Results<HtmlParserResult> results = new HtmlParser(entities).parse(url);
HtmlParserResult playerData = results.get("player");
String[] playerFields = playerData.getHeaders();
for(HtmlRecord playerRecord : playerData.iterateRecords()){
for(int i = 0; i < playerFields.length; i++){
System.out.print(playerFields[i] + ": " + playerRecord.getString(playerFields[i]) +"; ");
}
System.out.println();
HtmlParserResult ratingData = playerRecord.getLinkedEntityData().get("ratings");
for(HtmlRecord ratingRecord : ratingData.iterateRecords()){
System.out.print(" * " + ratingRecord.getString("rank_type") + ": ");
System.out.println(ratingRecord.fillFieldMap(new LinkedHashMap<>(), "all_players", "active_players", "female", "u16", "female_u16"));
}
}
}
private static void configureRatingsBetween(HtmlEntitySettings ratings, String startingHeader, String endingHeader, String rankType) {
Group group = ratings.newGroup()
.startAt("table").match("b").withExactText(startingHeader)
.endAt("b").withExactText(endingHeader);
group.addField("rank_type", rankType);
group.addField("all_players").match("tr").withText("World (all", "National (all", "Rank (all").match("td", 2).getText();
group.addField("active_players").match("tr").followedImmediatelyBy("tr").withText("Female (active players):").match("td", 2).getText();
group.addField("female").match("tr").withText("Female (active players):").match("td", 2).getText();
group.addField("u16").match("tr").withText("U-16 Rank (active players):").match("td", 2).getText();
group.addField("female_u16").match("tr").withText("Female U-16 Rank (active players):").match("td", 2).getText();
}
The output will be:
id: 25022; name: יעל כהן; date_of_birth: 02/02/2003; fide_id: 2821109; rating_std: 1602; rating_rapid: 1422; rating_blitz: 1526;
* world: {all_players=195907, active_players=94013, female=5490, u16=3824, female_u16=586}
* country: {all_players=1595, active_players=1024, female=44, u16=51, female_u16=3}
* continent: {all_players=139963, active_players=71160, female=3757, u16=2582, female_u16=372}
Hope it helps
Disclosure: I'm the author of this library. It's commercial closed source but it can save you a lot of development time.

As #Alex R pointed out, you'll need a Web Scraping library for this.
The one he recommended, JSoup, is quite robust and is pretty commonly used for this task in Java, at least in my experience.
You'd first need to construct a document that fetches your page, eg:
int localID = 25022; //your player's ID.
Document doc = Jsoup.connect("http://www.chess.org.il/Players/Player.aspx?Id=" + localID).get();
From this Document Object, you can fetch a lot of information, for example the FIDE ID you requested, unfortunately the web page you linked inst very simple to scrape, and you'll need to basically go through every link on the page to find the relevant link, for example:
Elements fidelinks = doc.select("a[href*=fide.com]");
This Elements object should give you a list of all links that link to anything containing the text fide.com, but you probably only want the first one, eg:
Element fideurl = doc.selectFirst("a[href=*=fide.com]");
From that point on, I don't want to write all the code for you, but hopefully this answer serves as a good starting point!
You can get the ID alone by calling the text() method on your Element object, but You can also get the link itself by just calling Element.attr('href')
The css selector you can use to get the other value is
div#main-col table.contentpaneopen tbody tr td table tbody tr td table tbody tr:nth-of-type(4) td table tbody tr td:first-of-type, which will get you the std score specifically, at least with standard css, so this should work with jsoup as well.

Save Wikipedia image as an ImageIcon [duplicate]

Is there any way I can access the thumbnail picture of any wikipedia page by using an API? I mean the image on the top right side in box. Is there any APIs for that?

You can get the thumbnail of any wikipedia page using prop=pageimages. For example:
http://en.wikipedia.org/w/api.php?action=query&titles=Al-Farabi&prop=pageimages&format=json&pithumbsize=100
And you will get the thumbnail full URL.

http://en.wikipedia.org/w/api.php
Look at prop=images.
It returns an array of image filenames that are used in the parsed page. You then have the option of making another API call to find out the full image URL, e.g.:
action=query&titles=Image:INSERT_EXAMPLE_FILE_NAME_HERE.jpg&prop=imageinfo&iiprop=url
or to calculate the URL via the filename's hash.
Unfortunately, while the array of images returned by prop=images is in the order they are found on the page, the first can not be guaranteed to be the image in the info box because sometimes a page will include an image before the infobox (most of the time icons for metadata about the page: e.g. "this article is locked").
Searching the array of images for the first image that includes the page title is probably the best guess for the infobox image.

This is good way to get the Main Image of a page in wikipedia
http://en.wikipedia.org/w/api.php?action=query&prop=pageimages&format=json&piprop=original&titles=India

Check out the MediaWiki API example for getting the main picture of a wikipedia page: https://www.mediawiki.org/wiki/API:Page_info_in_search_results.
As other's have mentioned, you would use prop=pageimages in your API query.
If you also want the image description, you would use prop=pageimages|pageterms instead in your API query.
You can get the original image using piprop=original. Or you can get a thumbnail image with a specified width/height. For a thumbnail with width/height=600, piprop=thumbnail&pithumbsize=600. If you omit either, the image returned in the API callback will default to a thumbnail with width/height of 50px.
If you are requesting results in JSON format, you should always use formatversion=2 in your API query (i.e., format=json&formatversion=2) because it makes retrieving the image from the query easier.
Original Size Image:
https://en.wikipedia.org/w/api.php?action=query&format=json&formatversion=2&prop=pageimages|pageterms&piprop=original&titles=Albert Einstein
Thumbnail Size (600px width/height) Image:
https://en.wikipedia.org/w/api.php?action=query&format=json&formatversion=2&prop=pageimages|pageterms&piprop=thumbnail&pithumbsize=600&titles=Albert Einstein

Way 1: You can try some query like this:
http://en.wikipedia.org/w/api.php?action=opensearch&limit=5&format=xml&search=italy&namespace=0
in the response, you can see the Image tag.
<Item>
<Text xml:space="preserve">Italy national rugby union team</Text>
<Description xml:space="preserve">
The Italy national rugby union team represent the nation of Italy in the sport of rugby union.
</Description>
<Url xml:space="preserve">
http://en.wikipedia.org/wiki/Italy_national_rugby_union_team
</Url>
<Image source="http://upload.wikimedia.org/wikipedia/en/thumb/4/46/Italy_rugby.png/43px-Italy_rugby.png" width="43" height="50"/>
</Item>
Way 2: use query http://en.wikipedia.org/w/index.php?action=render&title=italy
then you can get a raw html code, you can get the image use something like PHP Simple HTML DOM Parser
http://simplehtmldom.sourceforge.net
I have no time write it to you. just give you some advice, thanks.

I'm sorry for not answering specifically your question about the main image. But here's some code to get a list of all images:
function makeCall($url) {
$curl = curl_init();
curl_setopt($curl, CURLOPT_URL, $url);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
return curl_exec($curl);
}
function wikipediaImageUrls($url) {
$imageUrls = array();
$pathComponents = explode('/', parse_url($url, PHP_URL_PATH));
$pageTitle = array_pop($pathComponents);
$imagesQuery = "http://en.wikipedia.org/w/api.php?action=query&titles={$pageTitle}&prop=images&format=json";
$jsonResponse = makeCall($imagesQuery);
$response = json_decode($jsonResponse, true);
$imagesKey = key($response['query']['pages']);
foreach($response['query']['pages'][$imagesKey]['images'] as $imageArray) {
if($imageArray['title'] != 'File:Commons-logo.svg' && $imageArray['title'] != 'File:P vip.svg') {
$title = str_replace('File:', '', $imageArray['title']);
$title = str_replace(' ', '_', $title);
$imageUrlQuery = "http://en.wikipedia.org/w/api.php?action=query&titles=Image:{$title}&prop=imageinfo&iiprop=url&format=json";
$jsonUrlQuery = makeCall($imageUrlQuery);
$urlResponse = json_decode($jsonUrlQuery, true);
$imageKey = key($urlResponse['query']['pages']);
$imageUrls[] = $urlResponse['query']['pages'][$imageKey]['imageinfo'][0]['url'];
}
}
return $imageUrls;
}
print_r(wikipediaImageUrls('http://en.wikipedia.org/wiki/Saturn_%28mythology%29'));
print_r(wikipediaImageUrls('http://en.wikipedia.org/wiki/Hans-Ulrich_Rudel'));
I got this for http://en.wikipedia.org/wiki/Saturn_%28mythology%29:
Array
(
[0] => http://upload.wikimedia.org/wikipedia/commons/1/10/Arch_of_SeptimiusSeverus.jpg
[1] => http://upload.wikimedia.org/wikipedia/commons/8/81/Ivan_Akimov_Saturn_.jpg
[2] => http://upload.wikimedia.org/wikipedia/commons/d/d7/Lucius_Appuleius_Saturninus.jpg
[3] => http://upload.wikimedia.org/wikipedia/commons/2/2c/Polidoro_da_Caravaggio_-_Saturnus-thumb.jpg
[4] => http://upload.wikimedia.org/wikipedia/commons/b/bd/Porta_Maggiore_Alatri.jpg
[5] => http://upload.wikimedia.org/wikipedia/commons/6/6a/She-wolf_suckles_Romulus_and_Remus.jpg
[6] => http://upload.wikimedia.org/wikipedia/commons/4/45/Throne_of_Saturn_Louvre_Ma1662.jpg
)
And for the second URL (http://en.wikipedia.org/wiki/Hans-Ulrich_Rudel):
Array
(
[0] => http://upload.wikimedia.org/wikipedia/commons/e/e9/BmRKEL.jpg
[1] => http://upload.wikimedia.org/wikipedia/commons/3/3f/BmRKELS.jpg
[2] => http://upload.wikimedia.org/wikipedia/commons/2/2c/Bundesarchiv_Bild_101I-655-5976-04%2C_Russland%2C_Sturzkampfbomber_Junkers_Ju_87_G.jpg
[3] => http://upload.wikimedia.org/wikipedia/commons/6/62/Bundeswehr_Kreuz_Black.svg
[4] => http://upload.wikimedia.org/wikipedia/commons/9/99/Flag_of_German_Reich_%281935%E2%80%931945%29.svg
[5] => http://upload.wikimedia.org/wikipedia/en/6/64/HansUlrichRudel.jpeg
[6] => http://upload.wikimedia.org/wikipedia/commons/8/82/Heinkel_He_111_during_the_Battle_of_Britain.jpg
[7] => http://upload.wikimedia.org/wikipedia/commons/6/66/Regulation_WW_II_Underwing_Balkenkreuz.png
)
Note that the URL changed a bit on the 6th element of the second array. It's what #JosephJaber was warning about in his comment above.
Hope this helps someone.

I have written some code that gets main image (full URL) by Wikipedia article title. It's not perfect, but overall I'm very pleased with the results.
The challenge was that when queried for a specific title, Wikipedia returns multiple image filenames (without path). Furthermore, the secondary search (I used the code varatis posted in this thread - thanks!) returns URLs of all images found based on the image filename that was searched, regardless of the original article title. After all this, we may end up with a generic image irrelevant to the search, so we filter those out. The code iterates over filenames and URLs until it finds (hopefully the best) match... a bit complicated, but it works :)
Note on the generic filter: I've been compiling a list of generic image strings for the isGeneric() function, but the list just keeps growing. I am considering maintaining it as a public list - if there is any interest let me know.
Pre:
protected static $baseurl = "http://en.wikipedia.org/w/api.php";
Main function - get image URL from title:
public static function getImageURL($title)
{
$images = self::getImageFilenameObj($title); // returns JSON object
if (!$images) return '';
foreach ($images as $image)
{
// get object of image URL for given filename
$imgjson = self::getFileURLObj($image->title);
// return first image match
foreach ($imgjson as $img)
{
// get URL for image
$url = $img->imageinfo[0]->url;
// no image found
if (!$url) continue;
// filter generic images
if (self::isGeneric($url)) continue;
// match found
return $url;
}
}
// match not found
return '';
}
== The following functions are called by the main function above ==
Get JSON object (filenames) by title:
public static function getImageFilenameObj($title)
{
try // see if page has images
{
// get image file name
$json = json_decode(
self::retrieveInfo(
self::$baseurl . '?action=query&titles=' .
urlencode($title) . '&prop=images&format=json'
))->query->pages;
/** The foreach is only to get around
* the fact that we don't have the id.
*/
foreach ($json as $id) { return $id->images; }
}
catch(exception $e) // no images
{
return NULL;
}
}
Get JSON object (URLs) by filename:
public static function getFileURLObj($filename)
{
try // resolve URL from filename
{
return json_decode(
self::retrieveInfo(
self::$baseurl . '?action=query&titles=' .
urlencode($filename) . '&prop=imageinfo&iiprop=url&format=json'
))->query->pages;
}
catch(exception $e) // no URLs
{
return NULL;
}
}
Filter out generic images:
public static function isGeneric($url)
{
$generic_strings = array(
'_gray.svg',
'icon',
'Commons-logo.svg',
'Ambox',
'Text_document_with_red_question_mark.svg',
'Question_book-new.svg',
'Canadese_kano',
'Wiki_letter_',
'Edit-clear.svg',
'WPanthroponymy',
'Compass_rose_pale',
'Us-actor.svg',
'voting_box',
'Crystal_',
'transportation_inv',
'arrow.svg',
'Quill_and_ink-US.svg',
'Decrease2.svg',
'Rating-',
'template',
'Nuvola_apps_',
'Mergefrom.svg',
'Portal-',
'Translation_to_',
'/School.svg',
'arrow',
'Symbol_',
'stub',
'Unbalanced_scales.svg',
'-logo.',
'P_vip.svg',
'Books-aj.svg_aj_ashton_01.svg',
'Film',
'/Gnome-',
'cap.svg',
'Missing',
'silhouette',
'Star_empty.svg',
'Music_film_clapperboard.svg',
'IPA_Unicode',
'symbol',
'_highlighting_',
'pictogram',
'Red_pog.svg',
'_medal_with_cup',
'_balloon',
'Feature',
'Aiga_'
);
foreach ($generic_strings as $str)
{
if (stripos($url, $str) !== false) return true;
}
return false;
}
Comments welcome.

Lets take Example of Page http://en.wikipedia.org/wiki/index.html?curid=57570
to get Main Pic
Check out
prop=pageprops
action=query&pageids=57570&prop=pageprops&format=json
Results Page Data Eg.
{ "pages" : { "57570":{
"pageid":57570,
"ns":0,
"title":"Sachin Tendulkar",
"pageprops" : {
"defaultsort":"Tendulkar,Sachin",
"page_image":"Sachin_at_Castrol_Golden_Spanner_Awards_(crop).jpg",
"wikibase_item":"Q9488"
}
}
}
}}
We get main Pic file name this result as
** (wikiId).pageprops.page_image = Sachin_at_Castrol_Golden_Spanner_Awards_(crop).jpg**
Now as we have Image file name we will have to make another Api Call to get full image path from file name as follows
action=query&titles=Image:INSERT_EXAMPLE_FILE_NAME_HERE.jpg&prop=imageinfo&iiprop=url
Eg.
action=query&titles=Image:Sachin_at_Castrol_Golden_Spanner_Awards_(crop).jpg&prop=imageinfo&iiprop=url
Returns Array of Image Data having url in it as
http://upload.wikimedia.org/wikipedia/commons/3/35/Sachin_at_Castrol_Golden_Spanner_Awards_%28crop%29.jpg

I there is a way to reliably get a main image for a wikipedia page - the Extension called PageImages
The PageImages extension collects information about images used on a page.
Its aim is to return the single most appropriate thumbnail associated
with an article, attempting to return only meaningful images, e.g. not
those from maintenance templates, stubs or flag icons. Currently it
uses the first non-meaningless image used in the page.
https://www.mediawiki.org/wiki/Extension:PageImages
Just add the prop pageimages to your API Query:
/w/api.php?action=query&prop=pageimages&titles=Somepage&format=xml
This reliably filters out annoying default images and prevents you from having to filter them yourself! The extension is installed on all the main wikipedia pages...

Like Anuraj mentioned, the pageimages parameter is it. Look at the following url that'll bring about some nifty stuff:
https://en.wikipedia.org/w/api.php?action=query&prop=info|extracts|pageimages|images&inprop=url&exsentences=1&titles=india
Her are some interesting parameters:
The two parameters extracts and exsentences gives you a short
description you can use. (exsentences is the number of sentences you want to include in the excerpt)
The info and the inprop=url parameters gives you the url of the page
The prop property has multiple parameters separated by a bar symbol
And if you insert the format=json in there, it is even better

See this related question on an API for Wikipedia. However, I would not know if it is possible to retrieve the thumbnail picture through an API.
You can also consider just parsing the web page to find the image URL, and retrieve the image that way.

Here is my list of XPaths I have found work for 95 percent of articles. the main ones are 1, 2 3 and 4. A lot of articles are not formatted correctly and these would be edge cases:
You can use a DOM parsing lib to fetch image using the XPath.
static NSString *kWikipediaImageXPath2 = #"//*[#id=\"mw-content-text\"]/div[1]/div/table/tr[2]/td/a/img";
static NSString *kWikipediaImageXPath3 = #"//*[#id=\"mw-content-text\"]/div[1]/table/tr[1]/td/a/img";
static NSString *kWikipediaImageXPath1 = #"//*[#id=\"mw-content-text\"]/div[1]/table/tr[2]/td/a/img";
static NSString *kWikipediaImageXPath4 = #"//*[#id=\"mw-content-text\"]/div[2]/table/tr[2]/td/a/img";
static NSString *kWikipediaImageXPath5 = #"//*[#id=\"mw-content-text\"]/div[1]/table/tr[2]/td/p/a/img";
static NSString *kWikipediaImageXPath6 = #"//*[#id=\"mw-content-text\"]/div[1]/table/tr[2]/td/div/div/a/img";
static NSString *kWikipediaImageXPath7 = #"//*[#id=\"mw-content-text\"]/div[1]/table/tr[1]/td/div/div/a/img";
I used a ObjC wrapper called Hpple around libxml2.2 to pull out the image url. Hope this helps

You can also use cocoa Pod called SDWebImage
Code sample (remember to also add import SDWebImage):
func requestInfo(flowerName: String) {
let parameters : [String:String] = [
"format" : "json",
"action" : "query",
"prop" : "extracts|pageimages",//pageimages allows fetch imagePath
"exintro" : "",
"explaintext" : "",
"titles" : flowerName,
"indexpageids" : "",
"redirects" : "1",
"pithumbsize" : "500"//specify image size in px
]
AF.request(wikipediaURL, method: .get, parameters: parameters).responseJSON { (response) in
switch response.result {
case .success(let value):
print("Got the wikipedia info.")
print(response)
let flowerJSON : JSON = JSON(response.value!)
let pageid = flowerJSON["query"]["pageids"][0].stringValue
let flowerDescription = flowerJSON["query"]["pages"][pageid]["extract"].stringValue
let flowerImageURL = flowerJSON["query"]["pages"][pageid]["thumbnail"]["source"].stringValue //fetching Image URL
self.wikiInfoLabel.text = flowerDescription
self.imageView.sd_setImage(with: URL(string : flowerImageURL))//imageView updated with Wiki Image
case .failure(let error):
print(error)
}
}
}

I think not, but you can capture the image using a link parser HTML documents

Which Java-Datatype is required when calling JavaScript-Method

I'm trying to call a JavaScript-Method from within my JavaFX-WebView via:
JSObject win = (JSObject)getEngine().executeScript("window");
win.call("showPDF", myPDFFile /*typeof java.io.File*/);
the result of this call is:
Invalid parameter object: need either .data, .range or .url
This is the JavaScript-Part (not by me), which throws the error:
var source;
if (typeof src === 'string') {
source = { url: src };
} else if (isArrayBuffer(src)) {
source = { data: src };
} else if (src instanceof PDFDataRangeTransport) {
source = { range: src };
} else {
if (typeof src !== 'object') {
error('Invalid parameter in getDocument, need either Uint8Array, ' +
'string or a parameter object');
}
if (!src.url && !src.data && !src.range) {
error('Invalid parameter object: need either .data, .range or .url');
}
}
Implementation of isArrayBuffer:
function isArrayBuffer(v) {
return typeof v === 'object' && v !== null && v.byteLength !== undefined;
}
So my question is:
What type of (java) object could be used so that this call might work?
win.call("showPDF", ???);
EDIT 1:
String cannot be used, because it will be treated as a URL.
I would like to commit something like a ByteArray (my concrete file), but using a byte[] (instead of java.io.File) causes the same error.
Here are some comments from the JS function above:
https://github.com/mozilla/pdf.js/blob/master/src/display/api.js#L234
This is the main entry point for loading a PDF and interacting with it.
NOTE: If a URL is used to fetch the PDF data a standard XMLHttpRequest(XHR)
is used, which means it must follow the same origin rules that any XHR does
e.g. No cross domain requests without CORS.
#param {string|TypedArray|DocumentInitParameters|PDFDataRangeTransport} src
Can be a url to where a PDF is located, a typed array (Uint8Array)
already populated with data or parameter object.
What Datatype i have to use (in JAVA), so it will be a (e.g) TypedArray (Uint8Array) in JAVASCRIPT?
EDIT 2:
I was trying Aarons suggestion:
String encoded = Base64.getEncoder().encodeToString(Files.readAllBytes(myPDFFile.toPath()));
engine.executeScript("var src = _base64ToArrayBuffer('"+encoded+"'); showPDF(src);");
This causes a new problem:
Error: Invalid PDF binary data: either typed array, string or array-like object is expected in the data property.
This (new) error is thrown (few lines later) here: https://github.com/mozilla/pdf.js/blob/master/src/display/api.js#L291
console.log(_base64ToArrayBuffer(encoded)) returns: [object ArrayBuffer]
Solution:
I managed to make this work (with help of Aaron):
pdfViewer.html
<!doctype html>
<html>
<head>
<script src="web/compatibility.js"></script>
<script src="build/pdf.js"></script>
<script>
function _base64ToBinaryString(base64) {
var binary_string = window.atob(base64);
return binary_string;
}
function showPDF(pdfFile) {
console.log('calling showPDF...');
'use strict';
PDFJS.disableWorker = true; /* IMPORTANT TO DISABLE! */
PDFJS.workerSrc = 'build/pdf.worker.js';
console.log('TRYING TO GET DOCUMENT FROM BINARY DATA...');
PDFJS.getDocument({data: pdfFile}).then(function(pdf)
{
console.log('PDF LOADED.');
console.log('TRYING TO GET PAGE 1...');
pdf.getPage(1).then(function(page) {
console.log('PAGE 1 LOADED.');
var scale = 1.0;
var viewport = page.getViewport(scale);
var canvas = document.getElementById('the-canvas');
var context = canvas.getContext('2d');
canvas.height = viewport.height;
canvas.width = viewport.width;
var renderContext = {
canvasContext: context,
viewport: viewport
};
console.log('RENDERING PAGE 1...');
page.render(renderContext);
});
});
};
</script>
</head>
<body>
<canvas id="the-canvas" style="border:1px solid black;"/>
</body>
</html>
After loading the above HTML-Page into the WebView,
following Java-Code is used to load a PDF-File into the Page:
String encoded = Base64.getEncoder().encodeToString(Files.readAllBytes(myPdfFile.toPath()));
webEngine.executeScript("var src = _base64ToBinaryString('"+encoded+"'); showPDF(src);");

Try myPDFFile.getURI().getURL().toString() since the method accepts URLs in the form of Strings. Since WebKit is running locally, you should be able to read file:// URLs.
If you want to try the ArrayBuffer, then you can find some examples in the MDN JavaScript documentation at https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/ArrayBuffer
That means you need to create a FileReader, pass the URL, handle the asynchronous events. I only recommend this if you know JavaScript fairly well.
Unfortunately, the examples are all geared towards handling files in a web page, not passing data into a page.
[EDIT] Here is an example how to convert a Base64 encoded string to ArrayBuffer: Convert base64 string to ArrayBuffer
So you need the load the file into a Java byte[] array. Then convert the array to a Base64 encoded Java String. You can then paste this into dynamic JavaScript:
String encoded = ...byte[] -> Base64...
getEngine().executeScript("var src = _base64ToArrayBuffer('"+encoded+"'); showPDF(src);");

I want to pull Facebook posts from a public page to a Java application

I am creating an app in Java that will take all the information from a public website and load it in the app for people to read using jsoup. I was trying the same kind of function with Facebook but it wasn't working the same way. Does anyone have a good idea about how I should go about this?
Thanks,
Calland
public String[] scrapeEvents(String... args) throws Exception {
Document doc = Jsoup.connect("http://www.facebook.com/cedarstreettimes?fref=ts").get();
Elements elements = doc.select("div._wk");
String s = elements.toString();
return s;
}
edit: I found this link of information,but I'm a little confused on how to manipulate it to get me only the content of what the specific user posts on their wall. http://developers.facebook.com/docs/getting-started/graphapi/

I had a look at the source of that page -- the thing that is tripping up the parse is that all the real content is wrapped in comments, like this:
<code class="hidden_elem" id="u_0_42"><!-- <div class="fbTimelineSection ...> --></code>
There is JS on the page that lifts that data into the real DOM, but as jsoup doesn't execute JS it stays as comments. So before extracting the content, we need to emulate that JS and "un-hide" those elements. Here's an example to get you started:
String url = "https://www.facebook.com/cedarstreettimes?fref=ts";
String ua = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_2) AppleWebKit/537.33 (KHTML, like Gecko) Chrome/27.0.1438.7 Safari/537.33";
Document doc = Jsoup.connect(url).userAgent(ua).timeout(10*1000).get();
// move the hidden commented out html into the DOM proper:
Elements hiddenElements = doc.select("code.hidden_elem");
for (Element hidden: hiddenElements) {
for (Node child: hidden.childNodesCopy()) {
if (child instanceof Comment) {
hidden.append(((Comment) child).getData()); // comment data parsed as html
}
}
}
Elements articles = doc.select("div[role=article]");
for (Element article: articles) {
if (article.select("span.userContent").size() > 0) {
String text = article.select("span.userContent").text();
String imgUrl = article.select("div.photo img").attr("abs:src");
System.out.println(String.format("%s\n%s\n\n", text,imgUrl));
}
}
That example pulls out the article text and any photo that is associated with it.
(It's possibly better to use the FB API that this method; I wanted to show how you can emulate little bits of JS to make a scrape work properly.)

XMLParser is eating my whitespace

I am losing significant whitespace from a wiki page I am parsing and I'm thinking it's because of the parser. I have this in my Groovy script:
#Grab(group='org.ccil.cowan.tagsoup', module='tagsoup', version='1.2' )
def slurper = new XmlSlurper(new org.ccil.cowan.tagsoup.Parser())
slurper.keepWhitespace = true
inputStream.withStream{ doc = slurper.parse(it)
println "originalContent = " + doc.'**'.find{ it.#id == 'editpageform' }.'**'.find { it.#name=='originalContent'}.#value
}
Where inputStream is initialized from a URL GET request to edit a confluence wiki page.
Later on in the withInputStream block where I do this:
println "originalContent = " + doc.'**'.find{ it.#id == 'editpageform' }.'**'.find { it.#name=='originalContent'}.#value
I notice all the original content of the page is stripped of its newlines. I originally thought it was a server-side thing but when I went to make the same req in my browser and view source I could see newlines in the "originalContent" hidden parameter. Is there an easy way to disable the whitespace normalization and preserve the contents of the field? The above was run against a internal Confluence wiki page but could most likely be reproved when editing any arbitrary wiki page.
Updated above
I added a call to "slurped.keepWhitespace = true" in an attempt to preserve whitespace but that still doesn't work. I'm thinking this method is intended for elements and not attributes? Is there a way to easily tweak flags on the underlying Java XMLParser? Is there a specific setting to set for whitespace in attribute values?

I first tried to reproduce this with some confluence page of my own, but there was no value attribute and no text content in the input node, so I created my own test html.
Now, I figured the tagsoup parser would need to be configured to preserve whitespace too, just setting this on the slurper won't help because the default is to ignore whitespace.
So I've done just this, the tagsoup feature ignorable-whitespace is documented btw. (search for whitespace on the page)
Anyway, it doesn't work. Whitespace from attributes is preserved as you can see from the example and preserving text whitespace doesn't seem to work despite setting the extra feature. Maybe this is a bug in tagsoup or the xml slurper?
I suggest you have a closer look at your html too, is there really a value attribute present?
#Grab(group='org.ccil.cowan.tagsoup', module='tagsoup', version='1.2' )
String html = """\
<html><head><title>test</title></head><body>
<p>
<form id="editpageform">
<p>
<input name="originalContent" value=" ">
</input>
</p>
</form>
</p>
</body></html>
"""
def inputStream = new ByteArrayInputStream(html.getBytes())
def parser = new org.ccil.cowan.tagsoup.Parser()
parser.setFeature("http://www.ccil.org/~cowan/tagsoup/features/ignorable-whitespace", true)
def slurper = new XmlSlurper(parser)
slurper.keepWhitespace = true
inputStream.withStream{ doc = slurper.parse(it)
def parse = { doc.'**'.find{ it.#id == 'editpageform' }.'**'.find { it.#name=='originalContent'} }
println "originalContent (name) = '${parse().#name}'"
println "originalContent (value) = '${parse().#value}'"
println "originalContent (text) = '${parse().text()}'"
}

It seems the newlines are not preserved in the value attribute. See below:
#Grab(group='org.ccil.cowan.tagsoup', module='tagsoup', version='1.2' )
String html = """\
<html><head><title>test</title></head><body>
<p>
<form id="editpageform">
<p>
<input name="originalContent" value="
">
</input>
</p>
</form>
</p>
</body></html>
"""
def inputStream = new ByteArrayInputStream(html.getBytes())
def parser = new org.ccil.cowan.tagsoup.Parser()
parser.setFeature("http://www.ccil.org/~cowan/tagsoup/features/ignorable-whitespace", true)
def slurper = new XmlSlurper(parser)
slurper.keepWhitespace = true
inputStream.withStream{ doc = slurper.parse(it)
def parse = { doc.'**'.find{ it.#id == 'editpageform' }.'**'.find { it.#name=='originalContent'} }
println "originalContent (name) = '${parse().#name}'"
println "originalContent (value) = '${parse().#value}'"
println "originalContent (text) = '${parse().text()}'"
assert parse().#value.toString().contains('\n') : "Should contain a newline"
}

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Html Slurping in Groovy - java

Related

Scrape information from Web Pages with Java?

Save Wikipedia image as an ImageIcon [duplicate]

Which Java-Datatype is required when calling JavaScript-Method

I want to pull Facebook posts from a public page to a Java application

XMLParser is eating my whitespace

Categories

Resources