Wikipedia content parsing JSON - java

I want to get the contents of a Wikipedia page and then do some funny stuff with it.
The idea is that I want to get them in XML/JSON format and at the moment I don't seem to find a way to do it.
For the moment I succeeded in getting this far:
https://en.wikipedia.org/w/api.php?action=query&format=jsonfm&prop=revisions&titles=April_1&rvprop=content&rvcontentformat=text%2Fx-wiki
Bu I receive the content in XWiki and I cannot change it to JSON due to the fact that the page does not support it.
How can I parse the XWiki to a JSON or how can I get the contents of the page.
Thanks!

Yes, you can use the HTML parser inside of XWiki Rendering to parse the HTML generated by wikipedia. This gives you an AST on which you can do whatever you wish.
See http://rendering.xwiki.org/xwiki/bin/view/Main/WebHome for more details.
You just need to find a way to get the wikipedia content in HTML.

Related

How to fix hanging html tags in HTML fragment?

I am getting a possibly ill-composed HTML fragment from an external source:
<p>Include all the information someone would need to answer your <i><i>question<p>
How to make it safe for rendering within a bigger HTML document, closing all hanging HTML tags in Java?
You can try to parse incoming string to XML - there is plenty of tools that do that. If it fails it means that HTML is wrongly formatted (for instance not all tags are correctly closed).
If you need better validation you may additionally validate it against XSD.
You can achieve that by writing your own Java custom parser and fixing the tags.
Idea will be like this, get all open tags and find its relevant closing tag in the string.
You can replace with if there is no closing tag founds.
You need to handle duplicates and pre , post valid tags.
Else you can try this opensource handy parses which helps in achieving that.
http://java-source.net/open-source/html-parsers
http://htmlcleaner.sourceforge.net/ looks good option.
Hope this helps.

Retrieve HTML tag and plain text from URL

I want to know is it possible to retrieve HTML tag and plain text such as
<p>This is text </p> or <div> or This is text
by using XmlPullParser ? I read here that it is not recommended. So is there any alternative way or a simple code that allow you to retrieve HTML and plain text like I wanted above ? I'm still a beginner in android. Thank you for your help.
I think your best option (which I have also used) is JSOUP.
JSOUP provides a very convenient API for extracting and manipulating data, using DOM, CSS, and jquery-like methods. JSOUP allows you to scrape and parse HTML from a URL, file, or string and many more.
jSoup: https://jsoup.org/
You have here a nice tutorial (not mine)
http://www.androidbegin.com/tutorial/android-basic-jsoup-tutorial/
JSOUP is a great parser and is one of the most commonly used ones.
Another thing that might be helpful for you is HTML organizer, a common thing that happens when writing parsers is errors due to Malformed HTML files. This happens more often then what you expect so a HTML organizer can reduce the amount of errors.
A good organizer I used is: Tidy

Java - Extract html information from string

All of the guides out there tell me on how to remove the HTML tags from the text to extract the text between them. What I am after is the extraction of the data that is within the HTML tags.
e.g.
If i have a string:
"<FONT SIZE="5">Hello World</FONT>"
I want to get the font size information to update other variables. How do I go about this?
I've used jsoup several times for this purpose. It's a lenient HTML parser. Beware trying to parse it as "standard" XML as XML-parsing is strict by nature and will fail if the page does not conform to XML markup specs (which few HTML pages do).
You go about this by using one of the available Java libraries for HTML parsing, like TagSoup.
You can use a library like jerichoHTML wich enables you to search for HTML tags as well as their attributes or you build some DOM on you own.
Take a look at this:
http://en.wikipedia.org/wiki/Java_API_for_XML_Processing
If you parse the HTML you should be able to extract the values from the DOM tree.

How do I filter a HTTP get response?

I have learnt how to create a HTTP Get request method to retrieve data from a URL, but I would like to filter the response to only give me a list of the links on the webpage.
For example, if the HTML contained the following text:
<link href="http://www.thompsons.co.uk">
then it should print out:
http://www.thompsons.co.uk
I would strongly recommend that you DO NOT use regexes to "parse" HTML. Unless you have control over the formatting of the web pages you are processing, a solution based on regexes is liable to be fragile and buggy.
Instead, use a permissive HTML parser. This Question gives a number of alternatives: HTML/XML Parser for Java
You can use jsoup:
http://jsoup.org/cookbook/extracting-data/attributes-text-html
You read in the whole data fully, then parse it with regexp to extract the links. Read more here: http://www.mkyong.com/regular-expressions/how-to-extract-html-links-with-regular-expression/

Extract All Images From HTML Using JAVA

I want to get the list of all Image urls from HTML source of a webpage(Both abosulte and relative urls). I used Jsoup to parse the HTML but its not giving all images. For example when I am parsing google.com HTML source its showing zero images..In google.com HTML source image links are in form..
"background:url(/intl/en_com/images/srpr/logo1w.png)
And in rediff.com the images links are in form..
videoArr[j]=new Array("http://ishare.rediff.com/video/entertainment/bappi-da-the-first-indian-in-grammy-jury/2684982","http://datastore.rediff.com/h86-w116/thumb/5E5669666658606D6A6B6272/v3np2zgbla4vdccf.D.0.bappi.jpg","Bappi Da - the first Indian In Grammy jury","http://mypage.rediff.com/profile/getprofile/LehrenTV/12669275","LehrenTV","(2:33)");
j = 1
videoArr[j]=new Array("http://ishare.rediff.com/video/entertainment/bebo-shahid-jab-they-met-again-/2681664","http://datastore.rediff.com/h86-w116/thumb/5E5669666658606D6A6B6272/ra8p9eeig8zy5qvd.D.0.They-Met-Again.jpg","Bebo-Shahid : Jab they met again!","http://mypage.rediff.com/profile/getprofile/LehrenTV/12669275","LehrenTV","(2:17)");
All images are not with in "img" tags..I also want to extract images which are not even with in "img" tags as shown in above HTML source.
How can I do this..?Please help me on this..
Thanks
This is going to be a bit difficult, I think. You basically need a library that will download a web page, construct the page's DOM and execute any javascript that may alter the DOM. After all that is done you have to extract all the possible images from the DOM. Another possible option is to intercept all calls by library to download resources, examine the URL and if the URL is an image record that URL.
My suggestion would be to start by playing with HtmlUnit(http://htmlunit.sourceforge.net/gettingStarted.html.) It does a good job of building the DOM. I'm not sure what types of hooks it has, for intercepting the methods that download resources. Of course if it doesn't provide you with the hooks you can always use AspectJ or simply modify the HtmlUnit source code. Good luck, this sounds like a reasonably interesting problem. You should post your solution, when you figure it out.
If you just want every image referred to in the page, can't you just scan the HTML and any linked javascript or CSS with a simple regex? How likely is it you'd get [-:_./%a-zA-Z0-9]*(.jpg|.png|.gif) in the HTML/JS/CSS that's not an image? I'd guess not very likely. And you should be allowing for broken links anyway.
Karthik's suggestion would be more correct, but I imagine it's more important to you to just get absolutely everything and filter out uninteresting images.

Categories