Parse data from webpage to android app using Jsoup - java

My android app has a part were i need to parse data from wikipedia.com and use that in application. when i go to https://en.wikipedia.org/wiki/Template:COVID-19_pandemic_data I can see the covid19 cases. I want to retrieve the number from the table
I am using Jsoup. I am able to get html data by using this https://en.wikipedia.org/w/api.php?format=xml&action=parse&page=Template:COVID-19_pandemic_data .If you can guide me how can i extract the india cases and deaths from html file. as the html doc is huge and there no attr for tr. There's not much information about this on internet. What i have tried so far...
private void getWebsite() {
new Thread(new Runnable() {
#Override
public void run() {
final StringBuilder builder = new StringBuilder();
String web_link = "https://en.wikipedia.org/w/api.php?format=xml&action=parse&page=Template:COVID-19_pandemic_data";
try {
Document doc = Jsoup.connect(web_link).get();
String title = doc.title();
Elements links = doc.select("tr");
builder.append(title).append("\n");
for(Element link : links){
builder.append(link);
}
} catch (IOException e) {
builder.append("Error : ").append(e.getMessage()).append("\n");
}
runOnUiThread(new Runnable() {
#Override
public void run() {
textView.setText(builder.toString());
}
});
}
}).start();
}

The problem is related to the format of the data (XML). When you navigate down the XML elements, you find what's displayed in the document, when viewed via your browser, is:
<someTag>...</someTag>
But what's actually present is the xml encoded version of the string:
<someTag>...</someTag>
JSoup won't work well with this and you'll need further processing to convert the output to more XML to get it working I'd imagine. You can test this yourself by viewing the result of:
doc.getElementsByTag("text")
And you'd need to replace all < and > tokens with <, > respectively.
Here's what I tried, plus some minor edits after failing to be able to pull tbody/thead/th.. I then started trying to pull from the top level tag, starting with api, moving deeper into the DOM.
final StringBuilder builder = new StringBuilder();
String url = "https://en.wikipedia.org/w/api.php?format=xml&action=parse&page=Template:COVID-19_pandemic_data";
try {
Document doc = Jsoup.connect(url).get();
String title = doc.getElementsByTag("parse").attr("title");
Also worth mentioning there are some really good examples in the documents here: https://jsoup.org/cookbook/extracting-data/dom-navigation
And finally, for what it's worth, I'd change the URL used to: https://en.wikipedia.org/wiki/Template:COVID-19_pandemic_data to make life easier for use with JSoup so you can just pull the relevant bits of data from HTML rather than XML.
In my view, if you have the choice, HtmlUnit would be a better tool for this since you can simply specify an XPath for the HTML element you want to extract without having to use multiple method calls to get what you want... the more concise format means there's less room for errors to hide.

Related

How to load local html file in java swing?

I have a tree list which will open a specific html file when I click at a node. I try loading my html into a Jeditorpanel but it can't seem to work.
Here's my code from main file:
private void treeItemMouseClicked(java.awt.event.MouseEvent evt) {
DefaultMutableTreeNode selectedNode = (DefaultMutableTreeNode) treeItem.getSelectionPath().getLastPathComponent();
String checkLeaf = selectedNode.getUserObject().toString();
if (checkLeaf == "Java Turtorial 1") {
String htmlURL = "/htmlFILE/javaTurtorial1.html";
new displayHTML(htmlURL).setVisible(true);
}
}
Where I wanna display it:
public displayHTML(String htmlURL) {
initComponents();
try {
//Display html file
editorHTML.setPage(htmlURL);
} catch (IOException ex) {
Logger.getLogger(displayHTML.class.getName()).log(Level.SEVERE, null, ex);
}
}
My files:
One simple way to render HTML with JEditorPane is using it's setText method:
JEditorPane editorPane =...
editorPane.setContentType( "text/html" );
editorPane.setText( "<html><body><h1>I'm an html to render</h1></body></html>" );
Note that only certain HTML pages (relatively simple ones) can be rendered with this JEditoPane, if you need something more complicated you'll have to use thirdparty components
Based on OP's comment, I'm adding an update to the answer:
Update
Since the HTMLs that you're trying to load are files inside the JAR, you should read the file into some string variable and use the aforementioned method setText
Note that you shouldn't use java.io.File because it used to identify resources at the filesystem, and you're trying to access something inside the artifact:
Reading the resource like this can be done with the following construction:
InputStream is = getClass().getResourceAsStream("/htmls/myhtml.html");
// and then read with the help of variety of ways, depending on Java Version of your choice and by the availaility by auxiliary thirdparties
// here is the most simple way IMO for Java 9+
String htmlString = new String(input.readAllBytes(), StandardCharsets.UTF_8);
Read Here about many different ways to read InputStream into String

Jsoup not seeing some text on website

Currently I am making a program (in Java) that grabs all the streamers on twitch (Videogame streaming site) from a given URL e.g. and lists them into a text file using Jsoup.
However, no matter what I try, it seems like I can't get the streamer's names. After a while I discovered that the page source for some reason does not contain the streamer's names which I think could be the problem?
Here is my code currently.
public static void main(String[] args) throws IOException {
int i = 0;
PrintWriter streamerwriter = new PrintWriter("streamer.txt", "UTF-8");
Document doc = Jsoup.connect(https://www.twitch.tv/directory/game/Hearthstone%3A%20Heroes%20of%20Warcraft).get();
Elements streamers = doc.getElementsByClass("js-profile-link");
for (Element streamer : streamers) {
i++;
System.out.println(i + "." + streamer.text());
streamerwriter.println(i + "." + streamer.text());
}
streamerwriter.close();
}
Any help would be greatly appreciated.
you don't need to parse webpage.Because twitch has an api to select streamers.
https://streams.twitch.tv/kraken/streams?limit=20&offset=0&game=Hearthstone%3A+Heroes+of+Warcraft&broadcaster_language=&on_site=1
so you should parse json data
if you want to know why you don't see streamers in jsoup because of lazy loading.because that part you want to parse is loaded lazily.You should know that lazy request and parse that url by jsoup which i found and wrote up.(twitch api)
Please check this question out:
how to use Jsoup in site that has lazyload scrollLoader.js

ghost4j class cast exception during joining two PostScripts

I am trying to join two PostScript files to one with ghost4j 0.5.0 as follows:
final PSDocument[] psDocuments = new PSDocument[2];
psDocuments[0] = new PSDocument();
psDocuments[0].load("1.ps");
psDocuments[1] = new PSDocument();
psDocuments[1].load("2.ps");
psDocuments[0].append(psDocuments[1]);
psDocuments[0].write("3.ps");
During this simplified process I got the following exception message for the above "append" line:
org.ghost4j.document.DocumentException: java.lang.ClassCastException:
org.apache.xmlgraphics.ps.dsc.events.UnparsedDSCComment cannot be cast to
org.apache.xmlgraphics.ps.dsc.events.DSCCommentPage
Until now I have not made to find out whats the problem here - maybe some kind of a problem within one of the PostScript files?
So help would be appreciated.
EDIT:
I tested with ghostScript commandline tool:
gswin32.exe -dQUIET -dBATCH -dNOPAUSE -sDEVICE=pswrite -sOutputFile="test.ps" --filename "1.ps" "2.ps"
which results in a document where 1.ps and 2.ps are merged into one(!) page (i.e. overlay).
When removing the --filename the resulting document will be a PostScript with two pages as expected.
The exception occurs because one of the 2 documents does not follow the Adobe Document Structuring Convention (DSC), which is mandatory if you want to use the Document append method.
Use the SafeAppenderModifier instead. There is an example here: http://www.ghost4j.org/highlevelapisamples.html (Append a PDF document to a PostScript document)
I think something is wrong in the document or in the XMLGraphics library as it seems it cannot parse a part of it.
Here you can see the code in ghost4j that I think it is failing (link):
DSCParser parser = new DSCParser(bais);
Object tP = parser.nextDSCComment(DSCConstants.PAGES);
while (tP instanceof DSCAtend)
tP = parser.nextDSCComment(DSCConstants.PAGES);
DSCCommentPages pages = (DSCCommentPages) tP;
And here you can see why XMLGraphics may bre sesponsable (link):
private DSCComment parseDSCComment(String name, String value) {
DSCComment parsed = DSCCommentFactory.createDSCCommentFor(name);
if (parsed != null) {
try {
parsed.parseValue(value);
return parsed;
} catch (Exception e) {
//ignore and fall back to unparsed DSC comment
}
}
UnparsedDSCComment unparsed = new UnparsedDSCComment(name);
unparsed.parseValue(value);
return unparsed;
}
It seems parsed.parseValue(value) has thrown an exception, it was hidden in the catch and it returned an unparsed version ghost4j didn't expect.

How to extract headline titles followed by respective text from Wikipedia

I am trying to use Jsoup in order to extract text from Wikipedia articles.
My idea is to simply extract every headline, and their respective text paragraphs.
I am having some trouble understanding how I can take only the specific text of each section, here's what I have:
public static void main(String[] args) {
String url = "http://en.wikipedia.org/wiki/Albert_Einstein";
Document doc;
try {
doc = Jsoup.connect(url).get();
doc = Jsoup.parse(doc.toString());
Elements titles = doc.select(".mw-headline");
PrintStream out = new PrintStream(new FileOutputStream("output.txt"));
System.setOut(out);
for(Element h3 : doc.select(".mw-headline"))
{
String title = h3.text();
String titleID = h3.id();
Elements paragraphs = doc.select("p#"+titleID);
//Element nextEle=h3.nextElementSibling();
System.out.println(title);
System.out.println("----------------------------------------");
System.out.println(titleID);
System.out.print("\n");
System.out.println(paragraphs.text());
System.out.print("\n");
}
} catch (IOException e) {
System.out.println("deu merda");
e.printStackTrace();
}
With this I can extract every headline, but I can't get how I would get the text from each section to print it accordingly. I was thinking maybe with the headline's ID, but no dice.
Thank you for any help!
Depending on the tag structure of the page (if any), that could be complicated. A better alternative could be to iterate on all the elements, detecting headlines. Every time you detect a new headline (or you reach the end of the elements), it means a new headline. All elements up to here belong to the previous headline (or to the "header" of the article if there is no previous headline).

Java program to download images from a website and display the file sizes

I'm creating a java program that will read a html document from a URL and display the sizes of the images in the code. I'm not sure how to go about achieving this though.
I wouldn't need to actually download and save the images, i just need the sizes and the order in which they appear on the webpage.
for example:
a webpage has 3 images
<img src="dog.jpg" /> //which is 54kb
<img src="cat.jpg" /> //which is 75kb
<img src="horse.jpg"/> //which is 80kb
i would need the output of my java program to display
54kb
75kb
80kb
Any ideas where i should start?
p.s I'm a bit of a java newbie
If you're new to Java you may want to leverage an existing library to make things a bit easier. Jsoup allows you to fetch an HTML page and extract elements using CSS-style selectors.
This is just a quick and very dirty example but I think it will show how easy Jsoup can make such a task. Please note that error handling and response-code handling was omitted, I merely wanted to pass on the general idea:
Document doc = Jsoup.connect("http://stackoverflow.com/questions/14541740/java-program-to-download-images-from-a-website-and-display-the-file-sizes").get();
Elements imgElements = doc.select("img[src]");
Map<String, String> fileSizeMap = new HashMap<String, String>();
for(Element imgElement : imgElements){
String imgUrlString = imgElement.attr("abs:src");
URL imgURL = new URL(imgUrlString);
HttpURLConnection httpConnection = (HttpURLConnection) imgURL.openConnection();
String contentLengthString = httpConnection.getHeaderField("Content-Length");
if(contentLengthString == null)
contentLengthString = "Unknown";
fileSizeMap.put(imgUrlString, contentLengthString);
}
for(Map.Entry<String, String> mapEntry : fileSizeMap.entrySet()){
String imgFileName = mapEntry.getKey();
System.out.println(imgFileName + " ---> " + mapEntry.getValue() + " bytes");
}
You might also consider looking at Apache HttpClient. I find it generally preferable over the raw URLConnection/HttpURLConnection approach.
You should break you problem into 3 sub problems
Download the HTML document
Parse the HTML document and find the images
Download the images and determine its size
You can use regular expressions to find tag and get image URL. After that you'll need and HttpUrlConnection class to get image data and measure it's size.
You can do this:
try {
URL urlConn = new URL("http://yoururl.com/cat.jpg");
URLConnection urlC = urlConn.openConnection();
System.out.println(urlC.getContentLength());
} catch (MalformedURLException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}

Categories