How to get a table from an html page using JAVA

How to get a table from an html page using JAVA - java

I am working on a project where I am trying to fetch financial statements from the internet and use them in a JAVA application to automatically create ratios, and charts.
The site I am using uses a login and password to get to the tables.
The Tag is TBODY, but there are 2 other TBODY's in the html.
How can I use java to print my table to a txt file where I can then use in my application?
What would the best way to go about this, and what should I read up on?

If this were my project, I'd look into using an HTML parser, something like jsoup (although others are available). The jsoup site has a tutorial, and after playing with it a while, you'll likely find it pretty easy to use.
For example, for an HTML table like so:
jsoup could parse it like so:
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class TableEg {
public static void main(String[] args) {
String html = "http://publib.boulder.ibm.com/infocenter/iadthelp/v7r1/topic/" +
"com.ibm.etools.iseries.toolbox.doc/htmtblex.htm";
try {
Document doc = Jsoup.connect(html).get();
Elements tableElements = doc.select("table");
Elements tableHeaderEles = tableElements.select("thead tr th");
System.out.println("headers");
for (int i = 0; i < tableHeaderEles.size(); i++) {
System.out.println(tableHeaderEles.get(i).text());
}
System.out.println();
Elements tableRowElements = tableElements.select(":not(thead) tr");
for (int i = 0; i < tableRowElements.size(); i++) {
Element row = tableRowElements.get(i);
System.out.println("row");
Elements rowItems = row.select("td");
for (int j = 0; j < rowItems.size(); j++) {
System.out.println(rowItems.get(j).text());
}
System.out.println();
}
} catch (IOException e) {
e.printStackTrace();
}
}
}
Resulting in the following output:
headers
ACCOUNT
NAME
BALANCE
row
0000001
Customer1
100.00
row
0000002
Customer2
200.00
row
0000003
Customer3
550.00

Related

How to find specific elements within a larger element in HTML Java

Document doc = Jsoup.parse(url1, 3*1000);
String subHead = "A h2 heading"; //note that at this point I have already parsed the html and found all the H2 headings and analysed them, But now I want to go further and analyse all H4 headings within the H2 section
print("Printing h4 titles of : " + subHead);
Elements sibHead; //variable that stores all elements between this H2 title and the next
String bodySelect = ("h2");
Elements kpageE = kpage.select(bodySelect);
for (Element e : kpageE) {
String estring = e.text();
print(estring + "--------------------------------------------");
if (estring.contentEquals(subHead)) {
sibHead = e.nextElementSiblings(); //this prints all elements in the h2 title section but i want only the h4 titles
for(Element ei : sibHead) {
String eistr = ei.text();
print(eistr);
}
}
I have already parsed the HTML and have got a list of all H2 elements, now I want specific elements between one H2 element and the next, more specifically I want all H4 elements.

With Jsoup you can use the .getElementsByTag method of the Document class, which allows you to retrieve all the elements according to their tagName.
Here is an example of use:
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class App {
public static void main(String[] args) {
try {
Document doc = Jsoup.connect("https://inscription.devlab.umontp.fr/").get();
Elements h4elements = doc.getElementsByTag("h4");
for (Element h4 : h4elements) {
System.out.println(h4.text());
}
} catch (IOException e) {
e.printStackTrace();
}
}
}

Jsoup Check if a tag exists

I'm using jsoup to extract some ads from a page and i need to check if a class exists but i'm not doing it right:(
Here is the html:
I need to check if the class .large-4.medium-5.large-text-right.medium-text-right.columns exists and if so, i need to extract the element inside it but i've got stuck at checking if that class exists:(
Here is my code:
Elements pageSearchPrice = page2
.select("li[itemtype=https://schema.org/Offer] > div[class=listing-data]");
for(int j=0; j < pageSearchTitle.size(); j++) {
if(pageSearchPrice.get(j).hasClass(".large-4.medium-5.large-text-right.medium-text-right.columns")) {
String price = pageSearchPrice.get(j).select("strong[itemprop=price]").text();
list.get(index1).setPrice(price);
index1++;
}else {
list.get(index1).setPrice("No price");
index1++;
}
}

using Jsoup to extract a table inside several divs

I am trying to use jsoup so as to have access to a table embedded inside multiple div's of an html page.The table is under the outer division with id "content-top". I will give the inner divs leading to the table: content-top -> center -> middle-right-col -> result .
Under the div result; is table round. This is the table that i want to access and whose rows I need to traverse and print out the data contained in them. Below is the java code I have been trying to use but yielding no results :
Document doc = Jsoup.connect("http://www.calculator.com/#").data("express", "sin(x)").data("calculate","submit").post();
// give the application time to calculate result before retrieving result from results table
try {
Thread.sleep(10000);
}
catch(InterruptedException ex)
{
Thread.currentThread().interrupt();
}
Elements content = doc.select("div#result") ;
Element tables = content.get(0) ;
Elements table_rows = tables.select("tr") ;
Iterator iterRows = table_rows.iterator();
while (iterRows.hasNext()) {
Element tr = (Element)iterRows.next();
Elements table_data = tr.select("td");
Iterator iterData = table_data.iterator();
int tdCount = 0;
String f_x_value = null;
String result = null;
// process new line
while (iterData.hasNext()) {
Element td = (Element)iterData.next();
switch (tdCount++) {
case 1:
f_x_value = td.text();
f_x_value = td.select("a").text();
break;
case 2:
result = td.text();
result = td.select("a").text();
break;
}
}
System.out.println(f_x_value + " " + result ) ;
}
The above code crashes and hardly does what I want it to do. PLEASE CAN ANYONE PLEASE HELP ME !!!

public static String do_conversion (String str)
{
char c;
String output = "{";
for(int i = 0; i < str.length(); i++)
{
c = str.charAt(i);
if(c=='e')
output += "{mathrm{e}}";
else if(c=='(')
output += '{';
else if(c==')')
output += '}';
else if(c=='+')
output += "{cplus}";
else if(c=='-')
output += "{cminus}";
else if(c=='*')
output += "{cdot}";
else if(c=='/')
output += "{cdivide}";
else output += c; // else copy the character normally
}
output += ", mathrm{d}x}";
return output;
}
#Syam S

The page doesnt directly give you a table in a div with id as "result". It uses an ajax class to a php file and get the process done. So what you need to do here is to first build a json like
{"expression":"sin(x)","intVar":"x","upperBound":"","lowerBound":"","simplifyExpressions":false,"latex":"\\displaystyle\\int\\limits^{}_{}{\\sin\\left(x\\right)\\, \\mathrm{d}x}"}
The expression key hold the expression that you want to evaluate, the latex is a mathjax expression and then post it to int.php. This expects two arguments namely q which is the above json and v which seems to a constant value 1380119311. I didnt understand what this is.
Now this will return a response like
<html>
<head></head>
<body>
<table class="round">
<tbody>
<tr class="">
<th>$f(x) =$</th>
<td>$\sin\left(x\right)$</td>
</tr>
<tr class="sep odd">
<th>$\displaystyle\int{f(x)}\, \mathrm{d}x =$</th>
<td>$-\cos\left(x\right)$</td>
</tr>
</tbody>
</table>
<!-- Finished in 155 ms -->
<p id="share"> <img src="layout/32x32xshare.png.pagespeed.ic.i3iroHP5fI.png" width="32" height="32" /> <a id="share-link" href="http://www.integral-calculator.com/#expr=sin%28x%29" onclick="window.prompt("To copy this link to the clipboard, press Ctrl+C, Enter.", $("share-link").href); return false;">Direct link to this calculation (for sharing)</a> </p>
</body>
</html>
The table in this expression gives you the result and the site uses mathjax to display it like
A sample program would be
import java.io.IOException;
import org.apache.commons.lang3.StringEscapeUtils;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
public class JsoupParser6 {
public static void main(String[] args) {
try {
// Integral
String url = "http://www.integral-calculator.com/int.php";
String q = "{\"expression\":\"sin(4x) * e^(-x)\",\"intVar\":\"x\",\"upperBound\":\"\",\"lowerBound\":\"\",\"simplifyExpressions\":false,\"latex\":\"\\\\displaystyle\\\\int\\\\limits^{}_{}{\\\\sin\\\\left(4x\\\\right){\\\\cdot}{\\\\mathrm{e}}^{-x}\\\\, \\\\mathrm{d}x}\"}";
Document integralDoc = Jsoup.connect(url).data("q", q).data("v", "1380119311").post();
System.out.println(integralDoc);
System.out.println("\n*******************************\n");
//Differential
url = "http://www.derivative-calculator.net/diff.php";
q = "{\"expression\":\"sin(x)\",\"diffVar\":\"x\",\"diffOrder\":1,\"simplifyExpressions\":false,\"showSteps\":false,\"latex\":\"\\\\dfrac{\\\\mathrm{d}}{\\\\mathrm{d}x}\\\\left(\\\\sin\\\\left(x\\\\right)\\\\right)\"}";
Document differentialDoc = Jsoup.connect(url).data("q", q).data("v", "1380119305").post();
System.out.println(differentialDoc);
System.out.println("\n*******************************\n");
//Calculus
url = "http://calculus-calculator.com/calculation/integrate.php";
Document calculusDoc = Jsoup.connect(url).data("expression", "sin(x)").data("intvar", "x").post();
String outStr = StringEscapeUtils.unescapeJava(calculusDoc.toString());
Document formattedOutPut = Jsoup.parse(outStr);
formattedOutPut.body().html(formattedOutPut.select("div.isteps").toString());
System.out.println(formattedOutPut);
} catch (IOException e) {
e.printStackTrace();
}
}
}
Update based on comment.
The unescape works perfectly well. In MathJax you could right click and view the command. So if you go to your site http://calculus-calculator.com/ and try the sin(x) equation there and right click the result and view TexCommand like
The you could see the commands are exactly the ones which we get after unsescape. The demo site is not rendering it. May be a limitation of the demo site, thats all.

Java - HTML code: extract part of the tag

I have to extract some integers from a tag of a html code.
For example if I have:
< tag blabla="title"><a href="/test/tt123> TEST 1 < tag >
I did that removing all the chars and leaving only the digits and it worked until in the title name there was another digit, so i got "1231".
str.replaceAll("[^\\d.]", "");
How can I do to extract only the "123" integer?? Thanks for your help!

Jsoup is a good api to play around with html. Using that you could do like
String html = "<tag blabla=\"title\"><a href=\"/test/tt123\"> TEST 1 <tag>";
Document doc = Jsoup.parseBodyFragment(html);
String value = doc.select("a").get(0).attr("href").replaceAll("[^\\d.]", "");
System.out.println(value);

You could do this (a method that removes all duplicates in any number):
int[] foo = new int[str.length];
for(int i = 0; i < str.length; i++) {
foo[i] = Integer.parseInt(str.charAt(i));
}
Set<Integer> set = new HashSet<Integer>();
for(int i = 0; i < foo.length; i++){
set.add(foo[i]);
}
Now you have a set where all duplicate numbers from any string are removed. I saw your last comment not. So this answer might not be very useful to you. What you could do is that the three first digits in the foo array as well, which will give you 123.

First use XPath to parse out only the href value, then apply your replaceAll to achieve what you desired.
And you don't have to download any additional frameworks or libraries for this to work.
Here's a quick demo class on how this works:
package com.example.test;
import java.io.StringReader;
import javax.xml.xpath.XPath;
import javax.xml.xpath.XPathConstants;
import javax.xml.xpath.XPathExpressionException;
import javax.xml.xpath.XPathFactory;
import org.xml.sax.InputSource;
public class Test {
public static void main(String[]args){
String xml = "<tag blabla=\"title\"> TEST 1 </tag>";
XPath xPath = XPathFactory.newInstance().newXPath();
InputSource source = new InputSource(new StringReader(xml));
String hrefValue = null;
try {
hrefValue = (String) xPath.evaluate("//#href", source, XPathConstants.STRING);
} catch (XPathExpressionException e) {
e.printStackTrace();
}
String numbers = hrefValue.replaceAll("[^\\d.]", "");
System.out.println(numbers);
}
}

get all the children for a given xml in java

I am basically following the example here
http://www.mkyong.com/java/how-to-read-xml-file-in-java-jdom-example/
So rather than doing something like
node.getChildText("firstname")
right??
this works fine..
But is there a way to get all the "keys" and then I can query that to get values?
Just like we do in parsing json..
JSONObject json = (JSONObject) parser.parse(value);
for (Object key : json.keySet()) {
Object val = json.get(key);
}
rather than hardcoding keys and values?
Thanks
Code for reference:
package org.random_scripts;
import java.io.File;
import java.io.IOException;
import java.util.List;
import org.jdom2.Document;
import org.jdom2.Element;
import org.jdom2.JDOMException;
import org.jdom2.input.SAXBuilder;
public class XMLReader {
public static void main(String[] args) {
SAXBuilder builder = new SAXBuilder();
File xmlFile = new File("data.xml");
try {
Document document = (Document) builder.build(xmlFile);
Element rootNode = document.getRootElement();
List list = rootNode.getChildren("staff");
List children = rootNode.getChildren();
System.out.println(children);
for (int i = 0; i < list.size(); i++) {
Element node = (Element) list.get(i);
System.out.println("First Name : " + node.getChildText("firstname"));
System.out.println("Last Name : " + node.getChildText("lastname"));
System.out.println("Nick Name : " + node.getChildText("nickname"));
System.out.println("Salary : " + node.getChildText("salary"));
}
} catch (IOException io) {
System.out.println(io.getMessage());
} catch (JDOMException jdomex) {
System.out.println(jdomex.getMessage());
}
}
}

Well, if you wanted to write out all of the children of the node, you could do something like this:
List children = rootNode.getChildren();
for (int i = 0; i < list.size(); i++) {
Element node = (Element) list.get(i);
List dataNodes = node.getChildren();
for (int j = 0; j < dataNodes.size(); ++j) {
Element dataNode = (Element) dataNodes.get(j);
System.out.println(dataNode.getName() + " : " + dataNode.getText());
}
}
This would let you write out all of the children without knowing the names, with the only downside being that you wouldn't have "pretty" names for the fields (i.e. "First Name" instead of "firstname"). Of course, you'd have the same limitation in JSON - I don't know of an easy way to get pretty names for the fields unless your program has some knowledge about what the children are, which is the thing you seem to be trying to avoid.

The above code only provides the list of 1st level child under the tag.
For example::
<parent>
<child1>
<childinternal></childinternal>
</child1>
<child2></child2>
</parent>
The above code only prints child1 and child2, if you want to print even the internal nodes in depth you have to do recursive call.
To find a child has more nodes in it use, jdom api child.getContentSize(), if its greater than 1 menas it has more nodes.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to get a table from an html page using JAVA - java

Related

How to find specific elements within a larger element in HTML Java

Jsoup Check if a tag exists

using Jsoup to extract a table inside several divs

Java - HTML code: extract part of the tag

get all the children for a given xml in java

Categories

Resources