How t get specific value from html in java?

How t get specific value from html in java? - java

I am developing one Application which show Gold rate and create graph for this.
I find one website which provide me this gold rate regularly.My question is how to extract this specific value from html page.
Here is link which i need to extract = http://www.todaysgoldrate.co.in/todays-gold-rate-in-pune/ and this html page have following tag and content.
<p><em>10 gram gold Rate in pune = Rs.31150.00</em></p>
Here is my code which i use for extracting but i didn't find way to extract specific content.
public class URLExtractor {
private static class HTMLPaserCallBack extends HTMLEditorKit.ParserCallback {
private Set<String> urls;
public HTMLPaserCallBack() {
urls = new LinkedHashSet<String>();
}
public Set<String> getUrls() {
return urls;
}
#Override
public void handleSimpleTag(Tag t, MutableAttributeSet a, int pos) {
handleTag(t, a, pos);
}
#Override
public void handleStartTag(Tag t, MutableAttributeSet a, int pos) {
handleTag(t, a, pos);
}
private void handleTag(Tag t, MutableAttributeSet a, int pos) {
if (t == Tag.A) {
Object href = a.getAttribute(HTML.Attribute.HREF);
if (href != null) {
String url = href.toString();
if (!urls.contains(url)) {
urls.add(url);
}
}
}
}
}
public static void main(String[] args) throws IOException {
InputStream is = null;
try {
String u = "http://www.todaysgoldrate.co.in/todays-gold-rate-in-pune/";
//Here i need to extract this content by tag wise or content wise....
Thanks in Advance.......

You can use library like Jsoup
You can get it from here --> Download Jsoup
Here is its API reference --> Jsoup API Reference
Its really very easy to parse HTML content using Jsoup.
Below is a sample code which might be helpful to you..
public class GetPTags {
public static void main(String[] args){
Document doc = Jsoup.parse(readURL("http://www.todaysgoldrate.co.intodays-gold-rate-in-pune/"));
Elements p_tags = doc.select("p");
for(Element p : p_tags)
{
System.out.println("P tag is "+p.text());
}
}
public static String readURL(String url) {
String fileContents = "";
String currentLine = "";
try {
BufferedReader reader = new BufferedReader(new InputStreamReader(new URL(url).openStream()));
fileContents = reader.readLine();
while (currentLine != null) {
currentLine = reader.readLine();
fileContents += "\n" + currentLine;
}
reader.close();
reader = null;
} catch (Exception e) {
JOptionPane.showMessageDialog(null, e.getMessage(), "Error Message", JOptionPane.OK_OPTION);
e.printStackTrace();
}
return fileContents;
}
}

http://java-source.net/open-source/crawlers
You can use any of that's apis, but don't parse the HTML with the pure JDK, because it's too painfull.

Related

Google Search with JSoup

i trie to search in google with JSoup. The problem that i have is, that the variable query shows not the URL i want when i start searching.
Also, how does Jsoup search ? Looking for Title or URL or what ?
public class Start {
public static void main(String[] args) {
try {
new Google().Searching("Möbel Beck GmbH & Co.KG");
} catch (Exception e) {
System.out.println(e.getMessage());
}
}
}
public class Google implements Serializable {
private static final long serialVersionUID = 1L;
private static Pattern patternDomainName;
private Matcher matcher;
private static final String DOMAIN_NAME_PATTERN = "([a-zA-Z0-9]([a-zA-Z0-9\\-]{0,61}[a-zA-Z0-9])?\\.)+[a-zA-Z]{2,6}";
static {
patternDomainName = Pattern.compile(DOMAIN_NAME_PATTERN);
}
public void Searching(String searchstring) throws IOException {
Google obj = new Google();
Set<String> result = obj.getDataFromGoogle(searchstring);
for (String temp : result) {
if (temp.contains(searchstring)) {
System.out.println(temp + " ----> CONTAINS");
} else {
System.out.println(temp);
}
}
System.out.println(result.size());
}
public String getDomainName(String url) {
String domainName = "";
matcher = patternDomainName.matcher(url);
if (matcher.find()) {
domainName = matcher.group(0).toLowerCase().trim();
}
return domainName;
}
private Set<String> getDataFromGoogle(String query) {
Set<String> result = new HashSet<String>();
String request = "https://www.google.com/search?q=" + query;
System.out.println("Sending request..." + request);
try {
// need http protocol, set this as a Google bot agent :)
Document doc = Jsoup.connect(request)
.userAgent("Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)").timeout(6000)
.get();
// get all links
Elements links = doc.select("a[href]");
for (Element link : links) {
String temp = link.attr("href");
if (temp.startsWith("/url?q=")) {
// use regex to get domain name
result.add(getDomainName(temp));
}
}
} catch (IOException e) {
e.printStackTrace();
}
return result;
}
}

Parsing google sites directly is not a good idea. You can try Google API
https://developers.google.com/web-search/docs/#java-access

Output issues: Passing from BufferedReader to array method

I've compiled and debugged my program, but there is no output. I suspect an issue passing from BufferedReader to the array method, but I'm not good enough with java to know what it is or how to fix it... Please help! :)
public class Viennaproj {
private String[] names;
private int longth;
//private String [] output;
public Viennaproj(int length, String line) throws IOException
{
this.longth = length;
this.names = new String[length];
String file = "names.txt";
processFile("names.txt",5);
sortNames();
}
public void processFile (String file, int x) throws IOException, FileNotFoundException{
BufferedReader reader = null;
try {
//File file = new File("names.txt");
reader = new BufferedReader(new FileReader(file));
String line;
while ((line = reader.readLine()) != null) {
System.out.println(line);
}
} catch (IOException e) {
e.printStackTrace();
} finally {
try {
reader.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
public void sortNames()
{
int counter = 0;
int[] lengths = new int[longth];
for( String name : names)
{
lengths[counter] = name.length();
counter++;
}
for (int k = 0; k<longth; k++)
{
int counter2 = k+1;
while (lengths[counter2]<lengths[k]){
String temp2;
int temp;
temp = lengths[counter2];
temp2 = names[counter2];
lengths[counter2] = lengths[k];
names[counter2] = names[k];
lengths[k] = temp;
names[k] = temp2;
counter2++;
}
}
}
public String toString()
{
String output = new String();
for(String name: names)
{
output = name + "/n" + output;
}
return output;
}
public static void main(String[] args)
{
String output = new String ();
output= output.toString();
System.out.println(output+"");
}
}

In Java, the public static void main(String[] args) method is the starting point of the application.
You should create an object of Viennaproj in your main method. Looking at your implementation, just creating an object of Viennaproj will fix your code.
Your main method should look like below
public static void main(String[] args) throws IOException
{
Viennaproj viennaproj = new Viennaproj(5, "Sample Line");
String output= viennaproj.toString();
System.out.println(output);
}
And, if you are getting a FileNotFound exception when you execute this, it means that java is not able to find the file.
You must provide complete file path of your file to avoid that issue. (eg: "C:/test/input.txt")

C# dll method call from Java

Has anyone an idea about what is wrong with my attempt to call a method from a C# dll in my Java code?
Here is my example:
Java code:
public class CsDllHandler {
public interface IKeywordRun extends Library {
public String KeywordRun(String action, String xpath, String inputData,
String verifyData);
}
private static IKeywordRun jnaInstance = null;
public void runDllMethod(String action, String xpath, String inputData,
String verifyData) {
NativeLibrary.addSearchPath(${projectDllName},
"${projectPath}/bin/x64/Debug");
jnaInstance = (IKeywordRun) Native.loadLibrary(
${projectDllName}, IKeywordRun.class);
String csResult = jnaInstance.KeywordRun(action, xpath, inputData,
verifyData);
System.out.println(csResult);
}
}
And in C#:
[RGiesecke.DllExport.DllExport]
public static string KeywordRun(string action, string xpath, string inputData, string verifyData) {
return "C# here";
}
The Unmanaged Exports nuget should be enough for me to call this method (in theory) but I have some strange error:
Exception in thread "main" java.lang.Error: Invalid memory access
at com.sun.jna.Native.invokePointer(Native Method)
at com.sun.jna.Function.invokePointer(Function.java:470)
at com.sun.jna.Function.invokeString(Function.java:651)
at com.sun.jna.Function.invoke(Function.java:395)
at com.sun.jna.Function.invoke(Function.java:315)
at com.sun.jna.Library$Handler.invoke(Library.java:212)
at com.sun.proxy.$Proxy0.KeywordRun(Unknown Source)
at auto.test.keywords.utils.CsDllHandler.runDllMethod(CsDllHandler.java:34)
at auto.test.keywords.runner.MainClass.main(MainClass.java:24)

Well, after another day of research and "trial and error" I have found the cause of my problem and a solution.
The cause was that my C# dll had a dependency on log4net.dll. For running a static method from a standalone C# dll the code from the question is all you need.
The solution for using C# dll with dependencies is to create another dll with no dependency and to load the original dll in this adapter with reflection. In Java you should load the adapter dll with jna and call any exported method. I was able not only to execute methods from the adapter but also to configure log4net with reflection and Java
Here is my code:
(C#)
public class CSharpDllHandler {
private static Logger log = Logger.getLogger(CSharpDllHandler.class);
public interface IFrameworkAdapter extends Library {
public String runKeyword(String action, String xpath, String inputData,
String verifyData);
public String configureLog4net(String log4netConfigPath);
public String loadAssemblies(String frameworkDllPath,
String log4netDllPath);
}
private static IFrameworkAdapter jnaAdapterInstance = null;
private String jnaSearchPath = null;
public CSharpDllHandler(String searchPath) {
this.jnaSearchPath = searchPath;
// add to JNA search path
System.setProperty("jna.library.path", jnaSearchPath);
// load attempt
jnaAdapterInstance = (IFrameworkAdapter) Native.loadLibrary(
"FrameworkAdapter", IFrameworkAdapter.class);
}
public String loadAssemblies(String frameworkDllPath, String log4netDllPath) {
String csResult = jnaAdapterInstance.loadAssemblies(frameworkDllPath,
log4netDllPath);
log.debug(csResult);
return csResult;
}
public String runKeyword(String action, String xpath, String inputData,
String verifyData) {
String csResult = jnaAdapterInstance.runKeyword(action, xpath,
inputData, verifyData);
log.debug(csResult);
return csResult;
}
public String configureLogging(String log4netConfigPath) {
String csResult = jnaAdapterInstance
.configureLog4net(log4netConfigPath);
log.debug(csResult);
return csResult;
}
public String getJnaSearchPath() {
return jnaSearchPath;
}
}
In the main method just use something like this:
CSharpDllHandler dllHandler = new CSharpDllHandler(
${yourFrameworkAdapterDllLocation});
dllHandler.loadAssemblies(
${yourOriginalDllPath},${pathToTheUsedLog4netDllFile});
dllHandler.configureLogging(${log4net.config file path});
dllHandler.runKeyword("JAVA Action", "JAVA Xpath", "JAVA INPUT",
"JAVA VERIFY");
dllHandler.runKeyword("JAVA Action2", "JAVA Xpath2", "JAVA INPUT2",
"JAVA VERIFY2");
In C# I have the desired methods on the original dll:
public static string KeywordRun(string action, string xpath, string inputData, string verifyData) {
log.Debug("Action = " + action);
log.Debug("Xpath = " + xpath);
log.Debug("InputData = " + inputData);
log.Debug("VerifyData = " + verifyData);
return "C# UserActions result: "+ action+" "+xpath+" "+inputData+" "+verifyData;
}
and all the magic is in the DLL Adapter:
namespace FrameworkAdapter {
[ComVisible(true)]
public class FwAdapter {
private const String OK="OK";
private const String frameworkEntryClassName = "${nameOfTheDllClass with method to run }";
private const String log4netConfiguratorClassName = "log4net.Config.XmlConfigurator";
private static Assembly frameworkDll = null;
private static Type frameworkEntryClass = null;
private static MethodInfo keywordRunMethod = null;
private static Assembly logDll = null;
private static Type logEntryClass = null;
private static MethodInfo logConfigureMethod = null;
private static String errorMessage = "OK";
[RGiesecke.DllExport.DllExport]
public static string loadAssemblies(string frameworkDllPath, string log4netDllPath) {
try {
errorMessage = LoadFrameworkDll(frameworkDllPath, frameworkEntryClassName);
LoadFrameworkMethods("KeywordRun", "Setup", "TearDown");
errorMessage = LoadLogAssembly(log4netDllPath, log4netConfiguratorClassName);
if (errorMessage.CompareTo(OK) == 0)
errorMessage = LoadLogMethods("Configure");
}
catch (Exception e) {
return e.Message;
}
return errorMessage;
}
[RGiesecke.DllExport.DllExport]
public static string configureLog4net(string log4netConfigPath) {
if (errorMessage.CompareTo("OK") == 0) {
StringBuilder sb = new StringBuilder();
sb.AppendLine("Try to configure Log4Net");
try {
FileInfo logConfig = new FileInfo(log4netConfigPath);
logConfigureMethod.Invoke(null, new object[] { logConfig });
sb.AppendLine("Log4Net configured");
}
catch (Exception e) {
sb.AppendLine(e.InnerException.Message);
}
return sb.ToString();
}
return errorMessage;
}
[RGiesecke.DllExport.DllExport]
public static string runKeyword(string action, string xpath, string inputData, string verifyData) {
StringBuilder sb = new StringBuilder();
object result = null;
try {
result = keywordRunMethod.Invoke(null, new object[] { action, xpath, inputData, verifyData });
sb.AppendLine(result.ToString());
}
catch (Exception e) {
sb.AppendLine(e.InnerException.Message);
}
return sb.ToString();
}
private static String LoadFrameworkDll(String dllFolderPath, String entryClassName) {
try {
frameworkDll = Assembly.LoadFrom(dllFolderPath);
Type[] dllTypes = frameworkDll.GetExportedTypes();
foreach (Type t in dllTypes)
if (t.FullName.Equals(entryClassName)) {
frameworkEntryClass = t;
break;
}
}
catch (Exception e) {
return e.InnerException.Message;
}
return OK;
}
private static String LoadLogAssembly(String dllFolderPath, String entryClassName) {
try {
logDll = Assembly.LoadFrom(dllFolderPath);
Type[] dllTypes = logDll.GetExportedTypes();
foreach (Type t in dllTypes)
if (t.FullName.Equals(entryClassName)) {
logEntryClass = t;
break;
}
}
catch (Exception e) {
return e.InnerException.Message;
}
return OK;
}
private static String LoadLogMethods(String logMethodName) {
try {
logConfigureMethod = logEntryClass.GetMethod(logMethodName, new Type[] { typeof(FileInfo) });
}
catch (Exception e) {
return e.Message;
}
return OK;
}
private static void LoadFrameworkMethods(String keywordRunName, String scenarioSetupName, String scenarioTearDownName) {
///TODO load the rest of the desired methods here
keywordRunMethod = frameworkEntryClass.GetMethod(keywordRunName);
}
}
}
Running this code will provide all the logged messages from the original C# DLL to the Java console output (and to a file if configured). In a similar way, we can load any other needed dll files for runtime.
Please forgive my [very probable wrong] way of doing things in C# with reflection, I'm new to this language.

Convert HTML to plain text in Java

I need to convert HTML to plain text. My only requirement of formatting is to retain new lines in the plain text. New lines should be displayed not only in the case of <br> but other tags, e.g. <tr/>, </p> leads to a new line too.
Sample HTML pages for testing are:
http://www.particle.kth.se/~lindsey/JavaCourse/Book/Part1/Java/Chapter09/scannerConsole.html
http://www.javadb.com/write-to-file-using-bufferedwriter
Note that these are only random URLs.
I have tried out various libraries (JSoup, Javax.swing, Apache utils) mentioned in the answers to this StackOverflow question to convert HTML to plain text.
Example using JSoup:
public class JSoupTest {
#Test
public void SimpleParse() {
try {
Document doc = Jsoup.connect("http://www.particle.kth.se/~lindsey/JavaCourse/Book/Part1/Java/Chapter09/scannerConsole.html").get();
System.out.print(doc.text());
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
Example with HTMLEditorKit:
import javax.swing.text.html.*;
import javax.swing.text.html.parser.*;
public class Html2Text extends HTMLEditorKit.ParserCallback {
StringBuffer s;
public Html2Text() {}
public void parse(Reader in) throws IOException {
s = new StringBuffer();
ParserDelegator delegator = new ParserDelegator();
// the third parameter is TRUE to ignore charset directive
delegator.parse(in, this, Boolean.TRUE);
}
public void handleText(char[] text, int pos) {
s.append(text);
}
public String getText() {
return s.toString();
}
public static void main (String[] args) {
try {
// the HTML to convert
URL url = new URL("http://www.javadb.com/write-to-file-using-bufferedwriter");
URLConnection conn = url.openConnection();
BufferedReader reader = new BufferedReader(new InputStreamReader(url.openStream()));
String inputLine;
String finalContents = "";
while ((inputLine = reader.readLine()) != null) {
finalContents += "\n" + inputLine.replace("<br", "\n<br");
}
BufferedWriter writer = new BufferedWriter(new FileWriter("samples/testHtml.html"));
writer.write(finalContents);
writer.close();
FileReader in = new FileReader("samples/testHtml.html");
Html2Text parser = new Html2Text();
parser.parse(in);
in.close();
System.out.println(parser.getText());
}
catch (Exception e) {
e.printStackTrace();
}
}
}

Have your parser append text content and newlines to a StringBuilder.
final StringBuilder sb = new StringBuilder();
HTMLEditorKit.ParserCallback parserCallback = new HTMLEditorKit.ParserCallback() {
public boolean readyForNewline;
#Override
public void handleText(final char[] data, final int pos) {
String s = new String(data);
sb.append(s.trim());
readyForNewline = true;
}
#Override
public void handleStartTag(final HTML.Tag t, final MutableAttributeSet a, final int pos) {
if (readyForNewline && (t == HTML.Tag.DIV || t == HTML.Tag.BR || t == HTML.Tag.P)) {
sb.append("\n");
readyForNewline = false;
}
}
#Override
public void handleSimpleTag(final HTML.Tag t, final MutableAttributeSet a, final int pos) {
handleStartTag(t, a, pos);
}
};
new ParserDelegator().parse(new StringReader(html), parserCallback, false);

I would guess you could use the ParserCallback.
You would need to add code to support the tags that require special handling. There are:
handleStartTag
handleEndTag
handleSimpleTag
callbacks that should allow you to check for the tags you want to monitor and then append a newline character to your buffer.

Building on your example, with a hint from html to plain text? message:
import java.io.*;
import org.jsoup.*;
import org.jsoup.nodes.*;
public class TestJsoup
{
public void SimpleParse()
{
try
{
Document doc = Jsoup.connect("http://www.particle.kth.se/~lindsey/JavaCourse/Book/Part1/Java/Chapter09/scannerConsole.html").get();
// Trick for better formatting
doc.body().wrap("<pre></pre>");
String text = doc.text();
// Converting nbsp entities
text = text.replaceAll("\u00A0", " ");
System.out.print(text);
}
catch (IOException e)
{
e.printStackTrace();
}
}
public static void main(String args[])
{
TestJsoup tjs = new TestJsoup();
tjs.SimpleParse();
}
}

You can use XSLT for this purpose. Take a look at this link which addresses a similar problem.
Hope it is helpful.

I would use SAX. If your document is not well-formed XHTML, I would transform it with JTidy.

JSoup is not FreeMarker (or any other customer/non-HTML tag) compatible. Consider this as the most pure solution for converting Html to plain text.
http://stackoverflow.com/questions/1518675/open-source-java-library-for-html-to-text-conversion/1519726#1519726
My code:
return new net.htmlparser.jericho.Source(html).getRenderer().setMaxLineLength(Integer.MAX_VALUE).setNewLine(null).toString();

Problem querying an HTML file using HTMLEditorKit in Java

My HTML contains tags of the following form:
<div class="author">Apple - October 22, 2009 - 01:07</div>
I'd like to extract the date, "October 22, 2009 - 01:07" in this example, from each tag
I've implemented javax.swing.text.html.HTMLEditorKit.ParserCallback as follows:
class HTMLParseListerInner extends HTMLEditorKit.ParserCallback {
private ArrayList<String> foundDates = new ArrayList<String>();
private boolean isDivLink = false;
public void handleText(char[] data, int pos) {
if(isDivLink)
foundDates.add(new String(data)); // Extracts "Apple" instead of the date.
}
public void handleStartTag(HTML.Tag t, MutableAttributeSet a, int pos) {
String divValue = (String)a.getAttribute(HTML.Attribute.CLASS);
if (t.toString() == "div" && divValue != null && divValue.equals("author"))
isDivLink = true;
}
}
However, the above parser returns "Apple" which is inside a hyperlink within the tag. How can I fix the parser to extract the date?

Override handleEndTag and check for "a"?
However, this HTML parser is from the early 90's and these methods are not well specified.

import java.io.*;
import java.util.*;
import javax.swing.text.*;
import javax.swing.text.html.*;
import javax.swing.text.html.parser.*;
public class ParserCallbackDiv extends HTMLEditorKit.ParserCallback
{
private boolean isDivLink = false;
private String divText;
public void handleEndTag(HTML.Tag tag, int pos)
{
if (tag.equals(HTML.Tag.DIV))
{
System.out.println( divText );
isDivLink = false;
}
}
public void handleStartTag(HTML.Tag tag, MutableAttributeSet a, int pos)
{
if (tag.equals(HTML.Tag.DIV))
{
String divValue = (String)a.getAttribute(HTML.Attribute.CLASS);
if ("author".equals(divValue))
isDivLink = true;
}
}
public void handleText(char[] data, int pos)
{
divText = new String(data);
}
public static void main(String[] args)
throws IOException
{
String file = "<div class=\"author\"><a href=\"/user/1\"" +
"title=\"View user profile.\">Apple</a> - October 22, 2009 - 01:07</div>";
StringReader reader = new StringReader(file);
ParserCallbackDiv parser = new ParserCallbackDiv();
try
{
new ParserDelegator().parse(reader, parser, true);
}
catch (IOException e)
{
System.out.println(e);
}
}
}

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How t get specific value from html in java? - java

http://java-source.net/open-source/crawlers You can use any of that's apis, but don't parse the HTML with the pure JDK, because it's too painfull.

Related

Google Search with JSoup

Output issues: Passing from BufferedReader to array method

C# dll method call from Java

Convert HTML to plain text in Java

Problem querying an HTML file using HTMLEditorKit in Java

Categories

Resources