Set timeout when loading xml from a URL? - java

Is it possible to set a timeout when loading an xml directly from a URL?
Builder parser = new Builder();
Document doc = parser.build("http://somehost");
This may take sometimes minutes, and would be really handy to be able to time this out directly in the library.

You need to use build(InputStream inStream) api instead of build(String systemID).
URL url = new URL("http://somehost");
con = url.openConnection();
con.setConnectTimeout(connectTimeout);
con.setReadTimeout(readTimeout);
inStream = con.getInputStream();
Builder parser = new Builder();
Document doc = parser.build(inStream);

Related

Get website name as displayed on a browser tab?

How can I get the String representation of what is displayed on a tab when opening a website in a browser? Let's say, if I opened http://www.stackoverflow.com, is it possible to extract "Stack Overflow" String, as it's shown here:
I'm interested in Java implementation - java.net.URL doesn't seem to have a method for that.
I'm interested in Java implementation - java.net.URL doesn't seem to have a method for that.
java.net.URL won't do it, no, you need an HTML parser like JSoup. Then you just take the content of the title tag in the head.
E.g., assuming you have a URL:
Document doc = Jsoup.connect(url).get();
Element titleElement = doc.select("head title").first(); // Or just "title", it's always supposed to be in the head
String title = titleElement == null ? null : titleElement.text();
Look for following pattern in reponse -
private static final Pattern TITLE_TAG = Pattern.compile("\\<title>(.*)\\</title>", Pattern.CASE_INSENSITIVE|Pattern.DOTALL);
One more solution as parsing HTML using regex is not considered good -
javax.swing.text.html.HTMLDocument
URL url = new URL('http://yourwebsitehere.com');
URLConnection connection = url.openConnection();
InputStream is = connection.getInputStream();
InputStreamReader isr = new InputStreamReader(is);
BufferedReader br = new BufferedReader(isr);
HTMLEditorKit htmlKit = new HTMLEditorKit();
HTMLDocument htmlDoc = (HTMLDocument) htmlKit.createDefaultDocument();
String title = (String) htmlDoc.getProperty(HTMLDocument.TitleProperty);
System.out.println('HTMLDocument Title: ' + title);

Convert html String to org.w3c.dom.Document in Java

To convert from HTML String to
org.w3c.dom.Document
I'm using
jtidy-r938.jar
here is my code:
public static Document getDoc(String html) {
Tidy tidy = new Tidy();
tidy.setInputEncoding("UTF-8");
tidy.setOutputEncoding("UTF-8");
tidy.setWraplen(Integer.MAX_VALUE);
// tidy.setPrintBodyOnly(true);
tidy.setXmlOut(false);
tidy.setShowErrors(0);
tidy.setShowWarnings(false);
// tidy.setForceOutput(true);
tidy.setQuiet(true);
Writer out = new StringWriter();
PrintWriter dummyOut = new PrintWriter(out);
tidy.setErrout(dummyOut);
tidy.setSmartIndent(true);
ByteArrayInputStream inputStream = new ByteArrayInputStream(html.getBytes());
Document doc = tidy.parseDOM(inputStream, null);
return doc;
}
But sometime the library work incorrectly, some tag is lost.
Please tell a good open library to do this task.
Thanks very much!
You don't tell why sometimes the library doesn't give the good result.
Nevertheless, i am working very regularly with html files where I must extract data from and the main problem encountered is that fact that some tags are not valid because not closed for example.
The best solution i found to resolve is the api htmlcleaner (htmlCleaner Website).
It allows you to make your html file well formed.
Then, to transform it in document w3c or another strict format file is easier.
With HtmlCleaner, you could do such as :
HtmlCleaner cleaner = new HtmlCleaner();
TagNode node = cleaner.clean(html);
DomSerializer ser = new DomSerializer(cleaner.getProperties());
Document myW3cDoc = ser.createDOM(node);
I refer DomSerializer from htmlcleaner.

null pointer exception on getNodeValue() when parsing XML web page - android

I looked over several of the answers posted here, but I can't find the answer I need. It may have to do with the web site itself, but I don't think it is.
I'm trying to parse an XML on a web site and I'm getting a null pointer exception error.
I run the parsing is a separate thread following Google demand when reading from the web.
please see my code and try to help.
class BackgroundTask1 extends AsyncTask<String, Void, String[]> {
protected String[] doInBackground(String... url) {
new HttpGet();
new StringBuffer();
InputStream is = null;
HttpURLConnection con = null;
try {
//Log.d("eyal", "URL: " + boiUrl);
URL url1 = new URL("http://www.boi.org.il/currency.xml");
con = (HttpURLConnection)url1.openConnection();
con.setRequestMethod("GET");
con.connect();
is = con.getInputStream();
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
Document doc = builder.parse(is);
NodeList lastVld = doc.getElementsByTagName("LAST_UPDATE");
String lastV = lastVld.item(0).getFirstChild().getNodeValue();
}
catch (Exception e) {
e.printStackTrace();
}
I get the error on the last line.
Thanks for your help.
This code worked for me
InputStream is = null;
HttpURLConnection con = null;
try {
URL url1 = new URL("http://www.boi.org.il/currency.xml");
con = (HttpURLConnection)url1.openConnection();
con.setRequestMethod("GET");
con.connect();
is = con.getInputStream();
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
Document doc = builder.parse(is);
NodeList lastVld = doc.getElementsByTagName("LAST_UPDATE");
Element elem = (Element) lastVld.item(0);
String lastV = elem.getTextContent();
System.out.println(lastV);
} catch (Exception e) {
e.printStackTrace();
}
I verified I was getting good content by adding a transformer to print out the results to the console.
TransformerFactory tFactory = TransformerFactory.newInstance();
Transformer xform = tFactory.newTransformer();
xform.transform(new DOMSource(doc), new StreamResult(System.out));
There was a couple times I tried running where elem came out null, which I think had something to do with some bad content being retrieved from the URL. This was the content that was printed by the transformer.
<html>
<body>
<script>document.cookie='iiiiiii=11a887d6iiiiiii_11a887d6; path=/';window.location.href=window.location.href;</script>
</body>
</html>
I noticed that if I had this file open in my browser, the code would all of a sudden quit working until I refreshed the page, then it started giving me the right output.
I suspect there's an issue with something at this URL, because when it works properly, this code works fine.
Good luck...
You only have one LAST_UPDATE tag in your xml and it has an inner value, so try just using the node value from the Node Class you get from item(0)
String lastV = lastVld.item(0).getNodeValue();
HTHs
There is no node returned for that tag name. You may want to first check the size of the lastVld and then try to access the items in there.

How to read a InputStream with UTF-8?

Welcome all
I'm developing a Java app, that calls a PHP from internet that it's giving me a XML response.
In the response is contained this word: "Próximo", but when i parse the nodes of the XML and obtain the response into a String variable, I'm receiving the word like this: "Pr& oacute;ximo".
I'm sure that the problem is that i'm using different encoding in the Java app then encoding of PHP script. Then, i supose i must set encoding to the same as in your PHP xml, UTF-8
This is the code i'm using to geat the XML file from the PHP.
¿What should i change in this code to set the encoding to UTF-8? (note that im not using bufered reader, i'm using input stream)
InputStream in = null;
String url = "http://www.myurl.com"
try {
URL formattedUrl = new URL(url);
URLConnection connection = formattedUrl.openConnection();
HttpURLConnection httpConnection = (HttpURLConnection) connection;
httpConnection.setAllowUserInteraction(false);
httpConnection.setInstanceFollowRedirects(true);
httpConnection.setRequestMethod("GET");
httpConnection.connect();
if (httpConnection.getResponseCode() == HttpURLConnection.HTTP_OK)
in = httpConnection.getInputStream();
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
Document doc = db.parse(in);
doc.getDocumentElement().normalize();
NodeList myNodes = doc.getElementsByTagName("myNode");
When you get your InputStream read byte[]s from it. When you create your Strings, pass in the CharSetfor "UTF-8". Example:
byte[] buffer = new byte[contentLength];
int bytesRead = inputStream.read(buffer);
String page = new String(buffer, 0, bytesRead, "UTF-8");
Note, you're probably going to want to make your buffer some sane size (like 1024), and continuously called inputStream.read(buffer).
#Amir Pashazadeh
Yes, you can also use an InputStreamReader, and try changing the parse() line to:
Document doc = db.parse(new InputSource(new InputStreamReader(in, "UTF-8")));

XPath application using tika parser

I want to clean an irregular web content - (may be html, pdf image etc) mostly html. I am using tika parser for that. But I dont know how to apply xpath as I use in html cleaner.
The code I use is,
BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
ParseContext context = new ParseContext();
URL u = new URL("http://stackoverflow.com/questions/9128696/is-there-any-way-to-reach- drop-moment-in-drag-and-drop");
new HtmlParser().parse(u.openStream(),handler, metadata, context);
System.out.println(handler.toString());
But in this case I am getting no output. But for the url- google.com I am getting output.
In either case I don't know how to apply the xpath.
Any ideas please...
Tried by making my custom xpath as how body content handler uses,
HttpClient client = new HttpClient();
GetMethod method = new GetMethod("http://stackoverflow.com/questions/9128696/is-there-any-way-to-reach-drop-moment-in-drag-and-drop");
int status = client.executeMethod(method);
HtmlParser parse = new HtmlParser();
XPathParser parser = new XPathParser("xhtml", "http://www.w3.org/1999/xhtml");
//Matcher matcher = parser.parse("/xhtml:html/xhtml:body/descendant:node()");
Matcher matcher = parser.parse("/html/body//h1");
ContentHandler textHandler = new MatchingContentHandler(new WriteOutContentHandler(), matcher);
Metadata metadata = new Metadata();
ParseContext context = new ParseContext();
parse.parse(method.getResponseBodyAsStream(), textHandler,metadata ,context);
System.out.println("content: " + textHandler.toString());
But not getting the content in the given xpath..
I'd suggest you take a look at the source code for BodyContentHandler, which comes with Tika. BodyContentHandler only returns the xml within the body tag, based on an xpath
In general though, you should use a MatchingContentHandler to wrap your chosen ContentHandler with an XPath, which is what BodyContentHandler does internally.

Categories