how to differentiate xml from html links in java - java

I have a list of links, containing links to html and xml pages, how can I extract the xml links from the list? in java
thanks

You could use a list of common filename extensions to divine the type of data stored at a given URL, but that often won't be very reliable, particularly with Web 2.0 sites (just look at the URL of this SO question itself). In addition, a link to a PHP script (.php) or other dynamic content site could return either HTML or XML. Or it could return something else entirely, such as a JPG file.
There are a lot of simple heuristics you can use for detecting HTML vs. XML, simply by looking at the beginning of the file. For example, you could look for the <!DOCTYPE ...> declaration, check for the <?xml ...?> directive, and check to see if the file contains a root <html> tag. Of course, these should all be case-insensitive checks.
You can also try to identify the type of file based on its MIME type (for example, text/html or text/xml). Unfortunately, many servers return incorrect or invalid MIME types, so you often have to read the beginning of the file anyway to divine its content, as you can see in my first two inadequate versions of a getMimeType() method below. The third attempt worked better, but the third-party MimeMagic library still provided disappointing results. Nevertheless, you could use the additional heuristics that I mentioned earlier to either replace or improve the getMimeType() method.
package com.example.mimetype;
import java.io.BufferedInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.net.FileNameMap;
import java.net.MalformedURLException;
import java.net.URL;
import java.net.URLConnection;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import net.sf.jmimemagic.Magic;
import net.sf.jmimemagic.MagicException;
import net.sf.jmimemagic.MagicMatchNotFoundException;
import net.sf.jmimemagic.MagicParseException;
public class MimeUtils {
// After calling this method, you can retrieve a list of URLs for each mimetype.
public static Map<String, List<String>> sortLinksByMimeType(List<String> links) {
Map<String, List<String>> mapMimeTypesToLinks = new HashMap<String, List<String>>();
for (String url : links) {
try {
String mimetype = getMimeType(url);
System.out.println(url + " has mimetype " + mimetype);
// If this mimetype hasn't already been initialized, initialize it.
if (! mapMimeTypesToLinks.containsKey(mimetype)) {
mapMimeTypesToLinks.put(mimetype, new ArrayList<String>());
}
List<String> lst = mapMimeTypesToLinks.get(mimetype);
lst.add(url);
} catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
return mapMimeTypesToLinks;
}
public static String getMimeType(String url) throws MalformedURLException, IOException, MagicParseException, MagicMatchNotFoundException, MagicException {
// first attempt at determining MIME type--returned null for all URLs that I tried
// FileNameMap filenameMap = URLConnection.getFileNameMap();
// return filenameMap.getContentTypeFor(url);
// second attempt at determining MIME type--worked better, but still returned null for many URLs
// URLConnection c = new URL(url).openConnection();
// InputStream in = c.getInputStream();
// String mimetype = URLConnection.guessContentTypeFromStream(in);
// in.close();
// return mimetype;
URLConnection c = new URL(url).openConnection();
BufferedInputStream in = new BufferedInputStream(c.getInputStream());
byte[] content = new byte[100];
in.read(content);
in.close();
return Magic.getMagicMatch(content, false).getMimeType();
}
public static void main(String[] args) {
List<String> links = new ArrayList<String>();
links.add("http://stackoverflow.com/questions/10082568/how-to-differentiate-xml-from-html-links-in-java");
links.add("http://stackoverflow.com");
links.add("http://stackoverflow.com/feeds");
links.add("http://amazon.com");
links.add("http://google.com");
sortLinksByMimeType(links);
}
}

I'm not certain if your links are some sort of Link object, but as long as you can access the value as a string this should work I think.
List<String> xmlLinks = new ArrayList<String>();
for (String link : list) {
if (link.endsWith(".xml") || link.contains(".xml")) {
xmlLinks.add(link);
}
}

Related

Using Jsoup to parse search descriptions in SERPS (Google Results)

I keep running into a problem whenever I try to scrape off searches from Google Search results. I am using Jsoup to pull out the HTML code, but I am unable to pull out the information from the webpage that I need. I am aiming to reach the descriptions of the information under the titles. Here is my code:
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.io.IOException;
import java.net.InetSocketAddress;
import java.net.Proxy;
public class internetSearch {
public void retrieveFileInfo(String pulling) {
Document doc;
try {
String proxyAdress = "1.2.3.4";
int proxyPort = 1234;
Proxy proxy = new Proxy(Proxy.Type.HTTP, InetSocketAddress.createUnresolved(proxyAdress, proxyPort));
doc = Jsoup
.connect(pulling)
.userAgent("Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)")
.header("Content-Language", "en-US")
.timeout(0)
.get();
System.out.println(doc.toString());
Elements links = doc.select("div[class=g]");
for (Element link : links) {
Elements titles = link.select("h3[class=r]");
String title = titles.text();
Elements bodies = link.select("span[class=st]");
String body = bodies.text();
System.out.println("Title: " + title);
System.out.println("Body: " + body + "\n");
}
} catch (IOException e) {
e.printStackTrace();
}
}
}
I've used many sources across the web in order to get my code. In the past, I used Selenium as well, but to no avail.
I continuously search through my outcome in order to find the class ".st" which it is under (in h3, span, .st), and I do not reach a conclusion.
Is it just simply Google jumbling up the code or am I missing something vital?
Here is a solution with estivate (which is a Java DOM Parser with Annotations compatible with JSoup)
Document doc = // here your JSoup document grabbing
EstivateMapper2 mapper = new EstivateMapper2()
List<GoogleResult> results = mapper.mapToList(doc, GoogleResult .class);
with the defintion of GoogleResult as follow :
#Select("div.g")
public class GoogleResult {
#Text(select = "h3.r")
public String title;
#Text(select = "div.s cite")
public String link;
#Text(select = "span.st")
public String body;
}

How can I generate a list of Reddit saved items using jraw?

I'm trying to generate a list of all my saved reddit items using JRAW.
I've gone through the Quickstart , and successfully managed to login and retrieve information, and I can get a list of items on the Frontpage from the Cookbook, but I can't work out how I would get a list of my saved items (comments and posts) or a list of my own posts (also comments and posts).
The saved items are at https://www.reddit.com/user/<username>/saved/, but I don't know how to get jraw to retrieve and parse that, or if the api uses a different URL.
Edit: I think I probably need to use a UserContributionPaginator, but I haven't quite worked out exactly how to get it to work yet.
Worked it out.
package com.jraw;
import net.dean.jraw.RedditClient;
import net.dean.jraw.http.UserAgent;
import net.dean.jraw.http.oauth.Credentials;
import net.dean.jraw.http.oauth.OAuthData;
import net.dean.jraw.http.oauth.OAuthException;
import net.dean.jraw.models.Contribution;
import net.dean.jraw.models.Listing;
import net.dean.jraw.paginators.UserContributionPaginator;
public class printSaved {
public static void main(String [] args) {
UserAgent myUserAgent = UserAgent.of("desktop", "com.jraw.printSaved", "v0.01", "user");
RedditClient redditClient = new RedditClient(myUserAgent);
String username = "username";
Credentials credentials = Credentials.script(username, "<password>", "<clientId>", "<clientSecret>");
OAuthData authData = null;
try {
authData = redditClient.getOAuthHelper().easyAuth(credentials);
} catch (OAuthException e) {
e.printStackTrace();
}
redditClient.authenticate(authData);
UserContributionPaginator saved = new UserContributionPaginator(redditClient,"saved",username);
Listing<Contribution> savedList = saved.next();
for (Contribution item : savedList) {
System.out.println(item);
}
}
}

regarding encoded strings over Http(Apache)

I have a question regarding UTF-8 encoding when sending strings containing special characters using HttpServiceClient (Apache)
I have this small piece of code below where the method takes string and sends it via Http(which is not fully complete in the code).
Although the decoded string seems to work without problems, I would like to know if the method.addparameter or httpClient.execute(method) encodes the string again. We have the problem that at the client side the strings seem doubly encoded!
eg. strReq = äöü
import org.apache.commons.codec.DecoderException;
import org.apache.commons.codec.EncoderException;
import org.apache.commons.codec.net.URLCodec;
import org.apache.commons.httpclient.methods.PostMethod;
public class Demo {
public static void Test(String strReq) throws CancellationException, IOException, DecoderException {
PostMethod method = null;
method = new PostMethod("www.example.com");
// Encode the XML document.
URLCodec codec = new URLCodec();
String requestEncoded = new String(strReq);
try {
requestEncoded = codec.encode(strReq);
} catch (EncoderException e) {
}
System.out.println("encoded req = "+requestEncoded);
method.addParameter(Constants.Hdr, requestEncoded);
String str2 = codec.decode(requestEncoded);
System.out.println("str2 ="+str2);
}

get links in a web site

how can i get links in a web page without loading it? (basically what i want is this. a user enters a URL and i want to load all the available links inside that URL.) can you please tell me a way to achieve this
Here is example Java code, specifically:
import java.io.InputStream;
import java.io.InputStreamReader;
import java.io.Reader;
import java.net.URL;
import javax.swing.text.MutableAttributeSet;
import javax.swing.text.html.HTML;
import javax.swing.text.html.HTMLEditorKit;
import javax.swing.text.html.parser.ParserDelegator;
public class Main {
public static void main(String args[]) throws Exception {
URL url = new URL(args[0]);
Reader reader = new InputStreamReader((InputStream) url.getContent());
System.out.println("<HTML><HEAD><TITLE>Links for " + args[0] + "</TITLE>");
System.out.println("<BASE HREF=\"" + args[0] + "\"></HEAD>");
System.out.println("<BODY>");
new ParserDelegator().parse(reader, new LinkPage(), false);
System.out.println("</BODY></HTML>");
}
}
class LinkPage extends HTMLEditorKit.ParserCallback {
public void handleStartTag(HTML.Tag t, MutableAttributeSet a, int pos) {
if (t == HTML.Tag.A) {
System.out.println("<A HREF=\"" + a.getAttribute(HTML.Attribute.HREF) + "\">"
+ a.getAttribute(HTML.Attribute.HREF) + "</A><BR>");
}
}
}
You'll have to load the page on your server and then find the links, preferably by loading up the document in an HTML/XML parser and traversing that DOM. The server could then send the links back to the client.
You can't do it on the client because the browser won't let your Javascript code look at the contents of the page from a different domain.
If you want the content of a page you'll have to load it. But what you can do is loading it in memory and parse it to get all the <a> tags and their content.
You'll be able to parse this XML with tools like JDom or Sax if you're working with java (as your tag says) or with simple DOM tools with javascript.
Resources :
Parse XML with javascript
On the same topic :
get all the href attributes of a web site (javascript)
Just open an URLConnection, gets the page and parse it.
public void extract_link(String site)
{
try {
List<String> links = extractLinks(site);
for (String link : links) {
System.out.println(link);
}
} catch (Exception e) {
System.out.println(e);
}
}
This is a simple function to view all links in a page.
If you want to view link in the inner links , just call it recursively(but make sure you give a limit according to your need).

Full Link Extraction using java

My goal is to always get the same string (which is the URI in my case) while reading the href property from a link. Example:
Suppose think that a html file it have somany links like
a href="index.html"> but base domain is http://www.domainname.com/index.html
a href="../index.html"> but base domain is http://www.domainname.com/dit/index.html
how can i get all the link correctly means the full link including domain name?
how can i do that in java?
the input is HTML,that is,from a bunch of HTML code it need to extract correct link
You can do this using a fullworthy HTML parser like Jsoup. There's a Node#absUrl() which does exactly what you want.
package com.stackoverflow.q3394298;
import java.net.URL;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
public class Test {
public static void main(String... args) throws Exception {
URL url = new URL("https://stackoverflow.com/questions/3394298/");
Document document = Jsoup.connect(url).get();
Element link = document.select("a.question-hyperlink").first();
System.out.println(link.attr("href"));
System.out.println(link.absUrl("href"));
}
}
which prints (correctly) the following for the title link of your current question:
/questions/3394298/full-link-extraction-using-java
https://stackoverflow.com/questions/3394298/full-link-extraction-using-java
Jsoup may have more other (undiscovered) advantages for your purpose as well.
Related questions:
What are the pros and cons of the leading HTML parsers in Java?
Update: if you want to select all links in the document, then do as follows:
Elements links = document.select("a");
for (Element link : links) {
System.out.println(link.attr("href"));
System.out.println(link.absUrl("href"));
}
Use the URL object:
URL url = new URL(URL context, String spec)
Here's an example:
import java.net.*;
public class Test {
public static void main(String[] args) throws Exception {
URL base = new URL("http://www.java.com/dit/index.html");
URL url = new URL(base, "../hello.html");
System.out.println(base);
System.out.println(url);
}
}
It will print:
http://www.java.com/dit/index.html
http://www.java.com/hello.html

Categories