I'm using following code for retrieving data from internet but I get HTTP headers also which is useless for me.
URL url = new URL(webURL);
URLConnection conn = url.openConnection();
BufferedReader in = new BufferedReader(new InputStreamReader(conn.getInputStream()));
String inputLine;
while ((inputLine = in.readLine()) != null)
System.out.println(inputLine);
in.close();
how can I get html data only not any headers or whatsoever.
regards
Retrieving and parsing a document using TagSoup:
Parser p = new Parser();
SAX2DOM sax2dom = new SAX2DOM();
URL url = new URL("http://stackoverflow.com");
p.setContentHandler(sax2dom);
p.parse(new InputSource(new InputStreamReader(url.openStream())));
org.w3c.dom.Node doc = sax2dom.getDOM();
The TagSoup and SAX2DOM packages are:
import org.ccil.cowan.tagsoup.Parser;
import org.apache.xalan.xsltc.trax.SAX2DOM;
Writing the contents to System.out:
TransformerFactory tFact = TransformerFactory.newInstance();
Transformer transformer = tFact.newTransformer();
Source source = new DOMSource(doc);
Result result = new StreamResult(System.out);
transformer.transform(source, result);
These all come from import javax.xml.transform.*
You are retrieving correct data using URLConnecton. However if you want to read/access a particular html tag you must have to use HTML parser. I suggest you to use jSoup.
Example:
org.jsoup.nodes.Document doc = org.jsoup.Jsoup.connect("http://your_url/").get();
org.jsoup.nodes.Element head=doc.head(); // <head> tag content
org.jsoup.nodes.Element body=doc.body(); // <body> tag content
System.out.println(doc.text()); // Only text inside the <html>
You can parse the complete data to search for the string and accept the data only between html tags
You are meaning to translate html into text? If so, you can use org.htmlparser.*. Take a loo at http://htmlparser.sourceforge.net/
Related
This question already has answers here:
How can I efficiently parse HTML with Java?
(3 answers)
Closed 6 years ago.
I want to parse a simple web site and scrape information from that web site.
I used to parse XML files with DocumentBuilderFactory, i tried to do the same thing for the html file but it always get into an infinite loop.
URL url = new URL("http://www.deneme.com");
URLConnection uc = url.openConnection();
InputStreamReader input = new InputStreamReader(uc.getInputStream());
BufferedReader in = new BufferedReader(input);
String inputLine;
FileWriter outFile = new FileWriter("orhancan");
PrintWriter out = new PrintWriter(outFile);
while ((inputLine = in.readLine()) != null) {
out.println(inputLine);
}
in.close();
out.close();
File fXmlFile = new File("orhancan");
DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder dBuilder = dbFactory.newDocumentBuilder();
Document doc = dBuilder.parse(fXmlFile);
NodeList prelist = doc.getElementsByTagName("body");
System.out.println(prelist.getLength());
Whats is the problem? Or is there any easier way to scrape data from a web site for a given html tag?
There is a much easier way to do this. I suggest using JSoup. With JSoup you can do things like
Document doc = Jsoup.connect("http://en.wikipedia.org/").get();
Elements newsHeadlines = doc.select("#mp-itn b a");
Or if you want the body:
Elements body = doc.select("body");
Or if you want all links:
Elements links = doc.select("body a");
You no longer need to get connections or handle streams. Simple. If you have ever used jQuery then it is very similar to that.
Definitely JSoup is the answer. ;-)
HTML is not always valid, well-formatted XML. Try a special HTML parser instead of an XML parser. There are a couple of different ones available:
http://java-source.net/open-source/html-parsers
How can I get the String representation of what is displayed on a tab when opening a website in a browser? Let's say, if I opened http://www.stackoverflow.com, is it possible to extract "Stack Overflow" String, as it's shown here:
I'm interested in Java implementation - java.net.URL doesn't seem to have a method for that.
I'm interested in Java implementation - java.net.URL doesn't seem to have a method for that.
java.net.URL won't do it, no, you need an HTML parser like JSoup. Then you just take the content of the title tag in the head.
E.g., assuming you have a URL:
Document doc = Jsoup.connect(url).get();
Element titleElement = doc.select("head title").first(); // Or just "title", it's always supposed to be in the head
String title = titleElement == null ? null : titleElement.text();
Look for following pattern in reponse -
private static final Pattern TITLE_TAG = Pattern.compile("\\<title>(.*)\\</title>", Pattern.CASE_INSENSITIVE|Pattern.DOTALL);
One more solution as parsing HTML using regex is not considered good -
javax.swing.text.html.HTMLDocument
URL url = new URL('http://yourwebsitehere.com');
URLConnection connection = url.openConnection();
InputStream is = connection.getInputStream();
InputStreamReader isr = new InputStreamReader(is);
BufferedReader br = new BufferedReader(isr);
HTMLEditorKit htmlKit = new HTMLEditorKit();
HTMLDocument htmlDoc = (HTMLDocument) htmlKit.createDefaultDocument();
String title = (String) htmlDoc.getProperty(HTMLDocument.TitleProperty);
System.out.println('HTMLDocument Title: ' + title);
To convert from HTML String to
org.w3c.dom.Document
I'm using
jtidy-r938.jar
here is my code:
public static Document getDoc(String html) {
Tidy tidy = new Tidy();
tidy.setInputEncoding("UTF-8");
tidy.setOutputEncoding("UTF-8");
tidy.setWraplen(Integer.MAX_VALUE);
// tidy.setPrintBodyOnly(true);
tidy.setXmlOut(false);
tidy.setShowErrors(0);
tidy.setShowWarnings(false);
// tidy.setForceOutput(true);
tidy.setQuiet(true);
Writer out = new StringWriter();
PrintWriter dummyOut = new PrintWriter(out);
tidy.setErrout(dummyOut);
tidy.setSmartIndent(true);
ByteArrayInputStream inputStream = new ByteArrayInputStream(html.getBytes());
Document doc = tidy.parseDOM(inputStream, null);
return doc;
}
But sometime the library work incorrectly, some tag is lost.
Please tell a good open library to do this task.
Thanks very much!
You don't tell why sometimes the library doesn't give the good result.
Nevertheless, i am working very regularly with html files where I must extract data from and the main problem encountered is that fact that some tags are not valid because not closed for example.
The best solution i found to resolve is the api htmlcleaner (htmlCleaner Website).
It allows you to make your html file well formed.
Then, to transform it in document w3c or another strict format file is easier.
With HtmlCleaner, you could do such as :
HtmlCleaner cleaner = new HtmlCleaner();
TagNode node = cleaner.clean(html);
DomSerializer ser = new DomSerializer(cleaner.getProperties());
Document myW3cDoc = ser.createDOM(node);
I refer DomSerializer from htmlcleaner.
I have a problem with NekoHTML
I try to parse the HTML pages, where Some eventually tags have attributes of type class. These attributes must have multiple values, such as class = 'ui-dialog ui-widget ui-widget-content ui-corner-all ui-draggable' The problem is that using NekoHtml, is extracted only the first value of the class and then: class = 'ui-dialog'
The code I use is as follows, but I can not understand why this behavior, and if I can change it.
URL newURL = new URL(url);
Reader source = new BufferedReader(new InputStreamReader(newURL.openStream(), "UTF-8"));
doc = parseHTML(source);
EDIT
public Document parseHTML(Reader in) throws SAXException, IOException {
DOMParser parser = new DOMParser();
parser.setFeature("http://xml.org/sax/features/namespaces", false);
parser.setFeature("http://cyberneko.org/html/features/balance-tags", true);
XMLDocumentFilter[] filters = { new Purifier() };
parser.setProperty("http://cyberneko.org/html/properties/filters", filters);
parser.parse(new InputSource(in));
return parser.getDocument();
}
an example of page is http://www.123premier.com/used-cars/2257/hyundai-matrix/#
the using the xpath expression //A[#class='addthis_button_facebook at300b'] it doesn't find anything. But if i use the firebugs in firefox it give me the correct node.
the code for the xpath expression is the following
XPath xPath = XPathFactory.newInstance().newXPath();
DTMNodeList result = (DTMNodeList) xPath.evaluate("//A[#class='addthis_button_facebook at300b']", doc, XPathConstants.NODESET);
Thank you very much.
This question already has answers here:
How can I efficiently parse HTML with Java?
(3 answers)
Closed 6 years ago.
I want to parse a simple web site and scrape information from that web site.
I used to parse XML files with DocumentBuilderFactory, i tried to do the same thing for the html file but it always get into an infinite loop.
URL url = new URL("http://www.deneme.com");
URLConnection uc = url.openConnection();
InputStreamReader input = new InputStreamReader(uc.getInputStream());
BufferedReader in = new BufferedReader(input);
String inputLine;
FileWriter outFile = new FileWriter("orhancan");
PrintWriter out = new PrintWriter(outFile);
while ((inputLine = in.readLine()) != null) {
out.println(inputLine);
}
in.close();
out.close();
File fXmlFile = new File("orhancan");
DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder dBuilder = dbFactory.newDocumentBuilder();
Document doc = dBuilder.parse(fXmlFile);
NodeList prelist = doc.getElementsByTagName("body");
System.out.println(prelist.getLength());
Whats is the problem? Or is there any easier way to scrape data from a web site for a given html tag?
There is a much easier way to do this. I suggest using JSoup. With JSoup you can do things like
Document doc = Jsoup.connect("http://en.wikipedia.org/").get();
Elements newsHeadlines = doc.select("#mp-itn b a");
Or if you want the body:
Elements body = doc.select("body");
Or if you want all links:
Elements links = doc.select("body a");
You no longer need to get connections or handle streams. Simple. If you have ever used jQuery then it is very similar to that.
Definitely JSoup is the answer. ;-)
HTML is not always valid, well-formatted XML. Try a special HTML parser instead of an XML parser. There are a couple of different ones available:
http://java-source.net/open-source/html-parsers