How can I get the String representation of what is displayed on a tab when opening a website in a browser? Let's say, if I opened http://www.stackoverflow.com, is it possible to extract "Stack Overflow" String, as it's shown here:
I'm interested in Java implementation - java.net.URL doesn't seem to have a method for that.
I'm interested in Java implementation - java.net.URL doesn't seem to have a method for that.
java.net.URL won't do it, no, you need an HTML parser like JSoup. Then you just take the content of the title tag in the head.
E.g., assuming you have a URL:
Document doc = Jsoup.connect(url).get();
Element titleElement = doc.select("head title").first(); // Or just "title", it's always supposed to be in the head
String title = titleElement == null ? null : titleElement.text();
Look for following pattern in reponse -
private static final Pattern TITLE_TAG = Pattern.compile("\\<title>(.*)\\</title>", Pattern.CASE_INSENSITIVE|Pattern.DOTALL);
One more solution as parsing HTML using regex is not considered good -
javax.swing.text.html.HTMLDocument
URL url = new URL('http://yourwebsitehere.com');
URLConnection connection = url.openConnection();
InputStream is = connection.getInputStream();
InputStreamReader isr = new InputStreamReader(is);
BufferedReader br = new BufferedReader(isr);
HTMLEditorKit htmlKit = new HTMLEditorKit();
HTMLDocument htmlDoc = (HTMLDocument) htmlKit.createDefaultDocument();
String title = (String) htmlDoc.getProperty(HTMLDocument.TitleProperty);
System.out.println('HTMLDocument Title: ' + title);
Related
I gonna get an SVGDocument object to fill into JSVGCanvas, but I just had an SVG string without any files, so I cannot use URI to construct.
You can read your SVG from a StringReader like this:
StringReader reader = new StringReader(svgString);
String uri = "file:make-something-up";
String parser = XMLResourceDescriptor.getXMLParserClassName();
SAXSVGDocumentFactory f = new SAXSVGDocumentFactory(parser);
SVGDocument doc = f.createSVGDocument(uri, reader);
You need to make up a valid URI but it's not important unless you make relative references to other URIs from your SVG.
This question already has answers here:
How can I efficiently parse HTML with Java?
(3 answers)
Closed 6 years ago.
I want to parse a simple web site and scrape information from that web site.
I used to parse XML files with DocumentBuilderFactory, i tried to do the same thing for the html file but it always get into an infinite loop.
URL url = new URL("http://www.deneme.com");
URLConnection uc = url.openConnection();
InputStreamReader input = new InputStreamReader(uc.getInputStream());
BufferedReader in = new BufferedReader(input);
String inputLine;
FileWriter outFile = new FileWriter("orhancan");
PrintWriter out = new PrintWriter(outFile);
while ((inputLine = in.readLine()) != null) {
out.println(inputLine);
}
in.close();
out.close();
File fXmlFile = new File("orhancan");
DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder dBuilder = dbFactory.newDocumentBuilder();
Document doc = dBuilder.parse(fXmlFile);
NodeList prelist = doc.getElementsByTagName("body");
System.out.println(prelist.getLength());
Whats is the problem? Or is there any easier way to scrape data from a web site for a given html tag?
There is a much easier way to do this. I suggest using JSoup. With JSoup you can do things like
Document doc = Jsoup.connect("http://en.wikipedia.org/").get();
Elements newsHeadlines = doc.select("#mp-itn b a");
Or if you want the body:
Elements body = doc.select("body");
Or if you want all links:
Elements links = doc.select("body a");
You no longer need to get connections or handle streams. Simple. If you have ever used jQuery then it is very similar to that.
Definitely JSoup is the answer. ;-)
HTML is not always valid, well-formatted XML. Try a special HTML parser instead of an XML parser. There are a couple of different ones available:
http://java-source.net/open-source/html-parsers
I am trying to use CSS Parser in a java project to extract the CSS rules/DOM from a String of the text input.
All the examples that I have come across take the css file as input. Is there a way to bypass the file reading and work with the string content of the css file directly.
Because the class that I am working on gets only the string content of the css file and all the reading has already been taken care of.
Right now I have this, where the 'cssfile' is the filepath for css file being parsed.
InputStream stream = oParser.getClass().getResourceAsStream(cssfile);
InputSource source = new InputSource(new InputStreamReader(stream));
CSSOMParser parser = new CSSOMParser();
CSSStyleSheet stylesheet = parser.parseStyleSheet(source, null, null);
CSSRuleList ruleList = stylesheet.getCssRules();
System.out.println("Number of rules: " + ruleList.getLength());
Reference link
A workaround that I found was to create a Reader using a StringReader with the contents and set the characterStream for the Input source. But there should be a better way to do this..
InputSource inputSource = new InputSource();
Reader characterStream = new StringReader(cssContent);
inputSource.setCharacterStream(characterStream);
CSSStyleSheet stylesheet = cssParserObj.parseStyleSheet(source, null,
null);
CSSRuleList ruleList = stylesheet.getCssRules();
This question already has answers here:
How can I efficiently parse HTML with Java?
(3 answers)
Closed 6 years ago.
I want to parse a simple web site and scrape information from that web site.
I used to parse XML files with DocumentBuilderFactory, i tried to do the same thing for the html file but it always get into an infinite loop.
URL url = new URL("http://www.deneme.com");
URLConnection uc = url.openConnection();
InputStreamReader input = new InputStreamReader(uc.getInputStream());
BufferedReader in = new BufferedReader(input);
String inputLine;
FileWriter outFile = new FileWriter("orhancan");
PrintWriter out = new PrintWriter(outFile);
while ((inputLine = in.readLine()) != null) {
out.println(inputLine);
}
in.close();
out.close();
File fXmlFile = new File("orhancan");
DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder dBuilder = dbFactory.newDocumentBuilder();
Document doc = dBuilder.parse(fXmlFile);
NodeList prelist = doc.getElementsByTagName("body");
System.out.println(prelist.getLength());
Whats is the problem? Or is there any easier way to scrape data from a web site for a given html tag?
There is a much easier way to do this. I suggest using JSoup. With JSoup you can do things like
Document doc = Jsoup.connect("http://en.wikipedia.org/").get();
Elements newsHeadlines = doc.select("#mp-itn b a");
Or if you want the body:
Elements body = doc.select("body");
Or if you want all links:
Elements links = doc.select("body a");
You no longer need to get connections or handle streams. Simple. If you have ever used jQuery then it is very similar to that.
Definitely JSoup is the answer. ;-)
HTML is not always valid, well-formatted XML. Try a special HTML parser instead of an XML parser. There are a couple of different ones available:
http://java-source.net/open-source/html-parsers
I'm using following code for retrieving data from internet but I get HTTP headers also which is useless for me.
URL url = new URL(webURL);
URLConnection conn = url.openConnection();
BufferedReader in = new BufferedReader(new InputStreamReader(conn.getInputStream()));
String inputLine;
while ((inputLine = in.readLine()) != null)
System.out.println(inputLine);
in.close();
how can I get html data only not any headers or whatsoever.
regards
Retrieving and parsing a document using TagSoup:
Parser p = new Parser();
SAX2DOM sax2dom = new SAX2DOM();
URL url = new URL("http://stackoverflow.com");
p.setContentHandler(sax2dom);
p.parse(new InputSource(new InputStreamReader(url.openStream())));
org.w3c.dom.Node doc = sax2dom.getDOM();
The TagSoup and SAX2DOM packages are:
import org.ccil.cowan.tagsoup.Parser;
import org.apache.xalan.xsltc.trax.SAX2DOM;
Writing the contents to System.out:
TransformerFactory tFact = TransformerFactory.newInstance();
Transformer transformer = tFact.newTransformer();
Source source = new DOMSource(doc);
Result result = new StreamResult(System.out);
transformer.transform(source, result);
These all come from import javax.xml.transform.*
You are retrieving correct data using URLConnecton. However if you want to read/access a particular html tag you must have to use HTML parser. I suggest you to use jSoup.
Example:
org.jsoup.nodes.Document doc = org.jsoup.Jsoup.connect("http://your_url/").get();
org.jsoup.nodes.Element head=doc.head(); // <head> tag content
org.jsoup.nodes.Element body=doc.body(); // <body> tag content
System.out.println(doc.text()); // Only text inside the <html>
You can parse the complete data to search for the string and accept the data only between html tags
You are meaning to translate html into text? If so, you can use org.htmlparser.*. Take a loo at http://htmlparser.sourceforge.net/