Currently I need a program that given a URL, returns a list of all the images on the webpage.
ie:
logo.png
gallery1.jpg
test.gif
Is there any open source software available before I try and code something?
Language should be java. Thanks
Philip
Just use a simple HTML parser, like jTidy, and then get all elements by tag name img and then collect the src attribute of each in a List<String> or maybe List<URI>.
You can obtain an InputStream of an URL using URL#openStream() and then feed it to any HTML parser you like to use. Here's a kickoff example:
InputStream input = new URL("http://www.stackoverflow.com").openStream();
Document document = new Tidy().parseDOM(input, null);
NodeList imgs = document.getElementsByTagName("img");
List<String> srcs = new ArrayList<String>();
for (int i = 0; i < imgs.getLength(); i++) {
srcs.add(imgs.item(i).getAttributes().getNamedItem("src").getNodeValue());
}
for (String src: srcs) {
System.out.println(src);
}
I must however admit that HtmlUnit as suggested by Bozho indeed looks better.
HtmlUnit has HtmlPage.getElementsByTagName("img"), which will probably suit you.
(read the short Get started guide to see how to obtain the correct HtmlPage object)
This is dead simple with HTML Parser (and any other decent HTML parser):
Parser parser = new Parser("http://www.yahoo.com/");
NodeList list = parser.parse(new TagNameFilter("IMG"));
for ( SimpleNodeIterator iterator = list.elements(); iterator.hasMoreNodes(); ) {
Tag tag = (Tag) iterator.nextNode();
System.out.println(tag.getAttribute("src"));
}
You can use wget that has a lot of options available.
Or google for java wget ...
You can parse the HTML, and collect all SRC attributes of IMG elements in a Collection. Then download each resource from each url and write it to a file. For parsing there are several HTML parsers available, Cobra is one of them.
With Open Graph tags and HTML unit, you can extract your data really easily (PageMeta is a simple POJO holding the results):
Parser parser = new Parser(url);
PageMeta pageMeta = new PageMeta();
pageMeta.setUrl(url);
NodeList meta = parser.parse(new TagNameFilter("meta"));
for (SimpleNodeIterator iterator = meta.elements(); iterator.hasMoreNodes(); ) {
Tag tag = (Tag) iterator.nextNode();
if ("og:image".equals(tag.getAttribute("property"))) {
pageMeta.setImageUrl(tag.getAttribute("content"));
}
if ("og:title".equals(tag.getAttribute("property"))) {
pageMeta.setTitle(tag.getAttribute("content"));
}
if ("og:description".equals(tag.getAttribute("property"))) {
pageMeta.setDescription(tag.getAttribute("content"));
}
}
You can simply use regular expression in Java
<html>
<body>
<p>
<img src="38220.png" alt="test" title="test" />
<img src="32222.png" alt="test" title="test" />
</p>
</body>
</html>
String s ="html"; //above html content
Pattern p = Pattern.compile("<img [^>]*src=[\\\"']([^\\\"^']*)");
Matcher m = p.matcher (s);
while (m.find()) {
String src = m.group();
int startIndex = src.indexOf("src=") + 5;
String srcTag = src.substring(startIndex, src.length());
System.out.println( srcTag );
}
Related
I'm new in java and i have a link "https://moz.com/blog-sitemap.xml" that has URLs ,i want to get them and save them in a string vector/array.
i tried this first to see how i'm going to get the links
URL robotFile = new URL("https://moz.com/blog-sitemap.xml");
//read robot.txt line by line
Scanner robotScanner = new Scanner(robotFile.openStream());
while (robotScanner.hasNextLine()) {
System.out.println(robotScanner.nextLine());
}
this is the sample output
my answer is ,is there a simple easier way to get these links instead of looping on each line checking if it contains "https" so i can extract the link from it ?
You can use Jsoup to do this more easly:
List<String> urlList = new ArrayList<>();
Document doc = Jsoup.connect("https://moz.com/blog-sitemap.xml").get();
Elements urls = doc.getElementsByTag("loc");
for (Element url : urls) {
urlList.add(url.text());
}
I'm using JDOM 2.0.6 to transform an XSLT into an HTML, but I'm coming across the following problem - sometimes the data should be empty, that is, I'll have in my XSLT the following:
<div class="someclass"><xsl:value-of select="somevalue"/></div>
and when somevalue is empty, the output I get is:
<div class="someclass"/>
which may be perfectly valid XML, but is not valid HTML, and causes problems when displaying the resulting page.
Similar problems occur for <span> or <script> tags.
So my question is - how can I tell JDOM not to contract empty elements, and leave them as <div></div>?
Edit
I suspect the problem is not in the actual XSLTTransformer, but later when using JDOM to write to html. Here is the code I use:
XMLOutputter htmlout = new XMLOutputter(Format.getPrettyFormat());
htmlout.getFormat().setEncoding("UTF-8");
Document htmlDoc = transformer.transform(model);
htmlDoc.setDocType(new DocType("html"));
try (OutputStreamWriter osw = new OutputStreamWriter(new FileOutputStream(outHtml), "UTF-8")) {
htmlout.output(htmlDoc, osw);
}
Currently the proposed solution of adding a zero-width space works for me, but I'm interested to know if there is a way to tell JDOM to treat the document as an HTML (be it in the transform stage or the output stage, but I'm guessing the problem lies in the output stage).
You can use a zero-width-space between the elements. This doesn't affect the HTML output, but keeps the open-close-tags separated because they have a non-empty content.
<div class="someclass"><xsl:value-of select="somevalue"/></div>
Downside is: the tag is not really empty anymore. That would matter if your output would be XML. But for HTML - which is probably the last stage of processing - it should not matter.
In your case, the XML transform is happening directly to a file/stream, and it is no longer in the control of JDOM.
In JDOM, you can select whether the output from the JDOM document has expanded, or not-expanded output for empty elements. Typically, people have output from JDOM like:
XMLOutputter xout = new XMLOutputter(Format.getPrettyFormat());
xout.output(document, System.out);
You can modify the output format, though, and expand the empty elements
Format expanded = Format.getPrettyFormat().setExpandEmptyElements(true);
XMLOutputter xout = new XMLOutputter(expanded);
xout.output(document, System.out);
If you 'recover' (assuming it is valid XHTML?) the XSLT transformed xml as a new JDOM document you can output the result with expanded empty elements.
If you want to transform to a HTML file then consider to use Jaxp Transformer with a JDOMSource and a StreamResult, then the Transformer will serialize the transformation result as HTML if the output method is html (either as set in your code or as done with a no-namespace root element named html.
In addition to the "expandEmptyElements" option, you could create your own writer and pass it to the XMLOutputter:
XMLOutputter outputter = new XMLOutputter(Format.getPrettyFormat().setExpandEmptyElements(true));
StringWriter writer = new HTML5Writer();
outputter.output(document, writer);
System.out.println(writer.toString());
This writer can then modify all HTML5 void elements. Elements like "script" for example won't be touched:
private static class HTML5Writer extends StringWriter {
private static String[] VOIDELEMENTS = new String[] { "area", "base", "br", "col", "command", "embed", "hr",
"img", "input", "keygen", "link", "meta", "param", "source", "track", "wbr" };
private boolean inVoidTag;
private StringBuffer voidTagBuffer;
public void write(String str) {
if (voidTagBuffer != null) {
if (str.equals("></")) {
voidTagBuffer.append(" />");
super.write(voidTagBuffer.toString());
voidTagBuffer = null;
} else {
voidTagBuffer.append(str);
}
} else if (inVoidTag) {
if (str.equals(">")) {
inVoidTag = false;
}
} else {
for (int i = 0; i < VOIDELEMENTS.length; i++) {
if (str.equals(VOIDELEMENTS[i])) {
inVoidTag = true;
voidTagBuffer = new StringBuffer(str);
return;
}
}
super.write(str);
}
}
}
I know, this is dirty, but I had the same problem and didn't find any other way.
I have this code that will take in a HTML file, get all the opening HTML tags, and then print them. I was wondering if there was a way to also include the closing tags within this code. So right now it prints:
<html>
<head>
<title>
<body>
<table>
<p>
<a>
<p>
etc. etc.
I'm looking for it to print with the closing tags as well.
<p>
<a>
</a>
</p>
Here's the code I have thus far:
try {
BufferedReader in = new BufferedReader(new FileReader("test.html"));
String line;
StringBuilder stringBuilder = new StringBuilder();
while ((line = in.readLine()) != null) {
stringBuilder.append(line);
}
String pageContent = stringBuilder.toString();
Pattern pattern = Pattern.compile("<(?!!)(?!/)\\s*([a-zA-Z0-9]+)(.*?)>");
Matcher matcher = pattern.matcher(pageContent);
while (matcher.find()) {
String tagName = matcher.group(1);
System.out.println("<" + tagName + ">");
}
in.close();
}
Edit: Is there a way to do it without using an external library like Jsoup?
Edit 2: I changed my Pattern.compile to this-> <([a-zA-Z0-9]+|/[a-zA-Z0-9]+)(.*?)> and it worked. Thanks.
If its fine to use external library you can go with JSoup as described here. Extract Tags from a html file using Jsoup
I'm developing Java code to get data from a website and store it in a file. I want to store the result of xpath into a file. Is there any way to save the output of the xpath? Please forgive for any mistakes; this is my first question.
public class TestScrapping {
public static void main(String[] args) throws MalformedURLException, IOException, XPatherException {
// URL to be fetched in the below url u can replace s=cantabil with company of ur choice
String url_fetch = "http://www.yahoo.com";
//create tagnode object to traverse XML using xpath
TagNode node;
String info = null;
//XPath of the data to be fetched.....use firefox's firepath addon or use firebug to fetch the required XPath.
//the below XPath will display the title of the company u have queried for
String name_xpath = "//div[1]/div[2]/div[2]/div[1]/div/div/div/div/table/tbody/tr[1]/td[2]/text()";
// declarations related to the api
HtmlCleaner cleaner = new HtmlCleaner();
CleanerProperties props = new CleanerProperties();
props.setAllowHtmlInsideAttributes(true);
props.setAllowMultiWordAttributes(true);
props.setRecognizeUnicodeChars(true);
props.setOmitComments(true);
//creating url object
URL url = new URL(url_fetch);
URLConnection conn = url.openConnection(); //opening connection
node = cleaner.clean(new InputStreamReader(conn.getInputStream()));//reading input stream
//storing the nodes belonging to the given xpath
Object[] info_nodes = node.evaluateXPath(name_xpath);
// String li= node.getAttributeByName(name_xpath);
//checking if something returned or not....if XPath invalid info_nodes.length=0
if (info_nodes.length > 0) {
//info_nodes[0] will return string buffer
StringBuffer str = new StringBuffer();
{
for(int i=0;i<info_nodes.length;i++)
System.out.println(info_nodes[i]);
}
/*str.append(info_nodes[0]);
System.out.println(str);
*/
}
}
}
You can "simply" print the nodes as strings, to console/or a file --
example in Perl:
my $all = $XML_OBJ->find('/'); # selecting all nodes from root
foreach my $node ($all->get_nodelist()) {
print XML::XPath::XMLParser::as_string($node);
}
note: this output however may not be nicely xml-formatted/indented
The output of an XPath in Java is a nodeset, so yes, once you have a nodeset you can do anything you want with it, save it to a file, process it some more.
Saving it to a file would involve the same steps in java that saving anything else to a file involve, there is no difference between that and and any other data. Select the nodeset, itterate through it, get the parts you want from it and write them to some kind of file stream.
However, if you mean is there a Nodeset.SaveToFile(), then no.
I would recommend you to take the NodeSet, which is a collection of Nodes, iterate on it, and add it to a created DOM document object.
After this, you can use the TransformerFactory to get a Transformer object, and to use its transform method. You should transform from a DOMSource to a StreamResult object which can be created based on FileOutputStream.
I need to extract the text between two HTML tags and store it in a string. An example of the HTML I want to parse is as follows:
<div id=\"swiki.2.1\"> THE TEXT I NEED </div>
I have done this in Java using the pattern (swiki\.2\.1\\\")(.*)(\/div) and getting the string I want from the group $2. However this will not work in android. When I go to print the contents of $2 nothing appears, because the match fails.
Has anyone had a similar problem with using regex in android, or is there a better way (non-regex) to parse the HTML page in the first place. Again, this works fine in a standard java test program. Any help would be greatly appreciated!
For HTML-parsing-stuff I always use HtmlCleaner: http://htmlcleaner.sourceforge.net/
Awesome lib that works great with Xpath and of course Android. :-)
This shows how you can download an XML from URL and parse it to get a certain value from an XML attribute (also shown in the docs):
public static String snapFromHtmlWithCookies(Context context, String xPath, String attrToSnap, String urlString,
String cookies) throws IOException, XPatherException {
String snap = "";
// create an instance of HtmlCleaner
HtmlCleaner cleaner = new HtmlCleaner();
// take default cleaner properties
CleanerProperties props = cleaner.getProperties();
props.setAllowHtmlInsideAttributes(true);
props.setAllowMultiWordAttributes(true);
props.setRecognizeUnicodeChars(true);
props.setOmitComments(true);
URL url = new URL(urlString);
HttpURLConnection connection = (HttpURLConnection) url.openConnection();
connection.setDoOutput(true);
// optional cookies
connection.setRequestProperty(context.getString(R.string.cookie_prefix), cookies);
connection.connect();
// use the cleaner to "clean" the HTML and return it as a TagNode object
TagNode root = cleaner.clean(new InputStreamReader(connection.getInputStream()));
Object[] foundNodes = root.evaluateXPath(xPath);
if (foundNodes.length > 0) {
TagNode foundNode = (TagNode) foundNodes[0];
snap = foundNode.getAttributeByName(attrToSnap);
}
return snap;
}
Just edit it for your needs. :-)