How do I get a cleaned html file from HtmlCleaner? - java

My application downloads a certain website as HTML file the first time it is started. The HTML file is very messy ofcourse, so I want to clean it with HtmlCleaner, so that I can then parse it with Jsoup. But how do I get a new cleaned html item after it was cleaned?
I did some research and this is all i could find:
HtmlCleaner htmlCleaner = new HtmlCleaner();
TagNode root = htmlCleaner.clean(url);
HtmlCleaner.getInnerHtml(root);
String html = "<" + root.getName() + ">" + htmlCleaner.getInnerHtml(root) + "</" + root.getName() + ">";
But I can't see where in this code does it write to a new file? If it doesn't, how do I implement it so that the old file will be deleted and the new cleaned html file will be created?

you can do something like following:
HtmlCleaner cleaner = new HtmlCleaner();
final String siteUrl = "http://www.themoscowtimes.com/";
TagNode node = cleaner.clean(new URL(siteUrl));
// serialize to xml file
new PrettyXmlSerializer(props).writeToFile(
node , "cleaned.xml", "utf-8"
);
or
// serialize to html file
SimpleHtmlSerializer serializer = new SimpleHtmlSerializer(htmlCleaner.getProperties());
serializer.writeToFile(node, "c:/temp/cleaned.html");

Related

Getting info from a webpage in java

sorry if it's kind of a big question but I'm just looking for someone to tell me in what direction to learn more since I have no clue, I have very basic knowledge of HTML and Java.
Someone in my family has to copy every product from a supplier into his own webshop.
The problem is he needs to put in all the articles one by one by hand,I'm looking for a way to replace him by a program.
I already got a bit going on for the price calculation , all I need now is the info of the product.
http://pastebin.com/WVCy55Dj
From line 1009 to around 1030.
I need 3 seperate strings of the three span's with the class "CatalogusListDetailTest"
From line 987 to around 1000.
I need a way to get all these images, it's on the website at www.flamingo.be/Images/Products/Large/"productID"(our first string).jpg
sometimes there's a _A , _B as you can see in this example so I'm looking for a way to make it check if there is and get these images aswell.
If I could get this far then I'd be very thankful ! I'll figure the rest out myself, sorry for the long post, wanted to give as much info as possible.
You can look at HTML parser lib Jsoup, doc reference: http://jsoup.org/cookbook/
EDIT: Code to get the product code:
Elements classElements = document.getElementsByClass("CatalogusListDetailTextTitel");
for (Element classElement : classElements) {
if (classElement.text().contains("Productcode :")) {
System.out.println(classElement.parent().ownText());
}
}
Instead of document you may have to use an element to get the consistent result, above code will print all the product codes.
You can use JTidy for what you need.
Code Example:
public void downloadSinglePage(String pageLink, String targetDir) throws XPathExpressionException, IOException {
URL url = new URL(pageLink);
BufferedInputStream page = new BufferedInputStream(url.openStream());
Tidy tidy = new Tidy();
tidy.setQuiet(true);
tidy.setShowWarnings(false);
Document response = tidy.parseDOM(page, null);
XPathFactory factory = XPathFactory.newInstance();
XPath xPath=factory.newXPath();
NodeList nodes = (NodeList)xPath.evaluate(IMAGE_PATTERN, response, XPathConstants.NODESET);
String imageURL = (String) nodes.item(0).getNodeValue();
saveImageNIO(imageURL, targetDir);
}
where
IMAGE_PATTERN = "///a/img/#src";
but the pattern depends on how the image is innested in the page HTML code.
Method for saving Image using NIO:
public void saveImageNIO(String imageURL, String targetDir, String imageName) throws IOException {
URL url = new URL(imageURL);
ReadableByteChannel rbc = Channels.newChannel(url.openStream());
FileOutputStream fos = new FileOutputStream(targetDir + "/" + imageName + ".jpg");
fos.getChannel().transferFrom(rbc, 0, 1 << 24);
}

Android: Extracting the text between two HTML tags

I need to extract the text between two HTML tags and store it in a string. An example of the HTML I want to parse is as follows:
<div id=\"swiki.2.1\"> THE TEXT I NEED </div>
I have done this in Java using the pattern (swiki\.2\.1\\\")(.*)(\/div) and getting the string I want from the group $2. However this will not work in android. When I go to print the contents of $2 nothing appears, because the match fails.
Has anyone had a similar problem with using regex in android, or is there a better way (non-regex) to parse the HTML page in the first place. Again, this works fine in a standard java test program. Any help would be greatly appreciated!
For HTML-parsing-stuff I always use HtmlCleaner: http://htmlcleaner.sourceforge.net/
Awesome lib that works great with Xpath and of course Android. :-)
This shows how you can download an XML from URL and parse it to get a certain value from an XML attribute (also shown in the docs):
public static String snapFromHtmlWithCookies(Context context, String xPath, String attrToSnap, String urlString,
String cookies) throws IOException, XPatherException {
String snap = "";
// create an instance of HtmlCleaner
HtmlCleaner cleaner = new HtmlCleaner();
// take default cleaner properties
CleanerProperties props = cleaner.getProperties();
props.setAllowHtmlInsideAttributes(true);
props.setAllowMultiWordAttributes(true);
props.setRecognizeUnicodeChars(true);
props.setOmitComments(true);
URL url = new URL(urlString);
HttpURLConnection connection = (HttpURLConnection) url.openConnection();
connection.setDoOutput(true);
// optional cookies
connection.setRequestProperty(context.getString(R.string.cookie_prefix), cookies);
connection.connect();
// use the cleaner to "clean" the HTML and return it as a TagNode object
TagNode root = cleaner.clean(new InputStreamReader(connection.getInputStream()));
Object[] foundNodes = root.evaluateXPath(xPath);
if (foundNodes.length > 0) {
TagNode foundNode = (TagNode) foundNodes[0];
snap = foundNode.getAttributeByName(attrToSnap);
}
return snap;
}
Just edit it for your needs. :-)

call local image in drawable with html (WebView)

This question has been asked few times in forums, but in my code, i can't display my image. I think it's not the right method :
webViewContact.loadData(db.getParametres().get(0).getInformationParam(), "text/html", "utf-8");
getInformationParam() recup the HTML code, like :
<img src=\\"file:///android_asset/logoirdes_apropos.jpg\\"/> <b>Test</b>
My image file is in drawable, how i can display it ?
There are restrictions about the HTML loaded with loadData() can do. Suggest using loadUrl:
webViewContact.loadUrl("file:///android_asset/" + db.getParametres().get(0).getInformationParam())
You can try the following code, and your file will be at: htmlFile. You can certainly do it in UI thread for now, but you might consider to move this to a AsyncTask later in real production if the file is huge.
String directory = Environment.getExternalStoragePublicDirectory("html_cache");
Writer output;
try {
directory.mkdir();
File htmlFile = new File(directory + File.separator + "give_a_name.html");
String content = db.getParametres().get(0).getInformationParam();
// assumes default encoding is OK!
output = new BufferedWriter(new FileWriter(htmlFile));
output.write( aContents );
}
finally {
output.close();
}

Programmatically load data into solr using solrj and java

How can I load data from an xml file into solr using the solrj API?
Thanks Pascal. I miss worded my question, I'm actually using groovy. But in any event your approach does work, but this was my solution:
CommonsHttpSolrServer server = SolrServerSingleton.getInstance().getServer();
def dataDir = System.getProperty("user.dir");
File xmlFile = new File(dataDir+"/book.xml");
def xml = xmlFile.getText();
DirectXmlRequest xmlreq = new DirectXmlRequest( "/update", xml);
server.request(xmlreq);
server.commit();
The first arg to DirectXmlRequest is a url path, it must be "/update" and that the variable xml is a string containing the XML. For example
<add>
<doc>
<field name="title">blah</field>
</doc>
</add>
With Java 6, you can use Xpath to fetch what you need from your xml file. Then, you populate a SolrInputDocument from what you extracted from the xml. When that document contains everything you need, you submit it to Solr using the add method of SolrServer.
SolrClient client = new HttpSolrClient("http://localhost:8983/solr/jiva/");
String dataDir = System.getProperty("user.dir");
File xmlFile = new File(dataDir + "/Alovera-Juice.xml");
if (xmlFile.exists()) {
InputStream is = new FileInputStream(xmlFile);
String str = IOUtils.toString(is);
DirectXmlRequest dxr = new DirectXmlRequest("/update", str);
client.request(dxr);
client.commit();
}

Get all Images from WebPage Program | Java

Currently I need a program that given a URL, returns a list of all the images on the webpage.
ie:
logo.png
gallery1.jpg
test.gif
Is there any open source software available before I try and code something?
Language should be java. Thanks
Philip
Just use a simple HTML parser, like jTidy, and then get all elements by tag name img and then collect the src attribute of each in a List<String> or maybe List<URI>.
You can obtain an InputStream of an URL using URL#openStream() and then feed it to any HTML parser you like to use. Here's a kickoff example:
InputStream input = new URL("http://www.stackoverflow.com").openStream();
Document document = new Tidy().parseDOM(input, null);
NodeList imgs = document.getElementsByTagName("img");
List<String> srcs = new ArrayList<String>();
for (int i = 0; i < imgs.getLength(); i++) {
srcs.add(imgs.item(i).getAttributes().getNamedItem("src").getNodeValue());
}
for (String src: srcs) {
System.out.println(src);
}
I must however admit that HtmlUnit as suggested by Bozho indeed looks better.
HtmlUnit has HtmlPage.getElementsByTagName("img"), which will probably suit you.
(read the short Get started guide to see how to obtain the correct HtmlPage object)
This is dead simple with HTML Parser (and any other decent HTML parser):
Parser parser = new Parser("http://www.yahoo.com/");
NodeList list = parser.parse(new TagNameFilter("IMG"));
for ( SimpleNodeIterator iterator = list.elements(); iterator.hasMoreNodes(); ) {
Tag tag = (Tag) iterator.nextNode();
System.out.println(tag.getAttribute("src"));
}
You can use wget that has a lot of options available.
Or google for java wget ...
You can parse the HTML, and collect all SRC attributes of IMG elements in a Collection. Then download each resource from each url and write it to a file. For parsing there are several HTML parsers available, Cobra is one of them.
With Open Graph tags and HTML unit, you can extract your data really easily (PageMeta is a simple POJO holding the results):
Parser parser = new Parser(url);
PageMeta pageMeta = new PageMeta();
pageMeta.setUrl(url);
NodeList meta = parser.parse(new TagNameFilter("meta"));
for (SimpleNodeIterator iterator = meta.elements(); iterator.hasMoreNodes(); ) {
Tag tag = (Tag) iterator.nextNode();
if ("og:image".equals(tag.getAttribute("property"))) {
pageMeta.setImageUrl(tag.getAttribute("content"));
}
if ("og:title".equals(tag.getAttribute("property"))) {
pageMeta.setTitle(tag.getAttribute("content"));
}
if ("og:description".equals(tag.getAttribute("property"))) {
pageMeta.setDescription(tag.getAttribute("content"));
}
}
You can simply use regular expression in Java
<html>
<body>
<p>
<img src="38220.png" alt="test" title="test" />
<img src="32222.png" alt="test" title="test" />
</p>
</body>
</html>
String s ="html"; //above html content
Pattern p = Pattern.compile("<img [^>]*src=[\\\"']([^\\\"^']*)");
Matcher m = p.matcher (s);
while (m.find()) {
String src = m.group();
int startIndex = src.indexOf("src=") + 5;
String srcTag = src.substring(startIndex, src.length());
System.out.println( srcTag );
}

Categories