xpath: write to a file

xpath: write to a file - java

I'm developing Java code to get data from a website and store it in a file. I want to store the result of xpath into a file. Is there any way to save the output of the xpath? Please forgive for any mistakes; this is my first question.
public class TestScrapping {
public static void main(String[] args) throws MalformedURLException, IOException, XPatherException {
// URL to be fetched in the below url u can replace s=cantabil with company of ur choice
String url_fetch = "http://www.yahoo.com";
//create tagnode object to traverse XML using xpath
TagNode node;
String info = null;
//XPath of the data to be fetched.....use firefox's firepath addon or use firebug to fetch the required XPath.
//the below XPath will display the title of the company u have queried for
String name_xpath = "//div[1]/div[2]/div[2]/div[1]/div/div/div/div/table/tbody/tr[1]/td[2]/text()";
// declarations related to the api
HtmlCleaner cleaner = new HtmlCleaner();
CleanerProperties props = new CleanerProperties();
props.setAllowHtmlInsideAttributes(true);
props.setAllowMultiWordAttributes(true);
props.setRecognizeUnicodeChars(true);
props.setOmitComments(true);
//creating url object
URL url = new URL(url_fetch);
URLConnection conn = url.openConnection(); //opening connection
node = cleaner.clean(new InputStreamReader(conn.getInputStream()));//reading input stream
//storing the nodes belonging to the given xpath
Object[] info_nodes = node.evaluateXPath(name_xpath);
// String li= node.getAttributeByName(name_xpath);
//checking if something returned or not....if XPath invalid info_nodes.length=0
if (info_nodes.length > 0) {
//info_nodes[0] will return string buffer
StringBuffer str = new StringBuffer();
{
for(int i=0;i<info_nodes.length;i++)
System.out.println(info_nodes[i]);
}
/*str.append(info_nodes[0]);
System.out.println(str);
*/
}
}
}

You can "simply" print the nodes as strings, to console/or a file --
example in Perl:
my $all = $XML_OBJ->find('/'); # selecting all nodes from root
foreach my $node ($all->get_nodelist()) {
print XML::XPath::XMLParser::as_string($node);
}
note: this output however may not be nicely xml-formatted/indented

The output of an XPath in Java is a nodeset, so yes, once you have a nodeset you can do anything you want with it, save it to a file, process it some more.
Saving it to a file would involve the same steps in java that saving anything else to a file involve, there is no difference between that and and any other data. Select the nodeset, itterate through it, get the parts you want from it and write them to some kind of file stream.
However, if you mean is there a Nodeset.SaveToFile(), then no.

I would recommend you to take the NodeSet, which is a collection of Nodes, iterate on it, and add it to a created DOM document object.
After this, you can use the TransformerFactory to get a Transformer object, and to use its transform method. You should transform from a DOMSource to a StreamResult object which can be created based on FileOutputStream.

Related

IBM Integration Bus Java Compute Node: output a w3c.dom.Document or String

I have been working on a Java module to transform XMLs for the last few months. It is supposed to take a soap request and fill the soap:header element with additional elements from a metadata repository, for example. The module should be universally implementable into any middleware (my native system is SAP PI).
Now I am tasked with implementing this module as a jar into a JavaCompute Node in IBM Integration Bus. The problem is that to export the resulting XML I need to get the data into the outMessage of the JavaCompute Node. However, I did not find a way to convert an org.w3c.com.Document to MbElement or to insert the Document or its content into the MbElement.
Actually I did not see a way to put anything in there at all (not even an XML String) without using the IBM API as intended, so I would have to write code that reads my already finished Document and builds an MbElement from it.
This looks like the following:
public void evaluate(MbMessageAssembly inAssembly) throws MbException {
MbOutputTerminal out = getOutputTerminal("out");
MbOutputTerminal alt = getOutputTerminal("alternate");
MbMessage inMessage = inAssembly.getMessage();
// create new empty message
MbMessage outMessage = new MbMessage();
MbMessageAssembly outAssembly = new MbMessageAssembly(inAssembly,
outMessage);
try {
// optionally copy message headers
// copyMessageHeaders(inMessage, outMessage);
// ----------------------------------------------------------
// Add user code below
//Create an example output Document
String outputContent = "<element><subelement>Value</subelement></element>";
DocumentBuilder db = DocumentBuilderFactory.newInstance().newDocumentBuilder();
InputSource is = new InputSource();
is.setCharacterStream(new StringReader(outputContent));
Document outDocument = db.parse(is);
//Get the Document or its content into the outRoot or outMessage somehow.
MbElement outRoot = outMessage.getRootElement();
//Start to iterate over the Document and use Methods like this to build up the MbElement?
MbElement outBody = outRoot.createElementAsLastChild("request");
// End of user code
} catch (MbException e) { ...

You can cast your org.w3c.com.Document to byte array (example). Then you can use the following code:
MbMessage outMessage = new MbMessage();
//copy message headers if required
MbElement oRoot = outMessage.getRootElement();
MbElement oBody = oRoot.createElementAsLastChild(MbBLOB.PARSER_NAME);
oBody.createElementAsLastChild(MbElement.TYPE_NAME_VALUE, "BLOB", yourXmlAsByteArray);
MbMessageAssembly outAssembly = new MbMessageAssembly(inAssembly, inAssembly.getLocalEnvironment(), inAssembly.getExceptionList(), outMessage);

JDOM Transformer - don't contract empty elements

I'm using JDOM 2.0.6 to transform an XSLT into an HTML, but I'm coming across the following problem - sometimes the data should be empty, that is, I'll have in my XSLT the following:
<div class="someclass"><xsl:value-of select="somevalue"/></div>
and when somevalue is empty, the output I get is:
<div class="someclass"/>
which may be perfectly valid XML, but is not valid HTML, and causes problems when displaying the resulting page.
Similar problems occur for <span> or <script> tags.
So my question is - how can I tell JDOM not to contract empty elements, and leave them as <div></div>?
Edit
I suspect the problem is not in the actual XSLTTransformer, but later when using JDOM to write to html. Here is the code I use:
XMLOutputter htmlout = new XMLOutputter(Format.getPrettyFormat());
htmlout.getFormat().setEncoding("UTF-8");
Document htmlDoc = transformer.transform(model);
htmlDoc.setDocType(new DocType("html"));
try (OutputStreamWriter osw = new OutputStreamWriter(new FileOutputStream(outHtml), "UTF-8")) {
htmlout.output(htmlDoc, osw);
}
Currently the proposed solution of adding a zero-width space works for me, but I'm interested to know if there is a way to tell JDOM to treat the document as an HTML (be it in the transform stage or the output stage, but I'm guessing the problem lies in the output stage).

You can use a zero-width-space between the elements. This doesn't affect the HTML output, but keeps the open-close-tags separated because they have a non-empty content.
<div class="someclass"><xsl:value-of select="somevalue"/></div>
Downside is: the tag is not really empty anymore. That would matter if your output would be XML. But for HTML - which is probably the last stage of processing - it should not matter.

In your case, the XML transform is happening directly to a file/stream, and it is no longer in the control of JDOM.
In JDOM, you can select whether the output from the JDOM document has expanded, or not-expanded output for empty elements. Typically, people have output from JDOM like:
XMLOutputter xout = new XMLOutputter(Format.getPrettyFormat());
xout.output(document, System.out);
You can modify the output format, though, and expand the empty elements
Format expanded = Format.getPrettyFormat().setExpandEmptyElements(true);
XMLOutputter xout = new XMLOutputter(expanded);
xout.output(document, System.out);
If you 'recover' (assuming it is valid XHTML?) the XSLT transformed xml as a new JDOM document you can output the result with expanded empty elements.

If you want to transform to a HTML file then consider to use Jaxp Transformer with a JDOMSource and a StreamResult, then the Transformer will serialize the transformation result as HTML if the output method is html (either as set in your code or as done with a no-namespace root element named html.

In addition to the "expandEmptyElements" option, you could create your own writer and pass it to the XMLOutputter:
XMLOutputter outputter = new XMLOutputter(Format.getPrettyFormat().setExpandEmptyElements(true));
StringWriter writer = new HTML5Writer();
outputter.output(document, writer);
System.out.println(writer.toString());
This writer can then modify all HTML5 void elements. Elements like "script" for example won't be touched:
private static class HTML5Writer extends StringWriter {
private static String[] VOIDELEMENTS = new String[] { "area", "base", "br", "col", "command", "embed", "hr",
"img", "input", "keygen", "link", "meta", "param", "source", "track", "wbr" };
private boolean inVoidTag;
private StringBuffer voidTagBuffer;
public void write(String str) {
if (voidTagBuffer != null) {
if (str.equals("></")) {
voidTagBuffer.append(" />");
super.write(voidTagBuffer.toString());
voidTagBuffer = null;
} else {
voidTagBuffer.append(str);
}
} else if (inVoidTag) {
if (str.equals(">")) {
inVoidTag = false;
}
} else {
for (int i = 0; i < VOIDELEMENTS.length; i++) {
if (str.equals(VOIDELEMENTS[i])) {
inVoidTag = true;
voidTagBuffer = new StringBuffer(str);
return;
}
}
super.write(str);
}
}
}
I know, this is dirty, but I had the same problem and didn't find any other way.

Getting info from a webpage in java

sorry if it's kind of a big question but I'm just looking for someone to tell me in what direction to learn more since I have no clue, I have very basic knowledge of HTML and Java.
Someone in my family has to copy every product from a supplier into his own webshop.
The problem is he needs to put in all the articles one by one by hand,I'm looking for a way to replace him by a program.
I already got a bit going on for the price calculation , all I need now is the info of the product.
http://pastebin.com/WVCy55Dj
From line 1009 to around 1030.
I need 3 seperate strings of the three span's with the class "CatalogusListDetailTest"
From line 987 to around 1000.
I need a way to get all these images, it's on the website at www.flamingo.be/Images/Products/Large/"productID"(our first string).jpg
sometimes there's a _A , _B as you can see in this example so I'm looking for a way to make it check if there is and get these images aswell.
If I could get this far then I'd be very thankful ! I'll figure the rest out myself, sorry for the long post, wanted to give as much info as possible.

You can look at HTML parser lib Jsoup, doc reference: http://jsoup.org/cookbook/
EDIT: Code to get the product code:
Elements classElements = document.getElementsByClass("CatalogusListDetailTextTitel");
for (Element classElement : classElements) {
if (classElement.text().contains("Productcode :")) {
System.out.println(classElement.parent().ownText());
}
}
Instead of document you may have to use an element to get the consistent result, above code will print all the product codes.

You can use JTidy for what you need.
Code Example:
public void downloadSinglePage(String pageLink, String targetDir) throws XPathExpressionException, IOException {
URL url = new URL(pageLink);
BufferedInputStream page = new BufferedInputStream(url.openStream());
Tidy tidy = new Tidy();
tidy.setQuiet(true);
tidy.setShowWarnings(false);
Document response = tidy.parseDOM(page, null);
XPathFactory factory = XPathFactory.newInstance();
XPath xPath=factory.newXPath();
NodeList nodes = (NodeList)xPath.evaluate(IMAGE_PATTERN, response, XPathConstants.NODESET);
String imageURL = (String) nodes.item(0).getNodeValue();
saveImageNIO(imageURL, targetDir);
}
where
IMAGE_PATTERN = "///a/img/#src";
but the pattern depends on how the image is innested in the page HTML code.
Method for saving Image using NIO:
public void saveImageNIO(String imageURL, String targetDir, String imageName) throws IOException {
URL url = new URL(imageURL);
ReadableByteChannel rbc = Channels.newChannel(url.openStream());
FileOutputStream fos = new FileOutputStream(targetDir + "/" + imageName + ".jpg");
fos.getChannel().transferFrom(rbc, 0, 1 << 24);
}

java data structure to replace file io

My program goes to a my uni results page, finds all the links and saves to a file. Then I read the file and copy only lines which contain required links and save it to another file. Then I parse it again to extract required data
public class net {
public static void main(String[] args) throws Exception {
Document doc = Jsoup.connect("http://jntuconnect.net/results_archive/").get();
Elements links = doc.select("a");
File f1 = new File("flink.txt");
File f2 = new File("rlink.txt");
//write extracted links to f1 file
FileUtils.writeLines(f1, links);
// store each link from f1 file in string list
List<String> linklist = FileUtils.readLines(f1);
// second string list to store only required link elements
List<String> rlinklist = new ArrayList<String>();
// loop which finds required links and stores in rlinklist
for(String elem : linklist){
if(elem.contains("B.Tech") && (elem.contains("R07")||elem.contains("R09"))){
rlinklist.add(elem);
}
}
//store required links in f2 file
FileUtils.writeLines(f2, rlinklist);
// parse links from f2 file
Document rdoc = Jsoup.parse(f2, null);
Elements rlinks = rdoc.select("a");
// for storing hrefs and link text
List<String> rhref = new ArrayList<String>();
List<String> rtext = new ArrayList<String>();
for(Element rlink : rlinks){
rhref.add(rlink.attr("href"));
rtext.add(rlink.text());
}
}// end main
}
I don't want to create files to do this. Is there a better way to get hrefs and link texts of only specific urls without creating files?
It uses Apache commons fileutils, jsoup

Here's how you can get rid of the first file write/read:
Elements links = doc.select("a");
List<String> linklist = new ArrayList<String>();
for (Element elt : links) {
linklist.add(elt.toString());
}
The second round trip, if I understand the code, is intended to extract the links that meet a certain test. You can just do that in memory using the same technique.
I see you're relying on Jsoup.parse to extract the href and link text from the selected links. You can do that in memory by writing the selected nodes to a StringBuffer, convert it to a String by calling it's toString() method, and then using one of the Jsoup.parse methods that takes a String instead of a File argument.

Android: Extracting the text between two HTML tags

I need to extract the text between two HTML tags and store it in a string. An example of the HTML I want to parse is as follows:
<div id=\"swiki.2.1\"> THE TEXT I NEED </div>
I have done this in Java using the pattern (swiki\.2\.1\\\")(.*)(\/div) and getting the string I want from the group $2. However this will not work in android. When I go to print the contents of $2 nothing appears, because the match fails.
Has anyone had a similar problem with using regex in android, or is there a better way (non-regex) to parse the HTML page in the first place. Again, this works fine in a standard java test program. Any help would be greatly appreciated!

For HTML-parsing-stuff I always use HtmlCleaner: http://htmlcleaner.sourceforge.net/
Awesome lib that works great with Xpath and of course Android. :-)
This shows how you can download an XML from URL and parse it to get a certain value from an XML attribute (also shown in the docs):
public static String snapFromHtmlWithCookies(Context context, String xPath, String attrToSnap, String urlString,
String cookies) throws IOException, XPatherException {
String snap = "";
// create an instance of HtmlCleaner
HtmlCleaner cleaner = new HtmlCleaner();
// take default cleaner properties
CleanerProperties props = cleaner.getProperties();
props.setAllowHtmlInsideAttributes(true);
props.setAllowMultiWordAttributes(true);
props.setRecognizeUnicodeChars(true);
props.setOmitComments(true);
URL url = new URL(urlString);
HttpURLConnection connection = (HttpURLConnection) url.openConnection();
connection.setDoOutput(true);
// optional cookies
connection.setRequestProperty(context.getString(R.string.cookie_prefix), cookies);
connection.connect();
// use the cleaner to "clean" the HTML and return it as a TagNode object
TagNode root = cleaner.clean(new InputStreamReader(connection.getInputStream()));
Object[] foundNodes = root.evaluateXPath(xPath);
if (foundNodes.length > 0) {
TagNode foundNode = (TagNode) foundNodes[0];
snap = foundNode.getAttributeByName(attrToSnap);
}
return snap;
}
Just edit it for your needs. :-)

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

xpath: write to a file - java

Related

IBM Integration Bus Java Compute Node: output a w3c.dom.Document or String

JDOM Transformer - don't contract empty elements

Getting info from a webpage in java

java data structure to replace file io

Android: Extracting the text between two HTML tags

Categories

Resources