Replace XML text node with an element node

Replace XML text node with an element node - java

I am using groovy, so an java implementation would also be fine.
I have
"""<TextFlow fontFamily="Arial" fontSize="20"><span>before</span>Less than 7 days<span>after</span></TextFlow>"""
I would like to wrap first level text node with a tag. So I would like to get
"""<TextFlow fontFamily="Arial" fontSize="20"><span>before</span><span>Less than 7 days</span><span>after</span></TextFlow>"""
I have looked into XmlSlurper which doesn't deal with text nodes. I have also looked into XmlParser which can handle text nodes, but I am not sure how to replace it with an xml element. Please advice.

This worked for me, hope it'd help someone else
#Grab('org.jdom:jdom2:2.0.5')
#Grab('jaxen:jaxen:1.1.4')
#GrabExclude('jdom:jdom')
import org.jdom2.*
import org.jdom2.input.*
import org.jdom2.xpath.*
import org.jdom2.output.*
def xml = """<TextFlow fontFamily="Arial" fontSize="20"><span>before</span>Less than 7 days<span>after</span></TextFlow>"""
Document doc = new SAXBuilder().build(new StringReader(xml))
def urls = XPathFactory.instance().compile('//TextFlow/text()').evaluate(doc)
for(def c in urls) {
int pos = c.parent.content.indexOf(c)
Element span = new Element("span")
span.text = c.text
c.parent.setContent(pos, span)
}
new XMLOutputter().with {
format = Format.getRawFormat()
format.setLineSeparator(LineSeparator.NONE)
// XmlOutputter can write to OutputStream or Writer, which is sufficient for most cases
output(doc, System.out)
}

Related

Using POI To read/write a doc with the full POIFSFileSystem

I have the following issue, as everybody it seems, I want to replace some items with others in Word doc.
Issue with the issue is, the doc contains headers and footers which are part of the POIFSFileSystem (I know this because reading the FS / writing the doc back -without any changes- loses these informations, whereas reading the FS / writing it back as a new file doesn't).
Currently I do this :
POIFSFileSystem pfs = new POIFSFileSystem(fis);
HWPFDocument document = new HWPFDocument(pfs);
Range r1 = document.getRange();
…
document.write();
ByteArrayOutputStream bos = new ByteArrayOutputStream(50000);
pfs.writeFilesystem(bos);
pfs.close();
However this fails, with this error:
Opened read-only or via an InputStream, a Writeable File is required
If I don't rewrite the document, it works fine, but my changes are lost.
The other way around if I only save the document, not the filesystem, I lose the header/footer.
Now the problem is, how can I update the document while "saving as" the entire filesystem, or is there a way to force the document to contain everything from the file system?

The HWPF stuff is always in scratchpad because the DOC binary file format is the most horrible of all the Horrible formats. So it will really not be ready and also will be buggy in many cases.
But in your special case, your observations are not reproducible. Using apache poi 4.0.1 the HWPFDocument contains the header story, which also contains the footer stories, after creating from *.doc file. So the following works for me:
Source:
Code:
import java.io.FileInputStream;
import java.io.FileOutputStream;
import org.apache.poi.hwpf.*;
import org.apache.poi.hwpf.usermodel.*;
public class ReadAndWriteDOCWithHeaderFooter {
public static void main(String[] args) throws Exception {
HWPFDocument document = new HWPFDocument(new FileInputStream("TemplateDOCWithHeaderFooter.doc"));
Range bodyRange = document.getRange();
System.out.println(bodyRange);
for (int p = 0; p < bodyRange.numParagraphs(); p++) {
System.out.println(bodyRange.getParagraph(p).text());
if (bodyRange.getParagraph(p).text().contains("<<NAME>>"))
bodyRange.getParagraph(p).replaceText("<<NAME>>", "Axel Richter");
if (bodyRange.getParagraph(p).text().contains("<<DATE>>"))
bodyRange.getParagraph(p).replaceText("<<DATE>>", "12/21/1964");
if (bodyRange.getParagraph(p).text().contains("<<AMOUNT>>"))
bodyRange.getParagraph(p).replaceText("<<AMOUNT>>", "1,234.56");
System.out.println(bodyRange.getParagraph(p).text());
}
System.out.println("==============================================================================");
Range overallRange = document.getOverallRange();
System.out.println(overallRange);
for (int p = 0; p < overallRange.numParagraphs(); p++) {
System.out.println(overallRange.getParagraph(p).text()); // contains all inclusive header and footer
}
FileOutputStream out = new FileOutputStream("ResultDOCWithHeaderFooter.doc");
document.write(out);
out.close();
document.close();
}
}
Result:
So please do checking it again and tell us exactly what is not working for you. Because we need reproducing that, please do providing a minimal, complete, and verifiable example as I have done with my code.

Using Groovy to overwrite a FlowFile in NiFi

I'm trying to do something fairly simple and read an i9 PDF form from an incoming FlowFile, parse the first and last name out of it into a JSON, then output the JSON to the outgoing FlowFile.
I found no official documentation on how to do this, but someone has written up several cookbooks on doing things in several scripting languages in NiFi here. It seems pretty straightforward and I'm pretty sure I'm doing what is written there, but I'm not even sure the PDF is being read at all. It simply passes the PDF unmodified out to REL_SUCCESS every time.
Link to sample PDF
import java.nio.charset.StandardCharsets
import org.apache.pdfbox.io.IOUtils
import org.apache.pdfbox.pdmodel.PDDocument
import org.apache.pdfbox.util.PDFTextStripperByArea
import java.awt.Rectangle
import org.apache.pdfbox.pdmodel.PDPage
import com.google.gson.Gson
import java.nio.charset.StandardCharsets
def flowFile = session.get()
flowFile = session.write(flowFile, { inputStream, outputStream ->
try {
//Load Flowfile contents
PDDocument document = PDDocument.load(inputStream)
PDFTextStripperByArea stripper = new PDFTextStripperByArea()
//Get the first page
List<PDPage> allPages = document.getDocumentCatalog().getAllPages()
PDPage page = allPages.get(0)
//Define the areas to search and add them as search regions
stripper = new PDFTextStripperByArea()
Rectangle lname = new Rectangle(25, 226, 240, 15)
stripper.addRegion("lname", lname)
Rectangle fname = new Rectangle(276, 226, 240, 15)
stripper.addRegion("fname", fname)
//Load the results into a JSON
def boxMap = [:]
stripper.setSortByPosition(true)
stripper.extractRegions(page)
regions = stripper.getRegions()
for (String region : regions) {
String box = stripper.getTextForRegion(region)
boxMap.put(region, box)
}
Gson gson = new Gson()
//Remove random noise from the output
json = gson.toJson(boxMap, LinkedHashMap.class)
json = json.replace('\\n', '')
json = json.replace('\\r', '')
json = json.replace(',"', ',\n"')
//Overwrite flowfile contents with JSON
outputStream.write(json.getBytes(StandardCharsets.UTF_8))
} catch (Exception e){
System.out.println(e.getMessage())
session.transfer(flowFile, REL_FAILURE)
}
} as StreamCallback)
session.transfer(flowFile, REL_SUCCESS)
EDIT:
Was able to confirm that the flowFile object is being read properly by subbing a txt file in. So the problem seems to be that the inputStream is never being handed off to the PDDocument or something is happening when it does. I edited the code to try reading it into a File object first but that resulted in an error:
FlowFileHandlingException: null is not known in this session
EDIT Edit:
Solved by moving my try/catch. I don't seem to understand how that works, my code above has been edited and works properly.

session.get can return null, so definitely add a line after that if(!flowFile) return. Also put the try/catch outside the session.write, that way you can put the session.transfer(flowFile, REL_SUCCESS) after the session.write (inside the try) and the catch can transfer to failure.
Also I can't tell from the code how the PDFTextStripperByArea works to get the info from the incoming document. It looks like all the document stuff is inside the try, so wouldn't be available to the PDFTextStripper (and isn't passed in).
None of these things explain why you're getting the original flow file on the success relationship, but maybe there's something I'm not seeing that would be magically fixed by the changes above :)
Also, if you use log.info() or log.error() rather than System.out.println, you will see the output in the NiFi logs (and for error it will post a bulletin to the processor and you can see the message if you hover over the top right corner (red square if bulletin is present) of the processor.

JDOM Transformer - don't contract empty elements

I'm using JDOM 2.0.6 to transform an XSLT into an HTML, but I'm coming across the following problem - sometimes the data should be empty, that is, I'll have in my XSLT the following:
<div class="someclass"><xsl:value-of select="somevalue"/></div>
and when somevalue is empty, the output I get is:
<div class="someclass"/>
which may be perfectly valid XML, but is not valid HTML, and causes problems when displaying the resulting page.
Similar problems occur for <span> or <script> tags.
So my question is - how can I tell JDOM not to contract empty elements, and leave them as <div></div>?
Edit
I suspect the problem is not in the actual XSLTTransformer, but later when using JDOM to write to html. Here is the code I use:
XMLOutputter htmlout = new XMLOutputter(Format.getPrettyFormat());
htmlout.getFormat().setEncoding("UTF-8");
Document htmlDoc = transformer.transform(model);
htmlDoc.setDocType(new DocType("html"));
try (OutputStreamWriter osw = new OutputStreamWriter(new FileOutputStream(outHtml), "UTF-8")) {
htmlout.output(htmlDoc, osw);
}
Currently the proposed solution of adding a zero-width space works for me, but I'm interested to know if there is a way to tell JDOM to treat the document as an HTML (be it in the transform stage or the output stage, but I'm guessing the problem lies in the output stage).

You can use a zero-width-space between the elements. This doesn't affect the HTML output, but keeps the open-close-tags separated because they have a non-empty content.
<div class="someclass"><xsl:value-of select="somevalue"/></div>
Downside is: the tag is not really empty anymore. That would matter if your output would be XML. But for HTML - which is probably the last stage of processing - it should not matter.

In your case, the XML transform is happening directly to a file/stream, and it is no longer in the control of JDOM.
In JDOM, you can select whether the output from the JDOM document has expanded, or not-expanded output for empty elements. Typically, people have output from JDOM like:
XMLOutputter xout = new XMLOutputter(Format.getPrettyFormat());
xout.output(document, System.out);
You can modify the output format, though, and expand the empty elements
Format expanded = Format.getPrettyFormat().setExpandEmptyElements(true);
XMLOutputter xout = new XMLOutputter(expanded);
xout.output(document, System.out);
If you 'recover' (assuming it is valid XHTML?) the XSLT transformed xml as a new JDOM document you can output the result with expanded empty elements.

If you want to transform to a HTML file then consider to use Jaxp Transformer with a JDOMSource and a StreamResult, then the Transformer will serialize the transformation result as HTML if the output method is html (either as set in your code or as done with a no-namespace root element named html.

In addition to the "expandEmptyElements" option, you could create your own writer and pass it to the XMLOutputter:
XMLOutputter outputter = new XMLOutputter(Format.getPrettyFormat().setExpandEmptyElements(true));
StringWriter writer = new HTML5Writer();
outputter.output(document, writer);
System.out.println(writer.toString());
This writer can then modify all HTML5 void elements. Elements like "script" for example won't be touched:
private static class HTML5Writer extends StringWriter {
private static String[] VOIDELEMENTS = new String[] { "area", "base", "br", "col", "command", "embed", "hr",
"img", "input", "keygen", "link", "meta", "param", "source", "track", "wbr" };
private boolean inVoidTag;
private StringBuffer voidTagBuffer;
public void write(String str) {
if (voidTagBuffer != null) {
if (str.equals("></")) {
voidTagBuffer.append(" />");
super.write(voidTagBuffer.toString());
voidTagBuffer = null;
} else {
voidTagBuffer.append(str);
}
} else if (inVoidTag) {
if (str.equals(">")) {
inVoidTag = false;
}
} else {
for (int i = 0; i < VOIDELEMENTS.length; i++) {
if (str.equals(VOIDELEMENTS[i])) {
inVoidTag = true;
voidTagBuffer = new StringBuffer(str);
return;
}
}
super.write(str);
}
}
}
I know, this is dirty, but I had the same problem and didn't find any other way.

Get a tweet from html content in Java through either regex or at least without external libraries

How can I get the latest tweet from html content through either regex or without any external libraries. I am happy to use external libraries I would just prefer not to. I just wanted to know how it would be possible. I have written the html download part in Java and if anyone wants I will post it here.
So I'll do a pit of pseudo code so that I'm not only targeting Java developers This is how my program looks so far.
1.)Load site("www.twitter.com/user123")
2.)Get initial string and write it to variable->buffer
3.)Loop start
4.) Append string->buffer
5.) If there is no more ->break
6.)print buffer
Obviously the variable buffer will now have raw html content. How can I sort this out to get the tweet. I have found a way but this is too inconsistent. The way I managed it was to find the string which held the tweets and get the content surrounded by the code. However there were too many changes in this section. What I mean is some content inside of it changes, like the font size. I could write multiple if statements but is there a neater solution?

Let me just start off by saying that jsoup is an amazing lightweight HTML parsing library. You can use things like CSS selectors and whatnot. If you ever decide to use a library jsoup will make your life a lot easier.
You can just query for the element with the class of TweetTextSize, then get the text content. This will give you all text, hashtags, and links. (The downside being pictures are also given in links)
Otherwise, you'll need to manually traverse the DOM. For example, use regex to find the beginning of the first TweetTextSize, and then just keep all text which is not between a < and a >.
Unfortunately, this second solution is volatile and may not work in the future, and you'll end up with a big glob of code which is overly complex and hard to debug.

Simple answer if you want a regex and not a sophisticated third party library.
<p[^>]+js-tweet-text[^>]*>(.*)</p>
Try the above on the "view-source" of https://twitter.com/a
Thanks.
EDIT:
Source Code:
import java.io.ByteArrayOutputStream;
import java.io.InputStream;
import java.net.URL;
import java.net.URLConnection;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class TweetSucker {
public static void main(String[] args) throws Exception {
URLConnection urlConnection = new URL("https://twitter.com/a").openConnection();
InputStream inputStream = urlConnection.getInputStream();
String encoding = urlConnection.getContentEncoding();
ByteArrayOutputStream byteArrayOutputStream = new ByteArrayOutputStream();
byte[] buffer = new byte[8192];
int len = 0;
while ((len = inputStream.read(buffer)) != -1) {
byteArrayOutputStream.write(buffer, 0, len);
}
String htmlContent = null;
if (encoding != null) {
htmlContent = new String(byteArrayOutputStream.toByteArray(), encoding);
} else {
htmlContent = new String(byteArrayOutputStream.toByteArray());
}
Pattern TWEET_PATTERN = Pattern.compile("(<p[^>]+js-tweet-text[^>]*>(.*)</p>)", Pattern.CASE_INSENSITIVE);
Matcher matcher = TWEET_PATTERN.matcher(htmlContent);
while (matcher.find()) {
System.out.println("Tweet Found: " + matcher.group(2));
}
}
}

I know that you don't want any libraries but if you want something really quick this is working code in C#:
using (IE browser = new IE())
{
browser.GoTo("https://twitter.com/user");
List tweets = browser.List(Find.ById("stream-items-id"));
if (tweets != null)
{
foreach (var tweet in tweets.ListItems)
{
var tweetText = tweet.Paras.FirstOrDefault();
if (tweetText != null)
{
MessageBox.Show(tweetText.Text);
}
}
}
}
This program uses a library called WatiN (if you use Visual Studio go to Tools Menu, select "NuGet Package Manager" then select "Manage Nuget Packages for Solution" and then select "Browse" and then type "Watin" on the search box, after you find the library hit "Install", after it is installed you just add a reference in your code and then a using statement:
using WatiN.Core;
You can just copy and paste the code I wrote above in a button handler and it'll work, you need to change the twitter.com/XXXXXX user name to list all their tweets. Modify code accordingly to meet your needs.

XML file reading in Java

Is it necessary to know the structure and tags of an XML file completely before reading it in Java?
areaElement.getElementsByTagName("checked").item(0).getTextContent()
I don't know the field name "checked" before I read the file. Is there any way to list all the tags in the XML file, basically the file structure?

I had prepared this DOM parser by myself, using recursion which will parse your xml without having knowledge of single tag. It will give you each node's text content if exist, in a sequence. You can remove commented section in following code to get node name also. Hope it would help.
import java.io.BufferedWriter;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.OutputStreamWriter;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import org.w3c.dom.Document;
import org.w3c.dom.Node;
import org.w3c.dom.NodeList;
public class RecDOMP {
public static void main(String[] args) throws Exception{
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
dbf.setValidating(false);
DocumentBuilder db = dbf.newDocumentBuilder();
// replace following path with your input xml path
Document doc = db.parse(new FileInputStream(new File ("D:\\ambuj\\ATT\\apip\\APIP_New.xml")));
// replace following path with your output xml path
File OutputDOM = new File("D:\\ambuj\\ATT\\apip\\outapip1.txt");
FileOutputStream fostream = new FileOutputStream(OutputDOM);
OutputStreamWriter oswriter = new OutputStreamWriter (fostream);
BufferedWriter bwriter = new BufferedWriter(oswriter);
// if file doesnt exists, then create it
if (!OutputDOM.exists()) {
OutputDOM.createNewFile();}
visitRecursively(doc,bwriter);
bwriter.close(); oswriter.close(); fostream.close();
System.out.println("Done");
}
public static void visitRecursively(Node node, BufferedWriter bw) throws IOException{
// get all child nodes
NodeList list = node.getChildNodes();
for (int i=0; i<list.getLength(); i++) {
// get child node
Node childNode = list.item(i);
if (childNode.getNodeType() == Node.TEXT_NODE)
{
//System.out.println("Found Node: " + childNode.getNodeName()
// + " - with value: " + childNode.getNodeValue()+" Node type:"+childNode.getNodeType());
String nodeValue= childNode.getNodeValue();
nodeValue=nodeValue.replace("\n","").replaceAll("\\s","");
if (!nodeValue.isEmpty())
{
System.out.println(nodeValue);
bw.write(nodeValue);
bw.newLine();
}
}
visitRecursively(childNode,bw);
}
}
}

You should definitely check out libraries for this, like dom4j (http://dom4j.sourceforge.net/). They can parse the whole XML document and let you not only list things like elements but do XPath queries and other such cool stuff on them.
There is a performance hit, especially in large XML documents, so you will want to check on the performance hit for your use case before committing to a library. This is especially true if you only need a small bit out of the XML document (and you kind of know what you are looking for already).

The answer to your question is no, it is not necessary to know any element names in advance. For example, you can walk the tree to discover the element names. But it all depends what you are actually trying to do.
For the vast majority of applications, incidentally, the Java DOM is one of the worst ways to solve the problem. But I won't comment further without knowing your project requirements.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Replace XML text node with an element node - java

Related

Using POI To read/write a doc with the full POIFSFileSystem

Using Groovy to overwrite a FlowFile in NiFi

JDOM Transformer - don't contract empty elements

Get a tweet from html content in Java through either regex or at least without external libraries

XML file reading in Java

Categories

Resources