markup specific strings in Xml - java

I like to markup some strings in an xml document.
For example, I have:
<p> I like to go to Florida </p>
I need to tag the string "go" and have the output as:
<p> I like to <something>go</something> to Florida</p>
What is the best way to do this? I am using Java. I need to treat the XML file as XML not as text. I found some solutions that treat an xml file as a text file and use string.replace but I do not think those are good solutions.
Any suggestion is much appreciated.
Thank you,

Try an XSLT 2.0 transformation like this:
<xsl:template match="#*|*">
<xsl:copy>
<xsl:apply-templates select="#*|node()"/>
</xsl:copy>
</xsl:template>
<xsl:template match="text()">
<xsl:analyze-string regex="go">
<xsl:matching-substring>
<something><xsl:value-of select="."/></something>
</xsl:matching-substring>
<xsl:non-matching-substring>
<xsl:value-of select="."/>
</xsl:non-matching-substring>
</xsl:analyze-string>
</xsl:template>
You can of course extend the regular expression, e.g. regex="go|come|walk|run"; if you only want to match whole words, you might want to use tokenize() to split it into words and process each word separately.

Related

Issue with the Text displayed on a PDF using XSL [duplicate]

I'm having an issue where when I publish my modspecs to pdf (XSL-FO). My tables are having issues, where the content of a cell will overflow its column into the next one. How do I force a break on the text so that a new line is created instead?
I can't manually insert zero-space characters since the table entries are programmatically entered. I'm looking for a simple solution that I can just simply add to docbook_pdf.xsl (either as a xsl:param or xsl:attribute)
EDIT:
Here is where I'm at currently:
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="2.0" xmlns:fo="http://www.w3.org/1999/XSL/Format">
<xsl:import href="urn:docbkx:stylesheet"/>
...(the beginning of my stylesheet for pdf generation, e.g. header and footer content stuff)
<xsl:template match="text()">
<xsl:call-template name="intersperse-with-zero-spaces">
<xsl:with-param name="str" select="."/>
</xsl:call-template>
</xsl:template>
<xsl:template name="intersperse-with-zero-spaces">
<xsl:param name="str"/>
<xsl:variable name="spacechars">
      
     ​
</xsl:variable>
<xsl:if test="string-length($str) > 0">
<xsl:variable name="c1" select="substring($str, 1, 1)"/>
<xsl:variable name="c2" select="substring($str, 2, 1)"/>
<xsl:value-of select="$c1"/>
<xsl:if test="$c2 != '' and
not(contains($spacechars, $c1) or
contains($spacechars, $c2))">
<xsl:text>​</xsl:text>
</xsl:if>
<xsl:call-template name="intersperse-with-zero-spaces">
<xsl:with-param name="str" select="substring($str, 2)"/>
</xsl:call-template>
</xsl:if>
</xsl:template>
</xsl:stylesheet>
With this, the long words are successfully broken up in the table cells! Unfortunately, the side effect is that normal text elsewhere (like in a under sextion X) now breaks up words so that they appear on seperate lines. Is there a way to isolate the above process to just tables?
In the long words, try inserting a zero-width space character between the characters where a break is allowed.
You can use XSLT to insert a zero-width space between every character. Here is one way to do it: http://groups.yahoo.com/neo/groups/XSL-FO/conversations/topics/1177.
Here is a mailing list thread where various approaches to the problem are discussed: http://www.stylusstudio.com/xsllist/200201/post80920.html.
The SourceForge DocBook stylesheets includes a template for breaking up long URLs in FO output; see http://www.sagehill.net/docbookxsl/Ulinks.html#BreakLongUrls. The template (hyphenate-url) is in xref.xsl.
Since you're using XSLT 2.0:
<xsl:template match="text()">
<xsl:value-of
select="replace(replace(., '(\P{Zs})(\P{Zs})', '$1​$2'),
'([^\p{Zs}​])([^\p{Zs}​])',
'$1​$2')" />
</xsl:template>
This is using category escapes (http://www.w3.org/TR/xmlschema-2/#nt-catEsc) rather than an explicit list of characters to match, but you could do it that way instead. It needs two replace() because the inner replace() can only insert the character between every second character. The outer replace() matches on characters that are not either space characters or the character added by the inner replace().
Inserting after every thirteenth non-space character:
<xsl:template match="text()">
<xsl:value-of
select="replace(replace(., '(\P{Zs}{13})', '$1​'),
'​(\p{Zs})',
'$1')" />
</xsl:template>
The inner replace() inserts the character after every 13 non-space characters, and the outer replace() fixes it if the 14th character was a space character.
If you are using AH Formatter, then you can use axf:word-break="break-all" to allow AH Formatter to break anywhere within a word. See https://www.antenna.co.jp/AHF/help/en/ahf-ext.html#axf.word-break.

XSLT gathering data

I have a simple issue that I can't really find a workaround to and I need your help.
The main problem is, that while process an input XML there are various places where I need to "gather" information. This means all I really have to do is call a special template with parameters like so:
<xsl:template name="append-section">
<xsl:param name="id" />
<xsl:param name="title" />
<!-- more code here -->
</xsl:template>
Lets say this template is called 12 times during the XSLT procedure. At the end of the conversion I want to write this data to a file.
I have tried to appen this data to a global variable and then write the result to the file. Only to realise the variables are not really variables in XSLT. This solution did not work.
Second solution was to use the xsl:result-document with one temp file. This solution would have done something like always copying the previous content of the file to itself, but also appending the new data something like this:
<xsl:template name="append-section">
<xsl:param name="id" />
<xsl:param name="title" />
<xsl:result-document method="html" href="tmp/tmp.html">
<xsl:value-of select="document(tmp.html)" />
<xsl:element name="li">
<xsl:element name="a">
<xsl:attribute name="class">
<xsl:value-of select="'so-dropdown-page-menu-list-button'" />
</xsl:attribute>
<xsl:attribute name="href">
<xsl:value-of select="'#'" />
<xsl:value-of select="$id" />
</xsl:attribute>
<xsl:value-of select="$title" />
</xsl:element>
</xsl:element>
</xsl:result-document>
</xsl:template>
This code might not be perfect, but I had to realise unfortunatly that the following exception was thrown:
Cannot write more than one result document to the same URI
This solution also seems to be invalid.
So my question is this: How can I implement this simple issue? Gather the data from various places and write them to a file at the end of the transformation.
I use Saxon.
You need to structure your code according to the structure of the output, not the structure of the input. Don't try to do things as you encounter information in the input; do them when you need to generate the relevant piece of the output.
There are cases when this can seem inefficient because it means visiting the same input more than once. Usually these inefficiences will prove apparent rather than real. But the first thing is to get the transformation working; if it's not fast enough you can come back to us with another question.

unclosed html tag inside xslt

I'd like to have unclosed html tag as a result of xslt. I'll add closing tag later in xslt. How can I achieve this? This one doesn't compile:
<xsl:when test="$href">
<xsl:text><a href='{$href}'></xsl:text>
</xsl:when>
Thanx
This is the kind of thing that you probably should want to avoid at all costs. I do not know your requirements but you perhaps want a link or a span tag based on something.
In these instances you can use something like this
<xsl:apply-templates select="tag"/>
then 2 templates ie
<xsl:template match="tag">
<span>hello king dave</span>
</xsl:template>
<xsl:template match="tag[#href]">
link text....
</xsl:template>
It's hard to give a definite answer without a better idea of the precise use case, but it's worth noting that you can use match and name on the same <xsl:template>. For example, if you want to produce some particular output for all <tag> elements, but also wrap this output in an <a> tag in certain cases, then you could use an idiom like
<xsl:template match="tag[#href]">
<xsl:call-template name="tagbody" />
</xsl:template>
<xsl:template match="tag" name="tagbody">
Tag content was "<xsl:value-of select="."/>"
</xsl:template>
The idea here is that tag elements with an href will match the first template, which does some additional processing before and after calling the general tag template. Tags without an href will just hit the normal template without the wrapping logic. I.e. for an input like
<root>
<tag>foo</tag>
<tag href="#">bar</tag>
</root>
you would get an output like
Tag content was "foo"
Tag content was "bar"
I had the same problem before and was only able to solve it by copying the entire <a href='{$href}'>...</a> for each when branch.
Maybe you could try setting the doctype of your XSL to some loose XML standard, but afaik XSLT is pretty strict.
Edit: apparently you can set the doctype with a <xsl:output> tag.
Found solution on the net:
<xsl:text disable-output-escaping="yes"><![CDATA[<a href=']]></xsl:text>
<xsl:value-of select="href"/>
<xsl:text disable-output-escaping="yes"><![CDATA['>]]></xsl:text>

Remove special characters from XML via XSLT only for specific tags

I am having a certian issue with special characters in my XML.
Bascially I am splitting up an xml into multiple xmls using Xalan Processor.
When splitting the documents up I am using their value of the name tag as the name of the file generated. The problem is that the name contains characters that arent recognized by the XML processor like ™ (TM) and ® (R). I want to remove those characters ONLY when naming the files.
<xsl:template match="products">
<redirect:write select="concat('..\\xml\\product\\en\\',translate(string(name),'</> ',''),'.xml')">
The above is the XSL code I have writter to split the XML into multlpe XMLs. As you can see I am using hte translate method to subtitute '/','<','>' with '' from the name. I was hoping I could do the same with ™ (TM) and ® (R) but it doesnt seem to work.
Please advice me how I would be able to do that.
Thanks for you help in advance.
I don't have Xalan, but with 8 other XSLT processors this thransformation:
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text"/>
<xsl:template match="text()">
<xsl:value-of select="translate(., '</>™®', '')"/>
===================
<xsl:value-of select="translate(., '</>™®', '')"/>
</xsl:template>
</xsl:stylesheet>
when applied on this XML document:
<t>XXX™ My Trademark®</t>
produces the wanted result:
XXX My Trademark
===================
XXX My Trademark
I suggest that you try to use one of the two expressions above -- at least the second may work successfully.
Following Dimitre answer, I think that if you are not sure about wich special character could be in name, maybe you should keep what you consider legal document's name characters.
As example:
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text"/>
<xsl:template match="text()">
<xsl:value-of select="translate(.,
translate(.,
'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ ',
''),
'')"/>
</xsl:template>
</xsl:stylesheet>
With input:
<t>XXX™ My > Trademark®</t>
Result:
XXX My Trademark

How to preserve Empty XML Tags after XSLT - prevent collapsing them from <B></B> to <B/>

Say I have a very simple XML with an empty tag 'B':
<Root>
<A>foo</A>
<B></B>
<C>bar</C>
</Root>
I'm currently using XSLT to remove a few tags, like 'C' for example:
<?xml version="1.0" ?>
<xsl:stylesheet version="2.0" xmlns="http://www.w3.org/1999/XSL/Transform" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" indent="no" encoding="utf-8" omit-xml-declaration="yes" />
<xsl:template match="*">
<xsl:copy>
<xsl:copy-of select="#*" />
<xsl:apply-templates />
</xsl:copy>
</xsl:template>
<xsl:template match="C" />
</xsl:stylesheet>
So far OK, but the problem is I end up having an output like this:
<Root>
<A>foo</A>
<B/>
</Root>
when I actually really want:
<Root>
<A>foo</A>
<B></B>
</Root>
Is there a way to prevent 'B' from collapsing?
Thanks.
Ok, so here what worked for me:
<xsl:output method="html">
Try this:
<script type="..." src="..."> </script>
Your HTML output will be:
<script type="..." src="..."> </script>
The   prevents the collapsing but translates to a blank space. It's worked for me in the past.
There is no standard way, as they are equivalent; You might be able to find an XSLT engine that has an option for this behaviour, but I'm not aware of any.
If you're passing this to a third party that cannot accept empty tags using this syntax, then you may have to post-process the output yourself (or convince the third party to fix their XML parsing)
It is up to the XSLT engine to decide how the XML tag is rendered, because a parser should see no difference between the two variations. However, when outputting HTML this is a common problem (for <textarea> and <script> tags for example.) The simplest (but ugly) solution is to add a single whitespace inside the tag (this does change the meaning of the tag slightly though.)
This has been a long time issue and I finally made it work with a simple solution.
Add <xsl:text/> if you have a space character. I added a space in my helper class.
<xsl:choose>
<xsl:when test="$textAreaValue=' '">
<xsl:text/>
</xsl:when>
<xsl:otherwise>
<xsl:value-of select="$textAreaValue"/>
</xsl:otherwise>
</xsl:choose>
They are NOT always equivalent. Many browsers can't deal with <script type="..." src="..." /> and want a separate closing tag. I ran into this problem while using xml/xsl with PHP. Output "html" didn't work, I'm still looking for a solution.
No. The 2 are syntactically identical, so you shouldn't have to worry
It should not be a problem if it is or . However if you are using another tool which expects empty XML tags as way only, then you have a problem. A not very elegant way to do this will be adding a space between staring and ending 'B' tags through XSLT code.
<xsl:text disable-output-escaping="yes">
<![CDATA[<div></div>]]>
</xsl:text>
This works fine with C#'s XslCompiledTransform class with .Net 2.0, but may very well fail almost anywhere else. Do not use unless you are programmatically doing the transofrm yourself; it is not portable at all.
It's 7 years late, but for future readers I will buck the trend here and propose an actual solution to the original question. A solution that does not modify the original with spaces or the output directive.
The idea was to use an empty variable to trick the parser.
If you only want to do it just for one tag B, my first thought was to use something like this to attach a dummy variable.
<xsl:variable name="dummyempty" select="''"/>
<xsl:template match="B">
<xsl:copy>
<xsl:apply-templates select="#*" />
<xsl:value-of select="concat(., $dummyempty)"/>
</xsl:copy>
</xsl:template>
But I found that in fact, even the dummy variable is not necessary. This preserved empty tags, at least when tested with xsltproc in linux :
<xsl:template match="B">
<xsl:copy>
<xsl:apply-templates select="#*" />
<xsl:value-of select="."/>
</xsl:copy>
</xsl:template>
For a more generic solution to handle ALL empty tags, try this:
<xsl:variable name="dummyempty" select="''"/>
<xsl:template match="*[. = '']">
<xsl:copy>
<xsl:apply-templates select="node()|#*" />
<xsl:value-of select="$dummyempty"/>
</xsl:copy>
</xsl:template>
Again, depending on how smart your parser is, you may not even need the dummy variable.

Categories