Leave entities as-is when parsing XML with Woodstox

Leave entities as-is when parsing XML with Woodstox - java

I'm using Woodstox to process an XML that contains some entities (most notably >) in the value of one of the nodes. To use an extreme example, it's something like this:
<parent> < > & " &apos; </parent>
I have tried a lot of different configuration options for both WstxInputFactory (IS_REPLACING_ENTITY_REFERENCES, P_TREAT_CHAR_REFS_AS_ENTS, P_CUSTOM_INTERNAL_ENTITIES...) and WstxOutputFactory, but no matter what I try, the output is always something like this:
<parent>nbsp; < nbsp; > & " ' nbsp;</parent>
(> gets converted to >, < stays the same, loses the &...)
I'm reading the XML with an XMLEventReader created with
XMLEventReader reader = wstxInputFactory.createXMLEventReader(new StringReader(fulltext));
after configuring the WstxInputFactory.
Is there any way to configure Woodstox to just ignore all entities and output the text exactly as it was in the input String?

First of all, you need to include actual code since "output is always something like this" makes no sense without explaining exactly how are you outputting content that is parsed: you may be printing events, using some library, or perhaps using Woodstox stream or event writer.
Second: there is difference in XML between small number of pre-defined entities (lt, gt, apos, quot, amp), and arbitary user-defined entities like what nbsp here would be. Former you can use as-is, they are already defined; latter only exist if you define them in DTD.
Handling of the two groups is different, too; former will always be expanded no matter what, and this is by XML specification. Latter will be resolved (unless resolution disabled), and then expanded -- or if not defined exception will be thrown.
You can also specify custom resolver as mention by the other answer; but this will only be used for custom entities (here, ).
In the end it is also good to explain not what you are doing as much as what you are trying to achieve. That will help suggest things better than specific questions of "how do I do X" which may not be the ways to go about.
And as to configuration of Woodstox, maybe this blog entry:
https://medium.com/#cowtowncoder/configuring-woodstox-xml-parser-woodstox-specific-properties-1ce5030a5173
will help (as well as 2 others in the series) -- it covers existing configuration settings.

The basic five XML entities (quot, amp, apos, lt, gt) will be always processed. As far as I know there is no way to get the source of them with Sax.
For the other entities you can process them manually. You can capture the events until the end of the element and concatenate the values:
XMLInputFactory factory = WstxInputFactory.newInstance();
factory.setProperty(XMLInputFactory.IS_REPLACING_ENTITY_REFERENCES, Boolean.FALSE);
XMLEventReader xmlr = factory.createXMLEventReader(
this.getClass().getResourceAsStream(xmlFileName));
String value = "";
while (xmlr.hasNext()) {
XMLEvent event = xmlr.nextEvent();
if (event.isCharacters()) {
value += event.asCharacters().getData();
}
if (event.isEntityReference()) {
value += "&" + ((EntityReference) event).getName() + ";";
}
if (event.isEndElement()) {
// Assign it to the right variable
System.out.println(value);
value = "";
}
}
For your example input:
<parent> < > & " &apos; </parent>
The output will be:
< > & " '
Otherwise if you want to convert all the entities maybe you could use a custom XmlResolver for undeclared entities:
public class NaiveHtmlEntityResolver implements XMLResolver {
private static final Map<String, String> ENTITIES = new HashMap<>();
static {
ENTITIES.put("nbsp", " ");
ENTITIES.put("apos", "'");
ENTITIES.put("quot", "\"");
// and so on
}
#Override
public Object resolveEntity(String publicID,
String systemID,
String baseURI,
String namespace) throws XMLStreamException {
if (publicID == null && systemID == null) {
return ENTITIES.get(namespace);
}
return null;
}
}
And then tell Woodstox to use it for the undeclared entities:
factory.setProperty(WstxInputProperties.P_UNDECLARED_ENTITY_RESOLVER, new NaiveHtmlEntityResolver());

Related

Write elements of a map to a CSV correctly in a simplified way in Java 8

I have a countries Map with the following design:
England=24
Spain=21
Italy=10
etc
Then, I have a different citiesMap with the following design:
London=10
Manchester=5
Madrid=7
Barcelona=4
Roma=3
etc
Currently, I am printing these results on screen:
System.out.println("\nCountries:");
Map<String, Long> countryMap = countTotalResults(orderDataList, OrderData::getCountry);
writeResultInCsv(countryMap);
countryMap.entrySet().stream().forEach(System.out::println);
System.out.println("\nCities:\n");
Map<String, Long> citiesMap = countTotalResults(orderDataList, OrderData::getCity);
writeResultInCsv(citiesMap);
citiesMap.entrySet().stream().forEach(System.out::println);
I want to write each line of my 2 maps in the same CSV file. I have the following code:
public void writeResultInCsv(Map<String, Long> resultMap) throws Exception {
File csvOutputFile = new File(RUTA_FICHERO_RESULTADO);
try (PrintWriter pw = new PrintWriter(csvOutputFile)) {
resultMap.entrySet().stream()
.map(this::convertToCSV)
.forEach(pw::println);
}
}
public String convertToCSV(String[] data) {
return Stream.of(data)
.map(this::escapeSpecialCharacters)
.collect(Collectors.joining("="));
}
public String escapeSpecialCharacters(String data) {
String escapedData = data.replaceAll("\\R", " ");
if (data.contains(",") || data.contains("\"") || data.contains("'")) {
data = data.replace("\"", "\"\"");
escapedData = "\"" + data + "\"";
}
return escapedData;
}
But I get compilation error in writeResultInCsv method, in the following line:
.map(this::convertToCSV)
This is the compilation error I get:
reason: Incompatible types: Entry is not convertible to String[]
How can I indicate the following result in a CSV file in Java 8 in a simplified way?
This is the result and design that I want my CSV file to have:
Countries:
England=24
Spain=21
Italy=10
etc
Cities:
London=10
Manchester=5
Madrid=7
Barcelona=4
Roma=3
etc

Your resultMap.entrySet() is a Set<Map.Entry<String, Long>>. You then turn that into a Stream<Map.Entry<String, Long>>, and then run .map on this. Thus, the mapper you provide there needs to map objects of type Map.Entry<String, Long> to whatever you like. but you pass the convertToCSV method to it, which maps string arrays.
Your code tries to join on comma (Collectors.joining(",")), but your desired output contains zero commas.
It feels like one of two things is going on:
you copy/pasted this code from someplace or it was provided to you and you have no idea what any of it does. I would advise tearing this code into pieces: Take each individual piece, experiment with it until you understand it, then put it back together again and now you know what you're looking at. At that point you would know that having Collectors.joining(",") in this makes no sense whatsoever, and that you're trying to map an entry of String, Long using a mapping function that maps string arrays - which obviously doesn't work.
You would know all this stuff but you haven't bothered to actually look at your code. That seems a bit surprising, so I don't think this is it. But if it is - the code you have is so unrelated to the job you want to do, that you might as well remove your code entirely and turn this question into: "I have this. I want this. How do I do it?"
NB: A text file listing key=value pairs is not usually called a CSV file.

Property file based conditional patterns in java

I have a property file (a.txt) which has the values (Example values given below) like below
test1=10
test2=20
test33=34
test34=35
By reading this file, I need to produce an output like below
value = 35_20_34_10
which means => I have a pattern like test34_test2_test33_test1
Note, If the 'test33' has any value other than 34 then I need to produce the value like below
value = 35_20_10
which means => I have a pattern like test34_test2_test1
Now my problem is, every time when the customer is making the change in the logic, I am making the change in the code. So what I expect is, I want to keep the logic (pattern) in another property file so I will be sending the two inputs to the util (one input is the property file (A.txt) another input will be the 'pattern.txt'),
My util has to be compare the A.txt and the business logic 'pattern.txt' and produce the output like
value = 35_20_34_10 (or)
value = 35_20_10
If there an example for such pattern based logic as I expect?
Any predefined util / java class does this?
Any help would be Great.
thanks,
Harry

First of all, svasa's answer makes a lot of sense, but covers different level of
abstraction. I recommend you read his answer too, that pattern should
be useful.
You may wanna look at Apache Velocity and FreeMarker libraries to see how they structure their API.
Those are template engines - they usually have some abstraction of pattern or format, and abstraction of variable/value binding (or namespace, or source). You can render a template by binding it with a binding/namespace, which yields the result.
For example, you may wanna have a pattern "<a> + <b>", and binding that looks like a map: {a: "1", b: "2"}. By binding that binding to that pattern you'll get "1 + 2", when interpreting <...> as variables.
You basically load the pattern from your pattern.txt, then load your data file A.txt (for example, by treating it as properties and using Properties class) and construct binding based on these properties. You'll get your output and possibility to customize the pattern all the time.

You may call the sequences like test34_test2_test33_test1 as a pattern, let me call them as constraints when building something.
To me this problem best fits into a
builder pattern.
When building the value you want, you tell the builder that these are my constraints(pattern) and these are my original properties like below:
new MyPropertiesBuilder().setConstraints(constraints).setProperties(original).buildValue();
Details:
Set some constraints in a separate file where you specify the order of the properties and their values like :
test34=desiredvalue-could-be-empty
test2=desiredvalue-could-be-empty
test33=34
test1=desiredvalue-could-be-empty
The builder goes over the constraints in the order specified, but get the values from the original properties and build the desired string.
One way to achieve your requirement through builder pattern is to define classes like below :
Interface:
public interface IMyPropertiesBuilder
{
public void setConstraints( Properties properties );
public void setProperties( Properties properties );
public String buildValue();
}
Builder
public class MyPropertiesBuilder implements IMyPropertiesBuilder
{
private Properties constraints;
private Properties original;
#Override
public void setConstraints( Properties constraints )
{
this.constraints = constraints;
}
#Override
public String buildValue()
{
StringBuilder value = new StringBuilder();
Iterator it = constraints.keySet().iterator();
while ( it.hasNext() )
{
String key = (String) it.next();
if ( original.containsKey( key ) && constraints.getProperty( key ) != null && original.getProperty( key ).equals( constraints.getProperty( key ) ) )
{
value.append( original.getProperty( key ) );
value.append( "_" );
}
}
return value.toString();
}
#Override
public void setProperties( Properties properties )
{
this.original = properties;
}
}
User
public class MyPropertiesBuilderUser
{
private Properties original = new Properties().load(new FileInputStream("original.properties"));;
private Properties constraints = new Properties().load(new FileInputStream("constraints.properties"));
public String getValue()
{
String value = new MyPropertiesBuilder().setConstraints(constraints).setProperties(original).buildValue();
}
}

Rest-Assured XSD References Other XSDs

I am programming an XML validator according to schemas using Rest-Assured. However, I am having trouble handling XSDs that reference other XSDs, because I retrieve the original XSD from a URL using GET.
I have been trying to implement my own parsing to consolidate the XSDs(Strings) into one XSD(String), but it is becoming a recursive monster, and extremely inefficient/difficult. To see the algorithm, look at the end of the post.
I have two questions:
1) My problem is that I am using GET to retrieve the XSD, so it's not within the namespace. Is there a way to GET all referenced XSDs and consolidate them using Rest-Assured? I wouldn't have a clue about how to go about this.
2) Is there a better way to handle includes in general? As you can see, my algorithm is very costly and overcomplicated (especially the ref attribute), and I'm sure something will break easily if I change my test cases.
My algorithm(Pseudo-Code to avoid complexity) so far is like the following:
boolean xmlValid(String xmlAddress, String xsdAddress){
LinkedList XSDList = new LinkedList;
XSDList.add(xsdAddress);
xsdString = getExternalXSDStrings(XSDList);
try{ //No PseudoCode here
RestAssured.expect().
statusCode(200).
body(
RestAssuredMatchers.matchesXsd(xsdString)).
when().
get(xmlAddress);
}catch Exceptions{...}
}
String getExternalXSDStrings(LinkedList xsdReferences, String prevString){
LinkedList recursiveXSDReferences = new LinkedList();
for(xsdRef:xsdReferences){
xsdAddress = "http://..." + xsdRef;
Open InputStream From URL;
while(inputLine != null){
if(prologFlag) //Do Nothing, this is to avoid multiple prologs ;
else if(includeFlag){
if(refFlag) Note Reference;
else recursiveXSDReferences.add(includeReference);
}else if(refFlag){
referenceDefinition = Extract Reference Element Definition;
xsdString = xsdString + referenceDefinition;
}else{
xsdString = xsdString + inputLine;
}
}
Close input stream;
}
xsdString = prevString + xsdString;
if(xsdReferences.length > 0) return getExternalXSDStrings(recursiveXSDReferences , xsdString);
else return xsdString;
}
Thank you very much in advance!

Perhaps can make use of the XmlConfig in detailed configuration. This gives you access to configure features and namespaces etc. For example if you want to disable the loading of external DTD's you could do:
given().config(RestAssured.config().xmlConfig(xmlConfig().disableLoadingOfExternalDtd())). ..
So perhaps you could look in the "disableLoadingOfExternalDtd" method to see how it's implemented to get some hints.

JAXB: Get Tag as String

This question may have been answered before in some dark recess of the Interwebs, but I couldn't even figure out how to form a meaningful Google query to search for it.
So: Suppose I have a (simplified) XML document like so:
<root>
<tag1>Value</tag1>
<tag2>Word</tag2>
<tag3>
<something1>Foo</something1>
<something2>Bar</something2>
<something3>Baz</something3>
</tag3>
</root>
I know how to use JAXB to unmarshal this into a Java Object in the standard use cases.
What I don't know how to do is unmarshal tag3's contents wholesale into a String. By which I mean:
<something1>Foo</something1>
<something2>Bar</something2>
<something3>Baz</something3>
as a String, tags and all.

Use annotation #XmlAnyElement.
I've been looking for the same solution and I expected to find some annotation that prevents parsing dom and live it as it is, but did not find it.
Detail at:
Using JAXB to extract inner text of XML element
and
http://blog.bdoughan.com/2011/04/xmlanyelement-and-non-dom-properties.html
I added one cheking in method getElement(), otherwise we could get IndexOutOfBoundsException
if (xml.indexOf(START_TAG) < 0) {
return "";
}
For me it's quite strange behavior with this solution. method getElement() is called for every tag of your xml. The first call is for "Value", the second - "ValueWord", etc. It appends the next tag for previous
update:
I noticed that this approach works only for ONE occurence of tag that we want to parse to String. It's impossible to parse correctly the followint example:
<root>
<parent1>
<tag1>Value</tag1>
<tag2>Word</tag2>
<tag3>
<something1>Foo</something1>
<something2>Bar</something2>
<something3>Baz</something3>
</tag3>
</parent1>
<parent2>
<tag1>Value</tag1>
<tag2>Word</tag2>
<tag3>
<something1>TheSecondFoo</something1>
<something2>TheSecondBar</something2>
<something3>TheSecondBaz</something3>
</tag3>
</parent2>
"tag3" with parent tag "parent2" will contain parameters from the first tag (Foo, Bar, Baz) instead of (TheSecondFoo, TheSecondBar, TheSecondBaz)
Any suggestions are appreciated.
Thanks.

I have an utility method that might come in handy for you in that case. See if it helps. I made a sample code with your example:
public static void main(String[] args){
String text= "<root><tag1>Value</tag1><tag2>Word</tag2><tag3><something1>Foo</something1><something2>Bar</something2><something3>Baz</something3></tag3></root>";
System.out.println(extractTag(text, "<tag3>"));
}
public static String extractTag(String xml, String tag) {
String value = "";
String endTag = "</" + tag.substring(1);
Pattern p = Pattern.compile(tag + "(.*?)" + endTag);
Matcher m = p.matcher(xml);
if (m.find()) {
value = m.group(1);
}
return value;
}

Accessing annotations in UIMA

Is there a way in UIMA to access the annotations from the tokens like the same way they do in their CAS debugger GUI?. You can of course access all the annotations from the index repository, but i want to loop on the tokens, and get all associated annotations to every token.
The reason for that is simply, I want to want to check some annotations and discard the others and in such way it is much easier. Any help is appreciated :)

I'm a uimaFIT developer.
If you want to find all annotations within the boundaries of another annotation, you may prefer the shorter and faster variant
JCasUtil.selectCovered(referenceAnnotation, <T extends ANNOTATION>);
Mind that it is not a good idea creating a "dummy" annotation with the desired offsets and then search within its boundaries, because this immediately allocates memory in the CAS which and is not garbage-collected unless the complete CAS is collected.

After searching and asking the developers of cTAKES( Apache clinical Text Analysis and Knowledge Extraction System ). you can use the following library "uimafit" which can be found on http://code.google.com/p/uimafit/ . The following code can be used
List list = JCasUtil.selectCovered(jcas, <T extends Annotation>, startIndex, endIndex);
This will return all the between the 2 indices.
Hope that will help

if you don't want to use uimaFIT, you can create a filtered iterator to loop through annotations of interest.
The UIMA reference documentation is here: UIMA reference documentation
I recently used this approach in some code to find a sentence annotation which encompassed a regex annotation (this approach was acceptable for our project because all regular expression matches were shorter than the sentences in the document, and there was only one regex match per sentence. Obviously, based on indexing rules, your mileage may vary. If you are afraid of running into another shorterAnnotationType, put the inner code into a while loop):
static ArrayList<annotationsPair> process(Annotation shorterAnnotationType,
Annotation longerAnnotationType, JCas aJCas){
ArrayList<annotationsPair> annotationsList = new ArrayList<annotationsPair>();
FSIterator it = aJCas.getAnnotationIndex().iterator();
FSTypeConstraint constraint = aJCas.getConstraintFactory().createTypeConstraint();
constraint.add(shorterAnnotationType.getType());
constraint.add(longerAnnotationType.getType());
it = aJCas.createFilteredIterator(it, constraint);
Annotation a = null;
int shorterBegin = -1;
int shorterEnd = -1;
it.moveTo((shorterAnnotationType));
while (it.isValid()) {
a = (Annotation) it.get();
if (a.getClass() == shorterAnnotationType.getClass()){
shorterBegin = a.getBegin();
shorterEnd = a.getEnd();
System.out.println("Target annotation from " + shorterBegin
+ " to " + shorterEnd);
//because assume that sentence type is longer than other type,
//the sentence gets indexed prior
it.moveToPrevious();
if(it.isValid()){
Annotation prevAnnotation = (Annotation) it.get();
if (prevAnnotation.getClass() == longerAnnotationType.getClass()){
int sentBegin = prevAnnotation.getBegin();
int sentEnd = prevAnnotation.getEnd();
System.out.println("found annotation [" + prevAnnotation.getCoveredText()
+ "] location: " + sentBegin + ", " + sentEnd);
annotationsPair pair = new annotationsPair(a, prevAnnotation);
annotationsList.add(pair);
}
//return to where you started
it.moveToNext(); //will not invalidate iter because just came from next
}
}
it.moveToNext();
}
return annotationsList;
}
Hope this helps!
Disclaimer: I am new to UIMA.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Leave entities as-is when parsing XML with Woodstox - java

Related

Write elements of a map to a CSV correctly in a simplified way in Java 8

Property file based conditional patterns in java

Rest-Assured XSD References Other XSDs

JAXB: Get Tag as String

Accessing annotations in UIMA

Categories

Resources