+ sign being dropped from xml when validation occurs

+ sign being dropped from xml when validation occurs - java

up to a previous question I asked here WebResponse posting a null string
while the answer works for the question a new problem happened. When parsing the below xml
<?xml version="1.0" encoding="UTF-8"?>
<hml xmlns="http://schemas.nmdp.org/spec/hml/1.0.1"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://schemas.nmdp.org/spec/hml/1.0.1 http://schemas.nmdp.org/spec/hml/1.0.1/hml-1.0.1.xsd"
version="1.0.1" >
<!--
MIRING Element 1.1 requires the inclusion of an hmlid.
hmlid can be reported in the form of an ISO Object Identifier (OID)
"root" represents a unique publically registered organization
"extension" is a unique document id managed by the reporting organization.
-->
<hmlid root="2.34.48.32" extension="HML.3245662"/>
<!--
MIRING Element 1.2 requires the inclusion of a reporting-center.
reporting-center identifies the organization sending the HML message.
"reporting-center-id" is a unique identifier of the sender.
"reporting-center-context" reports the context/naming authority of the identifier.
-->
<reporting-center reporting-center-id="567"/>
<sample id="4555-6677-8">
<typing gene-family="HLA" date="2015-01-13">
<!--
MIRING Element 3 requires the inclusion of Genotyping information.
The Genotype should include all pertinent Loci, as well as a Genotype in a standard format.
GLStrings can be included either as plain text, or as a reference to a publicly
available service, such as GL Service (gl.nmdp.org)
-->
<allele-assignment date="2015-07-28" allele-db="IMGT/HLA" allele-version="3.17.0">
<haploid locus="HLA-A" method="DNA" type="02:20:01"/>
<glstring>
HLA-A*02:20:01
</glstring>
</allele-assignment>
<typing-method>
<!--
MIRING Element 6 requires platform documentation. This could be a peer-reviewed publication,
or an identifier of a procedure on a publicly available resource, such as NCBI GTR
-->
<sbt-ngs locus="HLA-A"
test-id="HLA-A.Test.1234"
test-id-source="AcmeGenLabs">
<raw-reads uri="rawreads/read1.fastq.gz"
availability="public"
format="fastq"
paired="1"
pooled="1"
adapter-trimmed="1"
quality-trimmed="0"/>
</sbt-ngs>
</typing-method>
<consensus-sequence date="2015-01-13">
<!--
MIRING Element 2 requires the inclusion of Reference Context.
The location and identifiers of the reference sequence should be specified.
start and end attributes are 0-based, and refer to positions on the reference sequence.
-->
<reference-database availability="public" curated="true">
<reference-sequence
name="HLA-A reference"
id="Ref111"
start="945000"
end="946000"
accession="GL000123.4"
uri="http://AcmeGenReference/RefDB/GL000123.4"/>
</reference-database>
<!--
MIRING Element 4 requires the inclusion of a consensus sequence.
The start and end positions are 0-based, and refer to positions on the reference sequence (reference-sequence-id)
Multiple consensus-sequence-block elements can be included sequentially.
-->
<consensus-sequence-block reference-sequence-id="Ref111"
start="945532"
end="945832"
strand="+"
phase-set="1"
expected-copy-number="1"
continuity="true"
description="HLA-A Consensus Sequence 4.5.67">
<!--
A sequence can be reported as plain text, or as a pointer to an external reference,
or as variants from a reference sequence.
-->
<sequence>
CCCAGTTCTCACTCCCATTGGGTGTCGGGTTTCCAGAGAAGCCAATCAGTGTCGTCGCGGTCGCTGTTCTAAAGCCCGCACGCACCCACCGGGACTCAGATTCTCCCCAGACGCCGAGGATGGCCGTCATGGCGCCCCGAACCCTCCTCCTGCTACTCTCGGGGGCCCTGGCCCTGACCCAGACCTGGGCGGGTGAGTGCGGGGTCGGGAGGGAAACCGCCTCTGCGGGGAGAAGCAAGGGGCCCTCCTGGCGGGGGCGCAGGACCGGGGGAGCCGCGCCGGGACGAGGGTCGGGCAGGT
</sequence>
<!--
MIRING Element 5 requires the inclusion of any relevant sequence polymorphisms.
These represent variants from the reference sequence.
start and end attributes are 0-based, and refer to positions on the reference sequence.
You can see this variant at positions 10 - 15 on the sequence. (945542 - 945532 = 10)
-->
<variant id="0"
reference-bases="GTCATG"
alternate-bases="ACTCCC"
start="945542"
end="945548"
filter="pass"
quality-score="95">
<!--
The functional effects of variants can be reported using variant-effect.
They should use Sequence Ontology (SO) variant effect terms.
-->
<variant-effect term="missense_variant"/>
</variant>
</consensus-sequence-block>
</consensus-sequence>
</typing>
</sample>
<!--
Multiple samples can be included in a single message.
Each sample should have it's own reference-database(s) even if they are identical to other samples' references.
-->
<sample id="4555-6677-9">
<typing gene-family="HLA" date="2015-01-13">
<allele-assignment date="2015-07-28" allele-db="IMGT/HLA" allele-version="3.17.0">
<haploid locus="HLA-A" method="DNA" type="02:20:01"/>
<glstring>
HLA-A*02:01:01:01
</glstring>
</allele-assignment>
<typing-method>
<sbt-ngs locus="HLA-A"
test-id="HLA-A.Test.1234"
test-id-source="AcmeGenLabs">
<raw-reads uri="rawreads/read2.fastq.gz"
availability="public"
format="fastq"
paired="1"
pooled="1"
adapter-trimmed="1"
quality-trimmed="0"/>
</sbt-ngs>
</typing-method>
<consensus-sequence date="2015-01-13">
<reference-database availability="public" curated="true">
<reference-sequence
name="HLA-A reference"
id="Ref112"
start="945000"
end="946000"
accession="GL000123.4"
uri="http://AcmeGenReference/RefDB/GL000123.4"/>
</reference-database>
<consensus-sequence-block
reference-sequence-id="Ref112"
start="945532"
end="945832"
strand="+"
phase-set="1"
expected-copy-number="1"
continuity="true"
description="HLA-A Consensus Sequence 4.5.89">
<sequence>
CCCAGTTCTCGTCATGATTGGGTGTCGGGTTTCCAGAGAAGCCAATCAGTGTCGTCGCGGTCGCTGTTCTAAAGCCCGCACGCACCCACCGGGACTCAGATTCTCCCCAGACGCCGAGGATGGCCGTCATGGCGCCCCGAACCCTCCTCCTGCTACTCTCGGGGGCCCTGGCCCTGACCCAGACCTGGGCGGGTGAGTGCGGGGTCGGGAGGGAAACCGCCTCTGCGGGGAGAAGCAAGGGGCCCTCCTGGCGGGGGCGCAGGACCGGGGGAGCCGCGCCGGGACGAGGGTCGGGCAGGT
</sequence>
</consensus-sequence-block>
</consensus-sequence>
</typing>
</sample>
</hml>
Which is the sample given for the validator so I know it works. However when I pass it through my restful POST code:
#POST
#Path("/Validate")
#Produces("application/xml")
public String validate(#FormParam("xml") String xml)
{
System.out.println(xml);
try {
Client client = Client.create();
WebResource webResource = client.resource("http://miring.b12x.org/validator/ValidateMiring/");
// POST method
ClientResponse response = webResource.accept("application/xml").post(ClientResponse.class,"xml="+xml);
// check response status code
if (response.getStatus() != 200) {
throw new RuntimeException("Failed : HTTP error code : " + response.getStatus());
}
// display response
String output = response.getEntity(String.class);
System.out.println("Output from Server .... ");
System.out.println(output + "\n");
return output;
} catch (Exception e) {
e.printStackTrace();
}
return "Oops";
}
Everything passes through perfectly fine except for Strand="+" which for some reason drops the + and gets the error message of The value '' of attribute 'strand' on element 'consensus-sequence-block' is not valid with respect to its...'
I tried it with all of strands enumerations +,-,-1,1 and all of them work except for +.
Using the WEB UI (miring.b12x.org) it works perfectly.
Is there something with parsing with SAX that could cause a + to be dropped or any reason a certain enumeration would be dropped?
Thank you
EDIT: Here is the output received:
Output from Server ....
<?xml version="1.0" encoding="UTF-8"?>
<miring-report xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
timestamp="07/19/2016 15:07:31"
xsi:noNamespaceSchemaLocation="http://schemas.nmdp.org/spec/miringreport/1.0/miringreport.xsd">
<hml-compliant>reject</hml-compliant>
<miring-compliant>reject</miring-compliant>
<hmlid extension="HML.3245662" root="2.34.48.32"/>
<samples compliant-sample-count="4"
noncompliant-sample-count="0"
sample-count="2">
<sample hml-compliant="true" id="4555-6677-8" miring-compliant="true"/>
<sample hml-compliant="true" id="4555-6677-9" miring-compliant="true"/>
</samples>
<fatal-validation-errors>
<miring-result miring-rule-id="reject" severity="fatal">
<description>[cvc-attribute.3:, The, value, ', ', of, attribute, 'strand', on, element, 'consensus-sequence-block', is, not, valid, with, respect, to, its, type,, 'null'.]</description>
<solution>Verify that your HML file is well formed, and conforms to http://schemas.nmdp.org/spec/hml/1.0.1/hml-1.0.1.xsd</solution>
</miring-result>
<miring-result miring-rule-id="reject" severity="fatal">
<description>[cvc-attribute.3:, The, value, ', ', of, attribute, 'strand', on, element, 'consensus-sequence-block', is, not, valid, with, respect, to, its, type,, 'null'.]</description>
<solution>Verify that your HML file is well formed, and conforms to http://schemas.nmdp.org/spec/hml/1.0.1/hml-1.0.1.xsd</solution>
</miring-result>
<miring-result miring-rule-id="reject" severity="fatal">
<description>[cvc-enumeration-valid:, Value, ', ', is, not, facet-valid, with, respect, to, enumeration, '[-1,, 1,, +,, -]'., It, must, be, a, value, from, the, enumeration.]</description>
<solution>Verify that your HML file is well formed, and conforms to http://schemas.nmdp.org/spec/hml/1.0.1/hml-1.0.1.xsd</solution>
</miring-result>
<miring-result miring-rule-id="reject" severity="fatal">
<description>[cvc-enumeration-valid:, Value, ', ', is, not, facet-valid, with, respect, to, enumeration, '[-1,, 1,, +,, -]'., It, must, be, a, value, from, the, enumeration.]</description>
<solution>Verify that your HML file is well formed, and conforms to http://schemas.nmdp.org/spec/hml/1.0.1/hml-1.0.1.xsd</solution>
</miring-result>
</fatal-validation-errors>
<validation-warnings>
<miring-result miring-rule-id="1.2.b" severity="warning">
<description>The node reporting-center is missing a reporting-center-context attribute.</description>
<solution>Please add a reporting-center-context attribute to the reporting-center node. You can use reporting-center-context to specify the naming authority of the reporting center identifier. Reporting-center-context is not explicitly required.</solution>
<xpath>/hml[1]/reporting-center[1]</xpath>
</miring-result>
</validation-warnings>
</miring-report>

You don’t set the type of your WebResource, and I don’t know what the default Content-Type of the request is, but I suspect it is application/x-www-form-urlencoded, which means + is being treated as a space. If that is the case, changing "xml="+xml to "xml=" + URLEncoder.encode(xml, "UTF-8") may address the problem.
The application/x-www-form-urlencoded format is the default format for HTML form submissions, as described in the HTML 4.01 specification. The the documentation for the URLEncoder class also describes this format.
In that format, a + character represents a space, so the strand attribute contains a single space. Except, the Attribute-Value Normalization section of the XML 1.0 specification states:
If the attribute type is not CDATA, then the XML processor MUST further process the normalized attribute value by discarding any leading and trailing space (#x20) characters …
So, that single space is then normalized into the empty string (when all leading and trailing space is removed). The empty string, strand='', does not conform to the XML schema you are referencing, http://schemas.nmdp.org/spec/hml/1.0.1/hml-1.0.1.xsd .
URLEncoder.encode escapes all “reserved” characters, including +, as percent-escapes, and then escapes spaces as +. The server expects this format (almost certainly because a Content-Type: application/x-www-form-urlencoded header is present in the HTTP request), and decodes the + and percent-escapes back to the original XML.

Related

The reference to entity "c" must end with the ';' delimiter

I have a maven project and I need to add server details in settings.xml to gain access to a repo that needs authentication.
I have the following, where mypass includes this &c% substring.
<server>
<id>myid-releases</id>
<username>myusername</username>
<password>mypass</password>
</server>

A & is a special character in xml because it marks the start of a so-called entity.
If your xml parser (or the one that maven uses) tries to resolve entities, they need to be valid. Thus, you need to place the entity that resolves to & into your xml: &
TL;DR:
Replace the & in in your password with &

Change log entry pattern dynamically on some condition

In my Java app Logback is used as logging framework. The appenders configured with the following pattern (simplified):
[CORR=%X{CORR}] [MSG=%msg]%n
As one can see, CORR value is taken from MDC. Log entry example:
[CORR=12342314] [MSG=Some message]
There are cases when the attribute is not stored in MDC, so log entry looks like:
[CORR=] [MSG=Some message]
But should be:
[MSG=Some message]
Is there any way to totally get rid of this [CORR=] part of pattern if the corresponding value is absent in MDC without creating custom LayoutBase implementations?
I'm trying to configure evaluator:
<evaluator name="DISPLAY_CORR_EVAL">
<expression>((String) mdc.get("CORR")) != null</expression>
</evaluator>
but have no idea how to use it in my case.

The problem was solved with help of Logback replace(p){r, t} conversion word:
Replaces occurrences of 'r', a regex, with its replacement 't' in the
string produces by the sub-pattern 'p'. For example,
"%replace(%msg){'\s', ''}" will remove all spaces contained in the
event message.
The pattern 'p' can be arbitrarily complex and in particular can
contain multiple conversion keywords. For instance, "%replace(%logger
%msg){'.', '/'}" will replace all dots in the logger or the message
of the event with a forward slash.
My pattern now looks as follows:
%replace([CORR=%X{CORR}]){'\[CORR=\]', ''}[MSG=]%n
when CORR is empty, [CORR=] matches r regex and thus being replaced by empty string.

java.net.URI and percent in query parameter value

System.out.println(
new URI("http", "example.com", "/servlet", "a=x%20y", null));
The result is http://example.com/servlet?a=x%2520y, where the query parameter value differs from the supplied one. Strange, but this does follow the Javadoc:
"The percent character ('%') is always quoted by these constructors."
We can pass the decoded string, a=x y and then we get a reasonable(?) result a=x%20y.
But what if the query parameter value contains an "&" character? This happens for example if the value is an URL itself with query parameters. Look at this (wrong) query string:
a=b&c. The ampersand must be escaped here (a=b%26c), otherwise this can be considered as a query parameter a=b and some garbage (c). If I pass this to an URI constructor, it encodes it, and returns a wrong URL: ...?a=b%2526c
This issue seems to render java.util.URI useless. Am I missing something here?
Summary of answers
java.net.URI does know about the existence of the query part of an URI, but it does not understand the internals of the query part, which can differ for each scheme. For example java.net.URI does not understand the internal structure of the HTTP query part. This would not be a problem, if java.net.URI considered query as an opaque string, and did not alter it. But it tries to apply some generic percent-encoding algorithm, which breaks HTTP URLs.
Therefore I cannot use the URI class to reliably assemble an URL from its parts, despite there are constructors for it. I would also mention that as of Java 7, the implementation of the relativize operation is quite limited, only works if one URL is the prefix of another one. These two functionality (and its leaner interface for these purposes) were the reason why I was interested in java.net.URI, but neither of them works for me.
At the end I used java.net.URL for parsing, and wrote code to assemble an URL from parts and to relativize two URLs. I also checked the Apache HttpClient URIBuilder class, and although it does understand the internals of an HTTP query string, but as of 4.3, it has the same problem with encoding like java.net.URI when dealing with the query part as a whole.

The query string
a=b&c
is not wrong in a URI. The RFC on URI Generic Syntax states
The query component is a string of information to be interpreted by
the resource.
query = *uric
Within a query component, the characters ";", "/", "?", ":", "#",
"&", "=", "+", ",", and "$" are reserved.
The character & in the query string is very much valid (uric represents reserved, mark, and alphanumeric characters). The RFC also states
Many URI include components consisting of or delimited by, certain
special characters. These characters are called "reserved", since
their usage within the URI component is limited to their reserved
purpose. If the data for a URI component would conflict with the
reserved purpose, then the conflicting data must be escaped before
forming the URI.
Because the & is valid but reserved, it is up to the user to determine if it is meant to be encoded or not.
What you call a query parameter is not a feature of a URI and therefore the URI class has no reason to (and shouldn't) support it.
Related:
Which characters make a URL invalid?

The only workaround I found was to use the single-argument constructors and methods. Note that you must use URI#getRawQuery() to avoid decoding %26. For example:
URI uri = new URI("http://a/?b=c%26d&e");
// uri.getRawQuery() equals "b=c%26d&e"
uri = new URI(new URI(uri.getScheme(), uri.getAuthority(),
uri.getPath(), null, null) + "?f=g%26h&i");
// uri.getRawQuery() equals "f=g%26h&i"
uri = uri.resolve("?j=k%26l&m");
// uri.getRawQuery() equals "j=k%26l&m"
// uri.toString() equals "http://a/?j=k%26l&m"

Single working solution known for me is reflection (see https://blog.stackhunter.com/2014/03/31/encode-special-characters-java-net-uri/)
URI uri = new URI("http", null, "example.com", -1, "/accounts", null, null);
Field field = URI.class.getDeclaredField("query");
field.setAccessible(true);
field.set(uri, encodedQueryString);
//clear cached string representation
field = URI.class.getDeclaredField("string");
field.setAccessible(true);
field.set(uri, null);

Use URLEncoder.encode() method, in your case for example:
URLEncoder.encode("a=x%20y", "ISO-8859-1");

SOAP service using Spring. Escaping special characters

I have a SOAP web service developed using Spring framework. Whenever the request contains some invalid data i need to display error message like below
Error occurred. Invalid data for <Field Name>.
So my code looks as below for name validation. This error will be sent as response wheneve no value passed for the name field.
Assert.notNull(name, "Error occurred. No value passed for the field <name>. ");
So what i expected as out out is
Error occurred. No value passed for the field <name>.
But the response in SOAP UI was like below.
Error occurred. No value passed for the field <name>.
How to display the proper < symbol in SOAP UI? I tried CDATA. But not sure how the receiver process the request with CDATA.
With CDATA message in SOAP UI was like below
Error occurred. No value passed for the field <![CDATA[<]]name>.

The XML Specification states:
The ampersand character (&) and the left angle bracket (<) must not appear in their literal form, except when used as markup delimiters, or within a comment, a processing instruction, or a CDATA section. If they are needed elsewhere, they must be escaped using either numeric character references or the strings " & " and " < " respectively. The right angle bracket (>) may be represented using the string " > ", and must, for compatibility, be escaped using either " > " or a character reference when it appears in the string " ]]> " in content, when that string is not marking the end of a CDATA section.
So, you need to either escape the left angle bracket in your error string:
Error occurred. No value passed for the field <name>.
Or encapsulate the entire error string in a CDATA section:
<![CDATA[Error occurred. No value passed for the field <name>.]]>
For more information see http://www.w3.org/TR/xml/#syntax

Inconsistent Apache Solr query results

I'm new to Apache Solr and trying to make a query using search terms against a field called "normalizedContents" and of type "text".
All of the search terms must exist in the field. Problem is, I'm getting inconsistent results.
For example, the solr index has only one document with normalizedContents field with value = "EDOUARD SERGE WILFRID EDOS0004 UNE MENTION COMPLEMENTAIRE"
I tried these queries in solr's web interface:
normalizedContents:(edouard AND une) returns the result
normalizedContents:(edouar* AND une) returns the result
normalizedContents:(EDOUAR* AND une) doesn't return anything
normalizedContents:(edouar AND une) doesn't return anything
normalizedContents:(edouar* AND un) returns the result (although there's no "un" word)
normalizedContents:(edouar* AND uned) returns the result (although there's no "uned" word)
Here's the declaration of normalizedContents in schema.xml:
<field name="normalizedContents" type="text" indexed="true" stored="true" multiValued="false"/>
So, wildcards and AND operator do not follow the expected behavior. What am I doing wrong ?
Thanks.

By default the field type text does stemming on the content (solr.SnowballPorterFilterFactory). Thus 'un' and 'uned' match une. Then you might not have the solr.LowerCaseFilterFactory filter on both, query and index analyzer, therefore EDUAR* does not match. And the 4th doesnt match as edouard is not stemmed to edouar. If you want exact matches, you should copy the data in another field that has a type with a more limited set of filters. E.g. only a solr.WhitespaceTokenizerFactory
Posting the <fieldType name="text"> section from your schema might be helpful to understand everything.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.