JAX-WS: Illegal character ((CTRL-CHAR, code 30)) [duplicate] - java

I am getting following exception from webservices:
com.ctc.wstx.exc.WstxUnexpectedCharException: Illegal character ((CTRL-CHAR, code 15))
I know the reason behind this, I am getting "Control Characters" in data I want to return. And in XML CTRL-CHAR are not allowed.
I searched for the solution, and many places I found the code to remove CTRL-CHAR.
The concern is shall I end up loss of data if I remove control characters from data?
I want the clean solution may encoding, instead of removing control char.

I would do what OrangeDog suggest. But if you want to solve it in your code try:
replaceAll("[\\x00-\\x09\\x11\\x12\\x14-\\x1F\\x7F]", "")
\\x12 is the char.

Thanks guys for you inputs. I am sharing solution might be helpful for others.
The requirement was not to wipe out CONTROL CHAR, it should remain as it is in DB also and one WS sends it across n/w client should able to get the CONTROL CHAR. So I implemented the code as follow:
Encode strings using URLEncoder in Web-Service code.
At client Side decode it using URLDecoder
Sharing sample code and output bellow.
Sample code:
System.out.println("NewSfn");
System.out.println(URLEncoder.encode("NewSfn", "UTF-8"));
System.out.println(URLDecoder.decode("NewSfn", "UTF-8"));
Output:
NewSfn
New%0FSfn
NewSfn
So client will recieve CONTROL CHARs.
EDIT: Stack Exchange is not showing CONTROL CHAR above. NewSfn is like this New(CONTROL CHAR)Sfn.

This error is being thrown by the Woodstox XML parser. The source code from the InputBootstrapper class looks like this:
protected void reportUnexpectedChar(int i, String msg)
throws WstxException
{
char c = (char) i;
String excMsg;
// WTF? JDK thinks null char is just fine as?!
if (Character.isISOControl(c)) {
excMsg = "Unexpected character (CTRL-CHAR, code "+i+")"+msg;
} else {
excMsg = "Unexpected character '"+c+"' (code "+i+")"+msg;
}
Location loc = getLocation();
throw new WstxUnexpectedCharException(excMsg, loc, c);
}
Amusing comment aside, the Woodstox is performing some additional validation on top of the JDK parser, and is rejecting the ASCII character 15 as invalid.
As to why that character is there, we can't tell you that, it's in your data. Similarly, we can't tell you if removing that character will break anything, since again, it's your data. You can only establish that for yourself.

If you have control characters in your text data then you need to solve that problem at its source.
The most likely causes are incorrect communication encodings (usually between database and app) or not sanitising user input.

I found the same problem when I was passing null values for some of the parameters. I passed empty or wrench values instead and this error went away.

I'm a bit confused by #ssedano's anwser, it seems to me he's trying to find all control chars from ASCII table 0x00 to 0x1F except for 0x0A (new line) and 0x0D (carriage return) plus 0x7F (del), then wouldn't ther regex be
replaceAll("[\\x00-\\x09\\x0B\\x0C\\x0E-\\x1F\\x7F]", "")

Related

flink "equal" symbol changed inside a string

i have a very strange issue with flink.
I have a json in input with some fields defined inside a pojo.
When i see the output, the = symbols are changed:
original string:
"body": "/opensearch/OpenSearch?searchTerms=productType:OL_2_WFR___%20OR%20OL_2_WRR___%20OR%20SL_2_WST___%20OR%20SR_2_WAT___&count=10"
String produced by flink:
"body": "/opensearch/OpenSearch?searchTerms\u003dproductType:OL_2_WFR___%20OR%20OL_2_WRR___%20OR%20SL_2_WST___%20OR%20SR_2_WAT___\u0026count\u003d10"
someone know how to resolve this issue?
They aren't, not really.
As per the JSON spec, if the byte 0x09 (ASCII tab character) appears inside a JSON string, or if the byte sequence 0x5C 0x74 (The characters \t) appears, or if the sequence 0x5C 0x75 0x30 0x30 0x30 0x39 appears (the characters \u0009), they all mean the exact same thing: There is one character there in that string, and it is the tab.
If you're having trouble with this, your JSON library is broken. Get a better one.
Most likely your JSON library is not broken and instead, either [A] you are comparing raw JSON, or attempting to retrieve info from raw JSON using e.g. regular expressions. Stop doing that, it'll be an endless parade of such 'weirdness', because you're not supposed to do this. There are all sorts of ways you can have different JSON strings that means the same thing, or [B] there is no problem here and you can just continue; you merely saw the difference and understandably assumed that there is a difference here, or that it'll cause problems down the line.
Assuming you don't do silly things like attempting to parse JSON with regular expressions or comparing raw JSON and assume that means anything relevant about the content of it, this will not be a problem.
Specifically, \u003d and = are identical as per the JSON spec. Whatever processed this JSON decided to replace one sequence with another sequence that means the same thing, which is an allowed operation.

Encoding a string in 128c barcode symbology

I am having some trouble with encoding this string into barcode symbology - Code 128.
Text to encode:
1021448642241082212700794828592311
I am using the universal encoder from idautomation.com:
https://www.bcgen.com/fontencoder/
I get the following output for the encoded text for Code 128:
Í*5LvJ8*r5;ÂoP<[7+.Î
However, in ";Âo" the character between the semi-colon and o (let us call it special A) - is not part of the extended character set used in Code128. (See the Latin Supplements at https://www.fonts2u.com/code-128.font)
Yet the same string shows a valid barcode at
https://www.bcgen.com/linear-barcode-creator.html
How?
If I use the output with the Special A on a webpage with a font face for barcodes, the special A character does not show up as the barcode (and that seems correct since the special A is not part of the character set).
What gives? Please help.
I am using the IDAutomation utility to encode the string to 128c symbology. If you can share code to do the encoding (in Java/Python/C/Perl) that would help too.
There are multiple fonts for Code128 that may use different characters to represent the barcode symbols. Make sure the font and the encoding logic match each other.
I used this one http://www.jtbarton.com/Barcodes/Code128.aspx (there is also sample code how to encode it on the site, but you have to translate it from VB). The font works for all three encodings (A, B and C).
Sorry, this is very late.
When you are dealing with the encoding of code 128, in any subset, it's a good idea to think of that coding in terms of numbers, not characters. At this level, when you have shifts, code-changes, checksums and stuff, intermixed with the data, the whole concept of "character" is lost.
However, this is what is happening:
The semicolon in the output corresponds to "27"
The lowercase o corresponds to "48" and the P to "79"
The "A with Macron" corresponds to your "00" sequence. This is why you should be dealing with numbers, not characters, at this level of encoding.
How would you expect it to show a character with a code of 00 ? That would be a space of NULL, neither of which is particularly visible.
Your software has simply rendered it the best way it can, which is to make the character 'visible' by adding 0x80 to it. If you look at charmap, you will see that code 0x80 is indeed A with macron.
The rest (indeed all) of your encoded string looks correct for a setc-encodation.

why '?' appears as output while Printing unicode characters in java

While printing certain unicode characters in java we get output as '?'. Why is it so and is there any way to print these characters?
This is my code
String symbol1="\u200d";
StringBuilder strg = new StringBuilder("unicodecharacter");
strg.insert(5,symbol1);
System.out.println("After insertion...");
System.out.println(strg.toString());
Output is
After insertion...
unico?decharacter
Here's a great article, written by Joel Spolsky, on the topic. It won't directly help you solve your problem, but it will help you understand what's going on. It'll also show you how involved the situation really is.
You have a character encoding which doesn't match the character you have or the supported characters on the screen.
I would check which encoding you are using through out and try to determine whether you are reading, storing or printing the value correctly.
Are you sure which encoding you need? You may need to explicitly encode your output as UTF-8 or ISO 8859-1 if you are dealing with European characters.
Java's default behaviour when reading an invalid unicode character is to replace it with the Replacement Character (\uFFFD). This character is often rendered as a question mark.
In your case, the text you're reading is not encoded as unicode, it's encoded as something else (Windows-1252 or ISO-8859-1 are probably the most common alternatives if your text is in English).
I wrote an Open Source Library that has a utility that converts any String to Unicode sequence and vise-versa. It helps to diagnose such issues. So for instance to print your String you can use something like this:
String str= StringUnicodeEncoderDecoder.decodeUnicodeSequenceToString("\\u0197" +
StringUnicodeEncoderDecoder.encodeStringToUnicodeSequence("Test"));
You can read about the library and where to download it and how to use it at Open Source Java library with stack trace filtering, Silent String parsing Unicode converter and Version comparison See the paragraph "String Unicode converter"

string.equals not working for me

This is the useful part of code:
java.util.List<Element> elems = src.getAllElements();
Iterator it = elems.iterator();
Element el;
String key,value,date="",place="";
String [] data;
int k=0;
Segment content;
String contentstr;
String classname;
while(it.hasNext()){
el = (Element)it.next();
if(el.getName().equals("span"))
{
classname=el.getAttributeValue("class");
if(classname.equals("edit_body"))
{
//java.util.List<Element> elemsinner = el.getChildElements();
//Iterator itinner = elemsinner.iterator();
content=el.getContent();
contentstr=content.toString();
if(true)
{
System.out.println("Done!");
System.out.println(classname);
System.out.println(contentstr);
}
}
}
}
No output. But if I remove the if(classname.equals("edit_body")) condition it does print (in one of the iterations):
Done!
edit_body
"I honestly think it is better to be a failure at something you love than to be a success at something you hate."
Can't get the bug part... help!
I am using an external java library BTW for html parsing.
BTW there are two errors at the start of the output, which is there in both the cases, with or without if condition.:
Dec 20, 2012 11:53:11 AM net.htmlparser.jericho.LoggerProviderJava$JavaLogger error SEVERE: EndTag br at (r1992,c60,p94048) not recognised as type '/normal' because its name and closing delimiter are separated by characters other than white space
Dec 20, 2012 11:53:11 AM net.htmlparser.jericho.LoggerProviderJava$JavaLogger error SEVERE: Encountered possible EndTag at (r1992,c60,p94048) whose content does not match a registered EndTagType
Hope that wont cause the error
Ok guys, Somebody explain me please! "edit_body".equals(el.getAttributeValue("class")) worked!!
I had right now the exactly same problem.
I success to solve it by using: SomeStringVar.replaceAll("\\P{Print}","");.
This command remove all the Unicode characters in the variant (characteres that you cant see- the strings look like equal, even they not really equal).
I use this command on each variant i needed in the equalization, and it works for me as well.
Looks like you are having leading or trailing whitespaces in your classname.
Try using this: -
if(classname.trim().equals("edit_body"))
This will trim any of those whitespaces at the ends.
Firstly, String.equals() is NOT broken. It works for millions of other programs / programmers. This is NOT the cause of your problems (unless you or someone has deliberately modified ... and broken your Java installation ...)
So why can two apparently equal strings compare as unequal?
There could be leading or trailing whitespace characters on the String.
There could be embedded non-printing characters.
There could be pairs Unicode characters that look the same when you display them with a typical font, but in fact are not the same. For instance the Greek code page contains characters that look by Latin vowels ... but are in fact different codes, and hence are not equal.
change the code to:
classname="edit_body"; //<- hardcode
if(classname.equals("edit_body"))
if the code enters the if statement now, then there must obviously be some difference in the string content when you use the original "classname=el.getAttributeValue("class");".
in such case, loop over the individual characters and compare those to find the difference.
If the code still doesnt enter the if statement, either your code is not compiling and you are running old code, or your java installation is broken ;-)
OR.
if java is anything like .net (I don't know java)
is "el.getAttributeValue" typed as string?
if it is typed as object, then the if statement would not enter since those are two different instances of the same string.
equals() is a method of String class. So, it works with double quotes.
if(someString.equals("something")) ✓
if(someString.equals('something')) ×

Convert XML document from Latin1 to UTF8 using Java

I am trying to create an XML document (rss feed) and have worked out all the kinks in it except for one character encoding issue. The problem is that I am using a UTF-8 encoding like so <?xml version="1.0" encoding="UTF-8"?> except the document itself is not encoded to UTF-8.
I am using the org.apache.ecs.xml package to create all the tags. I then use doc.output(stream) to write the content. This method does not seem to write output using UTF-8 and I don't know how to make that happen. Until I do, some symbols (the british pound is what I first noticed) aren't rendered properly in most readers.
--Updated with more information--
I ended up using a bad solution (as explained in the comments) to fix this problem. The correct answer seems to be dont use the org.apache.ecs.xml library. Thank you all for the help. StackOverflow wins again.
The simplest workaround is probably going to be changing your code like follows:
XMLDocument doc = new XMLDocument(1.0,false,Charset.defaultCharset().toString());
I'm guessing they're just using the default encoding to write characters to the stream. So pass the default encoding to the prologue and you should be fine.
I'll agree with other posters that this is probably the least of your worries. Looking at the source repository for ECS, it doesn't appear to have been updated for four years (the "ECS2" repository likewise).
And some self-promotion: if you're looking to build XML documents using a simple interface, the Practical XML library has a builder. It uses the standard JDK serialization mechanism for output.
Any chance you can write to a Writer rather than an OutputStream... that way you could specify the encoding.
Here is a solution my co-worker came up with that I THINK is the correct way to do it but what do I know. Instead of using doc.output(stream) we used:
try {
IOUtils.write(doc.toString(), stream, "UTF-8");
} catch (IOException e) {
throw new RuntimeException(e);
}
To be honest I dont completely understand the problem yet, which is why I am having problems in the first place. It seems that #subtenante's solution went through and converted any character that UTF-8 could not represent and replaced it with the unicode entity. This solution seems to write to the stream using the UTF-8 encoding like I originally wanted doc.output to. I dont know the exact difference, just that both solved my problems. Any further comments to help me understand the problem would be appreciated.
I'm not familiar with this package but from the source on the web I suspect it may be broken:
http://kickjava.com/src/org/apache/ecs/xml/XMLDocument.java.htm
contains stuff like
for (int i=0; i<prolog.size(); i++) {
268 ConcreteElement e = (ConcreteElement)prolog.elementAt(i);
269 e.output(out);
270 // XXX really this should use line separator!
271 // XXX should also probably check for pretty print
272 // XXX also probably have difficulties with encoding
which suggests problems.
We use XOM (http://www.xom.nu) and that specifically has a setEncoding() on its Serializer so I would suggest changing packages...
Here is a function I wrote to convert all non-ASCII characters to their corresponding entity. Might help you sanitizing some PCDATA content before output.
/**
* Creates xml entities for non ascii characters in the given String.
*/
public static String xmlEntitify(String in){
StringBuffer b = new StringBuffer();
for (int i=0;i<in.length();i++){
Character c = in.charAt(i);
if (c<128){
b.append(c);
}
else if (c=='\ufeff'){
// BOM character, just remove it
}
else {
String cstr = Integer.toHexString(c).toUpperCase();
while(cstr.length()<4){
cstr="0"+cstr;
}
b.append("&#x");
b.append(cstr);
b.append(";");
}
}
return b.toString();
}
Read your input stream into a String content, and write into the output stream xmlEntitify(content).
Your output is guaranteed to contain only ASCII characters, no more encoding problem.
UPDATE
Given the comments, I'll be even bolder : if you are not sanitizing your data, you are calling for trouble. I guess you are at least already replacing the < and & characters in your PCDATA. If not, you definitely should. I have another version of the above method which, instead of the first if, has :
if (c<128 && c!='&' && c!='<' && c!='>' && c!='"'){
b.append(c);
}
so that these characters are also converted to their corresponding Unicode entity.
This converts all of my PCDATA to unicode-friendly ASCII-only strings. I had no more encoding problem since I'm using this technique. I don't ever output XML PCDATA which has not been passed through this method : this is not sweeping the elephant under the carpet. It is just getting rid of the problem by being as generic as can be.

Categories