Java: Search in a wrong encoded String without modifying it

Java: Search in a wrong encoded String without modifying it - java

I have to find a user-defined String in a Document (using Java), which is stored in a database in a BLOB. When I search a String with special characters ("Umlaute", äöü etc.), it failes, meaning it does not return any positions at all. And I am not allowed to convert the document's content into UTF-8 (which would have fixed this problem but raised a new, even bigger one).
Some additional information:
The document's content is returned as String in "ISO-8859-1" (Latin1).
Here is an example, what a String could look like:
Die Erkenntnis, daÃ der KÃ¼nstler Schutz braucht, ...
This is how it should look like:
Die Erkenntnis, daß der Künstler Schutz braucht, ...
If I am searching for Künstler it would fail to find it, because it looks for ü but only finds Ã¼.
Is it possible to convert Künstler into KÃ¼nstler so I can search for the wrong encoded version instead?
Note:
We are using the Hibernate Framework for Database access. The original Getter for the Document's Content returns a byte[]. The String is than returned by calling
new String(getContent(), "ISO-8859-1")
The problem here is, that I cannot change this to UTF-8, because it would then mess up the rest of our application which is based on a third party application that delivers data this way.

Okay, looks like I've found a way to mess up the encoding on purpose.
new String("Künstler".getBytes("UTF-8"), "ISO-8859-1")
By getting the Bytes of the String Künstler in UTF-8 and then creating a new String, telling Java that this is Latin1, it converts to KÃ¼nstler. It's a hell of a hack but seems to work well.

Already answered by yourself.
An altoghether different approach:
If you can search the blob, you could search using
"SELECT .. FROM ... WHERE"
+ " ... LIKE '%" + key.replaceAll("\\P{Ascii}+", "%") + "%'"
This replaces non-ASCII sequences by the % wildcard: UTF-8 multibyte sequences are non-ASCII by design.

Related

JSON parser exception is observed for on fly JSON

I created a JSON file on the fly by using some runtime data and stored as string as like below:
JSON:
{
"ticketDetails": "kindle tracking ticket: TICKET0900060
Iimpact statement: impacted due to year 2020 format handling issue,
depending on the Gateway,
user can be asked to
try with another instrument.
Timeline: 00: 00 SAP Internal Declines spiked to 300 +
05: 22 AM flintron reported DECLINED errors since 0: 00 PST.
As per TDO,flitron is not seeing clear metrics impact " }
Note: I just copied the exact json which i'm getting at runtime. It containse \n space and exactly like above.
I can see few JSONObjects like ticketDetails is having huge description and when I tried to parse the above string is leading to parse error.
I tried the below way to eliminate the parse error by using
String removeSpecialCharacterFromJson= jsonString.replaceAll("[^A-Za-z0-9]","")
System.out.println(removeSpecialCharacterFromJson);
Sample Output:
kindletrackingticketTICKET0900060Iimpactstatementimpactedduetoyear2020formathandlingissue....[space between characters are removed and It's hard to read the string]
The above code removed all the special characters from the string and It will be successfully parsed. But the description is not having the space and It very hard to read the content after the above changes done.
I tried to escape the \s in the regular expression which is giving the original String value which is leading to parse exception.
String removeSpecialCharacterFromJson= jsonString.replaceAll("[^A-Za-z0-9\\s]","")
Is there anyohter way to handle this ? I just want to the ticketDetails to be readable format and It should not have any special characters and \n lines.
Can someone help me on this?

s in regular expression is not for space but for the whitespace
I guess that you may have some additional non allowed whitespace in your JSON string
Take a look at The JSON spec (RFC 7159):
Insignificant whitespace is allowed before or after any of the six structural characters.
ws = *(
%x20 / ; Space
%x09 / ; Horizontal tab
%x0A / ; Line feed or New line
%x0D ) ; Carriage return
Verify your values and look for improper whitespaces

AS400 SQL Script on a parameter file returns

I'm integrating an application to the AS400 using Java/JT400 driver. I'm having an issue when I extract data from a parameter file - the data retrieved seems to be encoded.
SELECT SUBSTR(F00001,1,20) FROM QS36F."FX.PARA" WHERE K00001 LIKE '16FFC%%%%%' FETCH FIRST 5 ROWS ONLY
Output
00001: C6C9D9C540C3D6D4D4C5D9C3C9C1D34040404040, - 1
00001: C6C9D9C5406040C3D6D4D4C5D9C3C9C1D3406040, - 2
How can I convert this to a readable format? Is there a function which I can use to decode this?
On the terminal connection to the AS400 the information is displayed correctly through the same SQL query.
I have no experience working with AS400 before this and could really use some help. This issue is only with the parameter files. The database tables work fine.

What you are seeing is EBCDIC output instead of ASCII. This is due to the CCSID not being specified in the database as mentioned in other answers. The ideal solution is to assign the CCSID to your field in the database. If you don't have the ability to do so and can't convince those responsible to do so, then the following solution should also work:
SELECT CAST(SUBSTR(F00001,1,20) AS CHAR(20) CCSID(37))
FROM QS36F."FX.PARA"
WHERE K00001 LIKE '16FFC%%%%%'
FETCH FIRST 5 ROWS ONLY
Replace the CCSID with whichever one you need. The CCSID definitions can be found here: https://www-01.ibm.com/software/globalization/ccsid/ccsid_registered.html

Since the file is in QS36F, I would guess that the file is a flat file and not externally defined ... so the data in the file would have to be manually interpreted if being accessed via SQL.
You could try casting the field, after you substring it, into a character format.
(I don't have a S/36 file handy, so I really can't try it)

It is hex of bytes of a text in EBCDIC, the AS/400 charset.
static String fromEbcdic(String hex) {
int m = hex.length();
if (m % 2 != 0) {
throw new IllegalArgumentException("Must be even length");
}
int n = m/2;
byte[] bytes = new byte[n];
for (int i = 0; i < n; ++i) {
int b = Integer.parseInt(hex.substring(i*2, i*2 + 2), 16);
bytes[i] = (byte) b;
}
return new String(bytes, Charset.forName("Cp500"));
}
passing "C6C9D9C540C3D6D4D4C5D9C3C9C1D34040404040".
Convert the file with Cp500 as charset:
Path path = Paths.get("...");
List<String> lines = Files.readAllLines(path, Charset.forName("Cp500"));
For line endings, which are on AS/400 the NEL char, U+0085, one can use regex:
content = content.replaceAll("\\R", "\r\n");
The regex \R will match exactly one line break, whether \r, \n, \r\n, \u0085.

A Big thank you for all the answers provided, they are all correct.
It is a flat parameter file in the AS400 and I have no control over changing anything in the system. So it has to be at runtime of the SQL query or once received.
I had absolutely no clue about what the code page was as I have no prior experience with AS400 and files in it. Hence all your answers have helped resolve and enlighten me on this. :)
So, the best answer is the last one. I have changed the SQL as follows and I get the desired result.
SELECT CAST(F00001 AS CHAR(20) CCSID 37) FROM QS36F."FX.PARA" WHERE K00001 LIKE '16FFC%%%%%' FETCH FIRST 5 ROWS ONLY
00001: FIRE COMMERCIAL , - 1
00001: FIRE - COMMERCIAL - , - 2
Thanks once again.
Dilanke

Filter Special Characters in Spring / Java

I'm using jsoup to get all text from websites.
Document doc = Jsoup.connect("URL").get();
String allText doc.text().toLowerCase();
Then I'm using Hibernate to persist the object that holds all text to a MySQL DB:
...
#Column(name="all_text")
#Lob
private String allText = null;
...
Everything is good so far. Only that sometimes I get a MySQL error when I try to save the object with allText:
java.sql.SQLException: Incorrect string value: '\xF0\x9F\x98\x8A s...' for column 'all_text' at row 1
Already looked this up and it's an encoding error. Probably have some special characters on their websites. I found a way to fix this by changing the encoding in the DB.
But my actual question is: what's the best way to filter and remove the special characters from the allText string and not persist them at all?
EDIT: To clarify, by special characters I mean Emoticons and all that stuff. Definitely anything that doesn't fit into UTF-8 encoding. I'm not concerned about ~ ^ etc...
Thanks in advance!

Just use regex:
allText.replaceAll("\\p{C}", "");
Don't forget to import java.util.regexPattern

Croatian character in Java standard output

I have a database with some cratian characters in it like Đ , in the database the character is stored correctly, when using a datatable in primefaces it also shows the character in the webpage just fine.
The problem is that when I send it to the out.println() the character Đ in the name is missing.
for (People p : people) {
System.out.println("p.getName());
}
I tried using String name2 = p.getName().getBytes("ISO-8859-2"); but it still not working

I assume you are using UTF-8 as default encoding on the Database and for Primefaces
Have also a look to this:
Display special characters using System.out.println

java.net.URI and percent in query parameter value

System.out.println(
new URI("http", "example.com", "/servlet", "a=x%20y", null));
The result is http://example.com/servlet?a=x%2520y, where the query parameter value differs from the supplied one. Strange, but this does follow the Javadoc:
"The percent character ('%') is always quoted by these constructors."
We can pass the decoded string, a=x y and then we get a reasonable(?) result a=x%20y.
But what if the query parameter value contains an "&" character? This happens for example if the value is an URL itself with query parameters. Look at this (wrong) query string:
a=b&c. The ampersand must be escaped here (a=b%26c), otherwise this can be considered as a query parameter a=b and some garbage (c). If I pass this to an URI constructor, it encodes it, and returns a wrong URL: ...?a=b%2526c
This issue seems to render java.util.URI useless. Am I missing something here?
Summary of answers
java.net.URI does know about the existence of the query part of an URI, but it does not understand the internals of the query part, which can differ for each scheme. For example java.net.URI does not understand the internal structure of the HTTP query part. This would not be a problem, if java.net.URI considered query as an opaque string, and did not alter it. But it tries to apply some generic percent-encoding algorithm, which breaks HTTP URLs.
Therefore I cannot use the URI class to reliably assemble an URL from its parts, despite there are constructors for it. I would also mention that as of Java 7, the implementation of the relativize operation is quite limited, only works if one URL is the prefix of another one. These two functionality (and its leaner interface for these purposes) were the reason why I was interested in java.net.URI, but neither of them works for me.
At the end I used java.net.URL for parsing, and wrote code to assemble an URL from parts and to relativize two URLs. I also checked the Apache HttpClient URIBuilder class, and although it does understand the internals of an HTTP query string, but as of 4.3, it has the same problem with encoding like java.net.URI when dealing with the query part as a whole.

The query string
a=b&c
is not wrong in a URI. The RFC on URI Generic Syntax states
The query component is a string of information to be interpreted by
the resource.
query = *uric
Within a query component, the characters ";", "/", "?", ":", "#",
"&", "=", "+", ",", and "$" are reserved.
The character & in the query string is very much valid (uric represents reserved, mark, and alphanumeric characters). The RFC also states
Many URI include components consisting of or delimited by, certain
special characters. These characters are called "reserved", since
their usage within the URI component is limited to their reserved
purpose. If the data for a URI component would conflict with the
reserved purpose, then the conflicting data must be escaped before
forming the URI.
Because the & is valid but reserved, it is up to the user to determine if it is meant to be encoded or not.
What you call a query parameter is not a feature of a URI and therefore the URI class has no reason to (and shouldn't) support it.
Related:
Which characters make a URL invalid?

The only workaround I found was to use the single-argument constructors and methods. Note that you must use URI#getRawQuery() to avoid decoding %26. For example:
URI uri = new URI("http://a/?b=c%26d&e");
// uri.getRawQuery() equals "b=c%26d&e"
uri = new URI(new URI(uri.getScheme(), uri.getAuthority(),
uri.getPath(), null, null) + "?f=g%26h&i");
// uri.getRawQuery() equals "f=g%26h&i"
uri = uri.resolve("?j=k%26l&m");
// uri.getRawQuery() equals "j=k%26l&m"
// uri.toString() equals "http://a/?j=k%26l&m"

Single working solution known for me is reflection (see https://blog.stackhunter.com/2014/03/31/encode-special-characters-java-net-uri/)
URI uri = new URI("http", null, "example.com", -1, "/accounts", null, null);
Field field = URI.class.getDeclaredField("query");
field.setAccessible(true);
field.set(uri, encodedQueryString);
//clear cached string representation
field = URI.class.getDeclaredField("string");
field.setAccessible(true);
field.set(uri, null);

Use URLEncoder.encode() method, in your case for example:
URLEncoder.encode("a=x%20y", "ISO-8859-1");

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.