How to avoid Junk/garbage characters while reading data from multiple languages?

How to avoid Junk/garbage characters while reading data from multiple languages? - java

I am parsing rss news feeds from over 10 different languages.
All the parsing is being done in java and data is stored in MySQL before my API's written in php are responding to the clients.
I constantly come across garbage characters when I read the data.
What have I tried :
I have configured my MySQL to store utf-8 data. My db,table and even the column have UTF8 as their default charset.
While connecting my db,I set the character set results as utf-8
When I run the jar file manually to insert the data,the character's appear fine. But when I set a cronjob for the same jar file,I start facing the problem all over again.
In English,I particularly face problems like this and in other vernacular languages,the character appear to be totally garbish and I cant even recongnize a single character.
Is there anything that I am missing?
Sample garbage characters :
Gujarati :"àª°à«‡àª²àªµà«‡ àª®à«àª¸àª¾àª«àª°à«€àª®àª¾àª‚ àª¸àª¾àª®àª¾àª¨ àªšà«‹àª°à«€ àª¥àª¶à«‡ àª¤à«‹ àª®àª³àª¶à«‡ àªµàª³àª¤àª°!"
Malyalam : "à´¨àµ‡à´ªàµà´ªà´¾à´³à´¿à´²àµ‡à´•àµà´•àµà´³àµà´³ à´•àµ‹à´³àµâ€ à´¨à´¿à´°à´•àµà´•àµ à´•àµà´±à´šàµà´šàµ"
English : Bank Board Bureauâ€™s ambit to widen to financial sector PSUs

The Gujarati starts રેલવે, correct? And the Malyalam starts നേപ, correct? And the English should have included Bureau’s.
This is the classic case of
The bytes you have in the client are correctly encoded in utf8. (Bureau is encoded in the Ascii/latin1 subset of utf8; but ’ is not the ascii apostrophe.)
You connected with SET NAMES latin1 (or set_charset('latin1') or ...), probably by default. (It should have been utf8.)
The column in the table was declared CHARACTER SET latin1. (Or possibly it was inherited from the table/database.) (It should have been utf8.)
The fix for the data is a "2-step ALTER".
ALTER TABLE Tbl MODIFY COLUMN col VARBINARY(...) ...;
ALTER TABLE Tbl MODIFY COLUMN col VARCHAR(...) ... CHARACTER SET utf8 ...;
where the lengths are big enough and the other "..." have whatever else (NOT NULL, etc) was already on the column.
Unfortunately, if you have a lot of columns to work with, it will take a lot of ALTERs. You can (should) MODIFY all the necessary columns to VARBINARY for a single table in a pair of ALTERs.
The fix for the code is to establish utf8 as the connection; this depends on the api used in PHP. The ALTERs will change the column definition.
Edit
You have VARCHAR with the wrong CHARACTER SET. Hence, you see Mojibake like àª°à«‡àª². Most conversion techniques try to preserve àª°à«‡àª², but that is not what you need. Instead, taking a step to VARBINARY preserves the bits while ignoring the old definition of the bits representing latin1-encoded characters. The second step again preserves the bits, but now claiming they represent utf8 characters.

Related

Java / Sql-server parameter binding does not work as expected

We notice a strange behaviour in our application concerning bind parameters. We use Java with JDBC to connect to a Sql Server database. In a table cell we have the value 'µ', and we compare it with a bind parameter, which is also set to the value 'µ'.
Now, in a sql statement like "... where value != ?", where 'value' is the value of 'µ' in the database and ? the bind variable, which is also set to 'µ', we notice that we get a record, though we would expect that 'µ' equals 'µ'.
The method that we use to fill the bind parameter is java.sql.PreparedStatement.setString(int, String).
Some facts:
The character value of µ in different encodings is:
ASCII (ISO-8859-1) : 0xB5
UTF-8 : 0xC2B5
UTF-16 (= Java) : 0x00B5
Now I did some investigations to see which bytes the database actually sees. Therefor I tried a sql-statement like this:
select convert(VARBINARY(MAX), value), -- selects µ from database table
convert(VARBINARY(MAX), N'µ'), -- selects µ from literal
convert(VARBINARY(MAX), ?) -- selects µ from bind parameter
from ...
The result for the three values is:
B500
B500
C200B500 <-- Here is the problem!
So, the internal representation of µ in the database and as NVARCHAR literal is B500.
Now we can't understand what is going on here. We have the value of 'µ' in a Java variable (which should internally be 0x00B5). When it is passed as bind variable, then is seems as if it is converted to UTF-8 (which makes byte sequence 0xC2B5), and then the database treats it as if it were two characters, making the sequence of characters C200B500 from it.
To make things even more confusing:
(1) On an other machine with a different database the same code works like expected. The result of the three lines is B500/B500/B500, so the bind variable is converted to be a proper B500.
(2) On the same machine, the same database but a different program (but using the same jdbc driver library and the same connect parameters) this also works as expected, giving the result of B500/B500/B500.
Some additional facts, maybe they are important:
The database is Sql Server 2014
Java is Java 7
The application in question is a webapp running in Tomcat 7.
Jdbc library is sqljdbc 4.2
Any help to sort this out is greatly appreciated!

I now found the solution. It did not at all have something to do with Sql Server or binding, but instead...
Tomcat 7 is not running in UTF-8 mode by default (I wasn't aware of that). The µ we are talking about comes from an other application that is providing this value via webservice calls. However, this application is using UTF-8 as default. So, it was sending an UTF-8 µ, but the webservice did not expect UTF-8 and thought that it would be two characters, and treated them like this, filling the internal String variable with the character for 0xC2 and 0xB5 (which is, for Sql Server, C200B500).

oracle jdbc unicode for polish character is not working properly

Hi: We have a tool that is able to handle reports for unicode support. It works fine until we encounter this new report for Polish characters.
We are able to retrieve the data and display correctly, however, when we use the data as input to perform search, it seems not convert some of the character correctly and therefore, not able to retrieve data. Here is an sample.
Table polish has two columns: party, description. One of the value of party is "Bełchatów". I use jdbc to read that value from database and search with the following statement using SQL:
SELECT * from polish where party = N'Bełchatów'
However, this give me no result. This is with ojdbc6.jar. (JDK 8) However, this does give me result back with ojdbc7.jar.
What is the reason? And how can we fix when using ojdbc6.jar.
Thanks!

This is because the Oracle JDBC driver doesn't convert the string into unicode character. There is a database property, oracle.jdbc.defaultNChar=true.
http://docs.oracle.com/cd/B14117_01/java.101/b10979/global.htm
When this property is true, it will convert the string when it is mark with N'Belchatów' nchart literal into u'Be\0142chat\00f3w'.
The user can also set in data source level. Depends on your Persist API vendor, the way to set it can be different.

NLS_CHARACTERSET WE8ISO8859P1 and UTF8 issues in Oracle

I am currently using a Database in oracle which has NLS_CHARACTERSET WE8ISO8859P1 so lets say I store a value in varchar2 field which is maž (accented character) so in database it gets stored as maå¾ . Now when I try to retrieve it with query select * from table where fieldValue = 'maž' it returns 0 rows, and then when i try to insert it again it gives me a constraint error saying value already exist.
How to overcome such situation.
I am doing this via Java code

http://docs.oracle.com/cd/B19306_01/server.102/b14225/ch2charset.htm#g1009784
Oracle Character Set Name: WE8ISO8859P1
Description: Western European 8-bit ISO 8859 Part 1
Region: WE (Western Europe)
Number of Bits Used to Represent a Character: 8
On the other hand, UTF-8 uses several bytes to store a symbol.
If your database uses WE8ISO8859P1 and the column type is from VARCHAR group (not NVARCHAR) and you're inserting a symbol which code > 255, this symbol will be transformed to WE8ISO8859P1 and some information will be lost.
To put it simply, if you're inserting UTF-8 into a db with single-byte character set, your data is lost.
The link above describes different scenarios how to tackle this issue.
You can also try Oracle asciistr/unistr functions, but in general it's not a good way to deal with such problems.

performance is slow with hibernate and MS sql server

I'm using hibernate and db is sqlserver.
SQL Server differentiates it's data types that support Unicode from the ones that just support ASCII. For example, the character data types that support Unicode are nchar, nvarchar, longnvarchar where as their ASCII counter parts are char, varchar and longvarchar respectively. By default, all Microsoft’s JDBC drivers send the strings in Unicode format to the SQL Server, irrespective of whether the datatype of the corresponding column defined in the SQL Server supports Unicode or not. In the case where the data types of the columns support Unicode, everything is smooth. But, in cases where the data types of the columns do not support Unicode, serious performance issues arise especially during data fetches. SQL Server tries to convert non-unicode datatypes in the table to unicode datatypes before doing the comparison. Moreover, if an index exists on the non-unicode column, it will be ignored. This would ultimately lead to a whole table scan during data fetch, thereby slowing down the search queries drastically.
The solution we used is ,we figured that there is a property called sendStringParametersAsUnicode which helps in getting rid of this unicode conversion. This property defaults to ‘true’ which makes the JDBC driver send every string in Unicode format to the database by default. We switched off this property.
My question is now we cannot send data in unicode conversion. in future if db column of varchar is changed to nvarchar (only one column not all varchar columns), now we should sent the string in unicode format.
Please suggest me how to handle the scenario.
Thanks.

You need to specify property: sendStringParametersAsUnicode=false in connection string url.
jdbc:sqlserver://localhost:1433;databaseName=mydb;sendStringParametersAsUnicode=false

Unicode is the native string representation for communication with SQL Server, if you are converting to MBCS (Multibyte character sets), then you are doing 2 converts for every string. I suggest that if you are concerned with performance, use all Unicode instead of all MBCS
ref: http://social.msdn.microsoft.com/Forums/en/sqldataaccess/thread/249c629f-b8f2-4a8a-91e8-aad0d83919ca

Bulk import from Informix into Oracle

We need to pull some tables from an Informix SE database, truncate tables on Oracle 10g, and then populate them with the Informix data.
Does a bulk import work? Will data types clash?
I'd like to use a simple Java executable that we can schedule daily. Can a Java program call the bulk import? Is there an example you can provide? Thanks.

Interesting scenario!
There are several issues to worry about:
What format does Oracle's bulk import expect the data to be in?
What is the correct format for the DATE and DATETIME values?
Pragmatically (and based on experience with Informix rather than Oracle), rather than truncate the tables before bulk loading, I would bulk load the data into newly created tables (a relatively time-consuming process), then arrange to replace the old tables with the new. Depending on what works quickest, I'd either do a sequence of operations:
Rename old table to junk table
Rename new table to old table
followed by a sequence of 'drop junk table' operations, or I'd do:
Drop old table
Rename new table to old table
If the operations are done this way, the 'down time' for the tables is minimized, compared with 'truncate table' followed by 'load table'.
Oracle is like SE - its DDL statements are non-transactional (unlike IDS where you can have a transaction that drops a table, creates a new one, and then rolls back the whole set of operations).
How to export the data?
This depends on how flexible the Oracle loaders are. If they can adapt to Informix's standard output formats (for example, the UNLOAD format), then the unloading operations are trivial. You might need to set the DBDATE environment variable to ensure that date values are recognized by Oracle. I could believe that 'DBDATE="Y4MD-"' is likely to be accepted; that is the SQL standard 2009-12-02 notation for 2nd December 2009.
The default UNLOAD format can be summarized as 'pipe-delimited fields with backslash escaping embedded newlines, backslash and pipe symbols':
abc|123|2009-12-02|a\|b\\c\
d||
This is one record with a character string, a number, a date, and another character string (containing 'a', '|', 'b', '\', 'c', newline and 'd') and a null field. Trailing blanks are removed from character strings; an empty but non-null character field has a single blank in the unload file.
If Oracle cannot readily be made to handle that, then consider whether Perl + DBI + DBD::Informix + DBD::Oracle might be a toolset to use - this allows you to connect to both the Oracle and the Informix (SE) databases and transfer the data between them.
Alternatively, you need to investigate alternative unloaders for SE. One program that may be worth investigating unless you're using Windows is SQLCMD (fair warning: author's bias creeping in). It has a fairly powerful set of output formatting options and can probably create a text format that Oracle would find acceptable (CSV, for example).
A final fallback would be to have a tool generate INSERT statements for the selected data. I think this could be useful as an addition to SQLCMD, but it isn't there yet. So, you would have to use:
SELECT 'INSERT INTO Target(Col1, Col2) VALUES (' ||
Col1 || ', ''' || Col2 || ''');'
FROM Source
This generates a simple INSERT statement. The snag with this is that it is not robust if Col2 (a character string) itself contains quotes (and newlines may cause problems on the receiving end too). You'd have to evaluate whether this is acceptable.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.