NLS_CHARACTERSET WE8ISO8859P1 and UTF8 issues in Oracle

NLS_CHARACTERSET WE8ISO8859P1 and UTF8 issues in Oracle - java

I am currently using a Database in oracle which has NLS_CHARACTERSET WE8ISO8859P1 so lets say I store a value in varchar2 field which is maž (accented character) so in database it gets stored as maå¾ . Now when I try to retrieve it with query select * from table where fieldValue = 'maž' it returns 0 rows, and then when i try to insert it again it gives me a constraint error saying value already exist.
How to overcome such situation.
I am doing this via Java code

http://docs.oracle.com/cd/B19306_01/server.102/b14225/ch2charset.htm#g1009784
Oracle Character Set Name: WE8ISO8859P1
Description: Western European 8-bit ISO 8859 Part 1
Region: WE (Western Europe)
Number of Bits Used to Represent a Character: 8
On the other hand, UTF-8 uses several bytes to store a symbol.
If your database uses WE8ISO8859P1 and the column type is from VARCHAR group (not NVARCHAR) and you're inserting a symbol which code > 255, this symbol will be transformed to WE8ISO8859P1 and some information will be lost.
To put it simply, if you're inserting UTF-8 into a db with single-byte character set, your data is lost.
The link above describes different scenarios how to tackle this issue.
You can also try Oracle asciistr/unistr functions, but in general it's not a good way to deal with such problems.

Related

Unable to copy data from DB2 to Oracle via Java because of different charsets

I'm reading a column field (char(255)) from a DB2 table via resulSet:
String tmp = rs.getString(i);
this string is 255 chars long in the table, but when I try to put this value in the Oracle column (char(255)) via an insert statement
insertStmt.setString(j,tmp);
i get an SQL Exception:
ORA-12899: value too large for column "[schema_name]"."[table_name]"."[column_name]" (current: 258, max: 255)
I can see that
tmp.getBytes("UTF-8").length
is 258, because of three "è" characters using two bytes instead of one.
How can I successfully convert/insert this string into the Oracle table?
From Oracle db I can read
NLS_CHARACTERSET AL32UTF8
but I'm not being able to handle it on Java side.

Are you able to change NLS_LENGTH_SEMANTICS or widen your target field?
NLS_LENGTH_SEMANTICS allows you to specify the length of a column datatype in terms of CHARacters rather than in terms of BYTEs. Typically this is when using an AL32UTF8 or UTF8 or other varying width NLS_CHARACTERSET database where one character is not always one byte.
It can be set at session level if not defined at database level.
ALTER SESSION SET NLS_LENGTH_SEMANTICS=BYTE
Oracle advises to use CHAR explicitly when creating the table.
Create table scott.test (Col1 CHAR(20 CHAR),Col2 VARCHAR2(100 CHAR));
It will also work when importing or using ‘alter table modify ’.
When accessing columns in PL/SQL, define the variables explicitly e.g.
Col2 VARCHAR2 (10 CHAR);
Limitations:
The instance or session value will only be used when creating NEW columns.
It's good practice to avoid using a mixture of BYTE and CHAR semantics columns in the same table.
As for maximum width, looking at a typical 3-byte UTF8 character: if you define a column as varchar2(1333) CHAR and you insert more than 1333 characters, you will still get “ORA-12899: value too large for column”. In these cases, oracle recommends converting varchar2 to CLOB.

Your column is defined with a BYTE semantic, you have to switch to CHAR semantic using DDL such as:
alter table tst modify (txt varchar2(5 CHAR));
Small example:
create table tst
(txt varchar2(5));
insert into tst (txt) values('ééééé');
-- SQL-Fehler: ORA-12899: value too large for column "REPORTER"."TST"."TXT" (actual: 10, maximum: 5)
Note that the BYTE semantic enables to store only 5 bytes and the insert fails.
alter table tst modify (txt varchar2(5 CHAR));
insert into tst (txt) values('ééééé');
1 row inserted
The CHAR semantic allowed to store 5 character even if they occupy more than 5 bytes.

oracle jdbc unicode for polish character is not working properly

Hi: We have a tool that is able to handle reports for unicode support. It works fine until we encounter this new report for Polish characters.
We are able to retrieve the data and display correctly, however, when we use the data as input to perform search, it seems not convert some of the character correctly and therefore, not able to retrieve data. Here is an sample.
Table polish has two columns: party, description. One of the value of party is "Bełchatów". I use jdbc to read that value from database and search with the following statement using SQL:
SELECT * from polish where party = N'Bełchatów'
However, this give me no result. This is with ojdbc6.jar. (JDK 8) However, this does give me result back with ojdbc7.jar.
What is the reason? And how can we fix when using ojdbc6.jar.
Thanks!

This is because the Oracle JDBC driver doesn't convert the string into unicode character. There is a database property, oracle.jdbc.defaultNChar=true.
http://docs.oracle.com/cd/B14117_01/java.101/b10979/global.htm
When this property is true, it will convert the string when it is mark with N'Belchatów' nchart literal into u'Be\0142chat\00f3w'.
The user can also set in data source level. Depends on your Persist API vendor, the way to set it can be different.

How to avoid Junk/garbage characters while reading data from multiple languages?

I am parsing rss news feeds from over 10 different languages.
All the parsing is being done in java and data is stored in MySQL before my API's written in php are responding to the clients.
I constantly come across garbage characters when I read the data.
What have I tried :
I have configured my MySQL to store utf-8 data. My db,table and even the column have UTF8 as their default charset.
While connecting my db,I set the character set results as utf-8
When I run the jar file manually to insert the data,the character's appear fine. But when I set a cronjob for the same jar file,I start facing the problem all over again.
In English,I particularly face problems like this and in other vernacular languages,the character appear to be totally garbish and I cant even recongnize a single character.
Is there anything that I am missing?
Sample garbage characters :
Gujarati :"àª°à«‡àª²àªµà«‡ àª®à«àª¸àª¾àª«àª°à«€àª®àª¾àª‚ àª¸àª¾àª®àª¾àª¨ àªšà«‹àª°à«€ àª¥àª¶à«‡ àª¤à«‹ àª®àª³àª¶à«‡ àªµàª³àª¤àª°!"
Malyalam : "à´¨àµ‡à´ªàµà´ªà´¾à´³à´¿à´²àµ‡à´•àµà´•àµà´³àµà´³ à´•àµ‹à´³àµâ€ à´¨à´¿à´°à´•àµà´•àµ à´•àµà´±à´šàµà´šàµ"
English : Bank Board Bureauâ€™s ambit to widen to financial sector PSUs

The Gujarati starts રેલવે, correct? And the Malyalam starts നേപ, correct? And the English should have included Bureau’s.
This is the classic case of
The bytes you have in the client are correctly encoded in utf8. (Bureau is encoded in the Ascii/latin1 subset of utf8; but ’ is not the ascii apostrophe.)
You connected with SET NAMES latin1 (or set_charset('latin1') or ...), probably by default. (It should have been utf8.)
The column in the table was declared CHARACTER SET latin1. (Or possibly it was inherited from the table/database.) (It should have been utf8.)
The fix for the data is a "2-step ALTER".
ALTER TABLE Tbl MODIFY COLUMN col VARBINARY(...) ...;
ALTER TABLE Tbl MODIFY COLUMN col VARCHAR(...) ... CHARACTER SET utf8 ...;
where the lengths are big enough and the other "..." have whatever else (NOT NULL, etc) was already on the column.
Unfortunately, if you have a lot of columns to work with, it will take a lot of ALTERs. You can (should) MODIFY all the necessary columns to VARBINARY for a single table in a pair of ALTERs.
The fix for the code is to establish utf8 as the connection; this depends on the api used in PHP. The ALTERs will change the column definition.
Edit
You have VARCHAR with the wrong CHARACTER SET. Hence, you see Mojibake like àª°à«‡àª². Most conversion techniques try to preserve àª°à«‡àª², but that is not what you need. Instead, taking a step to VARBINARY preserves the bits while ignoring the old definition of the bits representing latin1-encoded characters. The second step again preserves the bits, but now claiming they represent utf8 characters.

performance is slow with hibernate and MS sql server

I'm using hibernate and db is sqlserver.
SQL Server differentiates it's data types that support Unicode from the ones that just support ASCII. For example, the character data types that support Unicode are nchar, nvarchar, longnvarchar where as their ASCII counter parts are char, varchar and longvarchar respectively. By default, all Microsoft’s JDBC drivers send the strings in Unicode format to the SQL Server, irrespective of whether the datatype of the corresponding column defined in the SQL Server supports Unicode or not. In the case where the data types of the columns support Unicode, everything is smooth. But, in cases where the data types of the columns do not support Unicode, serious performance issues arise especially during data fetches. SQL Server tries to convert non-unicode datatypes in the table to unicode datatypes before doing the comparison. Moreover, if an index exists on the non-unicode column, it will be ignored. This would ultimately lead to a whole table scan during data fetch, thereby slowing down the search queries drastically.
The solution we used is ,we figured that there is a property called sendStringParametersAsUnicode which helps in getting rid of this unicode conversion. This property defaults to ‘true’ which makes the JDBC driver send every string in Unicode format to the database by default. We switched off this property.
My question is now we cannot send data in unicode conversion. in future if db column of varchar is changed to nvarchar (only one column not all varchar columns), now we should sent the string in unicode format.
Please suggest me how to handle the scenario.
Thanks.

You need to specify property: sendStringParametersAsUnicode=false in connection string url.
jdbc:sqlserver://localhost:1433;databaseName=mydb;sendStringParametersAsUnicode=false

Unicode is the native string representation for communication with SQL Server, if you are converting to MBCS (Multibyte character sets), then you are doing 2 converts for every string. I suggest that if you are concerned with performance, use all Unicode instead of all MBCS
ref: http://social.msdn.microsoft.com/Forums/en/sqldataaccess/thread/249c629f-b8f2-4a8a-91e8-aad0d83919ca

Hibernate and padding on CHAR primary key column in Oracle

I'm having a little trouble using Hibernate with a char(6) column in Oracle. Here's the structure of the table:
CREATE TABLE ACCEPTANCE
(
USER_ID char(6) PRIMARY KEY NOT NULL,
ACCEPT_DATE date
);
For records whose user id has less than 6 characters, I can select them without padding the user id when running queries using SQuirreL. I.E. the following returns a record if there's a record with a user id of "abc".
select * from acceptance where user_id = "abc"
Unfortunately, when doing the select via Hibernate (JPA), the following returns null:
em.find(Acceptance.class, "abc");
If I pad the value though, it returns the correct record:
em.find(Acceptance.class, "abc ");
The module that I'm working on gets the user id unpadded from other parts of the system. Is there a better way to get Hibernate working other than putting in code to adapt the user id to a certain length before giving it to Hibernate? (which could present maintenance issues down the road if the length ever changes)

That's God's way of telling you to never use CHAR() for primary key :-)
Seriously, however, since your user_id is mapped as String in your entity Hibernate's Oracle dialect translates that into varchar. Since Hibernate uses prepared statements for all its queries, that semantics carries over (unlike SQuirreL, where the value is specified as literal and thus is converted differently).
Based on Oracle type conversion rules column value is then promoted to varchar2 and compared as such; thus you get back no records.
If you can't change the underlying column type, your best option is probably to use HQL query and rtrim() function which is supported by Oracle dialect.

How come that your module gets an unpadded value from other parts of the system?
According to my understanding, if the other part of the system don't alter the PK, they should read 6 chars from the db and pass 6 chars all along the way -- that would be ok. The only exception would be when a PK is generated, in which case it may need to be padded.
You can circumvent the problem (by trimming or padding the value each time it's necessary), but it won't solve the problem upfront that your PK is not handled consistently. To solve the problem upfront you must eiher
always receive 6 chars from the other parts of the module
use varchar2 to deal with dynamic size correctly
If you can't solve the problem upfront, then you will indeed need to either
add trimming/padding all around the place when necessary
add trimming/padding in the DAO if you have one
add trimming/padding in the user type if this works (suggestion from N. Hughes)

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.