I called DATA INFILE from java.sql.Statement.executeUpdate(String sql) to load UTF-8 CSV file into table.
When I use
LOAD DATA INFILE '/var/lib/mysql-files/upload/utf8table.csv' INTO TABLE temp.utf8table CHARACTER SET utf8 FIELDS TERMINATED BY ';' LINES TERMINATED BY '\r\n' (#vC1, #vC2) set C1=#vC1, C2=nullif(#vC2,'');
, without specifying CHARACTER SET utf8, non ASCII characters were corrupted. But the same query imported all characters correctly when was executed in Mysql Workbench. Query with charset specified works well in both cases. What can be the difference in the execution environments that leaded to such behavior?
According to the docs:
The server uses the character set indicated by the character_set_database system variable to interpret the information in the file. SET NAMES and the setting of character_set_client do not affect interpretation of input. If the contents of the input file use a character set that differs from the default, it is usually preferable to specify the character set of the file by using the CHARACTER SET clause. A character set of binary specifies “no conversion.”
See also sysvar_character_set_client. The default is latin1 if not specified.
Related
Emoji characters are messing up a loading system we built and I'm looking for a simple short-term solution.
Its a Java loading program that uses JDBC to execute MySQL commands with this structure:
LOAD DATA
LOCAL INFILE `filepath`
REPLACE INTO TABLE `SOME_TABLE`
CHARACTER SET utf8
FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '\'' ESCAPED BY ''
LINES TERMINATED BY '\n'
(`col1`,...,`coln`)
SOME_TABLE has ENGINE=InnoDB DEFAULT CHARSET=utf8.
We are running MySQL 5.6.22.
Its been working great for years, but recently the files that we load started having occasional non-BMP characters (that happen to be emojis) and the LOAD DATA LOCAL INFILE ... command throws exceptions like:
java.sql.SQLException: Incorrect string value: '\xF0\x9D\x93\x9C' for column 'fieldm' at row 3004
I understand that the long-term solution is we need to move the table to CHARSET=utf8mb4. However, the tables are huge at this point and conversion will not be easy. There are also VARCHAR(255) fields indexed, and these need to be converted to VARCHAR(191) [to fit under max key length 767], or we need to go to DYNAMIC row format and set innodb_large_prefix=true.
We are looking for a short-term solution until we get to a point where we have time and resources to migrate to utfmb4.
It would be OK, in the short term, to simply discard the rows with non-BMP (emoji) characters. But, LOAD DATA LOCAL INFILE filepath REPLACE ... will not skip the bad rows, it fails the entire file.
At this point, it looks like we will need to write some filtering in Java to remove the non-BMP (emoji) rows before calling LOAD DATA LOCAL INFILE filepath REPLACE .... But, I am thinking that there must be some way to do this in MySQL without having to introduce that kind of pre-filter.
Does anybody have any ideas for a simple way to get MySQL to simply skip the rows that have non-BMP (emoji) data?
***** UPDATE *****
It looks like using CONVERT might be the solution for short term. Doing this replaces the Emoji with '????' in col4.
LOAD DATA
LOCAL INFILE `filepath`
REPLACE INTO TABLE `SOME_TABLE`
CHARACTER SET utf8
FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '\'' ESCAPED BY ''
LINES TERMINATED BY '\n'
(`col1`,`col2`,`col3`,#q, ..., `coln`)
SET `col4` = CONVERT(CONVERT(#q USING utf8mb4) USING utf8);
Does anybody see a problem with that?
In order to store Emoji, you must use utf8mb4, not utf8 throughout.
A shortcut (perhaps) for the 191 index issue is to upgrade to 5.7. There, you can keep 255 and have indexes.
Only certain columns will be holding Emoji, correct? Convert just those columns. (It is OK for different columns in the same table to have different charset and/or collation.)
I'm trying to set the connection Charset for my Firebird connection using Pentaho DI but still couldn't read the data in the right encoding.
I used many parameters like encoding , charSet,...etc but no luck.
Correct me what i have missed ?
You either need to use encoding with the Firebird name of the character set, or charSet with the Java name of the character set(*).
WIN1256 is not a valid Java character set name, so the connection will fail. If you specify charSet, then you need to use the Java name Cp1256 or - with Jaybird 2.2.1 or newer - windows-1256.
If this doesn't work then either Pentaho is not correctly passing connection properties, or your data is stored in a column with character set NONE in a different encoding than WIN1256 (or worse: stored in a column with character set WIN1256, but the data is actually a different encoding).
*: Technically you can combine encoding and charSet, but it is only for special use cases where you want Firebird to read data in one character set, and have Jaybird interpret it in another character set.
The Oracle Java Documentation states the following boast in its Tutorial introduction to character streams:
A program that uses character streams in place of byte streams automatically adapts to the local character set and is ready for internationalization — all without extra effort by the programmer.
(http://docs.oracle.com/javase/tutorial/essential/io/charstreams.html)
My question is concerned with the meaning of the word 'automatically' in this context. Elsewhere the documentation warns
Data in text files is automatically converted to Unicode when its encoding matches the default file encoding of the Java Virtual Machine.... If the default file encoding differs from the encoding of the text data you want to process, then you must perform the conversion yourself. You might need to do this when processing text from another country or computing platform.
(http://docs.oracle.com/javase/tutorial/i18n/text/convertintro.html)
Is 'the local character set' in the first quote analogous to 'the encoding of the text data you want to process' of the second quote? And if so, is the second quote not exploding the boast of the first - that you don't need to do any conversion unless you need to do a conversion?
In the context of the first tutorial you have linked, I read it that they use "local character set" to mean the default character set.
For example:
inputStream = new FileReader("xanadu.txt");
They are creating a FileReader, which does not allow you to specify a Charset, so the JVM's default charset will be used:
FileReader(String) calls
InputStreamReader(InputStream), which calls
StreamDecoder.forInputStreamReader(InputStream, Object, String), with null as the last parameter
So Charset.defaultCharset() is used as the Charset
If you wanted to use an explicit charset, you would write:
inputStream = new InputStreamReader(new FileInputStream("xanadu.txt"), charset);
No. The local character set is the character set (table of character values and respective codes) that the file uses, but the default text encoding is how the JVM interprets the characters (converts them into their character codes). They are linked and very similar, but not exactly the same.
Also, it says that it "automatically" converts it because that is the function of the JVM: it automatically converts the characters in the text file that contains your code into code that the machine can read.
I am using jTDS to connect to a Sybase database, and non-ASCII character data is broken. This happens both in my own app and in SQuirreLSQL.
Where can I specify the character set to be used for the connection? And can I find out what that character set should be somewhere in the data dictionary?
You can set the charset property
charset (default - the character set the server was installed with)
Very important setting, determines the byte value to character mapping for CHAR/VARCHAR/TEXT values. Applies for characters from the
extended set (codes 128-255). For NCHAR/NVARCHAR/NTEXT values doesn't
have any effect since these are stored using Unicode.
Simply append ;<property>=<value> to your JDBC URL.
See the FAQ
I am parsing a bunch of XML files and inserting the value obtained from them into a MySQL database. The character set of the mysql tables is set to utf8. I'm connecting to the database using the following connection url - jdbc:mysql://localhost:3306/articles_data?useUnicode=false&characterEncoding=utf8
Most of the string values with unicode characters are entered fine (like Greek letters etc.), except for some that have a math symbol. An example in particular - when I try to insert a string with mathematical script capital g (img at www.ncbi.nlm.nih.gov/corehtml/pmc/pmcents/1D4A2.gif) ( http://graphemica.com/𝒢 ) (Trying to parse and insert this article), I get the following exception -
java.sql.SQLException: Incorrect string value: '\xF0\x9D\x92\xA2 i...' for column 'text' at row 1
at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:1055)
at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:956)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3515)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3447)
at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:1951)
at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2101)
at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2554)
at com.mysql.jdbc.PreparedStatement.executeInternal(PreparedStatement.java:1761)
at com.mysql.jdbc.PreparedStatement.executeUpdate(PreparedStatement.java:2046)
at com.mysql.jdbc.PreparedStatement.executeUpdate(PreparedStatement.java:1964)
at com.mysql.jdbc.PreparedStatement.executeUpdate(PreparedStatement.java:1949)
If I change my connection URL to - jdbc:mysql://localhost:3306/articles_data, then the insert works, but all regular UTF8 characters are replaced with a question mark.
There are two possible ways I'm trying to fix it, and haven't succeeded at either yet -
When parsing the article, maintain the encoding. I'm using org.apache.xerces.parsers.DOMParser to parse the xml files, but can't figure out how to prevent it from decoding (relevant XML - <p>𝒢 is a set containing...</p>). I could re-encode it, but that just seems inefficient.
Insert the math symbols into the database.
MySQL up to version 5.1 seems to only support unicode characters in the basic multilingual plane, which when encoded as utf-8 take no more than 3 bytes. From the manual on unicode support in version 5.1:
MySQL 5.1 supports two character sets for storing Unicode data:
ucs2, the UCS-2 encoding of the Unicode character set using 16 bits per character
utf8, a UTF-8 encoding of the Unicode character set using one to three bytes per character
In version 5.5 some new character sets where added:
...
utf8mb4, a UTF-8 encoding of the Unicode character set using one to four bytes per character
ucs2 and utf8 support BMP characters. utf8mb4, utf16, and utf32 support BMP and supplementary characters.
So if you are on mysql 5.1 you would first have to upgrade. In later versions you have to change the charset to utf8mb4 to work with these supplementary characters.
It seems the jdbc connector also requires some further configuration (From Connector/J Notes and Tips):
To use 4-byte UTF8 with Connector/J configure the MySQL server with character_set_server=utf8mb4. Connector/J will then use that setting as long as characterEncoding has not been set in the connection string. This is equivalent to autodetection of the character set.