Java code page table - java

We transfer data from mainframe to Linux/Windows servers, using a file-transfer software. To decrease the time in transfer process, we are using the ZipDataset class, from JZOS API ToolKit. There is an option to convert the data before, from EBCDIC to ASCii, and we can use all the options available in the Java environment:
https://docs.oracle.com/javase/8/docs/technotes/guides/intl/encoding.doc.html
The Java default is:
public static final String DEFAULT_TARGET_ENCODING = "ISO8859-1";
I need to find the documentation, where I can find the EBCDIC character to ASCii conversion and how it is possible to create a special table conversion.
For example, for ISO8859-1, the ECBDIC character 'A' is x'C1' and the related character in ASCii is x'41'.
Where can I find the other doc page encoding relation?

Related

Retaining special character while reading from html java?

i am trying to read html source file which contains German characters like ä ö ü ß €
Reading using JSOUP
citAttr.nextElementSibling().text()
Encoding the string with
unicodeEscaper.translate(citAttr.nextElementSibling().text())
org.apache.commons.lang3.text.translate.UnicodeEscaper
Issue is after reading the charecters turns into �
But where as reading CSV with Encoded type UTF-8 with above unicodeEscaper saving & retriving the charecters works fine.
unicodeEscaper.translate(record.get(headerPosition.get(0)))
Whats the issue with reading from html ?? did try with StringUtilEscaper methods still the charecters turns into �
private String getText(Part p) throws MessagingException, IOException {
if (p.isMimeType("text/*")) {
String s = (String) p.getContent();
textIsHtml = p.isMimeType("text/html");
return s;
}
This is how i am reading email which have html content!
I just answered a similar question today... I guess I can just type what I know about extended character sets (foreign-language characters), since that's one of the major facets of the software I write.
Java's internal String's all use 16-bit chars (The primitive type char is a 16-bit primitive value. The name UTF-8 is a little misleading since it is used to represent the 16-bit "Unicode Space" (using two 8-bit numbers). This means that Java (and Java String's) have no problems representing the entire Unicode foreign-language alphabet ranges.
JSoup, and just about any HTML tool written in Java, when asking for website pages to download, will return 16-bit characters - as Java String's - just fine, without any problems! If there are problems viewing these ranges, it is likely not the download process, nor a JSoup or HttpUrlConnection setting. When you save a web-page to a String in Java, you haven't lost those characters, you essentially get UTF-8 "for free."
HOWEVER: Whenever a programmer attempts to save a UTF-8 String to a '.txt' File or a '.html' File, if you then go on to view that content (that file) in a web-browser, all you might see is that annoying question mark: �. This is because you need to make sure to let your web-browser know that the '.html' File that you have saved using Java - is not intended to be interpreted using the (much older, much shorter) 8-bit ASCII Range.
If you view an '.html' File in any web-browser, or upload that file to Google Cloud Platform (or some hosting site), you must do one of two things:
Include the <META> Tag mentioned in the comments: <meta charset="UTF-8"> in the HTML Page's <HEAD> ... </HEAD> section.
Or provide the setting in whatever hosting platform you have to identify the file as 'text/html, charset=UTF-8'. In Google Cloud Platform Storage Buckets there is a popup menu to assign this setting to any file.

Java unable to read chinese characters from Db2 Database

I am trying to read from Java application Chinese characters from Db2 database
Db2 database configuration
DB2 database XDSN3T configuration:
with DB2 CLP data are displayed correctly
also from another delphi application chinese data are correct
To obtain this I set:
Regional and language options, Advanced, non unicode programs --> Chinese RPC
non unicode programs:
- enviroment variables, DB2CODEPAGE = 1252
db2codepage:
Only Java is not able to display data correctly --> ÃæÁÏ¡¢¸¨ÁÏ¡¢¸½¼þ
Maybe something related to JDBC..
When you open the connection you can define the encoding, not sure if it's available for chinese. but here is an example:
Connection con = DriverManager.getConnection("jdbc:mysql://examplehost:8888/dbname?useUnicode=yes&characterEncoding=UTF-8","user", "pass");
As it's been said the encoding might be an issue; characters in java are stored using UTF-16 encoding which has itself some issues regarding the encoding of Chinese (also some emoji) characters.
You can find the character list for UTF-16 here: https://www.fileformat.info/info/charset/UTF-16/list.htm
The issue with UTF-16 comes when characters cannot be encoded using a single 16-bit unit; these characters are encoded using two 16-bit units which is called a surrogate pair. see: https://docs.oracle.com/javase/6/docs/api/java/lang/Character.html#unicode
Sorry I cannot provide a complete answer, but I hope this will help

Is 'the local character set' the same as 'the encoding of the text data you want to process'?

The Oracle Java Documentation states the following boast in its Tutorial introduction to character streams:
A program that uses character streams in place of byte streams automatically adapts to the local character set and is ready for internationalization — all without extra effort by the programmer.
(http://docs.oracle.com/javase/tutorial/essential/io/charstreams.html)
My question is concerned with the meaning of the word 'automatically' in this context. Elsewhere the documentation warns
Data in text files is automatically converted to Unicode when its encoding matches the default file encoding of the Java Virtual Machine.... If the default file encoding differs from the encoding of the text data you want to process, then you must perform the conversion yourself. You might need to do this when processing text from another country or computing platform.
(http://docs.oracle.com/javase/tutorial/i18n/text/convertintro.html)
Is 'the local character set' in the first quote analogous to 'the encoding of the text data you want to process' of the second quote? And if so, is the second quote not exploding the boast of the first - that you don't need to do any conversion unless you need to do a conversion?
In the context of the first tutorial you have linked, I read it that they use "local character set" to mean the default character set.
For example:
inputStream = new FileReader("xanadu.txt");
They are creating a FileReader, which does not allow you to specify a Charset, so the JVM's default charset will be used:
FileReader(String) calls
InputStreamReader(InputStream), which calls
StreamDecoder.forInputStreamReader(InputStream, Object, String), with null as the last parameter
So Charset.defaultCharset() is used as the Charset
If you wanted to use an explicit charset, you would write:
inputStream = new InputStreamReader(new FileInputStream("xanadu.txt"), charset);
No. The local character set is the character set (table of character values and respective codes) that the file uses, but the default text encoding is how the JVM interprets the characters (converts them into their character codes). They are linked and very similar, but not exactly the same.
Also, it says that it "automatically" converts it because that is the function of the JVM: it automatically converts the characters in the text file that contains your code into code that the machine can read.

Lucene encoding, java

I have questions about encoding in Lucene (java).
How is working with coding in Lucene? which is the default and how can I set it?
Or Lucene does not matter what it is encoding and it's just a matter of how adding a string to a document (java code is below) in the indexing phase, and then in the search in the index?
In other words, I have to worry if the input text is in UTF-8 and query are also in utf-8?
Document doc = new Document ();
doc.add (new TextField (tagName, object.getName () Field.Store.YES));
Thanks for any help
Lucene stores terms in UTF-8. (See Lucene's BytesRef class)
Java internally stores everything in UTF-16. (Java's String is UTF-16). So, Lucene's BytesRef gives you a constructor where it converts UTF16 to UTF8. Hence Java's String can be used without any issues.
For example, TextField what you have used in your code uses String for Field value.
If you have some other type of Field which takes byte[] then you need to make sure they are UTF8 bytes.
While querying, Lucene will always give you UTF-8 bytes, however you can convert that to Java's String by a method provided in the same class.You can always interpret these bytes in other character sets.
You have to take care of Character Encoding yourself - as long as you can get the characters right in Java's String, you should be fine. For eg: If the data you are indexing is from an XML with a diff char set or reading from a DB in a diff char set. You will have to make sure that you can read these data sources properly in the JVM used for indexing.

UTF-8 to EBCDIC in Java

Our requirement is to send EBCDIC text to mainframe. We have some chinese characters thus UTF8 format.
So, is there a way to convert the UTF-8 characters to EBCDIC?
Thanks,
Raj Mohan
Assuming your target system is an IBM mainframe or midrange, it has full support for all of the EBCDIC encodings built into it's JVM as encodings named CPxxxx, corresponding to the IBM CCSID's (CP stands for code-page). You will need to do the translations on the host-side since the client side will not have the necessary encoding support.
Since Unicode is DBCS and greater, and supports every known character, you will likely be targeting multiple EBCDIC encodings; so you will likely configure those encodings in some way. Try to have your client Unicode (UTF-8, UTF-16, etc) only, with the translations being done as data arrives on the host and/or leaves the host system.
Other than needing to do translations host-side, the mechanics are the same as any Java translation; e.g. new String(bytes,encoding) and String.getBytes(encoding), and the various NIO and writer classes. There's really no magic - it's no different than translating between, say, ISO 8859-x and Unicode, or any other SBCS (or limited DBCS).
For example:
byte[] ebcdta="Hello World".getBytes("CP037"); // get bytes for EBCDIC codepage 37
You can find more information on IBM's documentation website.
EBCDIC has many 8-Bit Codepages. Many of them are supported by the VM. Have a look at Charset.availableCharsets().keySet(), the EBCDIC pages are named IBM... (there are aliases like cp500 for IBM500 as you can see by Charset.forName("IBM500").aliases()).
There are two problems:
if you have characters included in different code pages of EBCDIC, this will not help
i am not sure, if these charsets are available in any vm outside windows.
For the first, have a look at this approach. For the second, have a try on the desired target runtime ;-)
You can always make use of the IBM Toolbox for Java (JTOpen), specifically the com.ibm.as400.access.AS400Text class in the jt400.jar.
It goes as follows:
int codePageNumber = 420;
String codePage = "CP420";
String sourceUtfText = "أحمد يوسف صالح";
AS400Text converter = new AS400Text(sourceUtfText.length(), codePageNumber);
byte[] bytesData = converter.toBytes(sourceUtfText);
String resultedEbcdicText = new String(bytesData, codePage);
I used the code-page 420 and its corresponding java representation of the encoding CP420, this code-page is used for Arabic text, so, you should pick the suitable code-page for Chinese text.
For the midrange AS/400 (IBM i these days) the best bet is to use the IBM Java Toolkit (jt400.jar) which does all these things transparently (perhaps slightly hinted).
Please note that inside Java a character is a 16 bit value, not an UTF-8 (that is an encoding).

Categories