Can I determine whether Data is English or Chinese language?

Can I determine whether Data is English or Chinese language? - java

Is it possible to determine whether data is in English or Chinese?

This is for example possible using statistical methods. English language has a very distinctive distribution of characters that appear at all, and a very distinctive distribution of what characters appear following another character (that would be called a level-1 model).
If 'e' is the most common symbol, it is very unlikely that the language is not something of European origin.
It may also be possible rather trivially (but maybe not 100% reliably) to do such a distinction by looking at Unicode character values (converting between character sets if necessary). If there are characters with a Unicode value greater than 127, English is somewhat unlikely (note that there are symbols like € though).
If there are many characters with Unicode values in the thousands, east Asian languages become more and more likely, with codes > 65535 being guaranteed to be Chinese.

My idea is to calculate the average position of the characters in the Unicode table. Since Chinese characters are located after ASCII (e.g. after value 127) you could easily determine if the text is English or Chinese.
edit: Basically the same Damon added. >_>

Related

How to use wingdings in java

How to display wingdings2 symbols in java. I tried googling but i couldn't find much help.
http://www.alanwood.net/demos/wingdings-2.html
Please refer the above link. I would like to display the white diamond symbol.

Wingdings is a font-based trick
The old Wingdings was a hack, a font that used alternate designs as the glyph. For example, where a NUMBER SIGN (pound sign, hash mark) was expected when using character position 35, a fountain pen image appears instead.
For this hack to succeed, (a) the user must have the desired Wingdings font installed, and (b) the text being displayed must use that font specifically.
Unicode
Nowadays, it likely makes sense to use emoji or other image characters defined among the 143,859 characters in Unicode. Those characters are each assigned a number, a code point, using numbers over the range of about a million.
Perhaps this character would work for you: ◇ Unicode Character 'WHITE DIAMOND' (U+25C7) at decimal code point 9,671.
System.out.println( "◇ = WHITE DIAMOND" ) ;
Your user needs a font, any font, that provides a glyph for that particular character. Modern OSes are skilled at automatically finding and using a secondary font with such a glyph if a displayed text block’s primary font is lacking. Understand that no single font provides a glyph for each and every character in Unicode.
There are likely other diamond-related characters too that a search might expose.
As a Java programmer, you can simply paste your desired character into your source code. Be sure to use UTF-8 as the character encoding of your file.

java valid identifier from java language specification

Many places on SO lead to the JLS section on Identifiers, but I have a question on what's written there.
The "Java letters" include uppercase and lowercase ASCII Latin letters
A-Z (\u0041-\u005a), and a-z (\u0061-\u007a), and, for historical
reasons, the ASCII underscore (_, or \u005f) and dollar sign ($, or
\u0024). The $ character should be used only in mechanically generated
source code or, rarely, to access pre-existing names on legacy
systems. The "Java digits" include the ASCII digits 0-9
(\u0030-\u0039).
But it goes on to say:
Letters and digits may be drawn from the entire Unicode character set,
which supports most writing scripts in use in the world today,
including the large sets for Chinese, Japanese, and Korean. This
allows programmers to use identifiers in their programs that are
written in their native languages.
I don't understand how these can both be true. The first section seems to dictate exactly which characters are allowed whereas the second section seems to say that the allowance is much more flexible.
I agree that usage of "includes" instead of "includes but is not limited to" shows that it doesn't exactly contradict. But it also first refers specifically to "Java letters"/"Java digits" and then relaxes this to just "letters"/"digits". My main point is lack of clarity and I wanted confirmation on what I assumed it meant.

As per the question Legal identifiers in Java you can see that there are many legal identifiers.
[For languages using the roman alphabet] only alphanumeric characters and occasionally underscores are used when naming identifiers by convention. However, a vast array of characters can be used.
The first paragraph refers to the code-style, or convention, among java programmers to use a reasonably consistent and readable naming scheme. The second paragraph you've quoted explains that there are a vast array of other characters which the JVM will accept - although your fellow programmers may disapprove.

First section is a special case of the second, and characters mentioned in both the sections have to satisfy the criteria mentioned in JLS 3.8 that is missed here,
A "Java letter" is a character for which the method Character.isJavaIdentifierStart(int) returns true.
A "Java letter-or-digit" is a character for which the method
Character.isJavaIdentifierPart(int) returns true.
The above methods accept/verify the code points that correspond to the characters in the entire Unicode character set (Section 2) which includes the Basic-Latin character set (Section 1).
Usually, you will never see anybody going beyond the Basic-Latin character set in their Java source files.

Encoding a string in 128c barcode symbology

I am having some trouble with encoding this string into barcode symbology - Code 128.
Text to encode:
1021448642241082212700794828592311
I am using the universal encoder from idautomation.com:
https://www.bcgen.com/fontencoder/
I get the following output for the encoded text for Code 128:
Í*5LvJ8*r5;ÂoP<[7+.Î
However, in ";Âo" the character between the semi-colon and o (let us call it special A) - is not part of the extended character set used in Code128. (See the Latin Supplements at https://www.fonts2u.com/code-128.font)
Yet the same string shows a valid barcode at
https://www.bcgen.com/linear-barcode-creator.html
How?
If I use the output with the Special A on a webpage with a font face for barcodes, the special A character does not show up as the barcode (and that seems correct since the special A is not part of the character set).
What gives? Please help.
I am using the IDAutomation utility to encode the string to 128c symbology. If you can share code to do the encoding (in Java/Python/C/Perl) that would help too.

There are multiple fonts for Code128 that may use different characters to represent the barcode symbols. Make sure the font and the encoding logic match each other.
I used this one http://www.jtbarton.com/Barcodes/Code128.aspx (there is also sample code how to encode it on the site, but you have to translate it from VB). The font works for all three encodings (A, B and C).

Sorry, this is very late.
When you are dealing with the encoding of code 128, in any subset, it's a good idea to think of that coding in terms of numbers, not characters. At this level, when you have shifts, code-changes, checksums and stuff, intermixed with the data, the whole concept of "character" is lost.
However, this is what is happening:
The semicolon in the output corresponds to "27"
The lowercase o corresponds to "48" and the P to "79"
The "A with Macron" corresponds to your "00" sequence. This is why you should be dealing with numbers, not characters, at this level of encoding.
How would you expect it to show a character with a code of 00 ? That would be a space of NULL, neither of which is particularly visible.
Your software has simply rendered it the best way it can, which is to make the character 'visible' by adding 0x80 to it. If you look at charmap, you will see that code 0x80 is indeed A with macron.
The rest (indeed all) of your encoded string looks correct for a setc-encodation.

How can Java handle characters that are unencodeable in UTF-16?

Since Java holds characters internally in UTF-16, what if you need to output in a certain encoding that includes characters that are not in unicode at all?

Java can only handle characters which are present in Unicode, basically. Text outside the BMP (i.e. above U+FFFF) is encoded as surrogate pairs (as each char is a UTF-16 code unit)... but if you want characters which aren't in Unicode at all, you're on your own - you could probably find some area of Unicode which is reserved for private use, and map the characters there... but you may well have "fun" in all kinds of odd ways.
Do you definitely need to handle characters which aren't in Unicode? I thought it covered almost everything these days...

Is it a good idea to use unicode symbols as Java identifiers?

I have a snippet of code that looks like this:
double Δt = lastPollTime - pollTime;
double α = 1 - Math.exp(-Δt / τ);
average += α * (x - average);
Just how bad an idea is it to use unicode characters in Java identifiers? Or is this perfectly acceptable?

It's a bad idea, for various reasons.
Many people's keyboards do not support these characters. If I were to maintain that code on a qwerty keyboard (or any other without Greek letters), I'd have to copy and paste those characters all the time.
Some people's editors or terminals might not display these characters properly. For example, some editors (unfortunately) still default to some ISO-8859 (Latin) variant. The main reason why ASCII is still so prevalent is that it nearly always works.
Even if the characters can be rendered properly, they may cause confusion. Straight from Sun (emphasis mine):
Identifiers that have the same external appearance may yet be different. For example, the identifiers consisting of the single letters LATIN CAPITAL LETTER A (A, \u0041), LATIN SMALL LETTER A (a, \u0061), GREEK CAPITAL LETTER ALPHA (A, \u0391), CYRILLIC SMALL LETTER A (a, \u0430) and MATHEMATICAL BOLD ITALIC SMALL A (a, \ud835\udc82) are all different.
...
Unicode composite characters are different from the decomposed characters. For example, a LATIN CAPITAL LETTER A ACUTE (Á, \u00c1) could be considered to be the same as a LATIN CAPITAL LETTER A (A, \u0041) immediately followed by a NON-SPACING ACUTE (´, \u0301) when sorting, but these are different in identifiers.
This is in no way an imaginary problem: α (U+03b1 GREEK SMALL LETTER ALPHA) and ⍺ (U+237a APL FUNCTIONAL SYMBOL ALPHA) are different characters!
There is no way to tell which characters are valid. The characters from your code work, but when I use the FUNCTIONAL SYMBOL ALPHA my Java compiler complains about "illegal character: \9082". Even though the functional symbol would be more appropriate in this code. There seems to be no solid rule about which characters are acceptable, except asking Character.isJavaIdentifierPart().
Even though you may get it to compile, it seems doubtful that all Java virtual machine implementations have been rigorously tested with Unicode identifiers. If these characters are only used for variables in method scope, they should get compiled away, but if they are class members, they will end up in the .class file as well, possibly breaking your program on buggy JVM implementations.

looks good as it uses the correct symbols, but how many of your team will know the keystrokes for those symbols?
I would use an english representation just to make it easier to type. And others might not have a character set that supports those symbols set up on their pc.

That code is fine to read, but horrible to maintain - I suggest use plain English identifiers like so:
double deltaTime = lastPollTime - pollTime;
double alpha = 1 - Math.exp(-delta....

It is perfectly acceptable if it is acceptable in your working group. A lot of the answers here operate on the arrogant assumption that everybody programs in English. Non-English programmers are by no means rare these days and they're getting less rare at an accelerating rate. Why should they restrict themselves to English versions when they have a perfectly good language at their disposal?
Anglophone arrogance aside, there are other legitimate reasons for using non-English identifiers. If you're writing mathematics packages, for example, using Greek is fine if your target is fellow mathematicians. Why should people type out "delta" in your workgroup when everybody can understand "Δ" and likely type it more quickly? Almost any problem domain will have its own jargon and sometimes that jargon is expressed in something other than the Latin alphabet. Why on Earth would you want to try and jam everything into ASCII?

It's an excellent idea. Honest. It's just not easily practicable at the time. Let's keep a reference to it for the future. I would love to see triangles, circles, squares, etc... as part of program code. But for now, please do try to re-write it, the way Crozin suggests.

Why not?
If the people working on that code can type those easily, it's acceptable.
But god help those who can't display unicode, or who can't type them.

In a perfect world, this would be the recommended way.
Unfortunately you run into character encodings when moving outside of plain 7-bit ASCII characters (UTF-8 is different from ISO-Latin-1 is different from UTF-16 etc), meaning that you eventually will run into problems. This has happened to me when moving from Windows to Linux. Our national scandinavian characters broke in the process, but fortunately was only in strings. We then used the \u encoding for all those.
If you can be absolutely certain that you will never, ever run into such a thing - for instance if your files contain a proper BOM - then by all means, do this. It will make your code more readable. If at least the smallest amount of doubt, then don't.
(Please note that the "use non-English languages" is a different matter. I'm just thinking in using symbols instead of letters).

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Can I determine whether Data is English or Chinese language? - java

Is it possible to determine whether data is in English or Chinese?

My idea is to calculate the average position of the characters in the Unicode table. Since Chinese characters are located after ASCII (e.g. after value 127) you could easily determine if the text is English or Chinese. edit: Basically the same Damon added. >_>

Related

How to use wingdings in java

java valid identifier from java language specification

Encoding a string in 128c barcode symbology

How can Java handle characters that are unencodeable in UTF-16?

Is it a good idea to use unicode symbols as Java identifiers?

Categories

Resources