Japanese Characters and strings read from Scanner/console

Japanese Characters and strings read from Scanner/console - java

I was wondering if I can print out a string with Japanese characters. I stopped a mini-project that was, at first, out of my league. But as my skills and curiosity of high-level languages improved, I stumbled across my old project. But even with breaks from coding, I still wondered if it was possible. This isn't my project by any stretch (in fact, if the example given is non-applicable to programming, I'll feel stupid for the mere attempt.)
public static void main(String[] args) {
// TODO code application logic here
//Example:
System.out.println("Input English String Here... ");
Scanner english = new Scanner(System.in);
String English = english.next();
System.out.println("今、漢字に入ります。 ");
Scanner japanese = new Scanner(System.in);
String Japanese = japanese.next();
System.out.println("Did it work...? ");
System.out.println(English);
System.out.println(Japanese);
}
run:
Input English String Here...
Good
今、漢字に入ります。
いい
Did it work...?
Good
??
I expect to see いい on the last line of output.

The most likely explanation for getting ?? instead of いい is that there is a mismatch between the character encoding that is being delivered by your computer's input system, and the default Java character encoding determined by the JVM.
Assuming that the input is UTF-8 encoded, then a more reliable way to configure the scanner is new Scanner(System.in, "UTF-8").
Also note that it is not necessary to create multiple scanner objects. You can ... and should ... create one and reuse it. It probably will not matter if the input is genuinely interactive, but if there is any possibility that input could be piped to the program, you could find that the first Scanner gobbles up input that should go to the second Scanner.

If you are using eclipse you can change the default character encoding under run->run configurations -> common.
Also it would be better to use Scanner(System.in,StandardCharsets.UTF_8.displayName()) instead of a hard coding a string value.
Here is a link to another topic about the changing the default encoding for net beans:
How to change file encoding in NetBeans?

Support for Japanese in fonts is spotty, and different between AWT and Swing components.
Those funny blobs probably mean you are using a font/component combination that doesn't
have japanese glyphs.
Another possibility is if you've been manipulating the characters of the string,
by passing them through byte arrays or integers, it's easy to accidentally lose
high order bits. There are several deprecated APIs because of this hazard.

Related

(Java) Trying to read a txt file and count the number of occurrences for each word

I am supposed to write a program that reads a file called mobydick.txt. The file contains the entire text of Moby Dick the book. The mobydick.txt file looks like this
I have to read the file, display every unique word in the file and then display the number of occurrences of each unique word.
The output should look like:
WORD Number
the 43
whale 12
boat 93
This is my code so far:
import java.util.*;
import java.io.*;
public class Main
{
public static void main(String[] args) throws IOException
{
//Create input stream & scanner
FileInputStream fin = new FileInputStream("mobydick.txt");
Scanner fileInput = new Scanner(fin);
//Create Arraylist
ArrayList<String> words = new ArrayList<String>();
ArrayList<Integer> count = new ArrayList<Integer>();
//Read through file and find the words
while(fileInput.hasNext())
{
//Get next word
String nextWord = fileInput.next();
//Determine if the word is in the arraylist
if(words.contains(nextWord))
{
int index = words.indexOf(nextWord);
count.set(index, count.get(index) + 1);
}
else
{
words.add(nextWord);
count.add(1);
}
}
//close
fileInput.close();
fin.close();
System.out.println("WORDS COUNT");
//Print out the results
for(int i = 0; i < words.size(); i++)
{
System.out.print(words.get(i) + " " + count.get(i) + "\n");
}
}
}
However, when I run this code I get a strange looking output.
It's strange because if I run the same code for a smaller and simpler text file like this, the output looks exactly like I want it to.
What am I doing wrong with the mobydick.txt?

Just look at the text input file. It contains, for example, ago-never. Computer tools for programmers tend to be extremely stupid, because us programmers need them to be extremely simple. Scanner splits on whitespace. Period. - is not whitespace. Scanner is dutifully giving you ago-never as a single token. If the book contains Cosmic said: "Sheesh, this coding stuff is hard, man!"., then these are the tokens that scanner is going to give you:
Cosmic
said:
"Sheesh,
this
coding
stuff
is
hard,
man!".
Which is obviously not what you wanted. You wanted for example man. Not man!"..
A second issue is that text files are files, and therefore, bag-o-bytes. bytes aren't characters. So, when you turn your file into a scanner, you're implicitly asking the computer to take a wild stab at how to do that, and wild stab it will: It will use 'platform default encoding', which is java-ese for 'never what you want'. There is no easy answer here. Somebody needs to investigate or tell you what the encoding is. It's probably UTF-8. In which case, you gotta tell java about that:
new Scanner(fin, "UTF-8")
you didn't, so java picked 'platform default encoding', which is some arbitrary and generally wrong choice, and thus something like 'Haägen Dasz' messes up - only the most basic characters tend to survive conversion with the wrong charset encoding.
As to how to solve that first problem, possibly all you really need is to tell scanner that you want the 'thing that is between tokens' to be 'any amount of non-letters'. The delimiter is a regexp which is presumably a concept you haven't been taught yet; it's quite complicated. The regexp \W+ represents the notion of: "1 or more 'non-word' characters", and that as separator would mean that the sequence of exclamation point, quote, dot, newline - all disappear as merely the thing that separates tokens. - is also not a letter, so, ago-never in the input file would then give you two tokens: ago, and never.
You should still lowercase the inputs, scanners cannot do this for you.
To set the delimiter:
scanner.useDelimiter("\\W+"); // double backslash. That's not a typo.
EDIT: This answer used [^a-zA-Z]+ before, but as #VGR pointed out in a comment, \\W+ is easier to understand; it's probably more idiomatic in general.

Use Get-Content in powershell as java input get extra character

I am practicing to use command line to run java script in windows 10.The java script is using scanner(System.in) to get input from a file and print the string it get from the file.The powershell command is as follow:
Get-Content source.txt | java test.TestPrint
The content of source.txt file is as follow:
:
a
2
!
And the TestPrint.java file is as follow:
package test;
import java.util.Scanner;
public class TestPrint {
public static void main(String[] args) {
// TODO Auto-generated method stub
Scanner in = new Scanner(System.in);
while(in.hasNextLine())
{
String str = in.nextLine();
if(str.equals("q")) break;
System.out.println( str );
}
}
}
Then weird thing happed.The result is
?:
a
2
!
You see,It add question mark into the begging of first line.Then when I change the character in first line of the source.txt file from ":" to "a",the result is
a
a
2
!
It add space into the begging of the first line.
I had tested the character and found the regularity：if the character is larger than "?" in ASCII,which is 63 in ASCII,then it will add space,such as "A"(65 in ASCII) or "["(91 in ASCII).If the character is smaller than "?", including "?" itself ,it will add question mark.

Could this be a Unicode issue (See: Java Unicode problems)? i.e. try specifying the type you want to read in:
Scanner in = new Scanner(System.in, "UTF-8");
EDIT:
Upon further research, PowerShell 5.1 and earlier, the default code page is Windows-1252. PowerShell 6+ and cross platform versions have switched to UTF-8. So (from the comments) you may have to specify Windows-1252 encoding:
Scanner in = new Scanner(System.in, "Windows-1252");
To find out what encoding is being used, execute the following in PowerShell:
[System.Text.Encoding]::Default
And you should be able to see what encoding is being used (for me in PowerShell v 5.1 it was Windows-1252, for PowerShell 6 it was UTF-8).

There is no text but encoded text.
Every program reading a text file or stream must know and use the same character encoding that the writer used.
An adaptive default character encoding is a 90s solution to a 70s and 80s problem (approx). Today, it's usually best to avoid constructors and methods that use a default, and in PowerShell, add an encoding argument where needed to control input or output.
To prevent data loss, you can use the Unicode character set throughout. UTF-8 is the most common for files and streams. (PowerShell and Java use UTF-16 for text datatypes.)
But you need to start from what you know the character encoding of the text file is. If you don't know this metadata, that's data loss right there.
Unicode provides that if a file or stream is known to be Unicode, it can start with metadata called a BOM. The BOM indicates which specific Unicode character encoding is being used and what the byte order is (for character encodings with code units longer than a byte). [This provision doesn't solve any problem that I've seen and causes problems of its own.]
(A character encoding, at the abstract level, is a map between codepoints and code units and is therefore independent of byte order. In practice, a character encoding takes the additional step of serializing/deserializing code units to/from byte sequences. So, sometimes using or not using a BOM is included in the encoding's name or description. A BOM might also be referred to as a signature. Ergo, "UTF-8 with signature.")
As metadata, a BOM, if present, should used if needed and always discarded when putting text into text datatypes. Unfortunately, Java's standard libraries don't discard the BOM. You can use popular libraries or a dozen or so lines of your own code to do this.
Again, start with the knowing the character encoding of the text file and inserting that metadata into the processing as an argument.

How should I specify Asian char, and String, constants in Java?

I need to tokenize Japanese sentences. What is best practices for representing the char values of kana and kanji? This is what I might normally do:
String s = "a";
String token = sentence.split(s)[0];
But, the following is not good in my opinion:
String s = String.valueOf('あ'); // a Japanese kana character
String token = sentence.split(s)[0];
because people who read my source might not be able to read, or display, Japanese characters. I'd prefer to not insult anyone by writing the actual character. I'd want a "romaji", or something, representation. This is an example of the really stupid "solution" I am using:
char YaSmall_hira_char = (char) 12419; // [ゃ] <--- small
char Ya_hira_char = (char) 12420; // [や]
char Toshi_kj_char = (char) 24180; // [年]
char Kiku_kj_char = (char) 32862; // [聞]
That looks absolutely ridiculous. And, it's not sustainable because there are over 2,000 Japanese characters...
My IDE, and java.io.InputStreamReaders, are all set to UTF-8, and my code it working fine. But the specter of character encoding bugs are hanging over my head because I just don't understand how to represent Asian characters as chars.
I need to clean-up this garbage I wrote, but I don't know which direction to go. Please help.

because people who read my source might not be able to read, or display, Japanese characters.
Then how could the do anything useful with your code when dealing with such characters is an intergral part of it?
Just make sure your development environment is set up correctly to support these characters in source code and that you have procedures in place to ensure everyone who works with the code will get the same correct setup. At the very least document it in your project description.
Then there is nothing wrong with using those characters directly in your source.

I agree that what you are currently doing is unsustainable. It is horribly verbose, and probably a waste of your time anyway.
You need to ask yourself who exactly you expect to read your code:
A native Japanese speaker / writer can read the Kana. They don't need the romanji, and would probably consider them to be an impediment to readability.
A non Japanese speaker would not be able to discern the meaning of the characters whether they are written as Kana or as romanji. Your effort would be wasted for them.
The only people who might be helped by romanji would be non-native Japanese speakers who haven't learned to read / write Kana (yet). And I imagine they could easily find a desktop tool / app for mapping Kana to romanji.
So lets step back to your example which you think is "not good".
String s = String.valueOf('あ'); // a Japanese kana character
String token = sentence.split(s)[0];
Even to someone (like me) who can't read (or speak) Japanese, the surface meaning of that code is clear. You are splitting the String using a Japanese character as the separator.
Now, I don't understand the significance of that character. But I wouldn't if it was a constant with a romanji name either. Besides, the chances are that I don't need to know in order to understand what the application is doing. (If I do need to know, I'm probably the wrong person to be reading the code. Decent Japanese language skills are mandatory for your application domain!!)
The issue you raised about not being able to the display the Japanese characters is easy to solve. The programmer simply needs to upgrade his software that can display Kana. Any decent Java IDE will be able to cope ... if properly configured. Besides, if this is a real concern, the proper solution (for the programmer!) is to use Java's Unicode escape sequence mechanism to represent the characters; e.g.
String s = String.valueOf('\uxxxx'); // (replace xxxx with hex unicode value)
The Java JDK includes tools that can rewrite Java source code to add or remove Unicode escaping. All the programmer needs to do is to "escape" the code before trying to read it.
Aside: You wrote this:
"I'd prefer to not insult anyone by writing the actual character."
What? No Westerner would or should consider Kana an insult! They may not be able to read it, but that's not an insult / insulting. (And if they do feel genuinely insulted, then frankly that's their problem ... not yours.)
The only thing that matters here is whether non-Japanese-reading people can fully understand your code ... and whether that's a problem you ought to be trying to solve. Worrying about solving unsolvable problems is not a fruitful activity.

Michael has the right answer, I think. (Posting this as an Answer rather than a Comment because Comment sizes are limited; apologies to those who are picky about the distinction.)
If anyone is working with your code, it will be because they need to alter how Japanese sentences are tokenized. They had BETTER be able to deal with Japanese characters at least to some degree, or they'll be unable to test any changes they make.
As you've pointed out, the alternatives are certainly no more readable. Maybe less so; even without knowing Japanese I can read your code and know that you are using the 'あ' character as your delimiter, so if I see that character in an input string I know what the output will be. I have no idea what the character means, but for this simple bit of code analysis I don't need to.
If you want to make it a bit easier for those of us who don't know the full alphabet, then when referring to single characters you could give us the Unicode value in a comment. But any Unicode-capable text editor ought to have a function that tells us the numeric value of the character we've pointed at -- Emacs happily tells me that it's #x3042 -- so that would purely be a courtesy to those of us who probably shouldn't be messing with your code anyway.

Not able to decode traditional chinese using java

I want to display a string which is in traditional Chinese language into my application GUI.
While debugging eclipse showed this string as some mixture of English alphabets and square boxes.
This is the java code which I used to decode it. The string 'str' I am getting from a traditional Chinese .mpg stream.
String TRADITIONAL_CHINESE_ENC = "Big5";
byte[] tmp = str.getBytes();
String decodedString=new String(tmp,TRADITIONAL_CHINESE_ENC);
But the result i am getting in decodedString is also a mixture of alphabets,square boxes,and some question mark embedded in a diamond shaped box etc.
This is happening only in case of traditional Chinese language. The same code works fine for simplified chinese,korean languages etc.
What could be wrong in my code when dealing with traditional Chinese?
I am using UTF-8 encoding for eclipse.

I can't see anything wrong with that code.
According to this Wikipedia page, there are 3 common encodings for traditional Chinese characters: Guobiao, UTF-8 and Big5. I suggest you try the two alternatives that you haven't tried, and if that fails try some of the less common alternatives listed.
(It is also possible that the real problem is in the way you are displaying the String ... but the fact that you are displaying Simplified Chinese and Korean correctly suggests that this is not the problem.)
I am using UTF-8 encoding for eclipse.
I don't think that is relevant. The code you showed us doesn't depend on the default character encoding of either the execution platform or the IDE.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.