How to get all national characters for selected Locale? - java

In my app I need to generate passwords based on all available national characters, like:
private String generatePassword(String charSet, int passwordLength) {
char[] symbols=charSet.toCharArray();
StringBuilder sbPassword=new StringBuilder();
Random wheel = new Random();
for (int i = 0; i < passwordLength; i++) {
int random = wheel.nextInt(symbols.length);
sbPassword.append(symbols[random]);
}
return sbPassword.toString();
}
For Latin we have smth like:
charSet="AaBbCcDdEeFfGgHhIiJjKkLlMmNnOoPpQqRrSsTtUuVvWwXxYyZz";
How to get similar String containing all national characters (alphabet) let's say for Thai, Arab or Hebrew?
I mean, all we know that Unicode contains all national characters available for any Locale, so there has to be a way to get them, otherwise I'd be forced to hardcode national alphabets - which is ugly... (in my case my app supports more than 10 locales)

Since you're using char[], you aren't going to be able to represent all Unicode code points in all scripts, since some of them will be outside the Basic Multilingual Plane and will not fit in a single char. Unfortunately, there is no easy way to get all the code points for a script without looping through them, like so:
char[] charsForScript(Character.UnicodeScript script) {) {
StringBuilder sb = new StringBuilder();
for (int cp = 0; cp < Character.MAX_VALUE; ++cp) {
if (Character.isValidCodePoint(cp) && script == Character.UnicodeScript.of(cp)) {
sb.appendCodePoint(cp);
}
}
return sb.toString().toCharArray();
}
This will return all the chars for a given script e.g., LATIN, GREEK, etc.
To get all code points, even outside the BMP, you could use:
int[] charsForScript(Character.UnicodeScript script) {) {
List<Integer> ints = new ArrayList<>();
for (int cp = 0; cp < Character.MAX_CODE_POINT; ++cp) {
if (Character.isValidCodePoint(cp) && script == Character.UnicodeScript.of(cp)) {
ints.add(cp);
}
}
return ints.stream().mapToInt(i -> i).toArray();
}

Related

How to tokenize Chinese into individual characters in Java? [duplicate]

I need to split a String into an array of single character Strings.
Eg, splitting "cat" would give the array "c", "a", "t"
"cat".split("(?!^)")
This will produce
array ["c", "a", "t"]
"cat".toCharArray()
But if you need strings
"cat".split("")
Edit: which will return an empty first value.
String str = "cat";
char[] cArray = str.toCharArray();
If characters beyond Basic Multilingual Plane are expected on input (some CJK characters, new emoji...), approaches such as "a💫b".split("(?!^)") cannot be used, because they break such characters (results into array ["a", "?", "?", "b"]) and something safer has to be used:
"a💫b".codePoints()
.mapToObj(cp -> new String(Character.toChars(cp)))
.toArray(size -> new String[size]);
split("(?!^)") does not work correctly if the string contains surrogate pairs. You should use split("(?<=.)").
String[] splitted = "花ab🌹🌺🌷".split("(?<=.)");
System.out.println(Arrays.toString(splitted));
output:
[花, a, b, 🌹, 🌺, 🌷]
To sum up the other answers...
This works on all Java versions:
"cat".split("(?!^)")
This only works on Java 8 and up:
"cat".split("")
An efficient way of turning a String into an array of one-character Strings would be to do this:
String[] res = new String[str.length()];
for (int i = 0; i < str.length(); i++) {
res[i] = Character.toString(str.charAt(i));
}
However, this does not take account of the fact that a char in a String could actually represent half of a Unicode code-point. (If the code-point is not in the BMP.) To deal with that you need to iterate through the code points ... which is more complicated.
This approach will be faster than using String.split(/* clever regex*/), and it will probably be faster than using Java 8+ streams. It is probable faster than this:
String[] res = new String[str.length()];
int 0 = 0;
for (char ch: str.toCharArray[]) {
res[i++] = Character.toString(ch);
}
because toCharArray has to copy the characters to a new array.
for(int i=0;i<str.length();i++)
{
System.out.println(str.charAt(i));
}
Maybe you can use a for loop that goes through the String content and extract characters by characters using the charAt method.
Combined with an ArrayList<String> for example you can get your array of individual characters.
If the original string contains supplementary Unicode characters, then split() would not work, as it splits these characters into surrogate pairs. To correctly handle these special characters, a code like this works:
String[] chars = new String[stringToSplit.codePointCount(0, stringToSplit.length())];
for (int i = 0, j = 0; i < stringToSplit.length(); j++) {
int cp = stringToSplit.codePointAt(i);
char c[] = Character.toChars(cp);
chars[j] = new String(c);
i += Character.charCount(cp);
}
In my previous answer I mixed up with JavaScript. Here goes an analysis of performance in Java.
I agree with the need for attention on the Unicode Surrogate Pairs in Java String. This breaks the meaning of methods like String.length() or even the functional meaning of Character because it's ultimately a technical object which may not represent one character in human language.
I implemented 4 methods that split a string into list of character-representing strings (Strings corresponding to human meaning of characters). And here's the result of comparison:
A line is a String consisting of 1000 arbitrary chosen emojis and 1000 ASCII characters (1000 times <emoji><ascii>, total 2000 "characters" in human meaning).
(discarding 256 and 512 measures)
Implementations:
codePoints (java 11 and above)
public static List<String> toCharacterStringListWithCodePoints(String str) {
if (str == null) {
return Collections.emptyList();
}
return str.codePoints()
.mapToObj(Character::toString)
.collect(Collectors.toList());
}
classic
public static List<String> toCharacterStringListWithIfBlock(String str) {
if (str == null) {
return Collections.emptyList();
}
List<String> strings = new ArrayList<>();
char[] charArray = str.toCharArray();
int delta = 1;
for (int i = 0; i < charArray.length; i += delta) {
delta = 1;
if (i < charArray.length - 1 && Character.isSurrogatePair(charArray[i], charArray[i + 1])) {
delta = 2;
strings.add(String.valueOf(new char[]{ charArray[i], charArray[i + 1] }));
} else {
strings.add(Character.toString(charArray[i]));
}
}
return strings;
}
regex
static final Pattern p = Pattern.compile("(?<=.)");
public static List<String> toCharacterStringListWithRegex(String str) {
if (str == null) {
return Collections.emptyList();
}
return Arrays.asList(p.split(str));
}
Annex (RAW DATA):
codePoints;classic;regex;lines
45;44;84;256
14;20;98;512
29;42;91;1024
52;56;99;2048
87;121;174;4096
175;221;375;8192
345;411;839;16384
667;826;1285;32768
1277;1536;2440;65536
2426;2938;4238;131072
We can do this simply by
const string = 'hello';
console.log([...string]); // -> ['h','e','l','l','o']
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Operators/Spread_syntax says
Spread syntax (...) allows an iterable such as an array expression or string to be expanded...
So, strings can be quite simply spread into arrays of characters.

Random seed generator

EDIT: Sorry for wrong posting, I'll check the forum locations better next time. I selected an answer as accepted, I think this considers the question closed. Thanks for the helpful replies and tips!
Original:
I need to upgrade to the new Iota wallet today. It doesn't have a random seed generator, so I built my own and ran it from NetBeans. Can you give me your opinion? It has to be 81 characters long, and contain A through Z and the number 9. Nothing else. Here's the entire code.
Does this leave anything insecure? Could the code have been cleaner from a standpoint of convention?
class SeedGenerator {
/*
This is a program to randomize a seed for Iota wallet
*/
public static void main(String[] args) {
System.out.println("*****");
int seedLength = 81;
String alphabet = "ABCDEFGHIJKLMNOPQRSTUVWXYZ9"; //Only characters allowed in Iota seed are A-Z and number 9
char[] charArray = alphabet.toCharArray(); //Turn string into array of characters to be referenced by index
String[] seed = new String[seedLength];
System.out.print("Random wallet seed is: ");
for (int i = 0; i < seedLength; i++) {
Random newRandomNumber = new Random();
int seedIndex = newRandomNumber.nextInt(alphabet.length()); //This is an array of index numbers to pull from charArray
// System.out.print(seedIndex + " "); //Used for testing the random character index range
seed[i] += charArray[seedIndex];
System.out.print(charArray[seedIndex] + "");
}
System.out.println();
System.out.println("*****");
}
}
When asking for code to be reviewed, you should post it here. But regardless of that, there are much more efficient ways to generate a random character.
One such way would be to generate a random character between 65 and 90, the decimal values for A-Z on the ASCII table. And then, just cast the value as a char to get the actual letter corresponding to the number. However, you say that you want the number 9 to also be included, so you can extend this to 91, and if you get 91, which on the ASCII table is [, add the number 9 to your string instead of that.
This code accomplishes that quite easily:
String mySeed = "";
for(int i=0; i<81; i++)
{
int randomNum = (int)(Math.random()*27) + 65;
if(randomNum==91)
mySeed+="9";
else
mySeed+=(char)randomNum;
}
System.out.println(mySeed);
And, as mentioned by #O.O. you can look at generating a secure random number here.
I recommend to use the offical IOTA Java library named Jota.
Class SeedRandomGenerator has a generateNewSeed implementation:
public static String generateNewSeed() {
char[] chars = Constants.TRYTE_ALPHABET.toCharArray();
StringBuilder builder = new StringBuilder();
SecureRandom random = new SecureRandom();
for (int i = 0; i < Constants.SEED_LENGTH_MAX; i++) {
char c = chars[random.nextInt(chars.length)];
builder.append(c);
}
return builder.toString();
}
Find the constants for TRYTES_ALPHABET and SEED_LENGTH_MAX in Constants class.

How to find a root word in an ArrayList

I'm working on a NLP project and try to match a specific input with a root in an ArrayList.
For example, the user will enter لاعبون and try to find the word لعب in an ArrayList, but when i run my code it gives me more than one root.
for(String dbData : rootList) {
//System.out.println(dbData);
// if(dbData.contains(x)) {
// System.out.println(dbData);
// }
for (int i = 0; i < dbData.length(); i++) {
c = dbData.charAt(i);
for (int j = 0; i < x.length(); i++) {
d = x.charAt(i);
if (c == d && m != rootList.size()) {
match = true;
//System.out.println(dbData);
} else {
++m;
match = false;
//System.out.println("لا يوجد تطابق");
}
if(match) {
System.out.println(dbData);
container = dbData;
}
}
}
}
This does not seem like a right approach to do stemming. Try the below that is a simple way to find stems in Arabic.
First you need a list of stems, and obviously you have that.
Then you should need to write the Arabic literature rules and forms that can parse a word to a stem.
Now you just convert your rules to java regex.
For example if you want to find لعب from لاعبون you should remove ون as it shows person and count, then you should check if لاعب is derived from one of the stems. As you know the forms لاعب is فاعل form of لعب so you should choose لعب.

Java make file reader account for specific letters

Well I'm almost finished with my world editor thanks to this great community, the only thing I need to know is how I can tell my read File code to process specific letters. When I hit enter on my keyboard I will write coordinates of a Vector3f to a text file, this Vector3f is the posistion of my active GameObject. My ProcessText method can read a text file and process the coordinates however he can only read ony type of format:
public void ProcessText()
{
String file_name = "C:/Users/Server/Desktop/textText.txt";
try
{
ProcessCoords file = new ProcessCoords(file_name);
String[] aryLines = file.OpenFile();
int i;
for (i = 0; i < aryLines.length; i++)
{
System.out.println(aryLines[i]);
if(aryLines[i].startsWith("makeGrass:")) {
String Arguments = aryLines[i].substring(aryLines[i].indexOf(":")+1, aryLines[i].length());
String[] ArgArray = Arguments.split(",");
this.makeGrass(Double.parseDouble(ArgArray[0]),
Double.parseDouble(ArgArray[1]),
Double.parseDouble(ArgArray[2]));
}
}
} catch(IOException e) {
System.out.println(e.getMessage());
}
}
In the above example my ProcessText method can only process the coordinates if they are written like this:
makeGrass:x,y,z //for example makeGrass:5,1,9
But when I press enter and write the coordinates from what me my engine gives I'm getting a different format:
makeGrass:(x y z) //for example makeGrass:(3 1 4)
Now what I need to know is how I have to rewrite the code in my ProcessText method so it accounts for the other format that has brackets at the beginning and end and also with spaces to sepearta x from y and y from z instead of commas.
I really don't knwo where else I would find an answer to this question so I'd apreciate any help and explanation as to how this works.
Thanks a lot in advance!
You want to accept as many formats as possible?
Instead of splitting I would try to match, this is safer and doesn't need any pre- or post-processing of the input or the received substrings:
Pattern pattern = Pattern.compile("([0-9]+)"); // outside of method
long[] ArgArray = new long[3];
Matcher matcher = pattern.matcher(Arguments);
int i = 0;
while (matcher.find() && i < 3) {
ArgArray[i++] = Long.parseLong(matcher.group(1));
}
// there was a mistake if i < 3, otherwise you have 3 numbers in ArgArray
If you want to split, you could maybe try this: split("[^0-9]+")
To only match makeGrass:(x y z)
Pattern pattern = Pattern.compile("^makeGrass:\\(([0-9]+) ([0-9]+) ([0-9]+)\\)$");
Like this you can directly match the line and have the 3 numbers in groups 1 - 3 (as String) after calling find once, if (matcher.find()) will decide if it's a valid makeGrass line and if so it can be processed in the if.
if you can guarantee that there will not be any spaces in the makeGrass:x,y,z format and that there will not be any parenthesis in it either then you can use String.replaceAll()... Something like below:
myString = "makeGrass:(3 1 4)"
myString = myString.replaceAll("\(", ""); //replace ( with empty space
myString = myString.replaceAll("\)", ""); //replace ) with empty space
myString = myString.replaceAll(" ", ","); //replace spaces with commas
then you don't need to different methods to handle the two types of input. just format as shown above and pass both inputs to the same method
Going this way you will not need to split on certain chars and then rebuild the string to fit your format
Just split with the regular expression : [\s,]
Splits the String at places where there is either a white space or a ,.
And use this to get rid of any brackets if present :
Arguments = Arguments.replaceAll("\\(", "").replaceAll("\\)", "");
( and ) are part of regex notation. So, they need to be escaped with \ and \ being a Java notation, needs to be escaped with another\. Hence it becomes `"\(". And we have to replace the string and store it back to the String variable. Because Java is pass by value. Both the operations are done in the same line.
The modified code for the method is :
public void ProcessText() {
String file_name = "C:/Users/Server/Desktop/textText.txt";
public void ProcessText()
{
String file_name = "C:/Users/Server/Desktop/textText.txt";
try
{
ProcessCoords file = new ProcessCoords(file_name);
String[] aryLines = file.OpenFile();
int i;
for (i = 0; i < aryLines.length; i++)
{
System.out.println(aryLines[i]);
if(aryLines[i].startsWith("makeGrass:")) {
String Arguments = aryLines[i].substring(aryLines[i].indexOf(":")+1, aryLines[i].length());
Arguments = Arguments.replaceAll("\\(", "").replaceAll("\\)", "");
String[] ArgArray = Arguments.split("[\\s,]");
this.makeGrass(Double.parseDouble(ArgArray[0]),
Double.parseDouble(ArgArray[1]),
Double.parseDouble(ArgArray[2]));
}
}
} catch(IOException e) {
System.out.println(e.getMessage());
}
}

Code optimization by chosing another datastructure

I have a piece of code that needs to be optimized.
for (int i = 0; i < wordLength; i++) {
for (int c = 0; c < alphabetLength; c++) {
if (alphabet[c] != x.word.charAt(i)) {
String res = WordList.Contains(x.word.substring(0,i) +
alphabet[c] +
x.word.substring(i+1));
if (res != null && WordList.MarkAsUsedIfUnused(res)) {
WordRec wr = new WordRec(res, x);
if (IsGoal(res)) return wr;
q.Put(wr);
}
}
}
Words are represented by string. The problem is that the code on line 4-6 creates to many string objects, because strings are immutable.
Which data structure should I change my word representation to, if I want to get faster code ? I have tried to change it to char[], but then I have problem with getting the following code work:
x.word.substring(0,i)
How to get subarray from a char[] ? And how to concatenate the char and char[] on line 4.6 ?
Is there any other suitable and mutable datastrucure that I can use ? I have thought of stringbuffer but can't find suitable operations on stringbuffers.
This function generates, given a specific word, all the word that differs by one character.
WordRec is just a class with a string representing a word, and a pointer to the "father" of that word.
Thanks in advance
You can reduce number of objects by using this approach:
StringBuilder tmp = new StringBuilder(wordLength);
tmp.append(x.word);
for (int i=...) {
for (int c=...) {
if (...) {
char old = tmp.charAt(i);
tmp.setCharAt(i, alphabet[c]);
String res = tmp.toString();
tmp.setCharAt(i, old);
...
}
}
}

Categories