how to sort non-english strings? - java

I did look up answers, and they are good for the standard alphabet. but I have a different situation than that.
so, I am programming in Java. I am writing a certain program. this program has at some place some list of string items.
I would like to sort those string items according to the alphabet.
if I would sort it by English alphabet, it would be easy since usually all code pages are compatible with American standard code for information interchange (ASCII), and they have all letters of English alphabet already sorted, so, if I would like to sort my list, I would only have to compare the values of chars to determine which letter goes where.
but my problem is, that I do not want to sort a list by using the English alphabet.
my program has the option to display in English or some other languages.
the problem is that some of those languages have different alphabet from the English alphabet, therefore letters are not the same as those in the English alphabet, and thus simple <, and > validation of char values does not work because letters are not sorted correctly in the code page.
for the purposes of this question lets say English alphabet is as follows:
a,
b,
c,
d,
e,
f,
g.
let's say there is a certain country named "ABC" whose alphabet goes like this:
d,
b,
g,
e,
a,
c,
f.
first of all, if a is equal to 97 on code page, b 98, c 99 et cetera, how can I sort my list using the second alphabet in this example, since the second alphabet has its first letter equal to 100, second equal to 98, third to 103 et cetera?
and my second question:
unfortunately, some of the countries I am translating my program too has alphabet where some combinations of letters are treated as one letter.
for my second example, let's say that country "def" has the following alphabet:
d,
g,
be,
e,
fe,
c,
f.
here:
d - the first letter in the alphabet,
g - second letter in the alphabet,
be - third letter in the alphabet (ONE letter, although it is written as two letters, it is considered to be just one letter, and has its position in the alphabet),
e - fourth letter in the alphabet,
the - fifth letter in the alphabet (also written as two letters, but treated as ONE letter),
c - sixth letter in the alphabet,
f - seventh letter in the alphabet.
as you can see in this imaginary example number 2 of imaginary country "def", this country has really screwed up the alphabet.
and after presenting these two examples of these two alphabets of two imaginary countries, you understand why I cannot use the standard method for sorting strings.
so, can you please help me out with this sorting. I am not sure what I can do to sort according to this screwed up alphabet.
post scriptum:
lines below this are not important for the question, but they are just more info if anyone wants to know where I have found such screwed up alphabet
well, i gave those examples which consists of 7 randomly ordered letters just for the purpose of this question - to make it more simple. in case you wonder, what my real problem is - i am trying to translate my program to croatian. croatian alphabet is really screwed up because it goes as follows:
1 |a
2 |b
3 |c
4 |č
5 |ć
6 |d
7 |đ
8 |đž
9 |e
10|f
11|g
12|h
13|i
14|j
15|k
16|l
17|lj
18|m
19|n
20|nj
21|o
22|p
23|r
24|s
25|š
26|t
27|u
28|v
29|z
30|ž
as you can see, Croatian alphabet is somewhat similar to the English alphabet, but most of the letters are not at the same location as English ones, and several of them do not exist in English alphabet at all, and several letters are one letter which is written as two letters. so really difficult to sort. so I hope someone knows some method of doing it.
of course, there is the dumbest method for sorting which will always work and can sort anything, and that is method with switch statement, where I compare two string items, and for each letter i use switch statement where switch statement has 31+default=32 cases from which, each of them has its own switch with 32 cases. what is in total 1024 cases, and if my average case has 4 lines of code, I end up that if I want to sort strings using the non-English alphabet, that my sort method would be at least 4096 lines long.
and that is a huge method.
this is the dumbest way of sorting, but only one I can figure out at the moment.
so I am asking here because I hope someone would know any simpler method to do this. the method which is not so big as 4k lines of code just to sort stupid strings.
I have a method for sorting English strings and it takes up only a bit more than 10 lines of code.
I hope someone can suggest me something less than 4k lines of code.
so if anyone knows the simpler solution, I would appreciate it.
thanx.

You use a Collator for that. Collators are Java's way to handle internationalized comparisons.
List<String> mylist = ...;
Locale croatian = new Locale("hr", "HR");
// Put whatever Locale you need as the argument to the getInstance method.
Collator collator = Collator.getInstance(croatian);
Collections.sort(mylist, collator);
Locale is not just "language" but also many other conventions. It is possible for the same language to be sorted differently depending on the country or region or convention within the country - that's why a Locale is identified by at most 3 parts: "country", "region" and "variant".

The concept is called collation. You can look up the concept to know more about it. For example, Oracle/Sun has a tutorial about this concept:
https://docs.oracle.com/javase/tutorial/i18n/text/rule.html

Related

Making a Hyperwebster Dictionary

I just watched a VSauce video and he mentioned that the Hyperwebster dictionary consists an infinite amount of words, but each character after another is the next in the English alphabet. Under that logic, every name, joke, phrase, book, and insult has been written in the dictionary. Basically, it lists words like this:
AAAAAA
AAAAAB
AAAAAC
..
ZZZZZZ
and this can be at any length. In my case, I just want a max of 3 characters (because that is 26^3 which is already a huge number, I don't want my compiler to break). I have a basic idea of how to do this, but I don't know how to apply each 'char' variable to be in order (as in ABC, not something random like QLD).
Another scenario I am interested in is making the first letter "index" so I can have it set to "Series A, Series B, etc.) but that would only add to the complexity. I want to be able to change the number of characters it will try to find. Also, I don't want a GUI obviously. Just output into the console.
I just wanted to know how I would go about doing something like this and how I can set it so it will create System.out.println(char1 + char2 + char3); and each output is a new thing like "aaa" "aab" "aac"
This is a better specification than your original question. Here's a suggestion:
char first = 'a';
first++;
System.out.println(first);
>>>'b'
Given the above behavior, we can write a loop:
for (char first = 'a'; first <= 'z'; first++) {
System.out.print(first);
}
Because chars have underlying number representations, which we can treat like integers, but since System.out.println looks for string representations of objects, when it sees a char type it knows to print the character, not the integer.

Replacing parts of a string with characters JAVA

I know I was here earlier asking something similar, but I think I have narrowed down what i want to ask.
Ok, so I am making a program that plays the game of hangman on the jedit console. The user will guess one character at a time. At the beginning of the game, the program will display asterisks the same length of the word they are guessing. They have as many guesses as letters in the word. When they get a letter correct, the program will display the letters in place of asterisks. Here is an example of what the console should look like.
if the word is homework ********
they guess the letter e ***e**** (the bold e just happened because stars so that, it doesn't need to be bold)
then they guess the letter h h**e****
etc until there are no more asterisks
So I created a method that prints out the number of asterisks based on the number of letters in the word. I don't know how to place the letters in the place of the asterisks. I want to know if I should make a method that replaces the asterisks, or how else I can go about this. Thank you in advance for the help.
p.s I am not asking for anyone to dump code on me, that is not what I want. Just having help, and me having someone to ask questions to about things that I don't understand would be nice. by the way, I am in an intro to computer science class, so my knowledge of java is fairly low.
There are many ways you could approach this. The first that popped into my head is that you could start with a char[] the same length as the answer string. Look up the Arrays class for an easy way to fill it with asterisks. As the user guesses letters, search the answer string for that letter and replace the corresponding indexes of the char[]. Then construct a String from the char[] and display it.
Why not make something more clever, make a list of all the chars guessed so far and each time you want to print the word just go over each letter and replace it with * if not in the set.
Short: make a set of all the guesses so far. You don't have to work on the same data structure as you show the user.
I would use a list of characters instead of String for ****.
List<Character> hiddenWord = new ArrayList<Character>();
Instantiate the list with the number of * you need.
Create a function that will receive the guessed letter.
Check if the word contains that letter (use indexOf(int ch, int fromIndex) repeatedly until you get -1 - read about it here), and for each result you get that is !=-1, set the position in the array to be that letter (something like hiddenWord.set(poz, letter), where poz is the result of indexOf and letter is the guessed letter).
You can use StringBuffer insead String. In class StringBuffer exists method setCharAt.
Breifly, you will have variable String word - for guessing word, and StringBuilder guess for asterisks and guesed letters. When letter is guessed you will update guess with setCharAt.

How to make a scanner in Java that checks if the first letter is a character between A-V and if the second character is a number between 1-20?

How to make a scanner that checks if the first letter is a character between A-V and if the second character is a number between 1-20? Some examples are: '.B4', 'H10.', '**V1', 'L19*', 'M12', or 'N14'.
I'm kind a new to Java. Still learning it in school. I've followed the lessons for about half a year now.
Now I've got an assignment for school. It is about creating a text-based minesweeper. I succeeded in printing the board and placing the mines. But now I'm stuck on getting the right input.
If you use '*' in the scanner like * B4 or B4* it should mark a square.
If you use '.' in the scanner like .B4 or B4. it should unmark a square.
And if you enter B4 it should open.
But I can't get this done in a neat way. I've tried to make sub-strings of it to check if every character is the right one but after I did that my code was kind of chaotic and it didn't work as supposed to.
I've tried it like: "Example 3 : Validating vowels in: Validating input using java.util.Scanner" only I used a variable of the length of my board. So if the board was 10 by 10 it wouldn't go further than J10. But that didn't work either for me.
So I was hoping that you could help me solving this problem.
As this is an assignment, I'll just give you a guideline rather than actual code.
First, you need to get the input into some format. Consider reading the input in from the scanner and storing it into a string.
We can then make use of Java's String functions, a list of which can be found here. Try to find a function that could be useful, perhaps one that lets us get the character at a certain index.
We can then do checks on the string. First we check the first character (the character at index 0), we want to know if that is a letter from A-V. To do this we can do a check on the ASCII numbers. Assuming you just want capital letters, if we convert A to an int, then it will have the value 65. V has the value 86. All the numbers in between correspond to the ASCII values of the letters in between.
Thus we can do a check, convert the first character to an integer, let's call it x. If x >= 65 && x <= 86, then it's a letter we can care about.
Next, you need to do the number checking. For this, take a look at the function Integer.parseInt(String s). It takes a String and then converts it to an integer. You'll have to do some checks to see if it's >= 10 or <10.

Algorithm and Data Structure for Checking letters in a word with another set of letters

I have a dictionary of 200,000 words and a set of letters. I need an algorithm to check if all the letters of a word are in that set of letters. It's very slow to check the words one by one. Because there is a huge number of words to process, I need a data structure to do this. Any ideas? Thanks!
For example: I have a set of letters {b,g,e,f,t,u,i,t,g,n,c,m,m,w,c,s}, I wanna check if word "big" and "buff". All letters of "big" are a subset of the original set then "big" is what i want while "buff" is not what i want because there is only one "f" in the original set.
This is what i wanna do.
This is for something like Scrabble or Boggle, right? Well, what you do is pre-generate your dictionary by sorting the letters in each word. So, word becomes dorw. Then you shove all these into a Trie data structure. So, in your Trie, the sequence dorw would point to the value word.
[Note that because we sorted the words, they lose their uniqueness, so one sorted word can point to multiple different words. ie your Trie needs to store a list or array at its data nodes]
You can save this structure out if you need to load it quickly later without all the sorting steps.
What you then do is take your input letters and you sort them too. You then start walking through your Trie recursively. If the current letter matches an existing path in the Trie, you follow it. Because you can have unused letter, you also allow the current letter to be dropped.
And it's that simple. Any time you encounter a node in your Trie that has a value, that's a word that you can make out of the letters you used to get there. You just add these words to a list as you find them, and when the recursion is done you have found every possible word.
If you have repeated letters in your input, you may need extra logic to prevent multiple instances of the same word being given (unless you want that). That logic can be invoked during the step that 'leaves out' a letter (you just skip past all the repeated letters) to the next letter.
[edit] You seem to want to do the opposite. My solution above finds all possible words that can be made from a set of letters. But you want to test a specific word to see if it's allowed, given the set of letters you have.
This is simple.
Store your available letters as a histogram. That is, for each letter, you store the number that you have. Then, you walk through each letter in your test word, building a new histogram as you go. As soon as one of your histogram buckets exceeds the value in your available-letters, the word cannot be made. If you get all the way to the end, you can successfully make the word.
You can use an array to mark the letter set. Each element in the array stands for a letter. To convert the letter to the element position, just need to subtract the ASCII code of 'a' or 'A'. Then the first element stands for 'a', then the second is 'b', and so on. Then the 27th is 'A'. The element value stands for the occurrences. For example, the array {2, 0, 1, 0, ...} stands for like {a, c, a}. The pseudo code could be:
for each word
copy the array to a new one
for each letter in the word
get the element position of the letter: position = letter - 'a'
decrease the element value in the new array by one: new_array[position]--
if the value is negative, return not found: if array[position] < 0 {return not found;}
sort the set, then sort each word and do a "merge"-like operation

Need help writing a descrambling method for substitution cipher

I need some help on a Java assignment. We are given a scrambled text file, which was scrambled using a substitution cipher, where every letter in the text is simply swapped out for another letter. My program is almost finished, but I'm having trouble figuring out how to write the final "descramble" method, which takes the scrambled text and replaces each letter with its correct substitute in order to reveal the correct text.
These are the instructions provided in the assignment:
The descrambling is done by using the letter in the scrambled text as the index in the char array. For example, if the scrambled text has a letter B, you replace it with the character it index 2 in the array. All whitespace and punctuation from the original file should also be in the descrambled file, only the letters should have been changed. Additionally, if a letter was capitalized in the original file, it should be capitalized in the descrambled file (similarily, lowercase letters should still be lowercase).
I'm not asking to have the answer given to me, since this is for school. I just can't seem to properly understand these instructions, what exactly is it that I need to do to successfully decode the text? Mostly, I don't understand how I can use a letter as an index for a char array, aren't indexes always integers?
You didn't say what language you're working in, so I'll use C/Java. You'll want to compute an integer index. Assume for the moment that scrambled_char is an upper case letter then it's:
// index into descrambling array:
int index = scrambled_char - 'A' + 1;
This has value 1 for character A, 2 for B, etc. as the problem says. It sounds like you're being given the descrambling array. For example:
char descramble[] = "_ZYX ... ";
This would cause A to be translated to Z, B to Y, C to X, ...
The descrambled character will be
char descrambled_char = descramble[index];
Now you just need to work out how to handle lower case letters, white space, and punctuation.

Categories