I'm working on a NLP project and try to match a specific input with a root in an ArrayList.
For example, the user will enter لاعبون and try to find the word لعب in an ArrayList, but when i run my code it gives me more than one root.
for(String dbData : rootList) {
//System.out.println(dbData);
// if(dbData.contains(x)) {
// System.out.println(dbData);
// }
for (int i = 0; i < dbData.length(); i++) {
c = dbData.charAt(i);
for (int j = 0; i < x.length(); i++) {
d = x.charAt(i);
if (c == d && m != rootList.size()) {
match = true;
//System.out.println(dbData);
} else {
++m;
match = false;
//System.out.println("لا يوجد تطابق");
}
if(match) {
System.out.println(dbData);
container = dbData;
}
}
}
}
This does not seem like a right approach to do stemming. Try the below that is a simple way to find stems in Arabic.
First you need a list of stems, and obviously you have that.
Then you should need to write the Arabic literature rules and forms that can parse a word to a stem.
Now you just convert your rules to java regex.
For example if you want to find لعب from لاعبون you should remove ون as it shows person and count, then you should check if لاعب is derived from one of the stems. As you know the forms لاعب is فاعل form of لعب so you should choose لعب.
Related
In my app I need to generate passwords based on all available national characters, like:
private String generatePassword(String charSet, int passwordLength) {
char[] symbols=charSet.toCharArray();
StringBuilder sbPassword=new StringBuilder();
Random wheel = new Random();
for (int i = 0; i < passwordLength; i++) {
int random = wheel.nextInt(symbols.length);
sbPassword.append(symbols[random]);
}
return sbPassword.toString();
}
For Latin we have smth like:
charSet="AaBbCcDdEeFfGgHhIiJjKkLlMmNnOoPpQqRrSsTtUuVvWwXxYyZz";
How to get similar String containing all national characters (alphabet) let's say for Thai, Arab or Hebrew?
I mean, all we know that Unicode contains all national characters available for any Locale, so there has to be a way to get them, otherwise I'd be forced to hardcode national alphabets - which is ugly... (in my case my app supports more than 10 locales)
Since you're using char[], you aren't going to be able to represent all Unicode code points in all scripts, since some of them will be outside the Basic Multilingual Plane and will not fit in a single char. Unfortunately, there is no easy way to get all the code points for a script without looping through them, like so:
char[] charsForScript(Character.UnicodeScript script) {) {
StringBuilder sb = new StringBuilder();
for (int cp = 0; cp < Character.MAX_VALUE; ++cp) {
if (Character.isValidCodePoint(cp) && script == Character.UnicodeScript.of(cp)) {
sb.appendCodePoint(cp);
}
}
return sb.toString().toCharArray();
}
This will return all the chars for a given script e.g., LATIN, GREEK, etc.
To get all code points, even outside the BMP, you could use:
int[] charsForScript(Character.UnicodeScript script) {) {
List<Integer> ints = new ArrayList<>();
for (int cp = 0; cp < Character.MAX_CODE_POINT; ++cp) {
if (Character.isValidCodePoint(cp) && script == Character.UnicodeScript.of(cp)) {
ints.add(cp);
}
}
return ints.stream().mapToInt(i -> i).toArray();
}
I need to parse large file with more than one JSON in it. I didn't find any way how to do it. File looks like a BSON for mongoDB.
File example:
{"column" : value, "column_2" : value}
{"column" : valeu, "column_2" : value}
....
You will need to determine where one JSON begins and another ends within the file. If each JSON is on an individual line, then this is easy, if not: You can loop through looking for the opening and closing braces, locating the points between each JSON.
char[] characters;
int openBraceCount = 0;
ArrayList<Integer> breakPoints = new ArrayList<>();
for(int i = 0; i < characters.length; i++) {
if(characters[i] == '{') {
openBraceCount++;
} else if(characters[i] == '}') {
openBraceCount--;
if(openBraceCount == 0) {
breakPoints.add(i + 1);
}
}
}
You can then break the file apart at each break point, and pass the individual JSON's into whatever your favorite JSON library is.
I have a piece of code that needs to be optimized.
for (int i = 0; i < wordLength; i++) {
for (int c = 0; c < alphabetLength; c++) {
if (alphabet[c] != x.word.charAt(i)) {
String res = WordList.Contains(x.word.substring(0,i) +
alphabet[c] +
x.word.substring(i+1));
if (res != null && WordList.MarkAsUsedIfUnused(res)) {
WordRec wr = new WordRec(res, x);
if (IsGoal(res)) return wr;
q.Put(wr);
}
}
}
Words are represented by string. The problem is that the code on line 4-6 creates to many string objects, because strings are immutable.
Which data structure should I change my word representation to, if I want to get faster code ? I have tried to change it to char[], but then I have problem with getting the following code work:
x.word.substring(0,i)
How to get subarray from a char[] ? And how to concatenate the char and char[] on line 4.6 ?
Is there any other suitable and mutable datastrucure that I can use ? I have thought of stringbuffer but can't find suitable operations on stringbuffers.
This function generates, given a specific word, all the word that differs by one character.
WordRec is just a class with a string representing a word, and a pointer to the "father" of that word.
Thanks in advance
You can reduce number of objects by using this approach:
StringBuilder tmp = new StringBuilder(wordLength);
tmp.append(x.word);
for (int i=...) {
for (int c=...) {
if (...) {
char old = tmp.charAt(i);
tmp.setCharAt(i, alphabet[c]);
String res = tmp.toString();
tmp.setCharAt(i, old);
...
}
}
}
I have a parameter which is obtained as a string
String Dept_ID[] = request.getParameterValues("dept_id"))
in jsp. I have to insert the string in the db whose type is numeric
#DEPT_ID NUMERIC(10,0)).
How to perform the conversion?
Your code is receiving an array of strings. You can convert an entry from the array into a number using Integer.parseInt or Long.parseLong as appropriate.
For example:
String Dept_ID[] = request.getParameterValues("dept_id"));
int[] ids = null;
if (Dept_ID != null) {
ids = new int[Dept_ID.length];
for (int index = 0; index < Dept_ID.length; ++index) {
ids[index] = Integer.parseInt(Dept_ID[index]);
}
}
If the number uses a different radix (number base) than 10, you can supply the radix as a second arg (see the links for details).
The above answer is correct, but it doesn't take into account what happens when your getting letters as input that can't be converted. You wanna use a try and catch method for that part if you ask me.
Something like (assuming your using the code above):
String Dept_ID[] = request.getParameterValues("dept_id"));
int[] ids = null;
if (Dept_ID != null) {
ids = new int[Dept_ID.length];
for (int index = 0; index < Dept_ID.length; index++) {
try {
ids[index] = Integer.parseInt(Dept_ID[index]);
}
catch ( NumberFormatException e ) {
System.out.println("Invalid crap.");
}
}
}
Also notice that I put the ++index part the other way around to index++, if you don't do this you will keep missing the first index in the array all the time.
I have an array of Strings:
qTrees[0] = "023012311312201123123130110332";
qTrees[1] = "023012311130023103123130110332";
qTrees[2] = "023013200020123103123130110333";
qTrees[3] = "023013200202301123123130110333";
Using this cycle I'm trying to retrieve similar part from them:
String similarPart = "";
for (int i = 0; i < qTrees[0].length(); i++){
if (qTrees[0].charAt(i) == qTrees[1].charAt(i) &&
qTrees[1].charAt(i) == qTrees[2].charAt(i) &&
qTrees[2].charAt(i) == qTrees[3].charAt(i) ){
similarPart += qTrees[0].charAt(i);
} else {
break;
}
}
But this is wrong. As you see it will return only "02301", but the deeper similarity is possible.
Please suggest me a better way to do it. Thanks.
You need to better define what you are trying to achieve. Do you want to:
find the longest common starting sequence between any two entries in the array;
find the longest common starting sequence across all of the entries in the array;
find the longest common sequence (i.e. same characters in same position) between any two entries;
find the longest common sequence across all entries in the array.
All of these will give slightly different approaches, but it will all boil down to correctly using break and continue in your loops.
Remove the else part in your code. Then it will check until the end of the string.
The code :
for (int i = 0; i < qTrees[0].length(); i++){
if (qTrees[0].charAt(i) == qTrees[1].charAt(i) &&
qTrees[1].charAt(i) == qTrees[2].charAt(i) &&
qTrees[2].charAt(i) == qTrees[3].charAt(i) ){
similarPart += qTrees[0].charAt(i);
}
}