Creating word pairs, triplets etc for evaluation in Bleu - java

I need to create a list of word pairs, triplets etc for evaluation in the Bleu metric. Bleu starts with unigrams (a single word) and goes up to N-grams - the N being specified at runtime.
For example, given the sentence
"Israeli officials are responsible for airport security"
For unigrams it would just be a list of the words. For bigrams it would be
Israeli officials
officials are
are responsible
responsible for
for airport
airport security
The relevant trigrams are
Israeli officials are
officials are responsible
are responsible for
responsible for aiport
for airport security
I've coded a working Bleu that hard codes the NGrams to 4 and brute forces the calculations of the unigrams etc. It's ugly as hell, and besides, I need to be able to supply the N at run time.
The snippet that's trying to generate the pairs / triplets etc -
String current = "";
int temp = 0;
for (int i = 0; i < goldWords.length - N_GRAM_ORDER; i++) {
current = current + ":" + goldWords[i];
while (temp < N_GRAM_ORDER) {
current = current + ":" + goldWords[temp + i];
temp++;
}
goldNGrams.add(current);
current = "";
temp = 0;
}
}
Edit - so the output from this snippet should be for bigrams -
israeli:officials
officials:are
are:responsible
responsible:for
for:airport
airport:security
Where goldWords is a String array containing the individual words to be made into NGrams.
I've been tinkering with this loop for days, drawing out the relationships etc and it just won't click for me. Can anyone see what I'm doing wrong?

I would change this:
String current = "";
int temp = 0;
for (int i = 0; i < goldWords.length - N_GRAM_ORDER; i++) {
current = current + ":" + goldWords[i];
while (temp < N_GRAM_ORDER) {
current = current + ":" + goldWords[temp + i];
temp++;
}
goldNGrams.add(current);
current = "";
temp = 0;
}
}
to this:
String current = "";
for (int i = 0; i < goldWords.length(); i++){
for (int j = 0; j < N_GRAM_ORDER; j++){
if (i + j < goldWords.length())
current += ":" + goldWords[i + j];
}
goldNGrams.add(current);
current = "";
}
So, the outer for loop iterates through the first word to be included, the inner loop iterates through all the words to be included. One thing to note is that the if statement is used to prevent an array out of bounds error. This should be moved to outside the inner for loop if you only want complete n-grams.
With the if statement where it is you will get:
Israeli:officials
officials:are
are:responsible
responsible:for
for:airport
airport:security
security
If you want:
Israeli:officials
officials:are
are:responsible
responsible:for
for:airport
airport:security
instead, try this code:
String current = "";
for (int i = 0; i < goldWords.length(); i++){
if (i + N_GRAM_ORDER < goldWords.length()){
for (int j = 0; j < N_GRAM_ORDER; j++){
current += ":" + goldWords[i + j];
}
}
goldNGrams.add(current);
current = "";
}
(the above code is done without checking it against the compiler, so there might be an Off By One or minor syntax error in it. Validate it, but it will get you close).

Here's an alternative that uses a String[] to collect the ngrams instead of a string. I changed the number of iterations on the outer for loop to ensure it captures the last n-gram.
public static List<String[]> ngrams(String[] gold, int n_length) {
List<String[]> list = new ArrayList<String[]>();
for (int i = 0; i < gold.length - (n_length-1); i++) {
String[] ngram = new String[n_length];
for(int j = 0; j < n_length; j++) {
ngram[j] = gold[i+j];
}
list.add(ngram);
}
return list;
}

according to the N_GRAM programming output
int N_GRAM_ORDER = 3, temp = 0, i;
for (i = 0; i <= goldWords.length - N_GRAM_ORDER; i += N_GRAM_ORDER) {
while (temp < N_GRAM_ORDER) {
current = current + ":" + goldWords[temp + i];
temp++;
}
goldGrams.add(current);
current = "";
temp = 0;
}
if ((temp + i) < goldWords.length) {
temp += i;
while (temp < goldWords.length) {
current = current + ":" + goldWords[temp++];
}
goldGrams.add(current);
}
}
output
Israeli:officials:are
responsible:for:airport
security

Related

java - Loop optimization on int array

I would like to know, which is fast way to write/read in an int array.
Here my Java code: I have three int arrays, two in read access and one int array in write access.
for(int j = h20 ; j < h21 ; j++){
for(int i = w20 ; i < w21 ; i++){
if( int_color == arr3[j*h31 + i] ) continue; //condition
arr1[(j+decY)*w11 + i+decX] = arr2[j*w21 + i];
}
}
My code is a classic 2D array loop, there are just one special condition to check.
Is it other way to write this code to decrease processing time?
Thx.
You can reduce the amount of calculations, if you separate them by variables. In your case, any calculation that relies on j alone doesn't have to be inside the inner loop because the result won't change for the rest of the loop. Instead, calculate the values outside and only use the result in the inner loop.
for(int j = h20 ; j < h21 ; j++){
int tmp1 = j*h31;
int tmp2 = (j+decY)*w11 + decX;
int tmp3 = j*w21;
// j won't change inside here, so you can simply use the precalculated values
for(int i = w20 ; i < w21 ; i++){
if( int_color == arr3[tmp1 + i] ) continue; //condition
arr1[tmp2 + i] = arr2[tmp3 + i];
}
}
Edit: If you want to reduce this even more, you could rewrite the calculation for tmp2:
(j+decY)*w11 + decX ==> j*w11 + decY*w11 + decX
Then, you could extract the decY*w11 + decX into its own variable outside the first loop.
int tmp0 = decY*w11 + decX;
for(int j = h20 ; j < h21 ; j++){
int tmp1 = j*h31;
int tmp2 = j*w11 + tmp0;
int tmp3 = j*w21;
// j won't change inside here, so you can simply use the precalculated values
for(int i = w20 ; i < w21 ; i++){
if( int_color == arr3[tmp1 + i] ) continue; //condition
arr1[tmp2 + i] = arr2[tmp3 + i];
}
}
But this will save you only one addition per iteration, so I don't think it's worth the extra effort.
Removing calculations, especially multiplications might help.
For arr3 this would be:
final int icolor = int_color;
final int ix3 = h20 + w20;
final int dx3 = h31 + h21 - h20;
for (int j = h20; j < h21; ++j) {
for (int i = w20 ; i < w21 ; ++i) {
assert ix3 == j*h31 + i;
if (icolor != arr3[ix3]) {
arr1[(j+decY)*w11 + i+decX] = arr2[j*w21 + i];
}
++ix3;
}
ix3 += dx3;
}
Whether this is really worthwile one needs to test.
Depending on the frequency of the condition, one might think of using System.arraycopy for consecutive ranges of i.

Java - 2D arrays, checking for duplication

The variable "num" is a 2D array. I'm trying to check in that array, if there are any duplicates. "num" is a user-input.
I have extensively looked through Java documentation and asked my lectures and I can't get a working answer. I understand the concept, what I'm meant to do, but just can't get the coding right.
Here is my code:
for(int i = 0; i < 3; i++){ //3 rows with 5 numbers each
for(int j = 0; j < 5; j++){
num[i][j] = Integer.parseInt(JOptionPane.showInputDialog(null, "Enter value for line: " + i + " and position: "+ j ));
if((num[i][j] == num[i][0]) || (num[i][j] == num[i][1]) ||(num[i][j] == num[i][2]) || (num[i][j] == num[i][3]) || (num[i][j] == num[i][4])){
if(num[i][j] != 0){
num[i][j] = Integer.parseInt(JOptionPane.showInputDialog(null, "ERROR. Enter value for line: " + i + " and position: "+ j ));
}
}
}
}
I have also tried using HashSet, but I think that only works with 1D arrays.
I would like to use something like this, as I feel this I understand the most:
secret = new Random().ints(1, 40).distinct().limit(5).toArray();
But obviously not with Random.
I've tried this:
Set<Integer> check = new HashSet<>();
Random gen = new Random();
for(int i = 0; i < 3; i++){ // 3 rows, 5 numbers
for(int j = 0; j < 5; j++){
num[i][j] = Integer.parseInt(JOptionPane.showInputDialog(null, "Enter value for row " + i + " and position " + j));
check.add(gen.nextInt(num[i][j]));
}
}
This last section of coding (directly above this) compiles and runs, but doesn't check for duplicates.
There are alternative ways to checking for duplicates (e.g. you could loop back through the data you've entered previously into the 2D array in order to check for duplicate values) however here's how I'd go about using a Set to check for duplicates in order to, Are you trying to populate the 2d array with all unique values, where each value is from the user?? (also - knowing this explicitly in the original post would be very helpful, thanks to Michael Markidis for specifying that)
With a little UX knowledge here, separating the ERROR is def helpful to the end-user, as ERROR + re-input at the same time is confusing.
import java.util.Arrays;
import java.util.HashSet;
import java.util.Set;
import javax.swing.JOptionPane;
public class App {
public static void main(String[] args) {
int[][] num = new int[3][5];
System.out.println("Before:");
for (int i = 0; i < 3; ++i)
System.out.println(Arrays.toString(num[i]));
Set<Integer> data = new HashSet<Integer>();
for (int i = 0; i < 3; i++) { // 3 rows with 5 numbers each
for (int j = 0; j < 5; j++) {
boolean isGoodInput = false;
while (!isGoodInput) {
String input = JOptionPane.showInputDialog(null, "Enter value for line: " + i + " and position: " + j);
Integer n = Integer.parseInt(input);
if (data.contains(n)) {
JOptionPane.showMessageDialog(null, "ERROR: Try again");
} else {
num[i][j] = n;
isGoodInput = data.add(n);
}
}
}
}
System.out.println("After:");
for (int i = 0; i < 3; ++i)
System.out.println(Arrays.toString(num[i]));
}
}
Note: the 2D array is limited to your specification in the original post as a 3x5, so you'd have to change these values in multiple places to make different sized arrays - perhaps making these more dynamic could speed up further development of this application in the future.
Here's one way to accomplish this where you use the hashset to track what has already been inserted into the 2D array:
int[][] num = new int[3][5];
Set<Integer> check = new HashSet<>();
for (int i = 0; i < 3; i++)
{ // 3 rows, 5 numbers
for (int j = 0; j < 5; j++)
{
int n = 0;
do
{
n = Integer.parseInt(JOptionPane.showInputDialog(null, "Enter value for row " + i + " and position " + j));
} while (!check.add(n)); // keep looping if it was in the hashset
// add it to the array since we know n is not a duplicate at this point
num[i][j] = n;
}
}

Getting 'trigrams' in Java

I am having a bit of an issue getting trigrams in Java. My program can currently get bigrams fine but when I try to implement the same structure of the method and change it to get trigrams it seems to not work as well.
I want the trigrams to get every possible combination of words within the arraylist, e.g.
Original = [eye, test, find, free, nhs]
Trigram = [eye test find, 2, eye test free, 3, eye test nhs, 4, eye find free, 3, eye find nhs, 4, eye free nhs, 5, etc...]
The numbers determine the distance between the first word and the last word and should get every combination of words of a 3 in the arraylist. This currently works fine for bigrams...
Original = [eye, test, find, free, nhs]
Bigram = [eye test, 1, eye find, 2, eye free, 3, eye nhs, 4, test find, 1, test free, 2, test nhs, 3, find free, 1, etc..]
Here are the methods
public ArrayList<String> bagOfWords;
public ArrayList<String> bigramList = new ArrayList<String>();
public ArrayList<String> trigramList = new ArrayList<String>();
public void trigram() throws FileNotFoundException{
PrintWriter tg = new PrintWriter(new File(trigramFile));
// CREATES THE TRIGRAM
for (int i = 0; i < bagOfWords.size() - 1; i++) {
for (int j = 1; j < bagOfWords.size() - 1; j++) {
for(int k = j + 1; k < bagOfWords.size(); k++){
int distance = (k - i);
if (distance < 4){
trigramList.add(bagOfWords.get(i) + " " + bagOfWords.get(j) + " " + bagOfWords.get(k) + ", " + distance);
}
}
}
}
public void bigram() throws FileNotFoundException{
// CREATES THE BIGRAM
PrintWriter bg = new PrintWriter(new File(bigramFile));
for (int i = 0; i < bagOfWords.size() - 1; i++) {
for (int j = i + 1; j < bagOfWords.size(); j++) {
int distance = (j - i);
if (distance < 4){
bigramList.add(bagOfWords.get(i) + " " + bagOfWords.get(j) + ", " + distance);
}
}
}
Can anyone help me alter the trigram() method to create an appropriate trigram for what I need?
Thanks for any help.
You want j to start at i+1, don't you? Also, I think you are letting i count to far. It should stop at bagOfWords.size() - 2. I am not sure why you check distance < 4. This will throw out valid groups.
public void trigram() throws FileNotFoundException{
PrintWriter tg = new PrintWriter(new File(trigramFile));
// CREATES THE TRIGRAM
for (int i = 0; i < bagOfWords.size() - 2; i++) {
for (int j = i + 1; j < bagOfWords.size() - 1; j++) {
for(int k = j + 1; k < bagOfWords.size(); k++){
int distance = (k - i);
trigramList.add(bagOfWords.get(i) + " " + bagOfWords.get(j) + " " + bagOfWords.get(k) + ", " + distance);
}
}
}
Answer of #bradimus is exactly right. I just gonna show another approach. Did you noticed, that your methods very similar? So, why not to try merge it to one universal method? Something like following:
public List<String> anygram(List<String> bagOfWords, int gramCount){
List<String> result = new ArrayList<String>();
for(int i=0;i<=bagOfWords.size()-gramCount; i++){
for(int j=i; j+gramCount<=bagOfWords.size(); j++){
StringBuilder builder = new StringBuilder();
builder.append(bagOfWords.get(i));
int k = j+1;
for(; k<j+gramCount; k++){
builder.append(" ");
builder.append(bagOfWords.get(k));
}
builder.append(", ").append(k-i-1);
result.add(builder.toString());
}
}
return result;
}
My answer is not for rating. I just became interested in this task, and come to this solution.

Java code for finding element for attribute

I am writing a java code for finding element Ex.food for any attribute(attr)Ex.description in XML.
What I have done
I am taking first char of attr as start position and then checking if it equals to "<" .From this point I am looping to ignore white space,tab etc and proceed to append valid char till I find another white space,tab.But this is not working.It is supposed to find the first word after "<" and print that only .But its actually going till the end from the the point it find first char after "<"
Note:I cannot use XML parsers like DOM etc
XML
<breakfast_menu>
<food name="Belgian Waffles" price="$5.95" discount="20"
description="Two" calories="650" />
</breakfast_menu>
for (int k = start_position; k >= 0; k--) {
char testChar = xmlInString.charAt(k);
String temp = Character.toString(testChar);
if ("<".equals(temp)) {
System.out.println(k + "value of k");
for (int j = k + 1; j >= 0; j++) {
char appendChar = xmlInString.charAt(j);
String temp1 = Character.toString(appendChar);
if(!("".equals(temp1))) {
System.out.println(j + " " + appendChar);
}
}
}
}
I made few changes now and its working as expected.i.e giving first word after "<". If you have any suggestion please let me know.
int k, j, l;
for (k = start_position; k >= 0; k--) {
char testChar = xmlInString.charAt(k);
String temp = Character.toString(testChar);
if ("<".equals(temp)) {
break;
}
}
for (j = k + 1; j <= start_position; j++) {
char appendChar = xmlInString.charAt(j);
String temp1 = Character.toString(appendChar);
if (!("".equals(temp1))) {
break;
}
}
for (l = j; l <= start_position; l++) {
char appendChar1 = xmlInString.charAt(l);
String temp2 = Character.toString(appendChar1);
if (" ".equals(temp2)) {
break;
}
}
System.out.println(xmlInString.substring(j, l));
I think this will help
String str = "<food name=\"Belgian Waffles\" price=\"$5.95\" discount=\"20\"\n" ;
String res = str.substring(str.indexOf("<"), str.indexOf(" ", str.indexOf("<") + "<".length()) + " ".length());
System.out.println(res);
output
<food
Your inner while loop for (int j = k + 1; j >= 0; j++) { is broken. The condition j >= 0 doesn't make sense - j only increments so how could it ever be < 0? Instead you should be looking for the next space, tab, etc.
I don't think your solution is going to work, but that is what's wrong with the code you have posted and why it's running off the end.

Generating random number of loops

I have this problem that I have observed when generating numbers as a condition in a for loop.
I use this in my android program.
When I do this:
String temp = "";
for (int i = 0; i < new Random().nextInt(1000); i++) {
temp += i + " ";
}
I always get no more than 100
But when I do this:
for (int i = 0; i < 10; i++) {
temp += new Random().nextInt(1000) + " ";
}
I got real random numbers ranging from 0 to 999.
What is actually happening?
I know I could do this:
int x = new Random().nextInt(1000);
for (int i = 0; i < x; i++) {
temp += i + " ";
}
And this does return random numbers from 0-999. But I just want to understand why the first code only returns numbers not more than 100.
for (int i = 0; i < new Random().nextInt(1000); i++) { // here upper limit of i will change time to time.
temp += i + " ";
}
.
for (int i = 0; i < 10; i++) { // here i increase up to 10
temp += new Random().nextInt(1000) + " ";
}
.
int x = new Random().nextInt(1000); // here x is random but this will never change while for loop is running
for (int i = 0; i < x; i++) {
temp += i + " ";
}
In this code
for (int i = 0; i < new Random().nextInt(1000); i++) {
temp += i + " ";
}
the variable i is incremented by one for each iteration of loop, but at a given point of time where i<100, there is a chance of a number smaller than 'i' get generated randomly and thus the loop exits.
you should initialize random variable before cycle
int max = new Random().nextInt(1000);
String temp = "";
for (int i = 0; i < max; i++)
{ temp += i + " "; }
Java Doc says:
Returns a pseudorandom, uniformly distributed int value between 0
(inclusive) and the specified value (exclusive)
So it is possible that value of i can be larger than the random generated by nextInt() and the loop exits.
As you are creating a new Random is generated using nextInt(1000) on each iteration of for loop, you will not get a fixed value for the loop and it will keep changing and so will your output.
String temp = "";
for (int i = 0; i < new Random().nextInt(1000); i++) { //new random nextInt() called on each iteration
temp += i + " ";
}
Note: My program ran till 64
Your first implementation ...
for (int i = 0; i < new Random().nextInt(1000); i++) {
temp += i + " ";
}
is calling new Random().nextInt(1000) at the end of every iteration to determine if it's time to end the loop. The same code could be rewritten as follows:
int i = 0;
while (i < new Random().nextInt(1000)) {
temp += i + " ";
i++;
}
which may be better illustrated as ...
int i = 0;
while (true) {
if (i < new Random().nextInt(1000)) {
break;
}
temp += i + " ";
i++;
}
so although the value of i is constantly increasing, the number against which it is being compared new Random().nextInt(1000) is constantly changing. Your comparisons might look like this ...
if (0 < 981) break;
if (1 < 27) break;
if (2 < 523) break;
if (3 < 225) break;
if (4 < 198) break;
if (5 < 4) break;
In the above example, even though the first call to new Random().nextInt(1000) is returning a whopping 981, the loop only happens 5 times, because at the beginning of the 6th iteration, new Random().nextInt(1000) returned 4, which is less than 5 the current value of i.
Hope this helps!
Here
String temp = "";
for (int i = 0; i < new Random().nextInt(1000); i++) {
temp += i + " ";
}
i may be bigger than 100 but probabilistically is practically impossible. yeah!
The "first for loop" may give output numbers less than 100 but not always, it will give a series of numbers from 0 to the random number which is returned by the Random.nextInt(1000) method....
1st and 3rd looping methods works same...
but the second one will pick 9 random numbers from 0 to 999 and the answer is like, (373 472 7 56 344 423 764 722 554 800)

Categories