Comparing data from 2 files by tokenization - java

I'm reading 2 files: one named myFile, and the other named dictionary.
In dictionary, there is 2 value for each of the word in it.
So, I read the sentence in myFile, and tokenize them to sort out the value for each word.
My code is running as below:
while ((text = file.readLine()) != null){//read myFile content line by line
ArrayList<String> content = new ArrayList<String>();
StringTokenizer str = new StringTokenizer(text);//split line content
while (str.hasMoreTokens()) {
String token = str.nextToken();
content.add(token);
}//create an array to store the content of line
//define subjective of each line
boolean subjective = false;
//compare from file content with SentiWordNet
for (int i=0; i<content.size(); i++){
String cont = content.get(i);
while((line = csv.readLine()) != null)
{
//read line from SentiWordNet
String[] data = line.split("\t");
//read data SentiWordnet
String sentiWord = data[4];
if (sentiWord.contains(cont)){
if (data[2] != "0" || data[3] != "0")
subjective = true;
}
}
}
System.out.println(subjective);
}
file is myFile with sentence, and csv is the dictionary.
The problem now is that only the 1st token in the myFile carry out the comparison, while others do not.
Any idea how to solve?

Looks like you are not closing the dictionary. This line of code:
while((line = csv.readLine()) != null)
Will start to fail once you get to the end of the dictionary the first time (i.e. for the first word in myFile). For subsequent words, it will fail immediately, because you haven't closed/reopened the file.
EDIT:
In looking at your code, you are trying to determine if a word is subjective by reading each sentence in myFile and looping over each word in the sentence, and for each such word, reading dictionary. If your myFile contains many sentences and words, you will be reading the dictionary (which is likely large) many times, which seems inefficient.
For example, if there are s sentences, each with w words, you will be opening, and reading the entire dictionary s*n times.
Alternatively, what you could do read in the entire myFile into an array of sentences of length s, or even into an array of words of length n*w. This will take memory on the order of n*w (your current algorithm takes only w memory since you are creating a single array to store the words in a sentence, and re-use this array for each sentence). Then, read in the dictionary once, and for each word in the dictionary, see if it is in the array of words/sentences.
A better approach which costs more memory (assuming your dictionary is bigger than myFile) might be to read the entire dictionary into memory and sort it. Then, read myFile and locate each word in your in-memory dictionary using an efficient search. Should be much faster, assuming your files are large.

Related

Setting two different text files as seperate string arrays and finding matches from the two arrays in Java

So basically i'm trying to take two text files (one with many jumbled words and one with many dictionary words.) I am supposed to take these two text files and convert them to two seperate arrays.
Following that, I need to compare jumbled strings from the first array and match the dictionary word in the second array up to it's jumbled counterpart. (ex. aannab(in the first array) to banana(in the second array))
I know how to set one array from a string, however I don't know how to do two from two seperate text files.
Use HashMap for matching. Where first text file data will be the key of Map and second text file data will be value. Then, by using key, you will get matching value.
you can read each file into an array like this:
String[] readFile(String filename) throws IOException {
List<String> stringList = new ArrayList<>();
try {
FileInputStream fis = new FileInputStream(new File(filename));
BufferedReader br = new BufferedReader(new InputStreamReader(fis));
String line = null;
while ((line = br.readLine()) != null) {
stringList.add(line);
}
} finally {
br.close();
}
return stringList.toArray(new String[stringList.size()]);
}
Next, try to do the matching:
String[] jumbles = readFile("jumbles.txt");
String[] dict = readfile("dict.txt);
for (String jumble : jumbles) {
for (String word : dict) {
// can only be a match if the same length
if (jumble.length() == word.length()) {
//next loop through each letter of jumble and see if it
//appears in word.
}
}
}
I know how to set one array from a string, however I don't know how to do two from two seperate text files
I would encourage you to divide your problems don't knows and knows.
Search don't knows over internet you will get lot of ways to do it.
Then search for what you know,to explore whether it can be done in a better way.
To help you here,
Your Don't knows:
Reading file in Java.
Processing the content of read file.
Your known part :
String to array representation ( Search whether there are better ways in your use case)
Combine both :-)

Java Shingle Pairs

I'm having trouble with a program I'm working on to create shingle pairs from each sentence in a text file. Right now my code reads in a .txt file in Java and outputs each sentence in order. I want to store each sentence separately then take each sentence and create 2-character shingles of them, which would be stored in an array. An example of this would be taking the sentence “The quick brown fox” and turning it into {th, he, e , q, qu, ui, ic, ck, k , b, br, ro, ow, wn, n , f, fo, ox} so that all of the spaces in between the words would be accounted for. My goal is to simply take each sentence and create an array for each of them that holds the shingle pairs like in the example above. My problem is that I'm not sure how to go about this. I can’t seem to figure out how to take the sentences and store them separately, and I’m not sure how to create shingle pairs. I'm still very new to Java, and any help is very much appreciated. Here is my code so far:
//Takes .txt file as command-line input parameter
File file = new File(args[0]);
Scanner scanner = new Scanner(new FileInputStream(file));
int i=0;
//Reads in and outputs each line from the file
while (scanner.hasNextLine()) {
System.out.print(++i + " : " + scanner.nextLine() + "\n");
}
Just take pairs of characters from [0,1] to [last-1,last]
String[] result = new String[sentence.length() - 1];
for (int i = 0; i < sentence.length() - 2; i++)
{
result[i] = sentence.substring(i, i + 2);
}
If you nead, you may delete spaces with trim() after it this cycle.
To split into sentences you can use pattern matching. Just define what is a valid sentene for your task. Here I assume a sentence is always ended with dot, question mark or exclamation mark; and the next sentence starts after one or more whitespaces
final Pattern sentencePattern = Pattern.compile("[\\.\\?!]+\\s+");
sentencePattern.splitAsStream(text).forEach(
System.out::println //your code here
);

searching from txt file for a specific characters (Java)

I have a big txt. (a dictionary) file which contains about 100k + words ordered like that:
tree trees asderi 12
car cars asdfei 123
mouse mouses dasrkfi 333
plate plates asdegvi 333
......
(ps. there are no empty rows in between)
what i want to do is to to check the 3th column (asderi in this case at first row) and if there are letters "i" and "e" in this word then copy the first word in this row (tree in this case) to a new txt. file. I don't need a whole solution but maybe and example how to read 3th word and check for it letters and if they are TRUE print the first word in that line out.
When it comes to big data files you want to process line by line rather than reading all of it to your memory you may want to start with this to process the file line by line:
BufferedReader br = new BufferedReader(new FileReader(new File("C:/sample/sample.txt")));
String line;
while ((line = br.readLine()) != null) {
// process the line.
}
br.close();
Once you have the line i bet you will be able to use the common String-methods like .indexOf(.., .substring(..., .split to aquire the data you want (expecially since the source file seems to have well structured data).
So assumed your "columns" are always seperated by a space and there is never a word in a column containing a space nor is there never a column missing you could catch the columns using .split like this:
// this will be the current line of the file
String s = "tree trees asderi 12";
String[] fragments = s.split(" ");
String thirdColumn = fragments[2];
boolean hasI = thirdColumn.contains("i");
String firstColumn = fragments[0];
System.out.println("Fragment: "+thirdColumn+" contains i: "+hasI+" thats why i want the first fragment: "+firstColumn);
But in the end you will have to try around a bit and play with the String-methods to get it together especially for all special cases this file probably will bring up ;)
You may update your "question" with some source you managed to write with this hints and then ask again if you get stuck.

Start reading the file after a specific word

I have a text file with some information in it, which looks something like displayed below.
I'm supposed to read the file after a specific word occurs (Complete Population), and store the vertically aligned values in each line like in an array (could be arraylist too)
What the file looks like -
Tue May 14 08:27:25 EST 2013
mating_pool=80
mutation_dist=3
algo_name=ARMO
Complete Population
8.78792396E8 7.45689508E8 8.37899916E8 9.52778502E8 8.47061622E8
8.80017166E8 7.50224432E8 8.23658404E8 9.51664198E8 8.49145008E8
8.85724416E8 7.48191542E8 7.61295532E8 1.00892758E9 8.52389824E8
8.96069156E8 7.11234404E8 7.68007126E8 9.7238065E8 8.5759227E8
8.96193522E8 7.11177522E8 7.67777526E8 9.72449466E8 8.5763106E8
8.95546766E8 7.1112849E8 7.68311754E8 9.71998374E8 8.57960886E8
8.95480802E8 7.11023308E8 7.68223532E8 9.72097758E8 8.5803376E8
8.9549393E8 7.11015392E8 7.68194136E8 9.72079838E8 8.5804897E8
8.95467666E8 7.11364074E8 7.68318732E8 9.7189094E8 8.58053462E8
8.95574386E8 7.11095656E8 7.68187948E8 9.71985272E8 8.58095624E8
8.95390774E8 7.11052654E8 7.684207E8 9.72098718E8 8.58105648E8
What I have tried
I'm able to read only one line of the numbers and not sure how to add numbers vertically.
Any help is appreciated.
Well, there actually is no issue here. You just need to code it.
There are some nice pieces of code in this thread.
Do something like this:
BufferedReader br = new BufferedReader(new FileReader(file));
String line;
while ((line = br.readLine()) != null) {
if (line.contains("Complete Population"){
// do something
break; // breaks the while loop
}
}
// we reached the section with numbers
while ((line = br.readLine()) != null) {
// use String.split to split the line, then convert
//the values to double and process them.
}
}
br.close();
Use a BufferedReader to wrap a FileReader on the file, and then use nextLine() to read each line.
Create a Pattern object with regex ".*Complete Population.*", and use a Matcher on that Pattern to check each line (looping with condition that the BufferedReader's nextLine() doesn't return null -- since null indicates end of file reached.)
When a line matches, begin processing subsequent lines to form arrays.
I'm not sure what you mean by "the vertically-aligned values", but if you mean the space-separated values on each line as an array, use String.split("\\s+"); on each line to split on whitespace, returning an array of Strings.
If by vertical arrays, you mean the first elements on each of the lines, then the second elements on each of the lines, and so on:
You can store these arrays of Strings retrieved by String.split("\\s+")ing each line together as a 2-d array by placing each array into a main array which will hold them all (an array of arrays of per-line Strings), and then, when the full read-in is done and end of file is reached, go back to this 2-d array and access element [0] of each line to get a list of the first items on each line, element[1] of each line to get a list of the second items on each line, and so on. If you want, you can store these (effectively vertical lists of items on the lines) in another set of arrays.

Appending text from array list to a String takes a lot of time

I am reading a Simple Notepad Text file containing a lot of data actually in a 3mb of size so you can imagine the number of words it can have! The problem is I am reading this file into a string then splits the string so that I can hold each single word inside an ArrayList(String). It works fine for me but the actual problem is that I am processing this array list for some purpose and then again I have to append or you can say put all the words of array list back to the String!
so that the steps are:
I read a text file into a String (alltext)
Split all words into an arraylist
process that array list (suppose I removed all the stop words like is, am, are)
after processing on array list I want to put all the words of array list back to the string (alltext)
then I have to work with that string (alltext)
(alltext is the string that must contains the text after all processing)
The problem is that at step number 4 it takes a lot of time to append all the words back to the string my code is:
BufferedReader br = new BufferedReader(new FileReader(file));
String line = "";
while ((line = br.readLine()) != null) {
alltext += line.trim().replaceAll("\\s+", " ") + " ";
}
br.close();
//Adding All elements from all text to temp list
ArrayList<String> tempList = new ArrayList<String>();
String[] array = alltext.split(" ");
for (String a : array) {
tempList.add(a);
}
//remove stop words here from the temp list
//Adding File Words from List in One String
alltext = "";
for (String removed1 : tempList) {
System.out.println("appending the text");
alltext += removed1.toLowerCase() + " ";
//here it is taking a lot of time suppose 5-10 minutes for a simple text file of even 1.4mb
}
So I just want any idea so that I can reduce the time for an efficient processing and relax the machine! I will be thankful for any suggestions and ideas...
Thanks
Use a StringBuffer instead of a String.
A String is immutable and thus you create a new Object everytime you append, which takes more and more time the longer your String becomes. A StringBuffer is mutable and made for cases like yours.
I would recommend StringBuilder
According to this stringbuilder-and-stringbuffer-in-java it's faster than a StringBuffer also check if you need the ArrayList because you can iterate through the array too

Categories