Index of words found in document - Java - java

I am trying to write a program that takes in a text file as input, retrieves the words, and outputs each word with each line number that they are located in. I'm having a lot of trouble with this project, although I've made some progress...
So far I have an ArrayList which holds all of the words found in the document, without punctuation marks. I am able to output this list and see all the words in the text file, but I do not know where to go from here... any ideas?
example:
myList = [A, ACTUALLY, ALMOST,....]
I need to somehow be able to associate each word with which line they came from, so I can populate a data structure that will hold each word with their associated line number(s).
I am a programming novice so I am not very familiar with all the types of data structures and algorithms out there... my instructor suggested I use a dynamic multilinked list but I don't know how I would implement that verses ArrayLists and arrays.
Any ideas would be greatly appreciated. Thanks!

You should use a hash table. A hash table is a key/value pair. The key can be every word in the text file, the value, an array list containing the line numbers.
Basically, loop through every word in the text file. If that word is not in your list of words, add it as the key and the line number as the value in a list into the hash table. If that word is already in the table, append the line number to the array list.
Java has good docs on a hash table here
for you to get the methods you need.

Related

script to build word pairs and their frequencies in native java

it's my first time programming ever and we have this assignment on finding both word frequencies and word pair frequencies in a text file
I've followed several tutorials online and implemented a rather fast word count solution however i have no clue on how to implement a method on ho to get all the word pairs in the text file and sum up the frequencies of duplicate word pairs to find their frequencies before adding them to an array (hashmap)
i tried asking my instructor but he is adamant on us figuring it out ourselves , please just point me in the right direction to a paper / article / tutorial (anything)i can read in-order to solve this
thanks in advance
Ideally this would be done using a hash map with String and Integer values. You can check wether it is in the hash map before adding it as a new value and frequency to the map.
Here is an example of a previously answered question using this method.
Counting frequency of words from a .txt file in java

Data Structure choices based on requirements

I'm completely new to programming and to java in particular and I am trying to determine which data structure to use for a specific situation. Since I'm not familiar with Data Structures in general, I have no idea what structure does what and what the limitations are with each.
So I have a CSV file with a bunch of items on it, lets say Characters and matching Numbers. So my list looks like this:
A,1,B,2,B,3,C,4,D,5,E,6,E,7,E,8,E,9,F,10......etc.
I need to be able to read this in, and then:
1)display just the letters or just the numbers sorted alphabetically or numerically
2)search to see if an element is contained in either list.
3)search to see if an element pair (for example A - 1 or B-10) is contained in the matching list.
Think of it as an excel spreadsheet with two columns. I need to be able to sort by either column while maintaining the relationship and I need to be able to do an IF column A = some variable AND the corresponding column B contains some other variable, then do such and such.
I need to also be able to insert a pair into the original list at any location. So insert A into list 1 and insert 10 into list 2 but make sure they retain the relationship A-10.
I hope this makes sense and thank you for any help! I am working on purchasing a Data Structures in Java book to work through and trying to sign up for the class at our local college but its only offered every spring...
You could use two sorted Maps such as TreeMap.
One would map Characters to numbers (Map<Character,Number> or something similar). The other would perform the reverse mapping (Map<Number, Character>)
Let's look at your requirements:
1)display just the letters or just the numbers sorted alphabetically
or numerically
Just iterate over one of the maps. The iteration will be ordered.
2)search to see if an element is contained in either list.
Just check the corresponding map. Looking for a number? Check the Map whose keys are numbers.
3)search to see if an element pair (for example A - 1 or B-10) is
contained in the matching list.
Just get() the value for A from the Character map, and check whether that value is 10. If so, then A-10 exists. If there's no value, or the value is not 10, then A-10 doesn't exist.
When adding or removing elements you'd need to take care to modify both maps to keep them in sync.

How to separate each word in a file and display each word and number of each word in the table?

I don't know anything about that. May you explain with detail to me, please?
Load the contents of the file into a string
Call the split operation string.split(" ") and assign the result to an array object
Create a HashMap to store your results
Use a for loop to iterate over the array
If the value is already in the map map.contains("example") then update the value to increment the occurrence
Otherwise add the new value to the map map.put("example", 1)
There are many tutorials dealing with the steps outlined above and you should be able to track them down rather easily.

Count word frequency of huge text file [duplicate]

This question already has answers here:
Parsing one terabyte of text and efficiently counting the number of occurrences of each word
(16 answers)
Closed 10 years ago.
I have a huge text file (larger than the available RAM memory). I need to count the frequency of all words and output the word and the frequency count into a new file. The result should be sorted in the descending order of frequency count.
My Approach:
Sort the given file - external sort
Count the frequency of each word sequentially, store the count in another file (along with the word)
Sort the output file based of frequency count - external sort.
I want to know if there are better approaches to do it. I have heard of disk based hash tables? or B+ trees, but never tried them before.
Note: I have seen similar questions asked on SO, but none of them have to address the issue with data larger than memory.
Edit: Based on the comments, agreed the a dictionary in practice should fit in the memory of today's computers. But lets take a hypothetical dictionary of words, that is huge enough not to fit in the memory.
I would go with a map reduce approach:
Distribute your text file on nodes, assuming each text in a node can fit into RAM.
Calculate each word frequency within the node. (using hash tables )
Collect each result in a master node and combine them all.
All unique words probably fit in memory so I'd use this approach:
Create a dictionary (HashMap<string, int>).
Read the huge text file line by line.
Add new words into the dictionary and set value to 1.
Add 1 to the value of existing words.
After you've parsed the entire huge file:
Sort the dictionary by frequency.
Write, to a new file, the sorted dictionary with words and frequency.
Mind though to convert the words to either lowercase or uppercase.
Best way to achieve it would be to read the file line by line and store the words into a Multimap (e.g. Guava). If this Map extends your memory you could try using a Key-Value store (e.g. Berkeley JE DB, or MapDB). These key-value stores work similar to a map, but they store their values on the HDD. I used MapDB for a similar problem and it was blazing fast.
If the list of unique words and the frequency fits in memory (not the file just the unique words) you can use a hash table and read the file sequentially (without storing it).
You can then sort the entries of the hash table by the number of occurrences.

Read two files (text) and compare for common values and output the string?

Question: I have two files one with list of serial number,items,price, location and other file has items. So i would like compare two files and printout the number times items are repeated in file1 with serial number.
Text1 file will have
Text2 file will have
Output should be
So the file1 is not formatted in proper order and file 2 is in order (line by line).
Since you have no apparent code or effort put into this, I'll only hint/guide you to some tools you can use.
For parsing strings: http://docs.oracle.com/javase/6/docs/api/java/lang/String.html
For reading in from a file: http://www.roseindia.net/java/beginners/java-read-file-line-by-line.shtml
And I would recommend reading file #2 first and saving those values to an arraylist, perhaps, so you can iterate through them later on when you do your searching.
Okay my approach to this would be
Read in the file1 and file2 into a string
"Split" the string in file 1 as well as file2 based on "," if that is what is being used
Check for the item in every 3rd one so my iteration would iterate +3 every time (You might need to sort if not in order both of these)
If found store in an Array,ArrayList etc. Go back to Step 3 if more items present. Else stop
Even though your file1 is not well formatted, it's content has some pattern which you can use to read it successfully.
For each item, it has all the information (i.e. serial number, name, price, location) but not in a certain order. So, you have pay attention to and use the following patterns while you read each item from the file1 -
Serial number is always a plain integer.
Price has that $ and . character.
Location is 2-character long, all capital.
And name is a string can not be any of the above.
Such problems are not best solved by monolithic JAVA code. If you don't have tool constraint then recommended way to solve it is to import data from file 1 into a database table and then run queries from your program to fetch whatever information you like. You can easily select serial numbers based on items and group them for count based on location.
This approach will ensure that you can keep up with changing requirements and if your files are huge you will have good performance.
I hope you are well versed with SQL and DB tools, so I have not posted any details on them.
Use regex.
Step one, tracing and splitting at [\d,], store results in map
Step two, read in the word from the second file. say it's "pen"
Step three, do regex search "pen" on each string within the map.
Step four, if the above returns true , do something like ([A-Z][A-Z],) on each string within the map.

Categories