I am working on a program that will create a word cloud from words that are spoken by a candidate in the presidential debate. The way the text file is set up one person can speak for multiple lines and I want to take in all those lines so I can count the frequency of the words they spoke. There is also a list of stop words that will not be counted for the word cloud. Some examples of the stop words are: "is", "a", "the" and so on. So far I have been able to take in all the stop words and the entire transcript for the debate and remove the stop words from the transcript. Now I want to separate the transcript into what each candidate said and I'm having troubles with it since a person speaks for multiple lines. Some help would be greatly appreciated.
Code so far:
import java.io.File;
import java.io.FileNotFoundException;
import java.util.ArrayList;
import java.util.Scanner;
public class ResendizYonzon {
public static void main(String[] args) throws FileNotFoundException
{
readTextFile("democratic-debate2015Oct13.txt");
}
public static String readTextFile(String text) throws FileNotFoundException {
File f = new File(text);
Scanner out = new Scanner(f);
String word = "";
File f1 = new File("stopwords.txt");
Scanner out1 = new Scanner(f1);
ArrayList<String> stopWords = new ArrayList<String>();
ArrayList<String> words = new ArrayList<String>();
while (out1.hasNext()) {
stopWords.add(out1.next());
}
while (out.hasNext()) {
words.add(out.next());
}
words.removeAll(stopWords);
out.close();
out1.close();
return word;
}
}
Transcript snippet:
CLINTON: No. I think that, like most people that I know, I have a range of views, but they are rooted in my values and my experience. And I don't take a back seat to anyone when it comes to progressive experience and progressive commitment.
You know, when I left law school, my first job was with the Children's Defense Fund, and for all the years since, I have been focused on how we're going to un-stack the deck, and how we're gonna make it possible for more people to have the experience I had.
You know, to be able to come from a grandfather who was a factory worker, a father who was a small business person, and now asking the people of America to elect me president.
COOPER: Just for the record, are you a progressive, or are you a moderate?
CLINTON: I'm a progressive. But I'm a progressive who likes to get things done. And I know...
(APPLAUSE)
...how to find common ground, and I know how to stand my ground, and I have proved that in every position that I've had, even dealing with Republicans who never had a good word to say about me, honestly. But we found ways to work together on everything from...
COOPER: Secretary...
CLINTON: ...reforming foster care and adoption to the Children's Health Insurance Program, which insures...
COOPER: ...thank you...
CLINTON: ...8 million kids. So I have a long history of getting things done, rooted in the same values...
COOPER: ...Senator...
CLINTON: ...I've always had.
COOPER: Senator Sanders. A Gallup poll says half the country would not put a socialist in the White House. You call yourself a democratic socialist. How can any kind of socialist win a general election in the United States?
SANDERS: Well, we're gonna win because first, we're gonna explain what democratic socialism is.
And what democratic socialism is about is saying that it is immoral and wrong that the top one-tenth of 1 percent in this country own almost 90 percent - almost - own almost as much wealth as the bottom 90 percent. That it is wrong, today, in a rigged economy, that 57 percent of all new income is going to the top 1 percent.
That when you look around the world, you see every other major country providing health care to all people as a right, except the United States. You see every other major country saying to moms that, when you have a baby, we're not gonna separate you from your newborn baby, because we are going to have - we are gonna have medical and family paid leave, like every other country on Earth.
Those are some of the principles that I believe in, and I think we should look to countries like Denmark, like Sweden and Norway, and learn from what they have accomplished for their working people.
(APPLAUSE)
Based on your description of problem in the question as well as in your comment to the question, you want to concatenate all speech given by a user into one, so that when a user asks for speech of a speaker i.e. CLINTON you just give them the parts of speech of CLINTON.
That is easy to accomplish. Which? the colon (:) is your ticket to solving this issue. If you look at the input file, whenever there is a new speaker the line starts with speaker name followed by colon.
What you need to do is following list of actions:
Open the file
Read the file line by line
For each line check if the line contains colon (:) or not
If the line contains colon then you need to split the line using colon as seperator
Assuming that you don't have colons elsewhere, splitting the following line
COOPER: Just for the record, are you a progressive, or are you a moderate?
gives you the following tokens (assuming there are no colons elsewhere)
Token[0] = COOPER
Token[1] = Just for the record, are you a progressive, or are you a moderate?
Now that you got two tokens, check if you are in just started reading transcript file or already have speakers
If just started reading the file then you got your first speaker so add him and initialize the variables
If already have other speakers then add the previous speaker (if not added yet) and update his/her speech before reading the speech of the new (or revisited) speaker.
If you continue above steps, every time you come across a speaker, you update the speech for it and add it to the hash map until the end of the line is reached.
Below is the sample code that would do above and is fully commented to help you understand it.
//public static HashSet that stores your speakers.
public static Map<String, String> speakerSpeech = new HashMap<String, String>();
public static void main(String[] args) throws FileNotFoundException
{
readTextFile("C:\\test_java\\transcript.txt");
}
public static void readTextFile(String text) throws FileNotFoundException {
File f = new File(text);
String line;
BufferedReader br;
try {
//open input stream to the path passed as text
FileInputStream fstream = new FileInputStream(text);
//open buffered reader using the input stream
br = new BufferedReader(new InputStreamReader(fstream));
//String builder used to append speech and lines (String is immutable)
StringBuilder speech = new StringBuilder();
// currentSpeaker is used for history. when new speaker is found, we should know who was previous one
// so we save all the speech that so far we have read
String currentSpeaker = null;
// while loop keeps looping over file line by line and terminates when line == null
// that is when end of file is reached.
while((line=br.readLine()) != null) {
//if line contains : then it is a line having a speaker, based on structure of your input file
if(line.contains(":")) {
//split the line using colon as seperator gives us 2 values (speaker and sentence) based
//on structure of your file
String[] chunks = line.split(":");
//store the speaker name CLINTON that was chunks[0] because left most value to colon
//triming whitespace (leading and trailing if any)
String speakerName = chunks[0].trim();
//condition to check if we just started reading transcripts or already read some
if(currentSpeaker == null) {
//just started reading transcript file, this is the first speaker ever
// assign the speaker to currentSpeaker
currentSpeaker = speakerName;
//add the remainder of speech after colon : to the speech StringBuilder
speech.append(chunks[1]);
} else {
//else because currentSpeaker is not null, we already have read speakers before
//current speaker is old speaker and we are about to scan new speaker so
//condition to check if speaker is already added to out list of speakers
if(speakerSpeech.containsKey(currentSpeaker)) {
//yes speaker is already added in map, then get its previous speechs
String previousSpeech = speakerSpeech.get(currentSpeaker);
//re-add the speaker in map and but this time with updated speech
//concatenating previous speech with current speech
speakerSpeech.put(currentSpeaker, previousSpeech + " >>> " + speech.toString());
} else {
//no speaker is new, then add it to the map with its speech
speakerSpeech.put(currentSpeaker, speech.toString());
}
//after storing previous speaker in list, add current speaker for record
currentSpeaker = speakerName.trim();
//initialize speech variable with new speakers speech after : colon
speech = new StringBuilder(chunks[1]);
}
} else {
//this else is because line did not have colon : hence, its continuation of speech
// of current speaker, just append to the speech
speech.append(line);
}
}
//because last line == null and loop terminates, we have to add the last speaker's speech to
//the list manually.
if(speakerSpeech.containsKey(currentSpeaker)) {
String previousSpeech = speakerSpeech.get(currentSpeaker);
speakerSpeech.put(currentSpeaker, previousSpeech + " >>> " + speech.toString());
} else {
speakerSpeech.put(currentSpeaker, speech.toString());
}
System.out.println("No. of speakers: " + speakerSpeech.size());
} catch(Exception ex) {
//handle error
}
//all speakers with their speech one giant string.
System.out.println(speakerSpeech.toString());
}
Executing the above gives you the following output:
{COOPER= Just for the record, are you a progressive, or are you a moderate? >>> Secretary... >>> ...thank you... >>> ...Senator... >>> Senator Sanders. A Gallup poll says half the country would not put a socialist in the White House. You call yourself a democratic socialist. How can any kind of socialist win a general election in the United States?, SANDERS= Well, we're gonna win because first, we're gonna explain what democratic socialism is.And what democratic socialism is about is saying that it is immoral and wrong that the top one-tenth of 1 percent in this country own almost 90 percent - almost - own almost as much wealth as the bottom 90 percent. That it is wrong, today, in a rigged economy, that 57 percent of all new income is going to the top 1 percent.That when you look around the world, you see every other major country providing health care to all people as a right, except the United States. You see every other major country saying to moms that, when you have a baby, we're not gonna separate you from your newborn baby, because we are going to have - we are gonna have medical and family paid leave, like every other country on Earth.Those are some of the principles that I believe in, and I think we should look to countries like Denmark, like Sweden and Norway, and learn from what they have accomplished for their working people.(APPLAUSE), CLINTON= No. I think that, like most people that I know, I have a range of views, but they are rooted in my values and my experience. And I don't take a back seat to anyone when it comes to progressive experience and progressive commitment.You know, when I left law school, my first job was with the Children's Defense Fund, and for all the years since, I have been focused on how we're going to un-stack the deck, and how we're gonna make it possible for more people to have the experience I had.You know, to be able to come from a grandfather who was a factory worker, a father who was a small business person, and now asking the people of America to elect me president. >>> I'm a progressive. But I'm a progressive who likes to get things done. And I know...(APPLAUSE)...how to find common ground, and I know how to stand my ground, and I have proved that in every position that I've had, even dealing with Republicans who never had a good word to say about me, honestly. But we found ways to work together on everything from... >>> ...reforming foster care and adoption to the Children's Health Insurance Program, which insures... >>> ...8 million kids. So I have a long history of getting things done, rooted in the same values... >>> ...I've always had.}
Related
I have a text file like this:
tom
and
jerry
went
to
america
and
england
I want to get the frequency of each word including partial matches too. ie, the word to present in the word tom. So my expected word count of to is 2.
1 america
3 and
1 england
1 jerry
2 to
1 tom
1 went
The text file I have is around 30gb hence its not possible to load all the content in memory.
So What I am doing right now is:
reading the input file using scanner
for each word finding the frequency using this code:
Long wordsCount = Files.lines(Paths.get(allWordsFile))
.filter(s->s.contains(word)).count();
ie, for each word I am looping the entire file content. Even though I am using threadpool executor, the performance of this approach is really poor.
Is there a better way of doing this?
Any tools are available to find the frequency of the words from a large file?
Assuming there are a lot of repetitions you could try something like this (wrote this from scratch may not compile perfectly)
File file =
new File("fileLoc");
BufferedReader br = new BufferedReader(new FileReader(file));
Map <String, Integer> hm = new HashMap<>();
String name;
while ((name = br.readLine()) != null)
if(hm.containsKey(name){
hm.replace(name,hm.get(name) + 1);
}
else{
hm.put(name,1);
}
}
EDIT: I didnt notice the partial matches part but you should be able to just loop back through the map after reading the enter file so that way if theres a partial match just combine the partial match value with the match value
The best in term of performance is to read the lines from the file with a BufferedReader, and to store the word counter in a HashMap.
I am writing a program for a school project in which I read a .csv file and perform calculations based on the data within it.
System.out.println("Please enter a second data set to search:\nFor example, type 'narcotics, poss,' 'criminal damage,' or 'battery.'\nCapitalization does not matter.");
secondDataSet = scan.nextLine().toUpperCase();
System.out.println("Searching for instances of " + secondDataSet);
if (!secondDataSet.equals(""))
{
while (inputStream.readLine() != null)
{
String[] all = line.split(splitBy);
if (all[8].toUpperCase().equals("TRUE"))
{
numArrests++;
}
if (all[8].toUpperCase().equals("FALSE"))
{
numNonArrests++;
}
if (all[5].equals(secondDataSet))
{
secondDataCount++;
}
numberOfLines++;
line = inputStream.readLine();
}
if (secondDataCount == 0)
{
i--;
System.out.println("The data set you entered does not exist.\nPlease try checking your spelling or reformatting your response.");
}
The above is part of my code I am using that contains my Scanner problems. I am reading a .csv containing the Chicago arrest records for the past several years. Every time I run the program, however, it skips past the scan.nextLine(). It just executes my print statement "Please enter a second data set..." and then prints out "Searching for instances of ".
I am using jGRASP, and my compiler looks like:
Please enter a data set to search:
For example, type 'narcotics, poss,' 'criminal damage,' or 'battery.'
Capitalization does not matter.
Searching for instances of
And it loops four times without getting user input. I tried using scan.next() instead, but that did not work when I input a String with two words because some there are some values in the .csv like "CRIMINAL DAMAGE" and scan.next() searched for the words "CRIMINAL" and "DAMAGE" separately. If someone has any suggestions on how to make the computer stop to read the scan.nextLine(), use scan.next() to read two words, or any other solution to this I would appreciate it.
I am extremely confused by this at the moment so any help would be very nice as I've spent an enormous amount of time on a small part of a large project that I need to complete... Also, I can clarify any questions you may have if my question is unclear or vague.
I posted here a few weeks ago regarding a project I have for work. The project began as creating a simple little program that would take an incoming ACH file and read each line. The program would also ask a user for a "reason code" and "bank" which would affect the next step. The program would then reformat all the data in a certain way and save it to an external file. For those that don't know, an ACH is simply a text based file that is in a very concrete format. (Every character and space has a meaning.)
I have completed that task using a few GUI items (Jcombobox, JFileChooser, etc), string array lists, buffered reader/writer, and lots of if/else statements.
The task has now been expanded to a much more complicated and I don't know exactly how to begin, so I thought I would seek the communities advice.
When an ACH file comes in it will be in a format that looks something like this:
101 100000000000000000000000000000
522 00000202020382737327372732737237
6272288381237237123712837912738792178
6272392390123018230912830918203810
627232183712636283761231726382168
822233473498327497384798234724273487398
522 83398402830943240924332849832094
62723921380921380921382183092183
6273949384028309432083094820938409832
82283409384083209482094392830404829304
900000000000000000000000000000000
9999999999999999999999999999999999999
9999999999999999999999999999999999999
(I will refer to each line by " " number, for example "1 number" are the lines that begin with 1)
The end result is that the lines of data are maniuplated and put into "batches". The output file begins with the "1 number"
and then contains a batch with the format of
5
6
8
5
6
8
5
6
8
We continue using the same "5 number" until all sixes that were below it in the original file have been written, then we go to the next "5" and work with the "6" below it.
So, my project now is to create a full GUI. After the user inputs the file the GUI will have some type of drop down box or similar list of all the "6" numbers. For each number there should be another drop down box to choose the reason code (there are 7 reason codes).
Basically the ultimate objective is:
Display all the "6" numbers and give the user the ability to choose a reason code for each.
Allow the user to only select a certain amount of the "6" numbers if they wish.
Is it possible for me to do this using Buffered Reader/ Writer? I currently save the values into Array Lists using the following code:
while((sCurrentLine = br.readLine()) !=null)//<---------This loop will continue while there are still lines to be read.
{
if (sCurrentLine.startsWith("5")){//<------------------If the line starts with "5"..
listFive.add(sCurrentLine);//<-------------------------Add the line to the array list "listFive".
countFive++;//<---------------------------------------------Increase the counter "countFive" by one.
}else if (sCurrentLine.startsWith("6") && countFive==1){//<---------If the line starts with "6" and countFive is at a value of 1..
listSix.add(sCurrentLine);//<---------------------------------------Add the line to the array list "listSix".
}else if (sCurrentLine.startsWith("6") && countFive==2){//<-----------------If the line starts with "6" and countFive is at a value of 2..
listSixBatchTwo.add(sCurrentLine);//<--------------------------------------Add the line to the array list "listSixBatchTwo".
}else if (sCurrentLine.startsWith("6") && countFive==3){//<-----------------------If the line starts with "6" and countFive is at a value of 3..
listSixBatchThree.add(sCurrentLine);//<------------------------------------------Add the line to array list "listSixBatchThree".
}else if (sCurrentLine.startsWith("6") && countFive==4){//<------------------------------If the line starts with "6" and countFive is at a value of 4..
listSixBatchFour.add(sCurrentLine); //<--------------------------------------------------Add the line to array list "listSixBatchFour".
}else if (sCurrentLine.startsWith("8")){//<-----------------------------------------------------If the line starts with "8"..
listEight.add(sCurrentLine);//<----------------------------------------------------------------Add the line to array list "listEight".
}else if (sCurrentLine.startsWith("1")){//<-----------------------------------------------------------If the line starts with "1"..
one = sCurrentLine;//<-------------------------------------------------------------------------------Save the line to String "one".
}else if (sCurrentLine.startsWith("9") && count9 == 1){//<---------------------------------------------------If the line starts with "9" and count9 is at a value of 1..
nine = sCurrentLine;//<-------------------------------------------------------------------------------------Save the line to String "nine".
count9 = 0;//<--------------------------------------------------------------------------------------------------Set count9 to a value of 0.
}else if (sCurrentLine.startsWith("999") && count9 == 0){//<-----------------------------------------------------------If the line starts with "999" and count9 is at a value of 0..
listNine.add(sCurrentLine);//<---------------------------------------------------------------------------------------Add the line to array list "listNine".
}else{
}
}
If anyone can point me where I can get started I would be very grateful. If you need more information please let me know.
Update:
Here is an example of my JOptionPane with decision making.
String[] choices = {"Wells Fargo", "Bank of America", "CitiBank", "Wells Fargo Legacy", "JPMC"};
String input = (String) JOptionPane.showInputDialog(null, "Bank Selection", "Please choose a bank: ", JOptionPane.QUESTION_MESSAGE, null, choices, choices[0]);
if (input.equals("Wells Fargo"))
{
bank = "WELLS FARGO";
}else if (input.equals("Bank of America")){
bank = "BANK OF AMERICA";
}else if (input.equals("CitiBank")){
bank = "CITI BANK";
}else if (input.equals("Wells Fargo Legacy")){
bank = "WELLS FARGO LEGACY";
}else if (input.equals("JPMC")){
bank = "JPMC";
}
}else{
}
Let's assume I wanted to use the Buffered Writer to save all of the "6" numbers into a String array, then put them into a drop down box in the GUI. How could I accomplish this?
Can you use the input from Buffered Writer in a GUI..
Well, a BufferedWriter is not used to get input but rather to output information, but assuming that you meant a BufferedReader, then the answer is yes, definitely. Understand that a GUI and getting data with a BufferedReader are orthogonal concepts -- they both can work just fine independently of the other. The main issues involving reading in data with a Bufferedhaving a GUI
Say a JOptionPane for example.
I'm not sure what you mean here or how this relates.
If yes, then how could I go about doing that? In all the examples I have seen and tutorials about JOptionPane everything is done BEFORE the main method. What if I need if statements included in my JOptionPane input? How can I accomplish this?
I'm not sure what you mean by "everything is done before the main method", but it sounds like you may be getting ahead of yourself. Before worrying about nuts and bolts and specific location of code, think about what classes/objects your program will have, and how they'll interact -- i.e., what methods they will have.
I believe I just had an idea of how I need to proceed first. Can someone please verify? 1. Create static variables representing the lines that will be read. (Such as a static ArrayList.
No, don't think about static anything off the bat, since once you do that, you leave the OOP realm and go into procedural programming. The main data should be held in instance variables within a class or two.
Create the actual GUI, outside the main Method.
I'm not sure what you mean "outside the main method", but the GUI will consist of multiple classes, probably one main one, and an instance of the main class is not infrequently created in the main method, or in a method called by the main method, but queued onto the Swing event thread.
Create the Buffered Reader which will write to the variables mentioned in #1, inside the main method.
Again, I wouldn't do this. The main method should be short, very short, and its reason for existence is only to start your key objects and set them running, and that's it. Nothing critical (other than just what I stated) should be done in them. You're thinking small toy programs, and that's not what you're writing. The reader should be an instance variable inside of its own class. It might be started indirectly by the GUI via a control class which is a class that responds to GUI events. If you need the data prior to creation of the GUI, then you will have your main method create the class that reads in the data, ask it to get the data, and then create your GUI class, passing the data into it.
I am using Google Guava APIs to calculate word count.
public static void main(String args[])
{
String txt = "Lemurs of Madagascar is a reference work and field guide giving descriptions and biogeographic data for all the known lemur species in Madagascar (ring-tailed lemur pictured). It also provides general information about lemurs and their history and helps travelers identify species they may encounter. The primary contributor is Russell Mittermeier, president of Conservation International. The first edition in 1994 received favorable reviews for its meticulous coverage, numerous high-quality illustrations, and engaging discussion of lemur topics, including conservation, evolution, and the recently extinct subfossil lemurs. The American Journal of Primatology praised the second edition's updates and enhancements. Lemur News appreciated the expanded content of the third edition (2010), but was concerned that it was not as portable as before. The first edition identified 50 lemur species and subspecies, compared to 71 in the second edition and 101 in the third. The taxonomy promoted by these books has been questioned by some researchers who view these growing numbers of lemur species as insufficiently justified inflation of species numbers.";
Iterable<String> result = Splitter.on(" ").trimResults(CharMatcher.DIGIT)
.omitEmptyStrings().split(txt);
Multiset<String> words = HashMultiset.create(result);
for(Multiset.Entry<String> entry : words.entrySet())
{
String word = entry.getElement();
int count = words.count(word);
System.out.printf("%S %d", word, count);
System.out.println();
}
}
The output should be
Lemurs 3
However I am getting like this:
Lemurs 1
Lemurs 1
Lemurs 1
What am I doing wrong?
MultiSet works fine. Take a close look at your results - switching the printf to e.g. "|%S| %d" will help:
|lemurs.| 1
|lemurs| 1
|Lemurs| 1
It is immediately apparent that those are all 3 different strings. The solution in this case is to simply strip all non-alphabetical chars, and lowercase all words.
Using printf("%S %d", words, count) with a capital S hides the detail that the different capitalizations of the word "lemurs" are being counted separately. When I run that program, I see
one occurence of "lemurs." with a period not being trimmed
one occurrence of "lemurs" all lowercase
one occurrence of "Lemurs" with the first letter capitalized
I'm having a bit of difficulty getting my code to work. One of my assignments requires me to use this data from an external file (basically a passage/poem):
Good morning life and all
Things glad and beautiful
My pockets nothing hold
But he that owns the gold
The sun is my great friend
His spending has no end
Hail to the morning sky
Which bright clouds measure high
Hail to you birds whose throats
Would number leaves by notes
Hail to you shady bowers
And you green fields of flowers
Hail to you women fair
That make a show so rare
In cloth as white as milk
Be it calico or silk
Good morning life and all
Things glad and beautiful
We are trying to find the total number of words, the number of words that have only three letters, and the percentage of occurrence of the three words. I think I can handle the assignment, but something went wrong in my code while I was working it out:
import java.io.*;
import java.util.*;
public class Prog739h
{
public static void main(String[] args) throws IOException
{
Scanner kbReader = new Scanner(new File("C:\\Users\\Guest\\Documents\\Java programs\\Prog739h\\Prog739h.in"));
int totalWords = 0;
while(kbReader.hasNextLine())
{
String data = kbReader.nextLine();
String[] words = data.split(" ");
totalWords+=words.length();
System.out.println(totalWords);
}
}
}
When I tried to compile to test the code at the moment to see if everything I had done was working properly, I was given an error that said it can't find symbol method length(). I checked my line with the "totalWords+=words.length()", but I don't know what I can do to fix the problem. Could someone please explain to me why this happened and provide some direction on how to fix this error? Thanks!
The answer is that the length of an array is given by the length field, not the length method. In other words, change
totalWords+=words.length();
to
totalWords+=words.length;
length is a public field on an Array object, the code is attempting to invoke it as a method using ()
Remove the () after length:
totalWords+=words.length
length is a property of array, access it without ()
please change:
totalWords+=words.length();
to
totalWords+=words.length;
Array properties shouldn't contain parenthesis
totalWords += words.length;
^