Determining whether string is a proper noun in text - java

I'm trying to parse a text (http://pastebin.com/raw.php?i=0wD91r2i) and retrieve the words and the number of their occurrences. However, I must not include proper nouns within the final output. I'm not quite sure how to accomplish this task.
My attempt at this
public class TextAnalysis
{
public static void main(String[] args)
{
ArrayList<Word> words = new ArrayList<Word>(); //instantiate array list of object Word
try
{
int lineCount = 0;
int wordCount = 0;
int specialWord = 0;
URL reader = new URL("http://pastebin.com/raw.php?i=0wD91r2i");
Scanner in = new Scanner(reader.openStream());
while(in.hasNextLine()) //while to parse text
{
lineCount++;
String textInfo[] = in.nextLine().replaceAll("[^a-zA-Z ]", "").split("\\s+"); //use regex to replace all punctuation with empty char and split words with white space chars in between
wordCount += textInfo.length;
for(int i=0; i<textInfo.length; i++)
{
if(textInfo[i].toLowerCase().matches("the|a|an|and|but|or|by|to|for|of|with|without|chapter|[0-9]+")) //if word matches any special word case, add count of special words then continue to next word
{
specialWord++;
continue;
}
if(!textInfo[i].matches(".*\\w.*")) continue; //also if text matches white space then continue
boolean found = false;
for(Word word: words) //check whether word already exists in list -- if so add count
{
if(word.getWord().equals(textInfo[i]))
{
word.addOccurence(1);
word.addLine(lineCount);
found = true;
}
}
if(!found) //else add new entry
{
words.add(new Word(textInfo[i], lineCount, 1));
}
}
}
//adds data from capital word to lowercase word ATTEMPT AT PROPER NOUNS HERE
for(Word word: words)
{
for(int i=0; i<words.size(); i++)
{
if(Character.isUpperCase(word.getWord().charAt(0)) && word.getWord().toLowerCase().equals(words.get(i).getWord()))
{
words.get(i).addOccurence(word.getOccurence());
words.get(i).addLine(word.getLine());
}
}
}
Comparator<Word> occurenceComparator = new Comparator<Word>() //comparares list based on number of occurences
{
public int compare(Word n1, Word n2)
{
if(n1.getOccurence() < n2.getOccurence()) return 1;
else if (n1.getOccurence() == n2.getOccurence()) return 0;
else return -1;
}
};
Collections.sort(words);
// Collections.sort(words, occurenceComparator);
// ArrayList<Word> top_words = new ArrayList<Word>(words.subList(0,100));
// Collections.sort(top_words);
System.out.printf("%-15s%-15s%s\n", "Word", "Occurences", "Word Distribution Index");
for(Word word: words)
{
word.setTotalLine(lineCount);
System.out.println(word);
}
System.out.println(wordCount);
System.out.printf("%s%.3f\n","The connecting word index is ",specialWord*100.0/wordCount);
}
catch(IOException ex)
{
System.out.println("WEB URL NOT FOUND");
}
}
}
formatting kind of off, not sure how to do it correctly.
Which determines if a word is capitalized and if there is a lower case version of the word, adds the data to the lower case word. However, this does not account for words where a lower case version never appears such as "Four" or "Now" in the text. How might I go about this without cross referencing a dictionary?
EDIT: I HAVE SOLVED THE PROBLEM MYSELF.
Thank you, however, to Wes for attempting to answer.

It seems like your algorithm is to assume any word that appears capitalized but does not appear uncapitalized is a proper noun. So if that's the case, then you can use the following algorithm to get the proper nouns.
//Assume you have tokenized your whole file into a Collection called allWords.
HashSet<String> lowercaseWords = new HashSet<>();
HashMap<String,String> lowerToCap = new HashMap<>();
for(String word: allWords) {
if (Character.isUpperCase(word.charAt(0))){
lowerToCap.put(word.toLowerCase(),word);
}
else {
lowercaseWords.add(word.toLowerCase);
}
}
//remove all the words that we've found as capitalized, only proper nouns will be left
lowercaseWords.removeAll(lowerToCap.keySet());
for(String properNounLower:lowercaseWords) {
System.out.println("Proper Noun: "+ lowerToCap.get(properNounLower));
}

Related

Find words in String consisting of all distinct characters without using Java Collection Framework

I need your help. I am stuck on one problem, solving it for several hours.
*1. Find word containing only of various characters. Return first word if there are a few of such words.
2. #param words Input array of words
3. #return First word that containing only of various characters*
**public String findWordConsistingOfVariousCharacters(String[] words) {
throw new UnsupportedOperationException("You need to implement this method");
}**
#Test
public void testFindWordConsistingOfVariousCharacters() {
String[] input = new String[] {"aaaaaaawe", "qwer", "128883", "4321"};
String expectedResult = "qwer";
StringProcessor stringProcessor = new StringProcessor();
String result = stringProcessor.findWordConsistingOfVariousCharacters(input);
assertThat(String.format("Wrong result of method findWordConsistingOfVariousCharacters (input is %s)", Arrays.toString(input)), result, is(expectedResult));
}
Thank you in advance
Just go through the data and check whether each string is made up of only distinct characters:
public static boolean repeat(String str) {
char[] chars = str.toCharArray();
Arrays.sort(chars);//The same character will only appear in groups
for(int i = 1;i<chars.length;i++) {
if(chars[i] == chars[i - 1]) {
return false;//Same character appeared twice
}
}
return true;//There is no repeating character
}
The method above is used to check whether a string is made up of distinct characters, now loops through the data:
for(int i = 0;i<input.length;i++){
if(repeat(input[i])){
System.out.println("The answer is " + input[i] + " at index " + i);
break;//you find it! Now break the loop
}
}
Assuming the strings are all ASCII characters, use a boolean[] to mark if you have encountered that character in the word already:
boolean [] encountered = new boolean[256];
for (char c : word.toCharArray()) {
if (encountered[(int)c]) {
// not unique
} else {
encountered[(int)c] = true;
}
}

Compare and sort Word entities (starting with vowel symbols) by consonants Java

I have a hierarchy of classes Symbol, Word, Sentence, Text. Each class just contains a field with setters, getters, overridden equals and toString. These classes are kind of nested in each other by lists. It looks like this:
public class Word {
private List<Symbol> symbols;
public Word(List<Symbol> symbols) {
this.symbols = symbols;
}
//getters, setters, toString
}
I need to find the words in the text which start from the vowel letter and sort them according to the consonant letter which goes after the vowel. I have been trying to get sentences and words from the text and then defined the words I need to sort. However, I do not know how to change compare in my helper method sort so it could actually compare consonant letters.
public void sortWords(Text text) {
List<Sentence> sentences = text.getSentences();
List<Word> words = new ArrayList<>();
for (Sentence sentence : sentences) {
words.addAll(sentence.getWords());
}
List<Symbol> symbols = new ArrayList<>();
for (Word word : words) {
symbols = word.getSymbols();
Symbol first = symbols.get(0);
if (first.isVowel()) {
sort(words);
}
}
}
private void sort(List<Word> words) {
Collections.sort(words, new Comparator<Word>() {
#Override
public int compare(Word word1, Word word2) {
List<Symbol> symbols = word1.getSymbols();
for (Symbol s : symbols) {
if (s.isConsonant()){
//smth should be here
}
}
}
I would be grateful for any advice!
});
}
Input:
Let's imagine that there is some kind of text here. Although I am not sure that you will find some sense in these words but there is text. It should be about programming but I did not figure out what to write exactly.
Expected output (words starting from vowels are sorted by first occurrence of consonants):
I, about, of, although, imagine, am, in, is, is, it, exactly
My output: there is not output yet because I have not finished the method
I don't know the details of your Symbol class, so I have left part of the code for you to fill in. If the comparison results in 0, you may want to still sort the words (Comparator<String> will help).
public void sortWords(Text text) {
List<Sentence> sentences = text.getSentences();
List<Word> words = new ArrayList<>();
for (Sentence sentence : sentences) {
words.addAll(sentence.getWords());
}
List<Word> wordsStartingWithAVowel = new ArrayList<>();
for (Word word : words) {
if (word.getSymbols().get(0).isVowel()) {
wordsStartingWithAVowel.add(word);
}
}
sort(wordsStartingWithAVowel);
}
private void sort(List<Word> words) {
Collections.sort(words, new Comparator<Word>() {
#Override
public int compare(Word word1, Word word2) {
int result;
Symbol firstConsonantWord1 = null; // get the first consonant of the first word
List<Symbol> symbols = word1.getSymbols();
for (Symbol s : symbols) {
if (s.isConsonant()){
firstConsonantWord1 = s;
break;
}
}
Symbol firstConsonantWord2 = null; // get the first consonant of the second word
List<Symbol> symbols = word2.getSymbols();
for (Symbol s : symbols) {
if (s.isConsonant()){
firstConsonantWord2 = s;
break;
}
}
if(firstConsonantWord1 == null && firstConsonantWord2 == null) // both words don’t contain any consonants
result = 0;
else if(firstConsonantWord1 == null)
result = -1;
else if(firstConsonantWord2 == null)
result = 1;
else { // both words contain at least one consonant
result = new Comparator<Symbol>(){
#Override
public int compare(Symbol symbol1, Symbol symbol2) {
// insert comparison of symbols here, depends on your Symbol class
}
}.compare(firstConsonantWord1, firstConsonantWord2);
}
// if result is 0 you may want to do further comparisons (like a standard String comparison)
}
});
}

parse a document with million words

I have implemented some code to find the anagrams word in the txt sample.txt file and output them on the console. The txt document contains String (word) in each of line.
Is that the right Approach to use if I want to find the anagram words in txt.file with Million or 20 Billion of words? If not which Technologie should I use in this case?
I appreciate any help.
Sample
abac
aabc
hddgfs
fjhfhr
abca
rtup
iptu
xyz
oifj
zyx
toeiut
yxz
jrgtoi
oupt
abac aabc abca
xyz zyx yxz
Code
package org.reader;
import java.io.BufferedReader;
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;
public class Test {
// To store the anagram words
static List<String> match = new ArrayList<String>();
// Flag to check whether the checkWorld1InMatch() was invoked.
static boolean flagCheckWord1InMatch;
public static void main(String[] args) {
String fileName = "G:\\test\\sample2.txt";
StringBuilder sb = new StringBuilder();
// In case of matching, this flag is used to append the first word to
// the StringBuilder once.
boolean flag = true;
BufferedReader br = null;
try {
// convert the data in the sample.txt file to list
List<String> list = Files.readAllLines(Paths.get(fileName));
for (int i = 0; i < list.size(); i++) {
flagCheckWord1InMatch = true;
String word1 = list.get(i);
for (int j = i + 1; j < list.size(); j++) {
String word2 = list.get(j);
boolean isExist = false;
if (match != null && !match.isEmpty() && flagCheckWord1InMatch) {
isExist = checkWord1InMatch(word1);
}
if (isExist) {
// A word with the same characters was checked before
// and there is no need to check it again. Therefore, we
// jump to the next word in the list.
// flagCheckWord1InMatch = true;
break;
} else {
boolean result = isAnagram(word1, word2);
if (result) {
if (flag) {
sb.append(word1 + " ");
flag = false;
}
sb.append(word2 + " ");
}
if (j == list.size() - 1 && sb != null && !sb.toString().isEmpty()) {
match.add(sb.toString().trim());
sb.setLength(0);
flag = true;
}
}
}
}
} catch (
IOException e) {
e.printStackTrace();
} finally {
try {
if (br != null) {
br.close();
}
} catch (IOException ex) {
ex.printStackTrace();
}
}
for (String item : match) {
System.out.println(item);
}
// System.out.println("Sihwail");
}
private static boolean checkWord1InMatch(String word1) {
flagCheckWord1InMatch = false;
boolean isAvailable = false;
for (String item : match) {
String[] content = item.split(" ");
for (String word : content) {
if (word1.equals(word)) {
isAvailable = true;
break;
}
}
}
return isAvailable;
}
public static boolean isAnagram(String firstWord, String secondWord) {
char[] word1 = firstWord.toCharArray();
char[] word2 = secondWord.toCharArray();
Arrays.sort(word1);
Arrays.sort(word2);
return Arrays.equals(word1, word2);
}
}
For 20 billion words you will not be to able to hold all of them in RAM so you need an approach to process them in chunks.
20,000,000,000 words. Java needs quite a lot of memory to store strings so you can count 2 bytes per character and at least 38 bytes overhead.
This means 20,000,000,000 words of one character would need 800,000,000,000 bytes or 800 GB, which is more than any computer I know has.
Your file will contain much less than 20,000,000,000 different words, so you might avoid the memory problem if you store every word only once (e.g. in a Set).
First for a smaller number.
As it is better to use a more powerful data structure, do not read all lines in core, but read line-wise.
Map<String, Set<String>> mapSortedToWords = new HashMap<>();
Path path = Paths.get(fileName);
try (BufferedReader in = Files.newBufferedReader(Path, StandardCharsets.UTF_8)) {
for (;;) {
String word = in.readLine();
if (word == null) {
break;
}
String key = sorted(word);
Set<String> words = mapSortedToWords.get(key);
if (words == null) {
words = new TreeSet<String>();
mapSortedToWords.put(key, words);
}
words.add(word);
}
}
for (Set<String> anagrams : mapSortedToWords.values()) {
if (anagrams.size() > 1) {
... anagrams
}
}
static String sorted(String word) {
char[] letters = word.toCharArray();
Arrays.sort(letters);
return new String(letters);
}
This stores in the map a set of words. Comparable with abac aabc abca.
For a large number a database where you store (sortedLetters, word) would be better. An embedded database like Derby or H2 poses no installation problems.
For the kind of file size that you specify ( 20 billion words), obviously there are two main problems with your code,
List<String> list = Files.readAllLines(Paths.get(fileName));
AND
for (int i = 0; i < list.size(); i++)
These two lines in your programs basically question,
Do you have enough memory to read full file in one go?
Is it OK to iterate 20 billion times?
For most systems, answer for both above questions would be NO.
So your target is to cut down memory foot print and reduce the number of iterations.
So you need to read your files chunk by chunk and use some kind of search data structures ( like Trie ) to store your words.
You will find numerous questions on SO for both of above topics like,
Fastest way to incrementally read a large file
Finding anagrams for a given word
Above algorithm says that you have to first create a dictionary for your words.
Anyway, I believe there is no ready made answer for you. Take a file with one billion words ( that is a very difficult task in itself ) and see what works and what doesn't but your current code will obviously not work.
Hope it helps !!
Use a stream to read the file. That way you are only storing one word at once.
FileReader file = new FileReader("file.txt"); //filestream
String word;
while(file.ready()) //return true if there a bytes left in the stream
{
char c = file.read(); //reads one character
if(c != '\n')
{
word+=c;
}
else {
process(word); // do whatever you want
word = "";
}
}
Update
You can use a map for finding the anagrams like below. For each word you have you may sort its chars and obtain a sorted String. So, this would be the key of your anagrams map. And values of this key will be the other anagram words.
public void findAnagrams(String[] yourWords) {
Map<String, List<String>> anagrams = new HashMap<String, List<String>>();
for (String word : yourWords) {
String sortedWord = sortedString(word);
List<String> values = anagrams.get(sortedWord);
if (values == null)
values = new LinkedList<>();
values.add(word);
anagrams.put(sortedWord, values);
}
System.out.println(anagrams);
}
private static String sortedString(String originalWord) {
char[] chars = originalWord.toCharArray();
Arrays.sort(chars);
String sorted = new String(chars);
return sorted;
}

How do I find the letters of a certain word within a Character array in Java?

I need to compare the characters from two different Character Arrays to find the Hidden word inputted by the user.
The goal is to input 2 Strings and to find a word scrambled within the other.
Ex. The word "tot" is scrambled in the word "tomato"
With the help of some people of the forums, I have implemented character arrays to store the user Strings, but I do not know a way to check each array for the characters needed. I have tried the code below but it always results in the program not being able to find the word. If anyone could provide a better method or solution I'd highly appreciate it.
public static void main(String[] args) {
input = new Scanner(System.in);
System.out.println("Please enter a word");
String word = input.next();
char[] charOfWrds = word.toCharArray();
System.out.println("Please enter a hidden word you would like to search for");
String search = input.next();
char[] charOfSrch = search.toCharArray();
if (isContains(charOfWrds, charOfSrch))
{
System.out.print("The word " + search + " is found in the word " + word);
}
else
{
System.out.print("The word was not found in " + word);
}
}
public static Boolean isContains(char[] charOfWrds, char[] charOfSrch) {
int count = 0;
for (char cha : charOfWrds)
{
for (char chaaa : charOfSrch)
{
if (cha == chaaa)
count++;
}
}
if (count == charOfSrch.length)
{
return true;
}
return false;
}
public static Boolean isContains(char[] charOfWords, char[] charOfSrch) {
List<Character> searchFor = new ArrayList<>();
List<Character> searchMe=new ArrayList<>();
for(char c:charOfWords)
searchFor.add(c);
for(char c:charOfSrch)
searchMe.add(c);
for(int x=searchFor.size()-1;x>=0;x--){
if(searchMe.contains(searchFor.get(x)){
searchMe.remove(searchFor.get(x));
searchFor.remove(x);//line A
}
}
return searchFor.size()==0;
}
Quick overview of what this does. I converted both of the character arrays into Lists, so that I could use the List methods. Then, I iterated through every character in the word that you need to find, and if I could find it in the other word, I removed it from both of them, meaning that if all the words were removed from the word needed to be find, the word was found in the other word;otherwise it was not.
I assumed that you could not reuse letters in the second word, but if you can, then just remove line A.
You should try regular expression rather than trying to write an algo.
Build the expression using user input and then match with the desired word.
take
tomato as charOfWrds
tot as charOfSrch
isContains will count 6 because you don't quit when you find a letter of the first word in the second.
t : two times
o : two times
t : two times
try this :
if (cha == chaaa){
count++;
break;
}
but to make this work you need to remove the letter once found from the second string, because if the word you're looking for is "tttt", then this code will give you true even if it's not, but if you remove t when you found it then it should do the trick.
I don't know if that's clear enough for you.
here is the code :
public static Boolean isContains(char[] charOfWrds, char[] charOfSrch) {
int count = 0;
for (int i=0; i<charOfWrds.length;i++){
for (int j=0;j<charOfSrch.length;j++){
if (charOfWrds[i] == charOfSrch[j]){
count++;
charOfSrch[j]=' ';
break;
}
}
}
if (count == charOfSrch.length)
{
return true;
}
return false;
}
It's working, i tried it with this :
String word = "tomato";
char[] charOfWrds = word.toCharArray();
String search = "tot";
char[] charOfSrch = search.toCharArray();
But this is ugly, you should try to use the java Api, unless you really have to do it with arrays.
My idea was similar to what #kirbyquerby provided except that it has a few optimizations.
Instead of using a linear search, after converting each word (the needle and the haystack) to a list, we sort those lists. This allows us to use binary search which changes the search complexity from O(n^2) to O(n log n).
Additionally, there is no need to remove characters from the needle list as we can just keep track of how many needle characters have been found. Once we are done searching, we simply compare the number of found needle characters to the total number of needle characters. If they are equal, we have found our needle. Lastly, if a needle character is not found, we can stop searching immediately as we know that the entire needle does not exist within the haystack.
package hiddenword;
import java.util.ArrayList;
import java.util.Collections;
import java.util.List;
public class HiddenWord {
public static void main(final String args[]) {
final String haystack = "Tomato";
final String needle = "tot";
if (contains(haystack, needle)) {
System.out.println(haystack + " contains " + needle);
} else {
System.out.println(haystack + " does not contain " + needle);
}
}
private static boolean contains(final String haystack, final String needle) {
// Convert each word to lowercase and into a List.
final List<Character> haystackChars = toSortedCharacterList(haystack.toLowerCase());
final List<Character> needleChars = toSortedCharacterList(needle.toLowerCase());
int foundNeedleChars = 0;
for (final Character c : needleChars) {
// Using sorted lists, our search time is be O(n log n) by using
// binary search instead of O(n^2) using linear search.
int index = Collections.binarySearch(haystackChars, c);
// A needle character has been found, remove it from the haystack
// and increment the count
if (index >= 0) {
haystackChars.remove(index);
++foundNeedleChars;
}
// A needle character was not found. This means that the haystack
// doesn't contain every character of the needle and we
// can quit early
else {
return false;
}
}
// If we've found all the needle characters, the needle word exists in
// the haystack word.
return foundNeedleChars == needleChars.size();
}
private static List<Character> toSortedCharacterList(final String input) {
final List<Character> list = new ArrayList<Character>();
// Convert primitive array to List
for (final char c : input.toCharArray()) {
list.add(c);
}
// Sort that thang
Collections.sort(list);
return list;
}
}
You can do this without having to convert the strings to an array with something like:
static boolean containsScrambled(String word, String search){
//loop through each character of the word whose characters we are searching for
for(int x = 0; x<search.length(); x++){
//find the current character to check
String c = search.substring(x, x+1);
if(word.indexOf(c) >= 0){
//if the character is in the word, remove the first instance of it
//as we cannot use that character again
word.replaceFirst(c, "");
}else{
//if the character is not in the word, fail and return false
return false;
}
}
//if we made it here, all of the characters existed, return true
return true;
}

Problem implementing classifier algorithm for whitespace separated words

I have a text and split it into words separated by white spaces.
I'm classifying units and they work if it occurs in the same word (eg.: '100m', '90kg', '140°F', 'US$500'), but I'm having problems if they appears separately, each part in a word (eg.: '100 °C', 'US$ 450', '150 km').
The classifier algorithm can understand if the unit is in right and the value is missing is in the left or right side.
My question is how can I iterate over all word that are in a list providing the corrects word to the classifier.
This is only an example of code. I have tried in a lot of ways.
for(String word: words){
String category = classifier.classify(word);
if(classifier.needPreviousWord()){
// ?
}
if(classifier.needNextWord()){
// ?
}
}
In another words, I need to iterate over the list classifying all the words, and if the previous word is needed to test, provide the last word and the unit. If the next word is needed, provide the unit and the next word. Appears to be simple, but I don't know how to do.
Don't use an implicit iterator in your for loop, but an explicit. Then you can go back and forth as you like.
Iterator<String> i = words.iterator();
while (i.hasNext()) {
String category = classifier.classify(i.next());
if(classifier.needPreviousWord()){
i.previous();
}
if(classifier.needNextWord()){
i.next();
}
}
This is not complete, because I don't know what your classifier does exactly, but it should give you an idea on how to proceed.
This could help.
public static void main(String [] args)
{
List<String> words = new ArrayList<String>();
String previousWord = "";
String nextWord = "";
for(int i=0; i < words.size(); i++) {
if(i > 0) {
previousWord = words.get(i-1);
}
String currentWord = words.get(i);
if(i < words.size() - 1) {
nextWord = words.get(i+1);
} else {
nextWord = "";
}
String category = classifier.classify(word);
if(category.needPreviousWord()){
if(previousWord.length() == 0) {
System.out.println("ERROR: missing previous unit");
} else {
System.out.println(previousWord + currentWord);
}
}
if(category.needNextWord()){
if(nextWord.length() == 0) {
System.out.println("ERROR: missing next unit");
} else {
System.out.println(currentWord + nextWord);
}
}
}
}

Categories