Problem implementing classifier algorithm for whitespace separated words - java

I have a text and split it into words separated by white spaces.
I'm classifying units and they work if it occurs in the same word (eg.: '100m', '90kg', '140°F', 'US$500'), but I'm having problems if they appears separately, each part in a word (eg.: '100 °C', 'US$ 450', '150 km').
The classifier algorithm can understand if the unit is in right and the value is missing is in the left or right side.
My question is how can I iterate over all word that are in a list providing the corrects word to the classifier.
This is only an example of code. I have tried in a lot of ways.
for(String word: words){
String category = classifier.classify(word);
if(classifier.needPreviousWord()){
// ?
}
if(classifier.needNextWord()){
// ?
}
}
In another words, I need to iterate over the list classifying all the words, and if the previous word is needed to test, provide the last word and the unit. If the next word is needed, provide the unit and the next word. Appears to be simple, but I don't know how to do.

Don't use an implicit iterator in your for loop, but an explicit. Then you can go back and forth as you like.
Iterator<String> i = words.iterator();
while (i.hasNext()) {
String category = classifier.classify(i.next());
if(classifier.needPreviousWord()){
i.previous();
}
if(classifier.needNextWord()){
i.next();
}
}
This is not complete, because I don't know what your classifier does exactly, but it should give you an idea on how to proceed.

This could help.
public static void main(String [] args)
{
List<String> words = new ArrayList<String>();
String previousWord = "";
String nextWord = "";
for(int i=0; i < words.size(); i++) {
if(i > 0) {
previousWord = words.get(i-1);
}
String currentWord = words.get(i);
if(i < words.size() - 1) {
nextWord = words.get(i+1);
} else {
nextWord = "";
}
String category = classifier.classify(word);
if(category.needPreviousWord()){
if(previousWord.length() == 0) {
System.out.println("ERROR: missing previous unit");
} else {
System.out.println(previousWord + currentWord);
}
}
if(category.needNextWord()){
if(nextWord.length() == 0) {
System.out.println("ERROR: missing next unit");
} else {
System.out.println(currentWord + nextWord);
}
}
}
}

Related

Turning the Nth (input from user) number into Uppercase and the rest will be in Lowercase

I will ask this again. I have this problem which is to create a program that would read a string input from the user (sentence or word). And the Nth number (from the user) will turn into upper case and the rest will be in lowercase.
Example:
string = "good morning everyone"
n = 2
Output = gOod mOrning eVeryone
for (int x = 0; x < s.length(); x++)
if (x == n-1){
temp+=(""+s.charAt(x)).toUpperCase();
}else{
temp+=(""+s.charAt(x)).toLowerCase();
}
s=temp;
System.out.println(s);
}
Output: gOod morning everyone
I know what you want to happen - but you didn't phrase your question very well. The only part your missing is iterating through every word in the sentence. If you asked "how do I apply a function on every word in a String" you likely would have gotten a better response.
This is a bit sloppy since it adds a trailing " " to the end - but you could fix that easily.
public class Test {
static String test = "This is a test.";
public static void main(String[] args) {
String[] words = test.split(" ");
String result = "";
for (String word : words) {
result += nthToUpperCase(word, 2);
result += " ";
}
System.out.println(result);
}
public static String NthToUpperCase(String s, int n) {
String temp = "";
for (int i = 0; i < s.length(); i++) {
if (i == (n-1)) {
temp+=Character.toString(s.charAt(i)).toUpperCase();
} else {
temp+=Character.toString(s.charAt(i));
}
}
return temp;
}
}
You can do this with two for loops. Iterate over each word and within the iteration iterate over each character.
toUpperCase(2, "good morning everyone");
private static void toUpperCase(int nth, String sentence) {
StringBuilder result = new StringBuilder();
for(String word : sentence.split(" ")) {
for(int i = 0; i < word.length(); i++) {
if(i > 0 && i % nth - 1 == 0) {
result.append(Character.toString(word.charAt(i)).toUpperCase());
} else {
result.append(word.charAt(i));
}
}
result.append(" ");
}
System.out.println(result);
}
gOoD mOrNiNg eVeRyOnE

How do I exclude capitalizing specific words in a String?

I'm new to programming, and here I'm required to capitalise the user's input, which excludes certain words.
For example, if the input is
THIS IS A TEST I get This Is A Test
However, I want to get This is a Test format
String s = in.nextLine();
StringBuilder sb = new StringBuilder(s.length());
String wordSplit[] = s.trim().toLowerCase().split("\\s");
String[] t = {"is","but","a"};
for(int i=0;i<wordSplit.length;i++){
if(wordSplit[i].equals(t))
sb.append(wordSplit[i]).append(" ");
else
sb.append(Character.toUpperCase(wordSplit[i].charAt(0))).append(wordSplit[i].substring(1)).append(" ");
}
System.out.println(sb);
}
This is the closest I have gotten so far but I seem to be unable to exclude capitalising the specific words.
The problem is that you are comparing each word to the entire array. Java does not disallow this, but it does not really make a lot of sense. Instead, you could loop each word in the array and compare those, but that's a bit lengthy in code, and also not very fast if the array of words gets bigger.
Instead, I'd suggest creating a Set from the array and checking whether it contains the word:
String[] t = {"is","but","a"};
Set<String> t_set = new HashSet<>(Arrays.asList(t));
...
if (t_set.contains(wordSplit[i]) {
...
Your problem (as pointed out by #sleepToken) is that
if(wordSplit[i].equals(t))
is checking to see if the current word is equal to the array containing your keywords.
Instead what you want to do is to check whether the array contains a given input word, like so:
if (Arrays.asList(t).contains(wordSplit[i].toLowerCase()))
Note that there is no "case sensitive" contains() method, so it's important to convert the word in question into lower case before searching for it.
You're already doing the iteration once. Just do it again; iterate through every String in t for each String in wordSplit:
for (int i = 0; i < wordSplit.length; i++){
boolean found = false;
for (int j = 0; j < t.length; j++) {
if(wordSplit[i].equals(t[j])) {
found = true;
}
}
if (found) { /* do your stuff */ }
else { }
}
First of all right method which is checking if the word contains in array.
contains(word) {
for (int i = 0;i < arr.length;i++) {
if ( word.equals(arr[i])) {
return true;
}
}
return false;
}
And then change your condition wordSplit[i].equals(t) to contains(wordSplit[i]
You are not comparing with each word to ignore in your code in this line if(wordSplit[i].equals(t))
You can do something like this as below:
public class Sample {
public static void main(String[] args) {
String s = "THIS IS A TEST";
String[] ignore = {"is","but","a"};
List<String> toIgnoreList = Arrays.asList(ignore);
StringBuilder result = new StringBuilder();
for (String s1 : s.split(" ")) {
if(!toIgnoreList.contains(s1.toLowerCase())) {
result.append(s1.substring(0,1).toUpperCase())
.append(s1.substring(1).toLowerCase())
.append(" ");
} else {
result.append(s1.toLowerCase())
.append(" ");
}
}
System.out.println("Result: " + result);
}
}
Output is:
Result: This is a Test
To check the words to exclude java.util.ArrayList.contains() method would be a better choice.
The below expression checks if the exclude list contains the word and if not capitalises the first letter:
tlist.contains(x) ? x : (x = x.substring(0,1).toUpperCase() + x.substring(1)))
The expression is also corresponds to:
if(tlist.contains(x)) { // ?
x = x; // do nothing
} else { // :
x = x.substring(0,1).toUpperCase() + x.substring(1);
}
or:
if(!tlist.contains(x)) {
x = x.substring(0,1).toUpperCase() + x.substring(1);
}
If you're allowed to use java 8:
String s = in.nextLine();
String wordSplit[] = s.trim().toLowerCase().split("\\s");
List<String> tlist = Arrays.asList("is","but","a");
String result = Stream.of(wordSplit).map(x ->
tlist.contains(x) ? x : (x = x.substring(0,1).toUpperCase() + x.substring(1)))
.collect(Collectors.joining(" "));
System.out.println(result);
Output:
This is a Test

parse a document with million words

I have implemented some code to find the anagrams word in the txt sample.txt file and output them on the console. The txt document contains String (word) in each of line.
Is that the right Approach to use if I want to find the anagram words in txt.file with Million or 20 Billion of words? If not which Technologie should I use in this case?
I appreciate any help.
Sample
abac
aabc
hddgfs
fjhfhr
abca
rtup
iptu
xyz
oifj
zyx
toeiut
yxz
jrgtoi
oupt
abac aabc abca
xyz zyx yxz
Code
package org.reader;
import java.io.BufferedReader;
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;
public class Test {
// To store the anagram words
static List<String> match = new ArrayList<String>();
// Flag to check whether the checkWorld1InMatch() was invoked.
static boolean flagCheckWord1InMatch;
public static void main(String[] args) {
String fileName = "G:\\test\\sample2.txt";
StringBuilder sb = new StringBuilder();
// In case of matching, this flag is used to append the first word to
// the StringBuilder once.
boolean flag = true;
BufferedReader br = null;
try {
// convert the data in the sample.txt file to list
List<String> list = Files.readAllLines(Paths.get(fileName));
for (int i = 0; i < list.size(); i++) {
flagCheckWord1InMatch = true;
String word1 = list.get(i);
for (int j = i + 1; j < list.size(); j++) {
String word2 = list.get(j);
boolean isExist = false;
if (match != null && !match.isEmpty() && flagCheckWord1InMatch) {
isExist = checkWord1InMatch(word1);
}
if (isExist) {
// A word with the same characters was checked before
// and there is no need to check it again. Therefore, we
// jump to the next word in the list.
// flagCheckWord1InMatch = true;
break;
} else {
boolean result = isAnagram(word1, word2);
if (result) {
if (flag) {
sb.append(word1 + " ");
flag = false;
}
sb.append(word2 + " ");
}
if (j == list.size() - 1 && sb != null && !sb.toString().isEmpty()) {
match.add(sb.toString().trim());
sb.setLength(0);
flag = true;
}
}
}
}
} catch (
IOException e) {
e.printStackTrace();
} finally {
try {
if (br != null) {
br.close();
}
} catch (IOException ex) {
ex.printStackTrace();
}
}
for (String item : match) {
System.out.println(item);
}
// System.out.println("Sihwail");
}
private static boolean checkWord1InMatch(String word1) {
flagCheckWord1InMatch = false;
boolean isAvailable = false;
for (String item : match) {
String[] content = item.split(" ");
for (String word : content) {
if (word1.equals(word)) {
isAvailable = true;
break;
}
}
}
return isAvailable;
}
public static boolean isAnagram(String firstWord, String secondWord) {
char[] word1 = firstWord.toCharArray();
char[] word2 = secondWord.toCharArray();
Arrays.sort(word1);
Arrays.sort(word2);
return Arrays.equals(word1, word2);
}
}
For 20 billion words you will not be to able to hold all of them in RAM so you need an approach to process them in chunks.
20,000,000,000 words. Java needs quite a lot of memory to store strings so you can count 2 bytes per character and at least 38 bytes overhead.
This means 20,000,000,000 words of one character would need 800,000,000,000 bytes or 800 GB, which is more than any computer I know has.
Your file will contain much less than 20,000,000,000 different words, so you might avoid the memory problem if you store every word only once (e.g. in a Set).
First for a smaller number.
As it is better to use a more powerful data structure, do not read all lines in core, but read line-wise.
Map<String, Set<String>> mapSortedToWords = new HashMap<>();
Path path = Paths.get(fileName);
try (BufferedReader in = Files.newBufferedReader(Path, StandardCharsets.UTF_8)) {
for (;;) {
String word = in.readLine();
if (word == null) {
break;
}
String key = sorted(word);
Set<String> words = mapSortedToWords.get(key);
if (words == null) {
words = new TreeSet<String>();
mapSortedToWords.put(key, words);
}
words.add(word);
}
}
for (Set<String> anagrams : mapSortedToWords.values()) {
if (anagrams.size() > 1) {
... anagrams
}
}
static String sorted(String word) {
char[] letters = word.toCharArray();
Arrays.sort(letters);
return new String(letters);
}
This stores in the map a set of words. Comparable with abac aabc abca.
For a large number a database where you store (sortedLetters, word) would be better. An embedded database like Derby or H2 poses no installation problems.
For the kind of file size that you specify ( 20 billion words), obviously there are two main problems with your code,
List<String> list = Files.readAllLines(Paths.get(fileName));
AND
for (int i = 0; i < list.size(); i++)
These two lines in your programs basically question,
Do you have enough memory to read full file in one go?
Is it OK to iterate 20 billion times?
For most systems, answer for both above questions would be NO.
So your target is to cut down memory foot print and reduce the number of iterations.
So you need to read your files chunk by chunk and use some kind of search data structures ( like Trie ) to store your words.
You will find numerous questions on SO for both of above topics like,
Fastest way to incrementally read a large file
Finding anagrams for a given word
Above algorithm says that you have to first create a dictionary for your words.
Anyway, I believe there is no ready made answer for you. Take a file with one billion words ( that is a very difficult task in itself ) and see what works and what doesn't but your current code will obviously not work.
Hope it helps !!
Use a stream to read the file. That way you are only storing one word at once.
FileReader file = new FileReader("file.txt"); //filestream
String word;
while(file.ready()) //return true if there a bytes left in the stream
{
char c = file.read(); //reads one character
if(c != '\n')
{
word+=c;
}
else {
process(word); // do whatever you want
word = "";
}
}
Update
You can use a map for finding the anagrams like below. For each word you have you may sort its chars and obtain a sorted String. So, this would be the key of your anagrams map. And values of this key will be the other anagram words.
public void findAnagrams(String[] yourWords) {
Map<String, List<String>> anagrams = new HashMap<String, List<String>>();
for (String word : yourWords) {
String sortedWord = sortedString(word);
List<String> values = anagrams.get(sortedWord);
if (values == null)
values = new LinkedList<>();
values.add(word);
anagrams.put(sortedWord, values);
}
System.out.println(anagrams);
}
private static String sortedString(String originalWord) {
char[] chars = originalWord.toCharArray();
Arrays.sort(chars);
String sorted = new String(chars);
return sorted;
}

Determining whether string is a proper noun in text

I'm trying to parse a text (http://pastebin.com/raw.php?i=0wD91r2i) and retrieve the words and the number of their occurrences. However, I must not include proper nouns within the final output. I'm not quite sure how to accomplish this task.
My attempt at this
public class TextAnalysis
{
public static void main(String[] args)
{
ArrayList<Word> words = new ArrayList<Word>(); //instantiate array list of object Word
try
{
int lineCount = 0;
int wordCount = 0;
int specialWord = 0;
URL reader = new URL("http://pastebin.com/raw.php?i=0wD91r2i");
Scanner in = new Scanner(reader.openStream());
while(in.hasNextLine()) //while to parse text
{
lineCount++;
String textInfo[] = in.nextLine().replaceAll("[^a-zA-Z ]", "").split("\\s+"); //use regex to replace all punctuation with empty char and split words with white space chars in between
wordCount += textInfo.length;
for(int i=0; i<textInfo.length; i++)
{
if(textInfo[i].toLowerCase().matches("the|a|an|and|but|or|by|to|for|of|with|without|chapter|[0-9]+")) //if word matches any special word case, add count of special words then continue to next word
{
specialWord++;
continue;
}
if(!textInfo[i].matches(".*\\w.*")) continue; //also if text matches white space then continue
boolean found = false;
for(Word word: words) //check whether word already exists in list -- if so add count
{
if(word.getWord().equals(textInfo[i]))
{
word.addOccurence(1);
word.addLine(lineCount);
found = true;
}
}
if(!found) //else add new entry
{
words.add(new Word(textInfo[i], lineCount, 1));
}
}
}
//adds data from capital word to lowercase word ATTEMPT AT PROPER NOUNS HERE
for(Word word: words)
{
for(int i=0; i<words.size(); i++)
{
if(Character.isUpperCase(word.getWord().charAt(0)) && word.getWord().toLowerCase().equals(words.get(i).getWord()))
{
words.get(i).addOccurence(word.getOccurence());
words.get(i).addLine(word.getLine());
}
}
}
Comparator<Word> occurenceComparator = new Comparator<Word>() //comparares list based on number of occurences
{
public int compare(Word n1, Word n2)
{
if(n1.getOccurence() < n2.getOccurence()) return 1;
else if (n1.getOccurence() == n2.getOccurence()) return 0;
else return -1;
}
};
Collections.sort(words);
// Collections.sort(words, occurenceComparator);
// ArrayList<Word> top_words = new ArrayList<Word>(words.subList(0,100));
// Collections.sort(top_words);
System.out.printf("%-15s%-15s%s\n", "Word", "Occurences", "Word Distribution Index");
for(Word word: words)
{
word.setTotalLine(lineCount);
System.out.println(word);
}
System.out.println(wordCount);
System.out.printf("%s%.3f\n","The connecting word index is ",specialWord*100.0/wordCount);
}
catch(IOException ex)
{
System.out.println("WEB URL NOT FOUND");
}
}
}
formatting kind of off, not sure how to do it correctly.
Which determines if a word is capitalized and if there is a lower case version of the word, adds the data to the lower case word. However, this does not account for words where a lower case version never appears such as "Four" or "Now" in the text. How might I go about this without cross referencing a dictionary?
EDIT: I HAVE SOLVED THE PROBLEM MYSELF.
Thank you, however, to Wes for attempting to answer.
It seems like your algorithm is to assume any word that appears capitalized but does not appear uncapitalized is a proper noun. So if that's the case, then you can use the following algorithm to get the proper nouns.
//Assume you have tokenized your whole file into a Collection called allWords.
HashSet<String> lowercaseWords = new HashSet<>();
HashMap<String,String> lowerToCap = new HashMap<>();
for(String word: allWords) {
if (Character.isUpperCase(word.charAt(0))){
lowerToCap.put(word.toLowerCase(),word);
}
else {
lowercaseWords.add(word.toLowerCase);
}
}
//remove all the words that we've found as capitalized, only proper nouns will be left
lowercaseWords.removeAll(lowerToCap.keySet());
for(String properNounLower:lowercaseWords) {
System.out.println("Proper Noun: "+ lowerToCap.get(properNounLower));
}

How to Check for Deleted Words Between 2 Sentences in Java

What's the best approach in Java if you want to check for words that were deleted from sentence A in sentence B. For example:
Sentence A: I want to delete unnecessary words on this simple sentence.
Sentence B: I want to delete words on this sentence.
Output: I want to delete (unnecessary) words on this (simple) sentence.
where the words inside the parenthesis are the ones that were deleted from sentence A.
Assuming order doesn't matter: use commons-collections.
Use String.split() to split both sentences into arrays of words.
Use commons-collections' CollectionUtils.addAll to add each array into an empty Set.
Use commons-collections' CollectionUtils.subtract method to get A-B.
Assuming order and position matters, this looks like it would be a variation of the Longest Common Subsequence problem, a dynamic programming solution.
wikipedia has a great page on the topic, there's really too much for me to outline here
http://en.wikipedia.org/wiki/Longest_common_subsequence_problem
Everyone else is using really heavy-weight algorithms for what is actually a very simple problem. It could be solved using longest common subsequence, but it's a very constrained version of that. It's not a full diff; it only includes deletes. No need for dynamic programming or anything like that. Here's a 20-line implementation:
private static String deletedWords(String s1, String s2) {
StringBuilder sb = new StringBuilder();
String[] words1 = s1.split("\\s+");
String[] words2 = s2.split("\\s+");
int i1, i2;
i1 = i2 = 0;
while (i1 < words1.length) {
if (words1[i1].equals(words2[i2])) {
sb.append(words1[i1]);
i2++;
} else {
sb.append("(" + words1[i1] + ")");
}
if (i1 < words1.length - 1) {
sb.append(" ");
}
i1++;
}
return sb.toString();
}
When the inputs are the ones in the question, the output matches exactly.
Granted, I understand that for some inputs there are multiple solutions. For example:
a b a
a
could be either a (b) (a) or (a) (b) a and maybe for some versions of this problem, one of these solutions is more likely to be the "actual" solution than the other, and for those you need some recursive or dynamic programming approach... but let's not make it too much more complicated than what Israel Sato originally asked for!
String a = "I want to delete unnecessary words on this simple sentence.";
String b = "I want to delete words on this sentence.";
String[] aWords = a.split(" ");
String[] bWords = b.split(" ");
List<String> missingWords = new ArrayList<String> ();
int x = 0;
for(int i = 0 ; i < aWords.length; i++) {
String aWord = aWords[i];
if(x < bWords.length) {
String bWord = bWords[x];
if(aWord.equals(bWord)) {
x++;
} else {
missingWords.add(aWord);
}
} else {
missingWords.add(aWord);
}
}
This works well....for updated strings also
updated strings enclosed with square brackets.
import java.util.*;
class Sample{
public static void main(String[] args){
Scanner sc=new Scanner(System.in);
String str1 = sc.nextLine();
String str2 = sc.nextLine();
List<String> flist = Arrays.asList(str1.split("\\s+"));
List<String> slist = Arrays.asList(str2.split("\\s+"));
List<String> completedString = new ArrayList<String>();
String result="";
String updatedString = "";
String deletedString = "";
int i=0;
int startIndex=0;
int endIndex=0;
for(String word: slist){
if(flist.contains(word)){
endIndex = flist.indexOf(word);
if(!completedString.contains(word)){
if(deletedString.isEmpty()){
for(int j=startIndex;j<endIndex;j++){
deletedString+= flist.get(j)+" ";
}
}
}
startIndex=endIndex+1;
if(!deletedString.isEmpty()){
result += "("+deletedString.substring(0,deletedString.length()-1)+") ";
deletedString="";
}
if(!updatedString.isEmpty()){
result += "["+updatedString.substring(0,updatedString.length()-1)+"] ";
updatedString="";
}
result += word+" ";
completedString.add(word);
if(i==slist.size()-1){
endIndex = flist.size();
for(int j=startIndex;j<endIndex;j++){
deletedString+= flist.get(j)+" ";
}
startIndex = endIndex+1;
}
}
else{
if(i == 0){
boolean boundaryCheck = false;
for(int j=i+1;j<slist.size();j++){
if(flist.contains(slist.get(j))){
endIndex=flist.indexOf(slist.get(j));
boundaryCheck=true;
break;
}
}
if(!boundaryCheck){
endIndex = flist.size();
}
if(!completedString.contains(word)){
for(int j=startIndex;j<endIndex;j++){
deletedString+= flist.get(j)+" ";
}
}
startIndex = endIndex+1;
}else if(i == slist.size()-1){
endIndex = flist.size();
if(!completedString.contains(word)){
for(int j=startIndex;j<endIndex;j++){
deletedString+= flist.get(j)+" ";
}
}
startIndex = endIndex+1;
}
updatedString += word+" ";
completedString.add(word);
}
i++;
}
if(!deletedString.isEmpty()){
result += "("+deletedString.substring(0,deletedString.length()-1)+") ";
}
if(!updatedString.isEmpty()){
result += "["+updatedString.substring(0,updatedString.length()-1)+"] ";
}
System.out.println(result);
}
}
This is basically a differ, take a look at this:
diff
and the root algorithm:
Longest common subsequence problem
Here's a sample Java implementation:
http://introcs.cs.princeton.edu/java/96optimization/Diff.java.html
which compares lines. The only thing you need to do is split by word instead of by line or alternatively put each word of both sentences in a separate line.
If e.g. on Linux, you can actually see the results of the latter option using diff program itself before you even write any code, try this:
$ echo "I want to delete unnecessary words on this simple sentence."|tr " " "\n" > 1
$ echo "I want to delete words on this sentence."|tr " " "\n" > 2
$ diff -uN 1 2
--- 1 2012-10-01 19:40:51.998853057 -0400
+++ 2 2012-10-01 19:40:51.998853057 -0400
## -2,9 +2,7 ##
want
to
delete
-unnecessary
words
on
this
-simple
sentence.
The lines with - in front are different (alternatively, it would show + if the lines were added into sentence B that were not in sentence A). Try it out to see if that fits your problem.
Hope this helps.

Categories