Finding 'n' most frequent words from a file using Java? - java

I want to read an file, and want to collect top n words depends on word frequency.
I have tried the following code to count every words in a string.
public static void main(String[] args) throws FileNotFoundException, IOException {
FileReader fr = new FileReader("txtFile.txt");
BufferedReader br = new BufferedReader(fr);
String text = "";
String sz = null;
while ((sz = br.readLine()) != null) {
text = text.concat(sz);
}
String[] words = text.split(" ");
String[] uniqueLabels;
int count = 0;
System.out.println(text);
uniqueLabels = getLabels(words);
for (String l: uniqueLabels) {
if ("".equals(l) || null == l) {
break;
}
for (String s: words) {
if (l.equals(s)) {
count++;
}
}
System.out.println("Word :: " + l + " Count :: " + count);
count = 0;
}
}
And I used the following code to collect unique lbels(words) get if from link,
private static String[] getLabels(String[] keys) {
String[] uniqueKeys = new String[keys.length];
uniqueKeys[0] = keys[0];
int uniqueKeyIndex = 1;
boolean keyAlreadyExists = false;
for (int i = 1; i < keys.length; i++) {
for (int j = 0; j <= uniqueKeyIndex; j++) {
if (keys[i].equals(uniqueKeys[j])) {
keyAlreadyExists = true;
}
}
if (!keyAlreadyExists) {
uniqueKeys[uniqueKeyIndex] = keys[i];
uniqueKeyIndex++;
}
keyAlreadyExists = false;
}
return uniqueKeys;
}
And this works fine, I want to collect top 10 ranked words depend on it's frequency in file.

First of all, if you want it to run moderately fast, don't loop trough all the Strings in an array...use a HashMap... or even find some map for primitives.
Then go through the words. If the words is in the map, increment the value, otherwise put a 1.
In the end, sort the map entries and fetch the first 10.
Not a total duplicate, but this answer pretty much shows how to get the counting done: Calculating frequency of each word in a sentence in java

I recommend using a Hashmap<String, Integer>() to count the word frequency. Hash uses key-value-pairs. That means the key is unique (your word) and the value variable. If you perform a put operation with a already existing key, the value will be updated.
Hashmap
Something like this should work:
hashmap.put(key, hashmap.get(key) + 1);
To get the top then words, I would implement sort the hashmap and retrieve the first ten entries.

I solved it as,
public class wordFreq {
private static String[] w = null;
private static int[] r = null;
public static void main(String[] args){
try {
System.out.println("Enter 'n' value :: ");
Scanner in = new Scanner(System.in);
int n = in.nextInt();
w = new String[n];
r = new int[n];
FileReader fr = new FileReader("acq.txt");
BufferedReader br = new BufferedReader(fr);
String text = "";
String sz = null;
while((sz=br.readLine())!=null){
text = text.concat(sz);
}
String[] words = text.split(" ");
String[] uniqueLabels;
int count = 0;
uniqueLabels = getUniqLabels(words);
for(int j=0; j<n; j++){
r[j] = 0;
}
for(String l: uniqueLabels)
{
if("".equals(l) || null == l)
{
break;
}
for(String s : words)
{
if(l.equals(s))
{
count++;
}
}
for(int i=0; i<n; i++){
if(count>r[i]){
r[i] = count;
w[i] = l;
break;
}
}
count=0;
}
display(n);
} catch (Exception e) {
System.err.println("ERR "+e.getMessage());
}
}
public static void display(int n){
for(int k=0; k<n; k++){
System.out.println("Label :: "+w[k]+"\tCount :: "+r[k]);
}
}
private static String[] getUniqLabels(String[] keys)
{
String[] uniqueKeys = new String[keys.length];
uniqueKeys[0] = keys[0];
int uniqueKeyIndex = 1;
boolean keyAlreadyExists = false;
for(int i=1; i<keys.length ; i++)
{
for(int j=0; j<=uniqueKeyIndex; j++)
{
if(keys[i].equals(uniqueKeys[j]))
{
keyAlreadyExists = true;
}
}
if(!keyAlreadyExists)
{
uniqueKeys[uniqueKeyIndex] = keys[i];
uniqueKeyIndex++;
}
keyAlreadyExists = false;
}
return uniqueKeys;
}
}
And the sample output is,
Enter 'n' value ::
5
Label :: computer Count :: 30
Label :: company Count :: 22
Label :: express Count :: 20
Label :: offer Count :: 16
Label :: shearson Count :: 16

Related

Finding most and least frequent words from a text file in Java

This was a question i came across when i was generally practicing some java questions online . I concentrated on finding the most frequent words as i thought that coding the least frequent words would be easy. I finally managed to code the most frequent words part but I'm unable to code the least frequent words part. Expecting a help from you guys
Thanks in advannce
This is the code for the most frequent part
import java.io.*;
public class wordFreq {
private static String[] w = null;
private static int[] r = null;
public static void main(String[] args){
try {
System.out.println("Enter 'n' value :: ");
Scanner in = new Scanner(System.in);
int n = in.nextInt();
w = new String[n];
r = new int[n];
FileReader fr = new FileReader("acq.txt");
BufferedReader br = new BufferedReader(fr);
String text = "";
String sz = null;
while((sz=br.readLine())!=null){
text = text.concat(sz);
}
String[] words = text.split(" ");
String[] uniqueLabels;
int count = 0;
uniqueLabels = getUniqLabels(words);
for(int j=0; j<n; j++){
r[j] = 0;
}
for(String l: uniqueLabels)
{
if("".equals(l) || null == l)
{
break;
}
for(String s : words)
{
if(l.equals(s))
{
count++;
}
}
for(int i=0; i<n; i++){
if(count>r[i]){
r[i] = count;
w[i] = l;
break;
}
/* else if(count==1){
System.out.println("least frequent");
System.out.println("("+w[i]+":"+r[i]+"),");
}*/
}
count=0;
}
display(n);
} catch (Exception e) {
System.err.println("ERR "+e.getMessage());
}
}
public static void display(int n){
System.out.println("Most Frequent");
for(int k=0; k<n; k++){
System.out.print("("+w[k]+":"+r[k]+"),");
}
}
private static String[] getUniqLabels(String[] keys)
{
String[] uniqueKeys = new String[keys.length];
uniqueKeys[0] = keys[0];
int uniqueKeyIndex = 1;
boolean keyAlreadyExists = false;
for(int i=1; i<keys.length ; i++)
{
for(int j=0; j<=uniqueKeyIndex; j++)
{
if(keys[i].equals(uniqueKeys[j]))
{
keyAlreadyExists = true;
}
}
if(!keyAlreadyExists)
{
uniqueKeys[uniqueKeyIndex] = keys[i];
uniqueKeyIndex++;
}
keyAlreadyExists = false;
}
return uniqueKeys;
}
}

How can i extract trend words from given dataset (Java)? [duplicate]

How to generate an n-gram of a string like:
String Input="This is my car."
I want to generate n-gram with this input:
Input Ngram size = 3
Output should be:
This
is
my
car
This is
is my
my car
This is my
is my car
Give some idea in Java, how to implement that or if any library is available for it.
I am trying to use this NGramTokenizer but its giving n-gram's of character sequence and I want n-grams of word sequence.
I believe this would do what you want:
import java.util.*;
public class Test {
public static List<String> ngrams(int n, String str) {
List<String> ngrams = new ArrayList<String>();
String[] words = str.split(" ");
for (int i = 0; i < words.length - n + 1; i++)
ngrams.add(concat(words, i, i+n));
return ngrams;
}
public static String concat(String[] words, int start, int end) {
StringBuilder sb = new StringBuilder();
for (int i = start; i < end; i++)
sb.append((i > start ? " " : "") + words[i]);
return sb.toString();
}
public static void main(String[] args) {
for (int n = 1; n <= 3; n++) {
for (String ngram : ngrams(n, "This is my car."))
System.out.println(ngram);
System.out.println();
}
}
}
Output:
This
is
my
car.
This is
is my
my car.
This is my
is my car.
An "on-demand" solution implemented as an Iterator:
class NgramIterator implements Iterator<String> {
String[] words;
int pos = 0, n;
public NgramIterator(int n, String str) {
this.n = n;
words = str.split(" ");
}
public boolean hasNext() {
return pos < words.length - n + 1;
}
public String next() {
StringBuilder sb = new StringBuilder();
for (int i = pos; i < pos + n; i++)
sb.append((i > pos ? " " : "") + words[i]);
pos++;
return sb.toString();
}
public void remove() {
throw new UnsupportedOperationException();
}
}
You are looking for ShingleFilter.
Update: The link points to version 3.0.2. This class may be in different package in newer version of Lucene.
This code returns an array of all Strings of the given length:
public static String[] ngrams(String s, int len) {
String[] parts = s.split(" ");
String[] result = new String[parts.length - len + 1];
for(int i = 0; i < parts.length - len + 1; i++) {
StringBuilder sb = new StringBuilder();
for(int k = 0; k < len; k++) {
if(k > 0) sb.append(' ');
sb.append(parts[i+k]);
}
result[i] = sb.toString();
}
return result;
}
E.g.
System.out.println(Arrays.toString(ngrams("This is my car", 2)));
//--> [This is, is my, my car]
System.out.println(Arrays.toString(ngrams("This is my car", 3)));
//--> [This is my, is my car]
/**
*
* #param sentence should has at least one string
* #param maxGramSize should be 1 at least
* #return set of continuous word n-grams up to maxGramSize from the sentence
*/
public static List<String> generateNgramsUpto(String str, int maxGramSize) {
List<String> sentence = Arrays.asList(str.split("[\\W+]"));
List<String> ngrams = new ArrayList<String>();
int ngramSize = 0;
StringBuilder sb = null;
//sentence becomes ngrams
for (ListIterator<String> it = sentence.listIterator(); it.hasNext();) {
String word = (String) it.next();
//1- add the word itself
sb = new StringBuilder(word);
ngrams.add(word);
ngramSize=1;
it.previous();
//2- insert prevs of the word and add those too
while(it.hasPrevious() && ngramSize<maxGramSize){
sb.insert(0,' ');
sb.insert(0,it.previous());
ngrams.add(sb.toString());
ngramSize++;
}
//go back to initial position
while(ngramSize>0){
ngramSize--;
it.next();
}
}
return ngrams;
}
Call:
long startTime = System.currentTimeMillis();
ngrams = ToolSet.generateNgramsUpto("This is my car.", 3);
long stopTime = System.currentTimeMillis();
System.out.println("My time = "+(stopTime-startTime)+" ms with ngramsize = "+ngrams.size());
System.out.println(ngrams.toString());
Output:
My time = 1 ms with ngramsize = 9 [This, is, This is, my, is my, This
is my, car, my car, is my car]
public static void CreateNgram(ArrayList<String> list, int cutoff) {
try
{
NGramModel ngramModel = new NGramModel();
POSModel model = new POSModelLoader().load(new File("en-pos-maxent.bin"));
PerformanceMonitor perfMon = new PerformanceMonitor(System.err, "sent");
POSTaggerME tagger = new POSTaggerME(model);
perfMon.start();
for(int i = 0; i<list.size(); i++)
{
String inputString = list.get(i);
ObjectStream<String> lineStream = new PlainTextByLineStream(new StringReader(inputString));
String line;
while ((line = lineStream.read()) != null)
{
String whitespaceTokenizerLine[] = WhitespaceTokenizer.INSTANCE.tokenize(line);
String[] tags = tagger.tag(whitespaceTokenizerLine);
POSSample sample = new POSSample(whitespaceTokenizerLine, tags);
perfMon.incrementCounter();
String words[] = sample.getSentence();
if(words.length > 0)
{
for(int k = 2; k< 4; k++)
{
ngramModel.add(new StringList(words), k, k);
}
}
}
}
ngramModel.cutoff(cutoff, Integer.MAX_VALUE);
Iterator<StringList> it = ngramModel.iterator();
while(it.hasNext())
{
StringList strList = it.next();
System.out.println(strList.toString());
}
perfMon.stopAndPrintFinalResult();
}catch(Exception e)
{
System.out.println(e.toString());
}
}
Here is my codes to create n-gram. In this case, n = 2, 3. n-gram of words sequence which smaller than cutoff value will ignore from result set. Input is list of sentences, then it parse using a tool of OpenNLP
public static void main(String[] args) {
String[] words = "This is my car.".split(" ");
for (int n = 0; n < 3; n++) {
List<String> list = ngrams(n, words);
for (String ngram : list) {
System.out.println(ngram);
}
System.out.println();
}
}
public static List<String> ngrams(int stepSize, String[] words) {
List<String> ngrams = new ArrayList<String>();
for (int i = 0; i < words.length-stepSize; i++) {
String initialWord = "";
int internalCount = i;
int internalStepSize = i + stepSize;
while (internalCount <= internalStepSize
&& internalCount < words.length) {
initialWord = initialWord+" " + words[internalCount];
++internalCount;
}
ngrams.add(initialWord);
}
return ngrams;
}
Check this out:
public static void main(String[] args) {
NGram nGram = new NGram();
String[] tokens = "this is my car".split(" ");
int i = tokens.length;
List<String> ngrams = new ArrayList<>();
while (i >= 1){
ngrams.addAll(nGram.getNGram(tokens, i, new ArrayList<>()));
i--;
}
System.out.println(ngrams);
}
private List<String> getNGram(String[] tokens, int n, List<String> ngrams) {
StringBuilder strbldr = new StringBuilder();
if (tokens.length < n) {
return ngrams;
}else {
for (int i=0; i<n; i++){
strbldr.append(tokens[i]).append(" ");
}
ngrams.add(strbldr.toString().trim());
String[] newTokens = Arrays.copyOfRange(tokens, 1, tokens.length);
return getNGram(newTokens, n, ngrams);
}
}
Simple recursive function, better running time.

Count Occurrence of each letter in a loop and display with letter with the most number of occurences

I'm having trouble in using this code I found on the net. my goal is to count the number of times a letter show and display the letter with the most occurrence and if there are 2 or more letters that occurred at the same number of times then they will both show up.
This is my current output:
Current Output
Here is the code i found on the net and working with:
public void fcount(String str)
{
int[] occurence = new int[255];
// Scanner scanner = new Scanner(System.in);
// str = scanner.nextLine();
// optional to put eveyting in uppercase
str = str.toUpperCase();
// convert to char
char[] digit = str.toCharArray();
// count
for(int i = 0; i < digit.length; i++)
occurence[digit[i]]++;
// find max
int max = 0; // max value
char maxValue = 0; // max index
for(int i = 0; i < occurence.length; i++)
{
// new max ?
if(occurence[i] > max) {
max = occurence[i];
maxValue = (char) i;
}
}
// result
System.out.println("Character used " + max + " times is: " + (char) maxValue);
// return "";
}
And Here is the the loop where i'm using it:
public void calpha()
{
char startUpper = 'A';
String cones = null;
for (int i = 0; i < 12; i++) {
cones = Character.toString(startUpper);
System.out.println(startUpper);
}
fcount(cones);
}
There is an error in you loop:
cones = Character.toString(startUpper);
You are just re-assigning the value of cones, so fcount receives a string containing only the last character.
A solution is
cones += Character.toString(startUpper);
You have an issue in your int[] occurence = new int[255]; statement and usage: occurence[digit[i]]++ may lead to IndexOutOfBoundsException since char type is up to 2^16
Your code can not deal with non-ANSII characters. Mine does.
import java.util.*;
class Problem {
public static void main(String args[]) {
final String input = "I see trees outside of my window.".replace(" ", "");
final List<Character> chars = new ArrayList<>(input.length());
for (final char c : input.toCharArray()) {
chars.add(c);
}
int maxFreq = 0;
final Set<Character> mostFrequentChars = new HashSet<>();
for(final char c : chars) {
final int freq = Collections.frequency(chars, c);
if (freq > maxFreq) {
mostFrequentChars.clear();
mostFrequentChars.add(c);
maxFreq = freq;
}
else {
if (freq == maxFreq) {
mostFrequentChars.add(c);
}
}
}
for (Character c : mostFrequentChars) {
System.out.println(c);
}
}
}
Try this code:
public static void main(String[] args) throws IOException {
char startUpper = 'A';
String cones = "";
for (int i = 0; i < 12; i++) {
cones += Character.toString(startUpper);
System.out.println(startUpper);
}
fcount(cones);
}
public static void fcount(String str) {
HashMap<Character, Integer> hashMap = new HashMap<Character, Integer>();
HashSet<Character> letters = new HashSet<Character>();
str = str.toUpperCase();
//Assume that string str minimium has 1 char
int max = 1;
for (int i = 0; i < str.length(); i++) {
int newValue = 1;
if (hashMap.containsKey(str.charAt(i))) {
newValue = hashMap.get(str.charAt(i)) + 1;
hashMap.put(str.charAt(i), newValue);
if (newValue>=max) {
max = newValue;
letters.add(str.charAt(i));
}
} else {
hashMap.put(str.charAt(i), newValue);
}
}
System.out.println("Character used " + max + " times is: " + Arrays.toString(letters.toArray()));
// return "";
}

How can I avoid repetition of the same number?

This is what I want :
Let the user enter as many numbers as they want until a non number is entered (you may
assume there will be less than 100 numbers). Find the most frequently entered number. (If
there are more than one, print all of them.)
Example output:
Input: 5
Input: 4
Input: 9
Input: 9
Input: 4
Input: 1
Input: a
Most common: 4, 9
I have got to the point in my code where I have managed to find out which are the most common numbers. However, I don't want to print out the same number over and over again; example from above: Most common: 4, 9, 9, 4
What needs to be done?
public static void main(String[] args) throws IOException {
BufferedReader in = new BufferedReader(new InputStreamReader(System.in));
String[] input = new String[100];
System.out.print("Input: ");
input[0] = in.readLine();
int size = 0;
for (int i = 1; i < 100 && isNumeric(input[i-1]); i++) {
System.out.print("Input: ");
input[i] = in.readLine();
size = size + 1;
}
/*for (int i = 0; i < size; i++) { //testing
System.out.println(input[i]);
}*/
int numOccur;
int[] occur = new int[size];
for(int i = 0; i < size; i++) {
numOccur = 0;
for (int j = 0; j < size; j++) {
if(input[i].equals(input[j])) {
numOccur = numOccur + 1;
}
}
occur[i] = numOccur;
//System.out.println(numOccur); //testing
}
int maxOccur = 0;
for(int i = 0; i < size; i++) {
if(occur[i] > maxOccur) {
maxOccur = occur[i];
}
}
//System.out.println(maxOccur); //testing
for (int i = 0; i < size && !numFound; i++) {
if(occur[i] == maxOccur) {
System.out.println(input[i]);
}
}
}
//checks if s is an in, true if it is an int
public static boolean isNumeric (String s) {
try {
Integer.parseInt(s);
return true; //parse was successful
} catch (NumberFormatException nfe) {
return false;
}
}
Found the solution!
String[] mostCommon = new String[size];
int numMostCommon = 0;
boolean numFound = false;
for (int i = 0; i < size; i++) {
int isDifferent = 0;
if (occur[i] == maxOccur) {
for (int j = 0; j < size; j++) {
if (!(input[i].equals(mostCommon[j]))) {
isDifferent = isDifferent + 1;
}
}
if (isDifferent == size) {
mostCommon[numMostCommon] = input[i];
numMostCommon = numMostCommon + 1;
}
}
}
for (int i = 0; i < numMostCommon - 1; i++) {
System.out.print("Most common: " + mostCommon[i] + ", ");
}
System.out.println(mostCommon[numMostCommon - 1]);
you could use the hash table for this to store the frequenceis as the limit is very less i.e. less than 100.
pseudo code would be like:
vector<int> hash(101)
cin>>input
if(isnumeric(input))
hash[input]++
else{
max=max_element(hash.begin(),hash.end());
for(int i=0;i<100;i++)
if(hash[i]==max)
print i
}
Set<Integer> uniqueMaxOccur = new HashSet<Integer>();
for (int i = 0; i < size ; i++) {
if(occur[i] == maxOccur) {
//System.out.println(input[i]);
uniqueMaxOccur.add(input[i]);
}
}
and display the values in the set
You can use a Set and store the values already printed.
What about something like this?
public static void main(String[] args) throws IOException {
BufferedReader in = new BufferedReader(new InputStreamReader(System.in));
Map<string,int> numberLookup = new HashMap<string,int>();
Boolean doContinue = true;
while (doContinue)
{
System.out.print("Input: ");
String input = in.readLine();
if (isNumeric(input))
{
if (!numberLookup.containsKey(input))
numberLookup.put(input,1);
else
numberLookup.put(input, numberLookup.get(input) + 1);
}
else
doContinue = false;
}
maxOccur = numberLookup.values().max();
System.out.print("These numbers were all entered " + maxOccur + " times:");
Iterator it = numberLookup.entrySet().iterator();
while (it.hasNext())
{
(Map.Entry)it.next();
System.out.println(pairs.getKey());
}
}
Sorry, I'm a C# person and don't have a Java compiler on me, so this might need some tweaking.

N-gram generation from a sentence

How to generate an n-gram of a string like:
String Input="This is my car."
I want to generate n-gram with this input:
Input Ngram size = 3
Output should be:
This
is
my
car
This is
is my
my car
This is my
is my car
Give some idea in Java, how to implement that or if any library is available for it.
I am trying to use this NGramTokenizer but its giving n-gram's of character sequence and I want n-grams of word sequence.
I believe this would do what you want:
import java.util.*;
public class Test {
public static List<String> ngrams(int n, String str) {
List<String> ngrams = new ArrayList<String>();
String[] words = str.split(" ");
for (int i = 0; i < words.length - n + 1; i++)
ngrams.add(concat(words, i, i+n));
return ngrams;
}
public static String concat(String[] words, int start, int end) {
StringBuilder sb = new StringBuilder();
for (int i = start; i < end; i++)
sb.append((i > start ? " " : "") + words[i]);
return sb.toString();
}
public static void main(String[] args) {
for (int n = 1; n <= 3; n++) {
for (String ngram : ngrams(n, "This is my car."))
System.out.println(ngram);
System.out.println();
}
}
}
Output:
This
is
my
car.
This is
is my
my car.
This is my
is my car.
An "on-demand" solution implemented as an Iterator:
class NgramIterator implements Iterator<String> {
String[] words;
int pos = 0, n;
public NgramIterator(int n, String str) {
this.n = n;
words = str.split(" ");
}
public boolean hasNext() {
return pos < words.length - n + 1;
}
public String next() {
StringBuilder sb = new StringBuilder();
for (int i = pos; i < pos + n; i++)
sb.append((i > pos ? " " : "") + words[i]);
pos++;
return sb.toString();
}
public void remove() {
throw new UnsupportedOperationException();
}
}
You are looking for ShingleFilter.
Update: The link points to version 3.0.2. This class may be in different package in newer version of Lucene.
This code returns an array of all Strings of the given length:
public static String[] ngrams(String s, int len) {
String[] parts = s.split(" ");
String[] result = new String[parts.length - len + 1];
for(int i = 0; i < parts.length - len + 1; i++) {
StringBuilder sb = new StringBuilder();
for(int k = 0; k < len; k++) {
if(k > 0) sb.append(' ');
sb.append(parts[i+k]);
}
result[i] = sb.toString();
}
return result;
}
E.g.
System.out.println(Arrays.toString(ngrams("This is my car", 2)));
//--> [This is, is my, my car]
System.out.println(Arrays.toString(ngrams("This is my car", 3)));
//--> [This is my, is my car]
/**
*
* #param sentence should has at least one string
* #param maxGramSize should be 1 at least
* #return set of continuous word n-grams up to maxGramSize from the sentence
*/
public static List<String> generateNgramsUpto(String str, int maxGramSize) {
List<String> sentence = Arrays.asList(str.split("[\\W+]"));
List<String> ngrams = new ArrayList<String>();
int ngramSize = 0;
StringBuilder sb = null;
//sentence becomes ngrams
for (ListIterator<String> it = sentence.listIterator(); it.hasNext();) {
String word = (String) it.next();
//1- add the word itself
sb = new StringBuilder(word);
ngrams.add(word);
ngramSize=1;
it.previous();
//2- insert prevs of the word and add those too
while(it.hasPrevious() && ngramSize<maxGramSize){
sb.insert(0,' ');
sb.insert(0,it.previous());
ngrams.add(sb.toString());
ngramSize++;
}
//go back to initial position
while(ngramSize>0){
ngramSize--;
it.next();
}
}
return ngrams;
}
Call:
long startTime = System.currentTimeMillis();
ngrams = ToolSet.generateNgramsUpto("This is my car.", 3);
long stopTime = System.currentTimeMillis();
System.out.println("My time = "+(stopTime-startTime)+" ms with ngramsize = "+ngrams.size());
System.out.println(ngrams.toString());
Output:
My time = 1 ms with ngramsize = 9 [This, is, This is, my, is my, This
is my, car, my car, is my car]
public static void CreateNgram(ArrayList<String> list, int cutoff) {
try
{
NGramModel ngramModel = new NGramModel();
POSModel model = new POSModelLoader().load(new File("en-pos-maxent.bin"));
PerformanceMonitor perfMon = new PerformanceMonitor(System.err, "sent");
POSTaggerME tagger = new POSTaggerME(model);
perfMon.start();
for(int i = 0; i<list.size(); i++)
{
String inputString = list.get(i);
ObjectStream<String> lineStream = new PlainTextByLineStream(new StringReader(inputString));
String line;
while ((line = lineStream.read()) != null)
{
String whitespaceTokenizerLine[] = WhitespaceTokenizer.INSTANCE.tokenize(line);
String[] tags = tagger.tag(whitespaceTokenizerLine);
POSSample sample = new POSSample(whitespaceTokenizerLine, tags);
perfMon.incrementCounter();
String words[] = sample.getSentence();
if(words.length > 0)
{
for(int k = 2; k< 4; k++)
{
ngramModel.add(new StringList(words), k, k);
}
}
}
}
ngramModel.cutoff(cutoff, Integer.MAX_VALUE);
Iterator<StringList> it = ngramModel.iterator();
while(it.hasNext())
{
StringList strList = it.next();
System.out.println(strList.toString());
}
perfMon.stopAndPrintFinalResult();
}catch(Exception e)
{
System.out.println(e.toString());
}
}
Here is my codes to create n-gram. In this case, n = 2, 3. n-gram of words sequence which smaller than cutoff value will ignore from result set. Input is list of sentences, then it parse using a tool of OpenNLP
public static void main(String[] args) {
String[] words = "This is my car.".split(" ");
for (int n = 0; n < 3; n++) {
List<String> list = ngrams(n, words);
for (String ngram : list) {
System.out.println(ngram);
}
System.out.println();
}
}
public static List<String> ngrams(int stepSize, String[] words) {
List<String> ngrams = new ArrayList<String>();
for (int i = 0; i < words.length-stepSize; i++) {
String initialWord = "";
int internalCount = i;
int internalStepSize = i + stepSize;
while (internalCount <= internalStepSize
&& internalCount < words.length) {
initialWord = initialWord+" " + words[internalCount];
++internalCount;
}
ngrams.add(initialWord);
}
return ngrams;
}
Check this out:
public static void main(String[] args) {
NGram nGram = new NGram();
String[] tokens = "this is my car".split(" ");
int i = tokens.length;
List<String> ngrams = new ArrayList<>();
while (i >= 1){
ngrams.addAll(nGram.getNGram(tokens, i, new ArrayList<>()));
i--;
}
System.out.println(ngrams);
}
private List<String> getNGram(String[] tokens, int n, List<String> ngrams) {
StringBuilder strbldr = new StringBuilder();
if (tokens.length < n) {
return ngrams;
}else {
for (int i=0; i<n; i++){
strbldr.append(tokens[i]).append(" ");
}
ngrams.add(strbldr.toString().trim());
String[] newTokens = Arrays.copyOfRange(tokens, 1, tokens.length);
return getNGram(newTokens, n, ngrams);
}
}
Simple recursive function, better running time.

Categories