I have a text file of 50 string lines of varying length and content. I need to read the file and sort in ascending order. Sorting condition: the number of words in a sentence that start from the letter "a".
public static void main(String[] args) throws FileNotFoundException {
String token1 = "";
Scanner inFile1 = new Scanner(new File("E:\\text.txt"));
List<String> temps = new LinkedList<String>();
inFile1.useDelimiter(". ");
while (inFile1.hasNext()) {
token1 = inFile1.nextLine();
temps.add(token1);
}
inFile1.close();
String[] tempsArray = temps.toArray(new String[0]);
for (int i = 0; i < tempsArray.length; i++) {
System.out.println(tempsArray[i]);
}
int cnt = 0; //number of words in the string line
for (int i=0; i<tempsArray.length; i++) {
int k=0; //number of words that start from the letter "а"
System.out.println("Line № = " + i);
StringTokenizer st = new StringTokenizer(tempsArray[i]);
while (st.hasMoreTokens()) {
cnt++;
String s= st.nextToken();
if (s.charAt(0)=='a') {
k++;
}
}
System.out.println("Number of words = " + cnt);
cnt=0;
System.out.println("Number of words 'а' = " + k);
}
}
I use Map as Kau advise me. But Map use unique keys. But my K can have same values and Map can't find an appropriate string element. What other Сollection сan I use?
I am assuming you already have the algorithm for shell short to sort an array of integers. Let the method be shellSort(int[] a).
What you can do is create a map with key as k and value as the string representing the line. At the same time we'll create an array of integers that holds all k . Then call the method shellSort on the array of k values. Then read back from the sorted array, look in the map using the array elements as keys. Fetch the corresponding map values (which are the lines) and put them back one by one in tempsArray which should finally have all the lines sorted in the desired way.
Below is the code (untested) just to give an idea.
public static void main(String[] args) throws FileNotFoundException {
String token1 = "";
Scanner inFile1 = new Scanner(new File("E:\\text.txt"));
List<String> temps = new LinkedList<String>();
inFile1.useDelimiter(". ");
while (inFile1.hasNext()) {
token1 = inFile1.nextLine();
temps.add(token1);
}
inFile1.close();
String[] tempsArray = temps.toArray(new String[0]);
for (int i = 0; i < tempsArray.length; i++) {
System.out.println(tempsArray[i]);
}
int cnt = 0; //number of words in the string line
Map<Integer, List<String>> myMap = new HashMap<Integer, List<String>>();
int[] countArr = new int[tempsArray.length];
for (int i=0; i<tempsArray.length; i++) {
int k=0; //number of words that start from the letter "а"
System.out.println("Line № = " + i);
StringTokenizer st = new StringTokenizer(tempsArray[i]);
while (st.hasMoreTokens()) {
cnt++;
String s= st.nextToken();
if (s.charAt(0)=='a') {
k++;
}
}
countArr[i] = k;
List<String> listOfLines = myMap.get(k);
if(listOfLines == null){
listOfLines = new ArrayList<String>();
listOfLines.add(tempsArray[i]);
myMap.put(k, listOfLines);
} else{
listOfLines.add(tempsArray[i]);
}
System.out.println("Number of words = " + cnt);
cnt=0;
System.out.println("Number of words 'а' = " + k);
}
//Call shellsort here on the array of k values
shellSort(countArr);
List<String> sortedListOfLines = new ArrayList<String>();
for(int i=0; i<countArr.length; i++){
List<String> lineList = myMap.get(countArr[i]);
if(lineList != null){
sortedListOfLines.addAll(lineList);
lineList = null;
myMap.put(countArr[i], lineList);
}
}
}
Related
How to generate an n-gram of a string like:
String Input="This is my car."
I want to generate n-gram with this input:
Input Ngram size = 3
Output should be:
This
is
my
car
This is
is my
my car
This is my
is my car
Give some idea in Java, how to implement that or if any library is available for it.
I am trying to use this NGramTokenizer but its giving n-gram's of character sequence and I want n-grams of word sequence.
I believe this would do what you want:
import java.util.*;
public class Test {
public static List<String> ngrams(int n, String str) {
List<String> ngrams = new ArrayList<String>();
String[] words = str.split(" ");
for (int i = 0; i < words.length - n + 1; i++)
ngrams.add(concat(words, i, i+n));
return ngrams;
}
public static String concat(String[] words, int start, int end) {
StringBuilder sb = new StringBuilder();
for (int i = start; i < end; i++)
sb.append((i > start ? " " : "") + words[i]);
return sb.toString();
}
public static void main(String[] args) {
for (int n = 1; n <= 3; n++) {
for (String ngram : ngrams(n, "This is my car."))
System.out.println(ngram);
System.out.println();
}
}
}
Output:
This
is
my
car.
This is
is my
my car.
This is my
is my car.
An "on-demand" solution implemented as an Iterator:
class NgramIterator implements Iterator<String> {
String[] words;
int pos = 0, n;
public NgramIterator(int n, String str) {
this.n = n;
words = str.split(" ");
}
public boolean hasNext() {
return pos < words.length - n + 1;
}
public String next() {
StringBuilder sb = new StringBuilder();
for (int i = pos; i < pos + n; i++)
sb.append((i > pos ? " " : "") + words[i]);
pos++;
return sb.toString();
}
public void remove() {
throw new UnsupportedOperationException();
}
}
You are looking for ShingleFilter.
Update: The link points to version 3.0.2. This class may be in different package in newer version of Lucene.
This code returns an array of all Strings of the given length:
public static String[] ngrams(String s, int len) {
String[] parts = s.split(" ");
String[] result = new String[parts.length - len + 1];
for(int i = 0; i < parts.length - len + 1; i++) {
StringBuilder sb = new StringBuilder();
for(int k = 0; k < len; k++) {
if(k > 0) sb.append(' ');
sb.append(parts[i+k]);
}
result[i] = sb.toString();
}
return result;
}
E.g.
System.out.println(Arrays.toString(ngrams("This is my car", 2)));
//--> [This is, is my, my car]
System.out.println(Arrays.toString(ngrams("This is my car", 3)));
//--> [This is my, is my car]
/**
*
* #param sentence should has at least one string
* #param maxGramSize should be 1 at least
* #return set of continuous word n-grams up to maxGramSize from the sentence
*/
public static List<String> generateNgramsUpto(String str, int maxGramSize) {
List<String> sentence = Arrays.asList(str.split("[\\W+]"));
List<String> ngrams = new ArrayList<String>();
int ngramSize = 0;
StringBuilder sb = null;
//sentence becomes ngrams
for (ListIterator<String> it = sentence.listIterator(); it.hasNext();) {
String word = (String) it.next();
//1- add the word itself
sb = new StringBuilder(word);
ngrams.add(word);
ngramSize=1;
it.previous();
//2- insert prevs of the word and add those too
while(it.hasPrevious() && ngramSize<maxGramSize){
sb.insert(0,' ');
sb.insert(0,it.previous());
ngrams.add(sb.toString());
ngramSize++;
}
//go back to initial position
while(ngramSize>0){
ngramSize--;
it.next();
}
}
return ngrams;
}
Call:
long startTime = System.currentTimeMillis();
ngrams = ToolSet.generateNgramsUpto("This is my car.", 3);
long stopTime = System.currentTimeMillis();
System.out.println("My time = "+(stopTime-startTime)+" ms with ngramsize = "+ngrams.size());
System.out.println(ngrams.toString());
Output:
My time = 1 ms with ngramsize = 9 [This, is, This is, my, is my, This
is my, car, my car, is my car]
public static void CreateNgram(ArrayList<String> list, int cutoff) {
try
{
NGramModel ngramModel = new NGramModel();
POSModel model = new POSModelLoader().load(new File("en-pos-maxent.bin"));
PerformanceMonitor perfMon = new PerformanceMonitor(System.err, "sent");
POSTaggerME tagger = new POSTaggerME(model);
perfMon.start();
for(int i = 0; i<list.size(); i++)
{
String inputString = list.get(i);
ObjectStream<String> lineStream = new PlainTextByLineStream(new StringReader(inputString));
String line;
while ((line = lineStream.read()) != null)
{
String whitespaceTokenizerLine[] = WhitespaceTokenizer.INSTANCE.tokenize(line);
String[] tags = tagger.tag(whitespaceTokenizerLine);
POSSample sample = new POSSample(whitespaceTokenizerLine, tags);
perfMon.incrementCounter();
String words[] = sample.getSentence();
if(words.length > 0)
{
for(int k = 2; k< 4; k++)
{
ngramModel.add(new StringList(words), k, k);
}
}
}
}
ngramModel.cutoff(cutoff, Integer.MAX_VALUE);
Iterator<StringList> it = ngramModel.iterator();
while(it.hasNext())
{
StringList strList = it.next();
System.out.println(strList.toString());
}
perfMon.stopAndPrintFinalResult();
}catch(Exception e)
{
System.out.println(e.toString());
}
}
Here is my codes to create n-gram. In this case, n = 2, 3. n-gram of words sequence which smaller than cutoff value will ignore from result set. Input is list of sentences, then it parse using a tool of OpenNLP
public static void main(String[] args) {
String[] words = "This is my car.".split(" ");
for (int n = 0; n < 3; n++) {
List<String> list = ngrams(n, words);
for (String ngram : list) {
System.out.println(ngram);
}
System.out.println();
}
}
public static List<String> ngrams(int stepSize, String[] words) {
List<String> ngrams = new ArrayList<String>();
for (int i = 0; i < words.length-stepSize; i++) {
String initialWord = "";
int internalCount = i;
int internalStepSize = i + stepSize;
while (internalCount <= internalStepSize
&& internalCount < words.length) {
initialWord = initialWord+" " + words[internalCount];
++internalCount;
}
ngrams.add(initialWord);
}
return ngrams;
}
Check this out:
public static void main(String[] args) {
NGram nGram = new NGram();
String[] tokens = "this is my car".split(" ");
int i = tokens.length;
List<String> ngrams = new ArrayList<>();
while (i >= 1){
ngrams.addAll(nGram.getNGram(tokens, i, new ArrayList<>()));
i--;
}
System.out.println(ngrams);
}
private List<String> getNGram(String[] tokens, int n, List<String> ngrams) {
StringBuilder strbldr = new StringBuilder();
if (tokens.length < n) {
return ngrams;
}else {
for (int i=0; i<n; i++){
strbldr.append(tokens[i]).append(" ");
}
ngrams.add(strbldr.toString().trim());
String[] newTokens = Arrays.copyOfRange(tokens, 1, tokens.length);
return getNGram(newTokens, n, ngrams);
}
}
Simple recursive function, better running time.
This is my program to remove duplicate words in a string using set the program
works fine removing duplicate elements, but the output is not in the correct order
public class Remove_DuplicateIN_String {
public static void main(String a[]) throws IOException {
String a1;//=new String[200];
int i;
InputStreamReader reader=new InputStreamReader(System.in);
BufferedReader in =new BufferedReader(reader);
System.out.println("Enter the String ");
a1=(in.readLine());
System.out.print(a1);
System.out.println("\n");
String words[]=new String[100];
words=a1.split(" ");
System.out.println(words.length);
Set<String> uniq=new HashSet<String>();
for(i=0;i<words.length;i++)
{
uniq.add(words[i]);
}
Iterator it=uniq.iterator();
while(it.hasNext())
{
System.out.print(it.next()+" ");
}
}
}
Enter the String
hi hi world hello a
hi hi world hello a
5
hi a world hello
I want output as hi world hello a
Use LinkedHashSet
It maintains order and avoid duplicates.
Set wordSet = new LinkedHashSet();
Use LinkedHashSet.
It will track order and also avoid duplicates of elements.
Set<String> linkedHashSet = new LinkedHashSet<String>();
If you have already stored elements in array of strings, you can use collection api to addAll into set.
String words[]=a1.split(" ");
Set<String> linkedHashSet=new LinkedHashSet<String>();
linkedHashSet.addAll(Arrays.asList(words));.
package StringPrograms;
import java.util.Scanner;
public class RemoveDuplicateWords {
public static void main(String[] args) {
boolean flag;
Scanner sc = new Scanner(System.in);
String input = sc.nextLine();
String[] str = input.split(" ");
int count = 0;
String[] out = new String[str.length];
for (int i = 0; i < str.length; i++) {
flag = true;
for (int j = 0; j <count; j++) {
if (str[i].equalsIgnoreCase(out[j])) {
flag = false;
break;
}
}
if (flag) {
out[count] = str[i];
count++;
}
}
for (int k = 0; k < out.length; k++) {
if (out[k] != null)
System.out.print(out[k] + " ");
}
}
}
String noDuplicates = Arrays.asList(startingString.split(" ")).stream()
.distinct()
.collect(Collectors.join(" "));
This approach doesn't handle commas and special characters though.
I want to read an file, and want to collect top n words depends on word frequency.
I have tried the following code to count every words in a string.
public static void main(String[] args) throws FileNotFoundException, IOException {
FileReader fr = new FileReader("txtFile.txt");
BufferedReader br = new BufferedReader(fr);
String text = "";
String sz = null;
while ((sz = br.readLine()) != null) {
text = text.concat(sz);
}
String[] words = text.split(" ");
String[] uniqueLabels;
int count = 0;
System.out.println(text);
uniqueLabels = getLabels(words);
for (String l: uniqueLabels) {
if ("".equals(l) || null == l) {
break;
}
for (String s: words) {
if (l.equals(s)) {
count++;
}
}
System.out.println("Word :: " + l + " Count :: " + count);
count = 0;
}
}
And I used the following code to collect unique lbels(words) get if from link,
private static String[] getLabels(String[] keys) {
String[] uniqueKeys = new String[keys.length];
uniqueKeys[0] = keys[0];
int uniqueKeyIndex = 1;
boolean keyAlreadyExists = false;
for (int i = 1; i < keys.length; i++) {
for (int j = 0; j <= uniqueKeyIndex; j++) {
if (keys[i].equals(uniqueKeys[j])) {
keyAlreadyExists = true;
}
}
if (!keyAlreadyExists) {
uniqueKeys[uniqueKeyIndex] = keys[i];
uniqueKeyIndex++;
}
keyAlreadyExists = false;
}
return uniqueKeys;
}
And this works fine, I want to collect top 10 ranked words depend on it's frequency in file.
First of all, if you want it to run moderately fast, don't loop trough all the Strings in an array...use a HashMap... or even find some map for primitives.
Then go through the words. If the words is in the map, increment the value, otherwise put a 1.
In the end, sort the map entries and fetch the first 10.
Not a total duplicate, but this answer pretty much shows how to get the counting done: Calculating frequency of each word in a sentence in java
I recommend using a Hashmap<String, Integer>() to count the word frequency. Hash uses key-value-pairs. That means the key is unique (your word) and the value variable. If you perform a put operation with a already existing key, the value will be updated.
Hashmap
Something like this should work:
hashmap.put(key, hashmap.get(key) + 1);
To get the top then words, I would implement sort the hashmap and retrieve the first ten entries.
I solved it as,
public class wordFreq {
private static String[] w = null;
private static int[] r = null;
public static void main(String[] args){
try {
System.out.println("Enter 'n' value :: ");
Scanner in = new Scanner(System.in);
int n = in.nextInt();
w = new String[n];
r = new int[n];
FileReader fr = new FileReader("acq.txt");
BufferedReader br = new BufferedReader(fr);
String text = "";
String sz = null;
while((sz=br.readLine())!=null){
text = text.concat(sz);
}
String[] words = text.split(" ");
String[] uniqueLabels;
int count = 0;
uniqueLabels = getUniqLabels(words);
for(int j=0; j<n; j++){
r[j] = 0;
}
for(String l: uniqueLabels)
{
if("".equals(l) || null == l)
{
break;
}
for(String s : words)
{
if(l.equals(s))
{
count++;
}
}
for(int i=0; i<n; i++){
if(count>r[i]){
r[i] = count;
w[i] = l;
break;
}
}
count=0;
}
display(n);
} catch (Exception e) {
System.err.println("ERR "+e.getMessage());
}
}
public static void display(int n){
for(int k=0; k<n; k++){
System.out.println("Label :: "+w[k]+"\tCount :: "+r[k]);
}
}
private static String[] getUniqLabels(String[] keys)
{
String[] uniqueKeys = new String[keys.length];
uniqueKeys[0] = keys[0];
int uniqueKeyIndex = 1;
boolean keyAlreadyExists = false;
for(int i=1; i<keys.length ; i++)
{
for(int j=0; j<=uniqueKeyIndex; j++)
{
if(keys[i].equals(uniqueKeys[j]))
{
keyAlreadyExists = true;
}
}
if(!keyAlreadyExists)
{
uniqueKeys[uniqueKeyIndex] = keys[i];
uniqueKeyIndex++;
}
keyAlreadyExists = false;
}
return uniqueKeys;
}
}
And the sample output is,
Enter 'n' value ::
5
Label :: computer Count :: 30
Label :: company Count :: 22
Label :: express Count :: 20
Label :: offer Count :: 16
Label :: shearson Count :: 16
After hard searchig I still haven't found the proper answer for my question and there is it:
I have to write a java program that enters an array of strings and finds in it the largest sequence of equal elements. If several sequences have the same longest length, the program should print the leftmost of them. The input strings are given as a single line, separated by a space.
For example:
if the input is: "hi yes yes yes bye",
the output should be: "yes yes yes".
And there is my source code:
public static void main(String[] args) {
System.out.println("Please enter a sequence of strings separated by spaces:");
Scanner inputStringScanner = new Scanner(System.in);
String[] strings = inputStringScanner.nextLine().split(" ");
System.out.println(String.join(" ", strings));
ArrayList<ArrayList<String>> stringsSequencesCollection = new ArrayList<ArrayList<String>>();
ArrayList<String> stringsSequences = new ArrayList<String>();
stringsSequences.add(strings[0]);
for (int i = 1; i < strings.length; i++) {
if(strings[i].equals(strings[i - 1])) {
stringsSequences.add(strings[i]);
} else {
System.out.println(stringsSequences + " " + stringsSequences.size());
stringsSequencesCollection.add(stringsSequences);
stringsSequences.clear();
stringsSequences.add(strings[i]);
//ystem.out.println("\n" + stringsSequences);
}
if(i == strings.length - 1) {
stringsSequencesCollection.add(stringsSequences);
stringsSequences.clear();
System.out.println(stringsSequences + " " + stringsSequences.size());
}
}
System.out.println(stringsSequencesCollection.size());
System.out.println(stringsSequencesCollection.get(2).size());
System.out.println();
int maximalStringSequence = Integer.MIN_VALUE;
int index = 0;
ArrayList<String> currentStringSequence = new ArrayList<String>();
for (int i = 0; i < stringsSequencesCollection.size(); i++) {
currentStringSequence = stringsSequencesCollection.get(i);
System.out.println(stringsSequencesCollection.get(i).size());
if (stringsSequencesCollection.get(i).size() > maximalStringSequence) {
maximalStringSequence = stringsSequencesCollection.get(i).size();
index = i;
//System.out.println("\n" + index);
}
}
System.out.println(String.join(" ", stringsSequencesCollection.get(index)));
I think it should be work correct but there is a problem - the sub array list's count isn't correct: All the sub arrayList's size is 1 and for this reason the output is not correct. I don't understand what is the reason for this. If anybody can help me to fix the code I will be gratefull!
I think it is fairly straight forward just keep track of a max sequence length as you go through the array building sequences.
String input = "hi yes yes yes bye";
String sa[] = input.split(" ");
int maxseqlen = 1;
String last_sample = sa[0];
String longest_seq = last_sample;
int seqlen = 1;
String seq = last_sample;
for (int i = 1; i < sa.length; i++) {
String sample = sa[i];
if (sample.equals(last_sample)) {
seqlen++;
seq += " " + sample;
if (seqlen > maxseqlen) {
longest_seq = seq;
maxseqlen = seqlen;
}
} else {
seqlen = 1;
seq = sample;
}
last_sample = sample;
}
System.out.println("longest_seq = " + longest_seq);
Lots of issues.
First of all, when dealing with the last string of the list you are not actually printing it before clearing it. Should be:
if(i == strings.length - 1)
//...
System.out.println(stringsSequences + " " + stringsSequences.size());
stringsSequences.clear();
This is the error in the output.
Secondly, and most importantly, when you do stringsSequencesCollection.add you are adding an OBJECT, i.e. a reference to the collection. When after you do stringsSequences.clear(), you empty the collection you just added too (this is because it's not making a copy, but keeping a reference!). You can verify this by printing stringsSequencesCollection after the first loop finishes: it will contain 3 empty lists.
So how do we do this? First of all, we need a more appropriate data structure. We are going to use a Map that, for each string, contains the length of its longest sequence. Since we want to manage ties too, we'll also have another map that for each string stores the leftmost ending position of the longest sequence:
Map<String, Integer> lengths= new HashMap<>();
Map<String, Integer> indexes= new HashMap<>();
String[] split = input.split(" ");
lengths.put(split[0], 1);
indexes.put(split[0], 0);
int currentLength = 1;
int maxLength = 1;
for (int i = 1; i<split.length; i++) {
String s = split[i];
if (s.equals(split[i-1])) {
currentLength++;
}
else {
currentLength = 1;
}
int oldLength = lengths.getOrDefault(s, 0);
if (currentLength > oldLength) {
lengths.put(s, currentLength);
indexes.put(s, i);
}
maxLength = Math.max(maxLength, currentLength);
}
//At this point, youll have in lengths a map from string -> maxSeqLengt, and in indexes a map from string -> indexes for the leftmost ending index of the longest sequence. Now we need to reason on those!
Now we can just scan for the strings with the longest sequences:
//Find all strings with equal maximal length sequences
Set<String> longestStrings = new HashSet<>();
for (Map.Entry<String, Integer> e: lengths.entrySet()) {
if (e.value == maxLength) {
longestStrings.add(e.key);
}
}
//Of those, search the one with minimal index
int minIndex = input.length();
String bestString = null;
for (String s: longestStrings) {
int index = indexes.get(s);
if (index < minIndex) {
bestString = s;
}
}
System.out.println(bestString);
Below code results in output as you expected:
public static void main(String[] args) {
System.out.println("Please enter a sequence of strings separated by spaces:");
Scanner inputStringScanner = new Scanner(System.in);
String[] strings = inputStringScanner.nextLine().split(" ");
System.out.println(String.join(" ", strings));
List <ArrayList<String>> stringsSequencesCollection = new ArrayList<ArrayList<String>>();
List <String> stringsSequences = new ArrayList<String>();
//stringsSequences.add(strings[0]);
boolean flag = false;
for (int i = 1; i < strings.length; i++) {
if(strings[i].equals(strings[i - 1])) {
if(flag == false){
stringsSequences.add(strings[i]);
flag= true;
}
stringsSequences.add(strings[i]);
}
}
int maximalStringSequence = Integer.MIN_VALUE;
int index = 0;
List <String> currentStringSequence = new ArrayList<String>();
for (int i = 0; i < stringsSequencesCollection.size(); i++) {
currentStringSequence = stringsSequencesCollection.get(i);
System.out.println(stringsSequencesCollection.get(i).size());
if (stringsSequencesCollection.get(i).size() > maximalStringSequence) {
maximalStringSequence = stringsSequencesCollection.get(i).size();
index = i;
//System.out.println("\n" + index);
}
}
System.out.println(stringsSequences.toString());
How to generate an n-gram of a string like:
String Input="This is my car."
I want to generate n-gram with this input:
Input Ngram size = 3
Output should be:
This
is
my
car
This is
is my
my car
This is my
is my car
Give some idea in Java, how to implement that or if any library is available for it.
I am trying to use this NGramTokenizer but its giving n-gram's of character sequence and I want n-grams of word sequence.
I believe this would do what you want:
import java.util.*;
public class Test {
public static List<String> ngrams(int n, String str) {
List<String> ngrams = new ArrayList<String>();
String[] words = str.split(" ");
for (int i = 0; i < words.length - n + 1; i++)
ngrams.add(concat(words, i, i+n));
return ngrams;
}
public static String concat(String[] words, int start, int end) {
StringBuilder sb = new StringBuilder();
for (int i = start; i < end; i++)
sb.append((i > start ? " " : "") + words[i]);
return sb.toString();
}
public static void main(String[] args) {
for (int n = 1; n <= 3; n++) {
for (String ngram : ngrams(n, "This is my car."))
System.out.println(ngram);
System.out.println();
}
}
}
Output:
This
is
my
car.
This is
is my
my car.
This is my
is my car.
An "on-demand" solution implemented as an Iterator:
class NgramIterator implements Iterator<String> {
String[] words;
int pos = 0, n;
public NgramIterator(int n, String str) {
this.n = n;
words = str.split(" ");
}
public boolean hasNext() {
return pos < words.length - n + 1;
}
public String next() {
StringBuilder sb = new StringBuilder();
for (int i = pos; i < pos + n; i++)
sb.append((i > pos ? " " : "") + words[i]);
pos++;
return sb.toString();
}
public void remove() {
throw new UnsupportedOperationException();
}
}
You are looking for ShingleFilter.
Update: The link points to version 3.0.2. This class may be in different package in newer version of Lucene.
This code returns an array of all Strings of the given length:
public static String[] ngrams(String s, int len) {
String[] parts = s.split(" ");
String[] result = new String[parts.length - len + 1];
for(int i = 0; i < parts.length - len + 1; i++) {
StringBuilder sb = new StringBuilder();
for(int k = 0; k < len; k++) {
if(k > 0) sb.append(' ');
sb.append(parts[i+k]);
}
result[i] = sb.toString();
}
return result;
}
E.g.
System.out.println(Arrays.toString(ngrams("This is my car", 2)));
//--> [This is, is my, my car]
System.out.println(Arrays.toString(ngrams("This is my car", 3)));
//--> [This is my, is my car]
/**
*
* #param sentence should has at least one string
* #param maxGramSize should be 1 at least
* #return set of continuous word n-grams up to maxGramSize from the sentence
*/
public static List<String> generateNgramsUpto(String str, int maxGramSize) {
List<String> sentence = Arrays.asList(str.split("[\\W+]"));
List<String> ngrams = new ArrayList<String>();
int ngramSize = 0;
StringBuilder sb = null;
//sentence becomes ngrams
for (ListIterator<String> it = sentence.listIterator(); it.hasNext();) {
String word = (String) it.next();
//1- add the word itself
sb = new StringBuilder(word);
ngrams.add(word);
ngramSize=1;
it.previous();
//2- insert prevs of the word and add those too
while(it.hasPrevious() && ngramSize<maxGramSize){
sb.insert(0,' ');
sb.insert(0,it.previous());
ngrams.add(sb.toString());
ngramSize++;
}
//go back to initial position
while(ngramSize>0){
ngramSize--;
it.next();
}
}
return ngrams;
}
Call:
long startTime = System.currentTimeMillis();
ngrams = ToolSet.generateNgramsUpto("This is my car.", 3);
long stopTime = System.currentTimeMillis();
System.out.println("My time = "+(stopTime-startTime)+" ms with ngramsize = "+ngrams.size());
System.out.println(ngrams.toString());
Output:
My time = 1 ms with ngramsize = 9 [This, is, This is, my, is my, This
is my, car, my car, is my car]
public static void CreateNgram(ArrayList<String> list, int cutoff) {
try
{
NGramModel ngramModel = new NGramModel();
POSModel model = new POSModelLoader().load(new File("en-pos-maxent.bin"));
PerformanceMonitor perfMon = new PerformanceMonitor(System.err, "sent");
POSTaggerME tagger = new POSTaggerME(model);
perfMon.start();
for(int i = 0; i<list.size(); i++)
{
String inputString = list.get(i);
ObjectStream<String> lineStream = new PlainTextByLineStream(new StringReader(inputString));
String line;
while ((line = lineStream.read()) != null)
{
String whitespaceTokenizerLine[] = WhitespaceTokenizer.INSTANCE.tokenize(line);
String[] tags = tagger.tag(whitespaceTokenizerLine);
POSSample sample = new POSSample(whitespaceTokenizerLine, tags);
perfMon.incrementCounter();
String words[] = sample.getSentence();
if(words.length > 0)
{
for(int k = 2; k< 4; k++)
{
ngramModel.add(new StringList(words), k, k);
}
}
}
}
ngramModel.cutoff(cutoff, Integer.MAX_VALUE);
Iterator<StringList> it = ngramModel.iterator();
while(it.hasNext())
{
StringList strList = it.next();
System.out.println(strList.toString());
}
perfMon.stopAndPrintFinalResult();
}catch(Exception e)
{
System.out.println(e.toString());
}
}
Here is my codes to create n-gram. In this case, n = 2, 3. n-gram of words sequence which smaller than cutoff value will ignore from result set. Input is list of sentences, then it parse using a tool of OpenNLP
public static void main(String[] args) {
String[] words = "This is my car.".split(" ");
for (int n = 0; n < 3; n++) {
List<String> list = ngrams(n, words);
for (String ngram : list) {
System.out.println(ngram);
}
System.out.println();
}
}
public static List<String> ngrams(int stepSize, String[] words) {
List<String> ngrams = new ArrayList<String>();
for (int i = 0; i < words.length-stepSize; i++) {
String initialWord = "";
int internalCount = i;
int internalStepSize = i + stepSize;
while (internalCount <= internalStepSize
&& internalCount < words.length) {
initialWord = initialWord+" " + words[internalCount];
++internalCount;
}
ngrams.add(initialWord);
}
return ngrams;
}
Check this out:
public static void main(String[] args) {
NGram nGram = new NGram();
String[] tokens = "this is my car".split(" ");
int i = tokens.length;
List<String> ngrams = new ArrayList<>();
while (i >= 1){
ngrams.addAll(nGram.getNGram(tokens, i, new ArrayList<>()));
i--;
}
System.out.println(ngrams);
}
private List<String> getNGram(String[] tokens, int n, List<String> ngrams) {
StringBuilder strbldr = new StringBuilder();
if (tokens.length < n) {
return ngrams;
}else {
for (int i=0; i<n; i++){
strbldr.append(tokens[i]).append(" ");
}
ngrams.add(strbldr.toString().trim());
String[] newTokens = Arrays.copyOfRange(tokens, 1, tokens.length);
return getNGram(newTokens, n, ngrams);
}
}
Simple recursive function, better running time.