Removing a comma at the end of String - java

I am trying to convert an HashSet into comma seperated delimited String
But the problem is that , i am getting an extra comma at the end of the String also as shown .
Please tell me how can i remove that extra comma at the end of the String .
import java.util.HashSet;
import java.util.Set;
public class Test {
private static Set<String> symbolsSet = new HashSet<String>();
static {
symbolsSet.add("Q1!GO1");
symbolsSet.add("Q2!GO2");
symbolsSet.add("Q3!GO3");
}
public static void main(String args[]) {
String[] a = symbolsSet.toArray(new String[0]);
StringBuffer sb = new StringBuffer();
for (int i = 0; i < a.length; i++) {
sb.append(a[i] + ",");
}
System.out.println(sb.toString());
}
}
Output :
Q3!GOO3,Q2!GO2,Q1!GO1,

Do like this
for (int i = 0; i < a.length - 1; i++) {
sb.append(a[i] + ",");
}
sb.append(a[a.length - 1]);

Try with:
for (int i = 0; i < a.length; i++) {
if ( i > 0 ) {
sb.append(",");
}
sb.append(a[i]);
}

There are many ways you can do this, below is a regex solution using String#replaceAll()
String s= "abc,";
s = s.replaceAll(",$", "");

StringBuffer sb = new StringBuffer();
String separator = "";
for (int i = 0; i < a.length; i++) {
sb.append(separator).append(a[i]);
separator = ",";
}

In your given scenario String.lastIndexOf method is pretty useful.
String withComma= sb.toString();
String strWithoutLastComma = withComma.substring(0,withComma.lastIndexOf(","));
System.out.println(strWithoutLastComma);

String str = sb.toString().substring(0, sb.toString().length()-1);
or append only if not last element
for (int i = 0; i < a.length; i++) {
if ( i > 0 ) {
sb.append(",");
}
sb.append(a[i]);
}

public static void main(String args[]) {
String[] a = symbolsSet.toArray(new String[0]);
StringBuffer sb = new StringBuffer();
for (int i = 0; i < a.length-1; i++) {
//append until the last with comma
sb.append(a[i] + ",");
}
//append the last without comma
sb.append(a[a.length-1]);
System.out.println(sb.toString());
}

Try this :
for (int i = 0; i < a.length; i++) {
if(i == a.length - 1) {
sb.append(a[i]); // Last element. Dont append comma to it
} else {
sb.append(a[i] + ","); // Append comma to it. Not a last element
}
}
System.out.println(sb.toString());

Try this,
for (int i = 0; i < a.length-1; i++) {
if(a.length-1!=i)
sb.append(a[i] + ",");
else
{
sb.append(a[i]);
break;
}
}

public static void main(String args[])
{
String[] a = symbolsSet.toArray(new String[0]);
StringBuffer sb = new StringBuffer();
for (int i = 0; i < a.length; i++)
{
sb.append(a[i] + ",");
}
System.out.println(sb.toString().substring(0, sb.toString().length - 1));
}
OR
public static void main(String args[])
{
String[] a = symbolsSet.toArray(new String[0]);
StringBuffer sb = new StringBuffer();
for (int i = 0; i < a.length; i++)
{
sb.append(a[i]);
if(a != a.length - 1)
{
sb.append(",");
}
}
System.out.println(sb.toString());
}

You could grow your own Separator class:
public class SimpleSeparator<T> {
private final String sepString;
boolean first = true;
public SimpleSeparator(final String sep) {
this.sepString = sep;
}
public String sep() {
// Return empty string first and then the separator on every subsequent invocation.
if (first) {
first = false;
return "";
}
return sepString;
}
public static void main(String[] args) {
SimpleSeparator sep = new SimpleSeparator(",");
System.out.print("[");
for ( int i = 0; i < 10; i++ ) {
System.out.print(sep.sep()+i);
}
System.out.print("]");
}
}
There's loads more you can do with this class with static methods that separate arrays, collections, iterators, iterables etc.

Try with this
for (int i = 0; i < a.length; i++) {
if(i == a.length - 1 )
{
sb.append(a[i]);
} else {
sb.append(a[i] + ",");
}
}

Related

How can i extract trend words from given dataset (Java)? [duplicate]

How to generate an n-gram of a string like:
String Input="This is my car."
I want to generate n-gram with this input:
Input Ngram size = 3
Output should be:
This
is
my
car
This is
is my
my car
This is my
is my car
Give some idea in Java, how to implement that or if any library is available for it.
I am trying to use this NGramTokenizer but its giving n-gram's of character sequence and I want n-grams of word sequence.
I believe this would do what you want:
import java.util.*;
public class Test {
public static List<String> ngrams(int n, String str) {
List<String> ngrams = new ArrayList<String>();
String[] words = str.split(" ");
for (int i = 0; i < words.length - n + 1; i++)
ngrams.add(concat(words, i, i+n));
return ngrams;
}
public static String concat(String[] words, int start, int end) {
StringBuilder sb = new StringBuilder();
for (int i = start; i < end; i++)
sb.append((i > start ? " " : "") + words[i]);
return sb.toString();
}
public static void main(String[] args) {
for (int n = 1; n <= 3; n++) {
for (String ngram : ngrams(n, "This is my car."))
System.out.println(ngram);
System.out.println();
}
}
}
Output:
This
is
my
car.
This is
is my
my car.
This is my
is my car.
An "on-demand" solution implemented as an Iterator:
class NgramIterator implements Iterator<String> {
String[] words;
int pos = 0, n;
public NgramIterator(int n, String str) {
this.n = n;
words = str.split(" ");
}
public boolean hasNext() {
return pos < words.length - n + 1;
}
public String next() {
StringBuilder sb = new StringBuilder();
for (int i = pos; i < pos + n; i++)
sb.append((i > pos ? " " : "") + words[i]);
pos++;
return sb.toString();
}
public void remove() {
throw new UnsupportedOperationException();
}
}
You are looking for ShingleFilter.
Update: The link points to version 3.0.2. This class may be in different package in newer version of Lucene.
This code returns an array of all Strings of the given length:
public static String[] ngrams(String s, int len) {
String[] parts = s.split(" ");
String[] result = new String[parts.length - len + 1];
for(int i = 0; i < parts.length - len + 1; i++) {
StringBuilder sb = new StringBuilder();
for(int k = 0; k < len; k++) {
if(k > 0) sb.append(' ');
sb.append(parts[i+k]);
}
result[i] = sb.toString();
}
return result;
}
E.g.
System.out.println(Arrays.toString(ngrams("This is my car", 2)));
//--> [This is, is my, my car]
System.out.println(Arrays.toString(ngrams("This is my car", 3)));
//--> [This is my, is my car]
/**
*
* #param sentence should has at least one string
* #param maxGramSize should be 1 at least
* #return set of continuous word n-grams up to maxGramSize from the sentence
*/
public static List<String> generateNgramsUpto(String str, int maxGramSize) {
List<String> sentence = Arrays.asList(str.split("[\\W+]"));
List<String> ngrams = new ArrayList<String>();
int ngramSize = 0;
StringBuilder sb = null;
//sentence becomes ngrams
for (ListIterator<String> it = sentence.listIterator(); it.hasNext();) {
String word = (String) it.next();
//1- add the word itself
sb = new StringBuilder(word);
ngrams.add(word);
ngramSize=1;
it.previous();
//2- insert prevs of the word and add those too
while(it.hasPrevious() && ngramSize<maxGramSize){
sb.insert(0,' ');
sb.insert(0,it.previous());
ngrams.add(sb.toString());
ngramSize++;
}
//go back to initial position
while(ngramSize>0){
ngramSize--;
it.next();
}
}
return ngrams;
}
Call:
long startTime = System.currentTimeMillis();
ngrams = ToolSet.generateNgramsUpto("This is my car.", 3);
long stopTime = System.currentTimeMillis();
System.out.println("My time = "+(stopTime-startTime)+" ms with ngramsize = "+ngrams.size());
System.out.println(ngrams.toString());
Output:
My time = 1 ms with ngramsize = 9 [This, is, This is, my, is my, This
is my, car, my car, is my car]
public static void CreateNgram(ArrayList<String> list, int cutoff) {
try
{
NGramModel ngramModel = new NGramModel();
POSModel model = new POSModelLoader().load(new File("en-pos-maxent.bin"));
PerformanceMonitor perfMon = new PerformanceMonitor(System.err, "sent");
POSTaggerME tagger = new POSTaggerME(model);
perfMon.start();
for(int i = 0; i<list.size(); i++)
{
String inputString = list.get(i);
ObjectStream<String> lineStream = new PlainTextByLineStream(new StringReader(inputString));
String line;
while ((line = lineStream.read()) != null)
{
String whitespaceTokenizerLine[] = WhitespaceTokenizer.INSTANCE.tokenize(line);
String[] tags = tagger.tag(whitespaceTokenizerLine);
POSSample sample = new POSSample(whitespaceTokenizerLine, tags);
perfMon.incrementCounter();
String words[] = sample.getSentence();
if(words.length > 0)
{
for(int k = 2; k< 4; k++)
{
ngramModel.add(new StringList(words), k, k);
}
}
}
}
ngramModel.cutoff(cutoff, Integer.MAX_VALUE);
Iterator<StringList> it = ngramModel.iterator();
while(it.hasNext())
{
StringList strList = it.next();
System.out.println(strList.toString());
}
perfMon.stopAndPrintFinalResult();
}catch(Exception e)
{
System.out.println(e.toString());
}
}
Here is my codes to create n-gram. In this case, n = 2, 3. n-gram of words sequence which smaller than cutoff value will ignore from result set. Input is list of sentences, then it parse using a tool of OpenNLP
public static void main(String[] args) {
String[] words = "This is my car.".split(" ");
for (int n = 0; n < 3; n++) {
List<String> list = ngrams(n, words);
for (String ngram : list) {
System.out.println(ngram);
}
System.out.println();
}
}
public static List<String> ngrams(int stepSize, String[] words) {
List<String> ngrams = new ArrayList<String>();
for (int i = 0; i < words.length-stepSize; i++) {
String initialWord = "";
int internalCount = i;
int internalStepSize = i + stepSize;
while (internalCount <= internalStepSize
&& internalCount < words.length) {
initialWord = initialWord+" " + words[internalCount];
++internalCount;
}
ngrams.add(initialWord);
}
return ngrams;
}
Check this out:
public static void main(String[] args) {
NGram nGram = new NGram();
String[] tokens = "this is my car".split(" ");
int i = tokens.length;
List<String> ngrams = new ArrayList<>();
while (i >= 1){
ngrams.addAll(nGram.getNGram(tokens, i, new ArrayList<>()));
i--;
}
System.out.println(ngrams);
}
private List<String> getNGram(String[] tokens, int n, List<String> ngrams) {
StringBuilder strbldr = new StringBuilder();
if (tokens.length < n) {
return ngrams;
}else {
for (int i=0; i<n; i++){
strbldr.append(tokens[i]).append(" ");
}
ngrams.add(strbldr.toString().trim());
String[] newTokens = Arrays.copyOfRange(tokens, 1, tokens.length);
return getNGram(newTokens, n, ngrams);
}
}
Simple recursive function, better running time.

How to move each "i" in a string to the next position in java

I want to shift each i in a given string one index to the right. How can I do that? For example:
"Chit Nyein Oo is nothing.";
becomes
"Chti Nyeni Oo si nothnig.";
If i occurs in the last index, it need not change its position.
Use string.replaceAll
string.replaceAll("i(.)", "$1i");
DEMO
EDIT: NOW it works for all conditions. Last letter in the String is 'i' or not, it works.
public class t4 {
public static void main(String[] args) {
String input = "Chit Nyein Oo is nothing.";
char o = 'i';
int indexes = 0;
if(input.charAt(input.length()-1) != 'i'){ //Test if last letter is not 'i'
for (int i = 0; i < input.length(); i++) {
if(input.charAt(i) == o){
indexes++;
}
}
int []positions = new int[indexes];
for (int i = 0; i < input.length(); i++) {
if(input.charAt(i) == o){
positions[indexes-1] = i;
indexes--;
}
}
char[] characters = input.toCharArray();
for (int i = 0; i < positions.length; i++) {
if(characters[input.length()-1] != 'i'){
char temp = characters[positions[i]];
characters[positions[i]] = characters[positions[i]+1];
characters[positions[i]+1] = temp;
} else {
continue;
}
}
String swappedString = new String(characters);
System.out.println(input);
System.out.println(swappedString);
} else { //so last letter is i
char t = input.charAt(input.length()-1);
String ha = input.substring(0, input.length()-1);
input = ha;
for (int i = 0; i < input.length(); i++) {
if(input.charAt(i) == o){
indexes++;
}
}
int []positions = new int[indexes];
for (int i = 0; i < input.length(); i++) {
if(input.charAt(i) == o){
positions[indexes-1] = i;
indexes--;
}
}
char[] characters = input.toCharArray();
for (int i = 0; i < positions.length; i++) {
if(characters[input.length()-1] != 'i'){
char temp = characters[positions[i]];
characters[positions[i]] = characters[positions[i]+1];
characters[positions[i]+1] = temp;
} else {
continue;
}
}
String swappedString = new String(characters);
swappedString = swappedString + Character.toString(t);
System.out.println(input);
System.out.println(swappedString);
}
}
}
You can do this using a StringBuilder.
class Test {
public static void main(String[] args) {
String input = "Chit Nyein Oo is nothingi";
int len = input.length();
StringBuilder sb = new StringBuilder();
// System.out.println(sb);
for(int i=0; i<len; i++) {
char charAti = input.charAt(i);
if(charAti == 'i' && i<len-1) {
sb.append(input.charAt(i+1));
sb.append(charAti);
i++;
}
else {
sb.append(charAti);
}
}
System.out.println(sb);
}
}

Remove repeated characters in a string

I need to write a static method that takes a String as a parameter and returns a new String obtained by replacing every instance of repeated adjacent letters with a single instance of that letter without using regular expressions. For example if I enter "maaaakkee" as a String, it returns "make".
I already tried the following code, but it doesn't seem to display the last character.
Here's my code:
import java.util.Scanner;
public class undouble {
public static void main(String [] args){
Scanner console = new Scanner(System.in);
System.out.println("enter String: ");
String str = console.nextLine();
System.out.println(removeSpaces(str));
}
public static String removeSpaces(String str){
String ourString="";
int j = 0;
for (int i=0; i<str.length()-1 ; i++){
j = i+1;
if(str.charAt(i)!=str.charAt(j)){
ourString+=str.charAt(i);
}
}
return ourString;
}
}
You could use regular expressions for that.
For instance:
String input = "ddooooonnneeeeee";
System.out.println(input.replaceAll("(.)\\1{1,}", "$1"));
Output:
done
Pattern explanation:
"(.)\\1{1,}" means any character (added to group 1) followed by itself at least once
"$1" references contents of group 1
maybe:
for (int i=1; i<str.length() ; i++){
j = i+1;
if(str.charAt(i)!=str.charAt(j)){
ourString+=str.charAt(i);
}
}
The problem is with your condition. You say compare i and i+1 in each iteration and in last iteration you have both i and j pointing to same location so it will never print the last character. Try this unleass you want to use regex to achive this:
EDIT:
public void removeSpaces(String str){
String ourString="";
for (int i=0; i<str.length()-1 ; i++){
if(i==0){
ourString = ""+str.charAt(i);
}else{
if(str.charAt(i-1) != str.charAt(i)){
ourString = ourString +str.charAt(i);
}
}
}
System.out.println(ourString);
}
if you cannot use replace or replaceAll, here is an alternative. O(2n), O(N) for stockage and O(N) for creating the string. It removes all repeated chars in the string put them in a stringbuilder.
input : abcdef , output : abcdef
input : aabbcdeef, output : cdf
private static String remove_repeated_char(String str)
{
StringBuilder result = new StringBuilder();
HashMap<Character, Integer> items = new HashMap<>();
for (int i = 0; i < str.length(); i++)
{
Character current = str.charAt(i);
Integer ocurrence = items.get(current);
if (ocurrence == null)
items.put(current, 1);
else
items.put(current, ocurrence + 1);
}
for (int i = 0; i < str.length(); i++)
{
Character current = str.charAt(i);
Integer ocurrence = items.get(current);
if (ocurrence == 1)
result.append(current);
}
return result.toString();
}
import java.util.*;
public class string2 {
public static void main(String[] args) {
//removes repeat character from array
Scanner sc=new Scanner(System.in);
StringBuffer sf=new StringBuffer();
System.out.println("enter a string");
sf.append(sc.nextLine());
System.out.println("string="+sf);
int i=0;
while( i<sf.length())
{
int j=1+i;
while(j<sf.length())
{
if(sf.charAt(i)==sf.charAt(j))
{
sf.deleteCharAt(j);
}
else
{
j=j+1;
}
}
i=i+1;
}
System.out.println("string="+sf);
}
}
Input AABBBccDDD, Output BD
Input ABBCDDA, Outout C
private String reducedString(String s){
char[] arr = s.toCharArray();
String newString = "";
Map<Character,Integer> map = new HashMap<Character,Integer>();
map.put(arr[0],1);
for(int index=1;index<s.length();index++)
{
Character key = arr[index];
int value;
if(map.get(key) ==null)
{
value =0;
}
else
{
value = map.get(key);
}
value = value+1;
map.put(key,value);
}
Set<Character> keyset = map.keySet();
for(Character c: keyset)
{
int value = map.get(c);
if(value%2 !=0)
{
newString+=c;
}
}
newString = newString.equals("")?"Empty String":newString;
return newString;
}
public class RemoveDuplicateCharecterInString {
static String input = new String("abbbbbbbbbbbbbbbbccccd");
static String output = "";
public static void main(String[] args)
{
// TODO Auto-generated method stub
for (int i = 0; i < input.length(); i++) {
char temp = input.charAt(i);
boolean check = false;
for (int j = 0; j < output.length(); j++) {
if (output.charAt(j) == input.charAt(i)) {
check = true;
}
}
if (!check) {
output = output + input.charAt(i);
}
}
System.out.println(" " + output);
}
}
Answer : abcd
public class RepeatedChar {
public static void main(String[] args) {
String rS = "maaaakkee";
String outCome= rS.charAt(0)+"";
int count =0;
char [] cA =rS.toCharArray();
for(int i =0; i+1<cA.length; ++i) {
if(rS.charAt(i) != rS.charAt(i+1)) {
outCome += rS.charAt(i+1);
}
}
System.out.println(outCome);
}
}
TO WRITE JAVA PROGRAM TO REMOVE REPEATED CHARACTERS:
package replace;
public class removingrepeatedcharacters
{
public static void main(String...args){
int i,j=0,count=0;
String str="noordeen";
String str2="noordeen";
char[] ch=str.toCharArray();
for(i=0;i<=5;i++)
{
count=0;
for(j=0;j<str2.length();j++)
{
if(ch[i]==str2.charAt(j))
{
count++;
System.out.println("at the index "+j +"position "+ch[i]+ "+ count is"+count);
if(count>=2){
str=str2;
str2=str.replaceFirst(Character.toString(ch[j]),Character.toString(' '));
}
System.out.println("after replacing " +str2);
}
}
}
}
}
String outstr = "";
String outstring = "";
for(int i = 0; i < str.length() - 1; i++) {
if(str.charAt(i) != str.charAt(i + 1)) {
outstr = outstr + str.charAt(i);
}
outstring = outstr + str.charAt(i);
}
System.out.println(outstring);
public static void remove_duplicates(String str){
String outstr="";
String outstring="";
for(int i=0;i<str.length()-1;i++) {
if(str.charAt(i)!=str.charAt(i+1)) {
outstr=outstr+str.charAt(i);
}
outstring=outstr+str.charAt(i);
}
System.out.println(outstring);
}
More fun with java 7:
System.out.println("11223344445555".replaceAll("(?<nums>.+)\\k<nums>+","${nums}"));
No more cryptic numbers in regexes.
public static String removeDuplicates(String str) {
String str2 = "" + str.charAt(0);
for (int i = 1; i < str.length(); i++) {
if (str.charAt(i - 1) == str.charAt(i) && i != 0) {
continue;
}
str2 = str2 + str.charAt(i);
}
return str2;
}

String permutation using for loops

I have to print all the possible permutations of the given input string.
Using the code below I get aaaa bbb ccc now in next iteration I want to print aaa aab aac. aba aca and so on. Please guide me about it.
String s = "abc";
char ch;
ArrayList<Character> input = new ArrayList<Character>();
public static void main (String [] args)
{
String s= "abc";
int count ;
char ch;
ArrayList<Character> input = new ArrayList<Character>();
for (int i=0; i < s.length(); i++)
{
ch = s.charAt(i);
input.add(ch);
}
for (int i=0; i <= input.size(); i++)
{
for(int j=0; j < input.size(); j++)
{
System.out.print(input.get(i));
}
System.out.println();
}
}
You can use recursive function. Example
private static String text = "abcd";
public static void main(String[] args) {
loopPattern(text, 0, "");
}
private static void loopPattern(String source, int index, String res) {
if (source == null || source.length() == 0) {
return;
}
if (index == source.length()) {
System.out.println(res);
return;
}
for (int i = 0; i < source.length(); i++) {
loopPattern(source, index + 1, res + source.charAt(i));
}
}
In current implementation, you should use:
for (int i=0; i <= input.size(); i++) {
for(int j=0; j < input.size(); j++) {
for(int k=0; k < input.size(); k++) {
System.out.print(input.get(i));
System.out.print(input.get(j));
System.out.print(input.get(k));
System.out.println();
}
}
}
But IMHO it is better to use s.charAt(i) instead of input.get(i).
A recursive version which is not dependent of the number of characters:
class Test
{
public static void main(String[] args)
{
if(args.length != 1)
{
System.out.println("Usage: java Test <string>");
System.exit(1);
}
String input = args[0];
iterate(input, "", input.length());
}
public static void iterate(String input, String current, int level)
{
if(level == 0)
{
System.out.println(current);
return;
}
for(int i=0; i < input.length(); i++)
{
iterate(input, current + input.charAt(i), level-1);
}
}

N-gram generation from a sentence

How to generate an n-gram of a string like:
String Input="This is my car."
I want to generate n-gram with this input:
Input Ngram size = 3
Output should be:
This
is
my
car
This is
is my
my car
This is my
is my car
Give some idea in Java, how to implement that or if any library is available for it.
I am trying to use this NGramTokenizer but its giving n-gram's of character sequence and I want n-grams of word sequence.
I believe this would do what you want:
import java.util.*;
public class Test {
public static List<String> ngrams(int n, String str) {
List<String> ngrams = new ArrayList<String>();
String[] words = str.split(" ");
for (int i = 0; i < words.length - n + 1; i++)
ngrams.add(concat(words, i, i+n));
return ngrams;
}
public static String concat(String[] words, int start, int end) {
StringBuilder sb = new StringBuilder();
for (int i = start; i < end; i++)
sb.append((i > start ? " " : "") + words[i]);
return sb.toString();
}
public static void main(String[] args) {
for (int n = 1; n <= 3; n++) {
for (String ngram : ngrams(n, "This is my car."))
System.out.println(ngram);
System.out.println();
}
}
}
Output:
This
is
my
car.
This is
is my
my car.
This is my
is my car.
An "on-demand" solution implemented as an Iterator:
class NgramIterator implements Iterator<String> {
String[] words;
int pos = 0, n;
public NgramIterator(int n, String str) {
this.n = n;
words = str.split(" ");
}
public boolean hasNext() {
return pos < words.length - n + 1;
}
public String next() {
StringBuilder sb = new StringBuilder();
for (int i = pos; i < pos + n; i++)
sb.append((i > pos ? " " : "") + words[i]);
pos++;
return sb.toString();
}
public void remove() {
throw new UnsupportedOperationException();
}
}
You are looking for ShingleFilter.
Update: The link points to version 3.0.2. This class may be in different package in newer version of Lucene.
This code returns an array of all Strings of the given length:
public static String[] ngrams(String s, int len) {
String[] parts = s.split(" ");
String[] result = new String[parts.length - len + 1];
for(int i = 0; i < parts.length - len + 1; i++) {
StringBuilder sb = new StringBuilder();
for(int k = 0; k < len; k++) {
if(k > 0) sb.append(' ');
sb.append(parts[i+k]);
}
result[i] = sb.toString();
}
return result;
}
E.g.
System.out.println(Arrays.toString(ngrams("This is my car", 2)));
//--> [This is, is my, my car]
System.out.println(Arrays.toString(ngrams("This is my car", 3)));
//--> [This is my, is my car]
/**
*
* #param sentence should has at least one string
* #param maxGramSize should be 1 at least
* #return set of continuous word n-grams up to maxGramSize from the sentence
*/
public static List<String> generateNgramsUpto(String str, int maxGramSize) {
List<String> sentence = Arrays.asList(str.split("[\\W+]"));
List<String> ngrams = new ArrayList<String>();
int ngramSize = 0;
StringBuilder sb = null;
//sentence becomes ngrams
for (ListIterator<String> it = sentence.listIterator(); it.hasNext();) {
String word = (String) it.next();
//1- add the word itself
sb = new StringBuilder(word);
ngrams.add(word);
ngramSize=1;
it.previous();
//2- insert prevs of the word and add those too
while(it.hasPrevious() && ngramSize<maxGramSize){
sb.insert(0,' ');
sb.insert(0,it.previous());
ngrams.add(sb.toString());
ngramSize++;
}
//go back to initial position
while(ngramSize>0){
ngramSize--;
it.next();
}
}
return ngrams;
}
Call:
long startTime = System.currentTimeMillis();
ngrams = ToolSet.generateNgramsUpto("This is my car.", 3);
long stopTime = System.currentTimeMillis();
System.out.println("My time = "+(stopTime-startTime)+" ms with ngramsize = "+ngrams.size());
System.out.println(ngrams.toString());
Output:
My time = 1 ms with ngramsize = 9 [This, is, This is, my, is my, This
is my, car, my car, is my car]
public static void CreateNgram(ArrayList<String> list, int cutoff) {
try
{
NGramModel ngramModel = new NGramModel();
POSModel model = new POSModelLoader().load(new File("en-pos-maxent.bin"));
PerformanceMonitor perfMon = new PerformanceMonitor(System.err, "sent");
POSTaggerME tagger = new POSTaggerME(model);
perfMon.start();
for(int i = 0; i<list.size(); i++)
{
String inputString = list.get(i);
ObjectStream<String> lineStream = new PlainTextByLineStream(new StringReader(inputString));
String line;
while ((line = lineStream.read()) != null)
{
String whitespaceTokenizerLine[] = WhitespaceTokenizer.INSTANCE.tokenize(line);
String[] tags = tagger.tag(whitespaceTokenizerLine);
POSSample sample = new POSSample(whitespaceTokenizerLine, tags);
perfMon.incrementCounter();
String words[] = sample.getSentence();
if(words.length > 0)
{
for(int k = 2; k< 4; k++)
{
ngramModel.add(new StringList(words), k, k);
}
}
}
}
ngramModel.cutoff(cutoff, Integer.MAX_VALUE);
Iterator<StringList> it = ngramModel.iterator();
while(it.hasNext())
{
StringList strList = it.next();
System.out.println(strList.toString());
}
perfMon.stopAndPrintFinalResult();
}catch(Exception e)
{
System.out.println(e.toString());
}
}
Here is my codes to create n-gram. In this case, n = 2, 3. n-gram of words sequence which smaller than cutoff value will ignore from result set. Input is list of sentences, then it parse using a tool of OpenNLP
public static void main(String[] args) {
String[] words = "This is my car.".split(" ");
for (int n = 0; n < 3; n++) {
List<String> list = ngrams(n, words);
for (String ngram : list) {
System.out.println(ngram);
}
System.out.println();
}
}
public static List<String> ngrams(int stepSize, String[] words) {
List<String> ngrams = new ArrayList<String>();
for (int i = 0; i < words.length-stepSize; i++) {
String initialWord = "";
int internalCount = i;
int internalStepSize = i + stepSize;
while (internalCount <= internalStepSize
&& internalCount < words.length) {
initialWord = initialWord+" " + words[internalCount];
++internalCount;
}
ngrams.add(initialWord);
}
return ngrams;
}
Check this out:
public static void main(String[] args) {
NGram nGram = new NGram();
String[] tokens = "this is my car".split(" ");
int i = tokens.length;
List<String> ngrams = new ArrayList<>();
while (i >= 1){
ngrams.addAll(nGram.getNGram(tokens, i, new ArrayList<>()));
i--;
}
System.out.println(ngrams);
}
private List<String> getNGram(String[] tokens, int n, List<String> ngrams) {
StringBuilder strbldr = new StringBuilder();
if (tokens.length < n) {
return ngrams;
}else {
for (int i=0; i<n; i++){
strbldr.append(tokens[i]).append(" ");
}
ngrams.add(strbldr.toString().trim());
String[] newTokens = Arrays.copyOfRange(tokens, 1, tokens.length);
return getNGram(newTokens, n, ngrams);
}
}
Simple recursive function, better running time.

Categories