Ngrams with a dataset java eclipse

Ngrams with a dataset java eclipse - java

hi i am trying to use ngrams on this data set ( number of attacks) but i am confused to how to merge the 2 methods together so that the n-grams can spot the frequency on how many times they appear. i am just trying to data process but i am confused. any help would be appreciated thank you.
This is what i have done so far. as you can see the main method holds the dataset but how can i merge all methods together so that the Ngrams runs on the data. thank you
public class MainProcess {
public static void main(String args[]) throws IOException
{
FileReader readhandle = new
FileReader("/Users/muhammad/Desktop/ADFA-
LD/Attack_Data_Master/Adduser_1/UAD-Adduser-1-=1.txt");
BufferedReader br = new BufferedReader(readhandle);
String line = null;
while((line = br.readLine()) != null)
{
System.out.println(line);
}
br.close();
readhandle.close();
}
public class Ngrams {
ArrayList<String> nGrams = new ArrayList<String>();
public void generateNGrams(String str, int n) {
if (str.length() == n ) {
int counter = 0;
while (counter < n) {
nGrams.add(str.substring(counter));
counter++;
}
return;
}
int counter = 0;
String gram = "";
while (counter < n) {
gram += str.charAt(counter);
counter++;
}
nGrams.add(gram);
generateNGrams(str.substring(1), n);
}
public void printNGrams() {
for (String str : nGrams) {
System.out.println(str);
}
}}
}

Related

Is there any simple way to convert camel-case to snake-case correctly?

I tried as the below code code snippet, but the TradeID is printing as Trade_I_D, but it must be as Trade_ID.
input: getCurrency, getAccountName, getTradeID
expected output: Currency, Account_Name, Trade_ID
public class RemoveGet {
public static void main(String args[]) {
for (String a : args) {
String b = a.replace("get", "");
//System.out.println(b);
StringBuffer sb = new StringBuffer();
for (int i = 0; i < b.length(); i++) {
if (Character.isUpperCase(b.charAt(i))) {
sb.append("_");
sb.append(b.charAt(i));
} else {
sb.append(b.charAt(i));
}
}
//System.out.println(sb.toString());
String c = sb.toString();
if (c.startsWith("_")) {
System.out.println(c.substring(1));
}
}
}
}

try this
str = str.replace("get", "")
.replaceAll("([A-Z]+)([A-Z][a-z])", "$1_$2")
.replaceAll("([a-z])([A-Z])", "$1_$2")

Use a boolean first-time switch to only put an underscore after the second upper case letter.
Here are some test results.
getTradeID
Trade_ID
Here's the complete runnable code.
public class RemoveGet {
public static void main(String args[]) {
args = new String[1];
args[0] = "getTradeID";
for (String a : args) {
System.out.println(a);
String b = a.replace("get", "");
boolean firstTimeSwitch = true;
// System.out.println(b);
StringBuffer sb = new StringBuffer();
sb.append(b.charAt(0));
for (int i = 1; i < b.length(); i++) {
if (firstTimeSwitch && Character.isUpperCase(b.charAt(i))) {
sb.append("_");
sb.append(b.charAt(i));
firstTimeSwitch = false;
} else {
sb.append(b.charAt(i));
}
}
System.out.println(sb.toString());
}
}
}

Instead of writing all logic in the main function, write some functions to do small tasks and call them in the main function. This makes the code readable and easy to debug. This could be the possible solution code:
public class RemoveGet {
public static String addUnderScoreAppropriately(String input) {
String result = "";
String underScore = "_";
for(int i=0; i<input.length();i++) {
if((Character.isUpperCase(input.charAt(i))) && (i != 0)) {
result = result + underScore + input.charAt(i);
}else{
result = result + input.charAt(i);
}
}
result = result.replace("_I_D","_ID");
return result;
}
public static void main(String args[]) {
for (String a : args) {
System.out.println(addUnderScoreAppropriately(a.replace("get","")));
}
}
}

Confused with why I am getting Index out of bounds error?

So I am trying to create a program which takes a text file, creates an index (by line numbers) for all the words in the file and writes the index into the output file. Here is the main class:
import java.util.Scanner;
import java.io.*;
public class IndexMaker
{
public static void main(String[] args) throws IOException
{
Scanner keyboard = new Scanner(System.in);
String fileName;
// Open input file:
if (args.length > 0)
fileName = args[0];
else
{
System.out.print("\nEnter input file name: ");
fileName = keyboard.nextLine().trim();
}
BufferedReader inputFile =
new BufferedReader(new FileReader(fileName), 1024);
// Create output file:
if (args.length > 1)
fileName = args[1];
else
{
System.out.print("\nEnter output file name: ");
fileName = keyboard.nextLine().trim();
}
PrintWriter outputFile =
new PrintWriter(new FileWriter(fileName));
// Create index:
DocumentIndex index = new DocumentIndex();
String line;
int lineNum = 0;
while ((line = inputFile.readLine()) != null)
{
lineNum++;
index.addAllWords(line, lineNum);
}
// Save index:
for (IndexEntry entry : index)
outputFile.println(entry);
// Finish:
inputFile.close();
outputFile.close();
keyboard.close();
System.out.println("Done.");
}
}
The program contains two more classes: IndexEntry which represents one index entry, and the DocumentIndex class which represents the entire index for a document: the list of all its index entries. The index entries should always be arranged in alphabetical order. So the implementation for these two classes are shown below
import java.util.ArrayList;
public class IndexEntry {
private String word;
private ArrayList<Integer> numsList;
public IndexEntry(String w) {
word = w.toUpperCase();
numsList = new ArrayList<Integer>();
}
public void add(int num) {
if (!numsList.contains(num)) {
numsList.add(num);
}
}
public String getWord() {
return word;
}
public String toString() {
String result = word + " ";
for (int i=0; i<numsList.size(); i++) {
if (i == 0) {
result += numsList.get(i);
} else {
result += ", " + numsList.get(i);
}
}
return result;
}
}
import java.util.ArrayList;
public class DocumentIndex extends ArrayList<IndexEntry> {
public DocumentIndex() {
super();
}
public DocumentIndex(int c) {
super(c);
}
public void addWord(String word, int num) {
super.get(foundOrInserted(word)).add(num);
}
private int foundOrInserted(String word) {
int result = 0;
for (int i=0; i<super.size(); i++) {
String w = super.get(i).getWord();
if (word.equalsIgnoreCase(w)) {
result = i;
} else if (w.compareTo(word) > 0) {
super.add(i, new IndexEntry(w));
result = i;
}
}
return result;
}
public void addAllWords(String str, int num) {
String[] arr = str.split("[^A-Za-z]+");
for (int i=0; i<arr.length; i++) {
if (arr[i].length() > 0 ) {
addWord(arr[i], num);
}
}
}
}
When I run this program I'm getting an error and I'm not sure where the error came from.
Exception in thread "main" java.lang.IndexOutOfBoundsException: Index 0 out of bounds for length 0
at java.base/jdk.internal.util.Preconditions.outOfBounds(Preconditions.java:64)
at java.base/jdk.internal.util.Preconditions.outOfBoundsCheckIndex(Preconditions.java:70)
at java.base/jdk.internal.util.Preconditions.checkIndex(Preconditions.java:248)
at java.base/java.util.Objects.checkIndex(Objects.java:372)
at java.base/java.util.ArrayList.get(ArrayList.java:459)
at DocumentIndex.addWord(DocumentIndex.java:14)
at DocumentIndex.addAllWords(DocumentIndex.java:35)
at Main.main(Main.java:53)```

There is where the problem arises:
String line;
int lineNum = 0;
while ((line = inputFile.readLine()) != null)
{
lineNum++;
index.addAllWords(line, lineNum);
}
You add lineNum by 1 before executing the line after. At the last loop, lineNum will be 1 more than the maximum, because the loop starts at line 1, and it is 0 index based.
Instead, use:
String line;
int lineNum = 0;
while ((line = inputFile.readLine()) != null)
{
index.addAllWords(line, lineNum);
lineNum++;
}

How can I find the most frequent word in a text?

I have a problem.It seems like if I have an input like this:
"Thanks Thanks Thanks car car"
The output will be "thanks". If my word starts with an uppercase letter it will print that word with a lowercase letter.
What can I add to my solution to solve that problem?
public class Main {
public static void main(String[] args) throws IOException {
String line;
String[] words = new String[100];
Map < String, Integer > frequency = new HashMap < > ();
BufferedReader reader = new BufferedReader(new InputStreamReader(System.in));
while ((line = reader.readLine()) != null) {
line = line.trim();
if (!line.isEmpty()) {
words = line.split("\\W+");
for (String word: words) {
String processed = word.toLowerCase();
processed = processed.replace(",", "");
if (frequency.containsKey(processed)) {
frequency.put(processed,
frequency.get(processed) + 1);
} else {
frequency.put(processed, 1);
}
}
}
}
int mostFrequentlyUsed = 0;
String theWord = null;
for (String word: frequency.keySet()) {
Integer theVal = frequency.get(word);
if (theVal > mostFrequentlyUsed) {
mostFrequentlyUsed = theVal;
theWord = word;
} else if (theVal == mostFrequentlyUsed && word.length() <
theWord.length()) {
theWord = word;
mostFrequentlyUsed = theVal;
}
}
System.out.printf(theWord);
}

To let the code print the most frequent word in the format it was entered and not in lowercase, You can change below line of code.
String processed = word.toLowerCase();
Change it to :
String processed = word;
But then be aware then containsKey() method is case-sensitive and won't consider "Thanks" and 'thanks" as the same word.

Please find the below program which print both upper and lower case based on input.
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.util.HashMap;
import java.util.Map;
public class Main {
public static void main(String[] args) throws IOException {
BufferedReader reader = new BufferedReader(new InputStreamReader(System.in));
String[] strArr=reader.readLine().split(" ");
String result=null;
int maxCount=0;
Map<String, Integer> strMap=new HashMap<String, Integer>();
int count=0;
for(String s:strArr){
count=0;
if(strMap.containsKey(s)){
count=strMap.get(s);
strMap.put(s,++count);
}else{
strMap.put(s, ++count);
}
}
//find Maximum
for(Map.Entry<String, Integer> itr: strMap.entrySet()){
if(maxCount==0){
maxCount=itr.getValue();
result=itr.getKey();
}else{
if(maxCount < itr.getValue()){
maxCount=itr.getValue();
result=itr.getKey();
}
}
}
// No of occurences with count
System.out.println("word"+ result+"count"+ maxCount);
printInLowerOrUpperCare(result);
}
public static void printInLowerOrUpperCare(String result){
if(result.charAt(0) >='a' && result.charAt(0) >= 'z' ){
System.out.println(result.toUpperCase());
}else{
System.out.println(result.toLowerCase());
}
}
}

BitSet stored in TreeMap always exceed its cardinality

The file that my program is reading contains space separated numbers such "59 23 2 84 83", if i am sure that the # "84" occur only 36 times but bitset.cardinality() report 293 times.. please help
static int line_counter = 0;
static TreeMap<String, BitSet> ItemsArray = new TreeMap<String, BitSet>();
public static void main(String[] args) throws IOException {
String[] line;
BufferedReader br = new BufferedReader(new FileReader("abc.txt"));
while (br.ready()) {
line = br.readLine().split(" ");
Arrays.sort(line);
ItemsArray(line);
line_counter++;
}
System.out.println("ItemsArray cardinality = " + ItemsArray.get("84").cardinality() + "\n");
}
private static void ItemsArray(String[] line) {
BitSet temp_bitset = new BitSet();
for (String item : line) {
temp_bitset.clear();
if (ItemsArray.get(item) == null) {
temp_bitset.set(line_counter);
ItemsArray.put(item, temp_bitset);
} else {
temp_bitset = (BitSet) ItemsArray.get(item).clone();
temp_bitset.set(line_counter);
ItemsArray.put(item, temp_bitset);
}
}
}

Your problem is that there is only one BitSet for each line. You then confuse matters by replacing it with one from the map if the number repeats in several lines which therefore may actually be from a different line. You then seem to clear it for no real reason. You then seem to think clone is the solution to all of the above problems.
Here's an idea:
static int line_counter = 0;
static TreeMap<String, BitSet> allBits = new TreeMap<String, BitSet>();
public static void main(String[] args) throws IOException {
String[] line;
BufferedReader br = new BufferedReader(new FileReader("abc.txt"));
while (br.ready()) {
line = br.readLine().split(" ");
Arrays.sort(line);
consumeItems(line);
line_counter++;
}
System.out.println("ItemsArray cardinality = " + allBits.get("84").cardinality() + "\n");
}
private static void consumeItems(String[] line) {
for (String item : line) {
BitSet temp = allBits.get(item);
if (temp == null) {
temp = new BitSet();
allBits.put(item, temp);
}
// Use a bit in the BitSet to indicate that this number appeared in tat line.
temp.set(line_counter);
}
}
Not sure it's what you need but it demonstrates the normal technique for creating/updating map entries.

wordCount frequency returns repeated Set in java

I have a method which returns single word as a String. I need to count all those words returned by the method which reads chunk of text. Problem is I am getting the count right but output is wrong. It's repeating. Not quite sure where is things going wrong?
private int totalWords = 0;
private static Map<String, Integer> wordFrequency = new HashMap<String, Integer>();
public static void findResult(CharacterReader characterReader)
{
boolean x = true;
CharBuffer buffer = CharBuffer.allocate(100);
String str = "";
try
{
while(x)
{
char cha = characterReader.getNextChar();
Set<Character> charSet = new HashSet<Character>();
charSet.add(',');
charSet.add('.');
charSet.add(';');
charSet.add(':');
charSet.add('\'');
charSet.add('~');
charSet.add('?');
charSet.add('!');
charSet.add('%');
while(cha != ' ' && !charSet.contains(cha))
{
buffer.put(cha);
cha = characterReader.getNextChar();
}
buffer.flip();
str = buffer.toString();
buffer.clear();
countWords(str);
System.out.println(wordFrequency);
}
}catch(EOFException e)
{
x = false;
}
private static void countWords(String word)
{
if (wordFrequency.containsKey(word))
{
Integer count = wordFrequency.get(word);
count++;
wordFrequency.put(word, count);
} else {
wordFrequency.put(word, 1);
}
}
public static void main (String args[])
{
CharacterReader cr = new SimpleCharacterReader();
findResult(cr);
}

Move
System.out.println(wordFrequency);
To outside the try statement. You are printing the whole set after each word.

It's all in where you've placed your System.out.println. You've got it inside of the loop!
while(x) {
// .....
countWords(str);
System.out.println(wordFrequency);
}
Solution: do it after the loop.
while(x) {
// .....
countWords(str);
}
System.out.println(wordFrequency);

Try moving
System.out.println(wordFrequency);
out of the while loop braces...

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Ngrams with a dataset java eclipse - java

Related

Is there any simple way to convert camel-case to snake-case correctly?

Confused with why I am getting Index out of bounds error?

How can I find the most frequent word in a text?

BitSet stored in TreeMap always exceed its cardinality

wordCount frequency returns repeated Set in java

Categories

Resources