I'm trying to achieve counting TF-IDF in java with DB as the corpus of document. I have done calculating the Term Frequency store in hashmap, but i have a problem, how can i calculate the document frequency each term? eg. the term "president" occurs in which document id? and how many it occurs in other document? i have 3 documents in DB for training, im stuck here, any advice? and thanks for helping, here is my code:
try{
Map <String,Integer> kamus = new HashMap<String, Integer>();
k.Koneksi();
String sql = "select * from data_berita";
Statement n = k.koneksi.createStatement();
ResultSet res = n.executeQuery(sql);
while(res.next()){
int id = res.getInt("id");
String konten = res.getString("konten");
String f = konten.replaceAll("[-?/<>_+=!##%&*.“”‘’()$·,';:{}|\"]", "").toLowerCase();
String [] array = f.split("\\s+");
for(String s : array){
int fl = 1;
k.Koneksi();
String sq = "Select * from StopWord";
Statement stat = k.koneksi.createStatement();
ResultSet rs = stat.executeQuery(sq);
while(rs.next()){
if(s.equals(rs.getString("Stopword"))){
fl=0;
}
}
if (fl!=0){
if(kamus.containsKey(s)){
kamus.put(s, kamus.get(s)+1);
}
else{
kamus.put(s, 1);
}
}
}
for(Map.Entry<String,Integer> en : kamus.entrySet()){
String d = en.getKey();
Integer s = en.getValue();
System.out.println(d+" "+s);
}
and this is the result of term frequency(not all):
alliance=1, president=3, suzhou=3, unloading=1, attended=1, liquefied=1, bright=1
Related
I only get the first row from table, from a JDBC call.
Below is my code
import java.util.*;
import java.sql.*;
public class TrainManagementSystem {
public ArrayList <Train> viewTrain (String coachType, String source, String destination){
Connection myCon = null;
PreparedStatement myStat = null;
ResultSet myRes = null;
ArrayList<Train> result = new ArrayList<>();
try{
myCon = DB.getConnection();
myStat = myCon.prepareStatement("select * from train where source = ? and destination = ? and ? > 0");
myStat.setString(1, source);
myStat.setString(2, destination);
myStat.setString(3, coachType);
myRes = myStat.executeQuery();
while(myRes.next()){
int train_number = myRes.getInt("train_number");
String train_name = myRes.getString("train_name");
String source1 = myRes.getString("source");
String destination1 = myRes.getString("destination");
int ac1 = myRes.getInt("ac1");
int ac2 = myRes.getInt("ac2");
int ac3 = myRes.getInt("ac3");
int sleeper = myRes.getInt("sleeper");
int seater = myRes.getInt("seater");
System.out.println(myRes.getString("train_name"));
Train train = new Train(train_number, train_name, source1, destination1, ac1, ac2, ac3, sleeper, seater);
result.add(train);
return result;
}
}catch(Exception e){
System.out.println(e);
}
return result;
}
}
I only get the first row in result set in JDBC. The query is not retrieving the second row.
I have attached my entries in table.
When giving input as
Why is it not going to next row? When I try removing coachtype and only search for train between Howrah and Dehradun, it gives me result Dehradun Mail whereas the result should add Doon Express as well.
Because you have a return result; inside your while loop.
Move it line down outside of the loop
I have to get 'tags' from the database and store them in an array so I could check if my document contains them. Due to the number of tag categories (customers, system_dependencies, keywords) I have multiple arrays to compare my document with. Is there an easy way to simplify and make my code look nicer?
This is my approach but it looks terrible with all the repetitive for loops.
ArrayList<String> KEYWORDS2 = new ArrayList<String>();
ArrayList<String> CUSTOMERS = new ArrayList<String>();
ArrayList<String> SYSTEM_DEPS = new ArrayList<String>();
ArrayList<String> MODULES = new ArrayList<String>();
ArrayList<String> DRIVE_DEFS = new ArrayList<String>();
ArrayList<String> PROCESS_IDS = new ArrayList<String>();
while (resultSet2.next()) {
CUSTOMERS.add(resultSet2.getString(1));
}
sql = "SELECT da_tag_name FROM da_tags WHERE da_tag_type_id = 6";
stmt = conn.prepareStatement(sql);
resultSet2 = stmt.executeQuery();
while (resultSet2.next()) {
SYSTEM_DEPS.add(resultSet2.getString(1));
}
while (resultSet.next()) {
String da_document_id = resultSet.getString(1);
String file_name = resultSet.getString(2);
try {
if(file_name.endsWith(".docx") || file_name.endsWith(".docm")) {
System.out.println(file_name);
XWPFDocument document = new XWPFDocument(resultSet.getBinaryStream(3));
XWPFWordExtractor wordExtractor = new XWPFWordExtractor(document);
//Return what's inside the document
System.out.println("Keywords found in the document:");
for (String keyword : KEYWORDS) {
if (wordExtractor.getText().contains(keyword)) {
System.out.println(keyword);
}
}
System.out.println("\nCustomers found in the document:");
for (String customer : CUSTOMERS) {
if (wordExtractor.getText().contains(customer)) {
System.out.println(customer);
}
}
System.out.println("\nSystem dependencies found in the document:");
for (String systemDeps : SYSTEM_DEPS) {
if (wordExtractor.getText().contains(systemDeps)) {
System.out.println(systemDeps);
}
}
System.out.println("Log number: " + findLogNumber(wordExtractor));
System.out.println("------------------------------------------");
wordExtractor.close();
}
As you can see there are 3 more to come and this doesn't look good already. Maybe there's a way to compare all of them at the same time.
I have made another attempt at this creating this method:
public void genericForEachLoop(ArrayList<String> al, POITextExtractor te) {
for (String item : al) {
if (te.getText().contains(item)) {
System.out.println(item);
}
}
}
Then calling it like so: genericForEachLoop(MODULES, wordExtractor);
Any better solutions?
I've got two ideas to shorten this: first of all you can write a general for-loop in a separate method that has an ArrayList as a parameter. Then you pass it each of your ArrayLists successively, which would mean that at least you do not have to repeat the for-loops. Secondly, you can create an ArrayList of type ArrayList and store your ArrayLists inside it. Then you can iterate over the whole thing. Only apparent disadvantage of both ideas (or a combination of them) would be, that you need to name the variable for your query string alike for the search of each ArrayList.
What you could do is use a Map and an enum like this:
enum TagType {
KEYWORDS2(2), // or whatever its da_tag_type_id is
CUSTOMERS(4),
SYSTEM_DEPS(6),
MODULES(8),
DRIVE_DEFS(10),
PROCESS_IDS(12);
public final daTagTypeId; // this will be used in queries
TagType(int daTagTypeId) {
this.daTagTypeId = daTagTypeId;
}
}
Map<TagType, List<String>> tags = new HashMap<>();
XWPFDocument document = new XWPFDocument(resultSet.getBinaryStream(3));
XWPFWordExtractor wordExtractor = new XWPFWordExtractor(document);
for(TagType tagType : TagType.values()) {
tags.put(tagType, new ArrayList<>()); // initialize
String sql = String.format("SELECT da_tag_name FROM da_tags WHERE da_tag_type_id = %d", tagType.daTagTypeId); // build query
stmt = conn.prepareStatement(sql);
resultSet2 = stmt.executeQuery();
while(resultSet2.next()) { // fill from DB
tags.get(tagType).add(.add(resultSet2.getString(1)));
}
System.out.println(String.format("%s found in the document:", tags.get(tagType).name());
for (String tag : tags.get(tagType)) { // search in text
if (wordExtractor.getText().contains(tag)) {
System.out.println(keyword);
}
}
}
But at this point I'm not sure you need those lists at all:
enum TagType {
KEYWORDS2(2), // or whatever its da_tag_type_id is
CUSTOMERS(4),
SYSTEM_DEPS(6),
MODULES(8),
DRIVE_DEFS(10),
PROCESS_IDS(12);
public final daTagTypeId; // this will be used in queries
TagType(int daTagTypeId) {
this.daTagTypeId = daTagTypeId;
}
}
XWPFDocument document = new XWPFDocument(resultSet.getBinaryStream(3));
XWPFWordExtractor wordExtractor = new XWPFWordExtractor(document);
for(TagType tagType : TagType.values()) {
String sql = String.format("SELECT da_tag_name FROM da_tags WHERE da_tag_type_id = %d", tagType.daTagTypeId); // build query
stmt = conn.prepareStatement(sql);
resultSet2 = stmt.executeQuery();
System.out.println(String.format("%s found in the document:", tags.get(tagType).name());
while(result2.next()) {
String tag = resultSet2.getString(1);
if (wordExtractor.getText().contains(tag)) {
System.out.println(keyword);
}
}
}
This given I don't know where those resultSet is declared and initialised, nor where that resultSet2 is initialised.
Basically you just fetch tags for each type from DB and then directly search them in the text without storing them at first and then re-iterating the stored ones... I mean that's what the DB is there for.
I am having a lot of trouble iterating through all my records. Perhaps, by reading my code someone could help.
private String saveData(Handle handle, String username, String name, String prof, String table) {
String sqlCommand;
Map<String, Object> userResults;
for (Integer tableNum = 1; tableNum < 5; tableNum++) {
//query all tables
sqlCommand = String.format("SELECT varname FROM s" + tableNum.toString());
userResults = handle.createQuery(sqlCommand)
.bind("username", username)
.first();
//attempt to ierate all records
for (Map.Entry<String, Object> entry : userResults.entrySet()) {
Object obj = entry.getValue(); // doesnt have .get(string) as below
}
//get the desired field
logger.debug("Results: " + userResults.toString());
String varname = (String) userResults.get("varname");
if ((varname.toLowerCase()).matches(name.toLowerCase()))
return "";
}
//save data
return name;
}
How do I iterate through each record of the table?
You say this works for row1. You cannot go to the next row as there is .first(); in your handle and you have not tried to fetch next record. Modify your query - the documentation here says you can use .list(maxRows);
I'm trying to create Term-Document matrix for a small corpus to further experiment with LSI. However, I couldn't find a way to do it with Lucene 4.4.
I know how to get TermVector for each document as following:
//create boolean query to search for a specific document (not shown)
TopDocs hits = searcher.search(query, 1);
Terms termVector = reader.getTermVector(hits.scoreDocs[0].doc, "contents");
System.out.println(termVector.size()); //just testing
I thought I can just union all the termVector together as columns in a matrix to get the matrix. However, termVector for different documents have different size. And we don't know how to pad 0 into the termVector. So, certainly, this method does not work.
Hence, I wonder if someone can show me how to create Term-Document vector with Lucene 4.4 please? (If possible, please show sample code).
If Lucene does not support this function, what is the other way you recommend to do it?
Many thanks,
I found the solution to my problem here. Very detail example given by Mr. Sujit, although the code is written in older version of Lucene so many things will have to be changed. I'll update details when I finish my code.
Here is my solution that works on Lucene 4.4
public class BuildTermDocumentMatrix {
public BuildTermDocumentMatrix(File index, File corpus) throws IOException{
reader = DirectoryReader.open(FSDirectory.open(index));
searcher = new IndexSearcher(reader);
this.corpus = corpus;
termIdMap = computeTermIdMap(reader);
}
/**
* Map term to a fix integer so that we can build document matrix later.
* It's used to assign term to specific row in Term-Document matrix
*/
private Map<String, Integer> computeTermIdMap(IndexReader reader) throws IOException {
Map<String,Integer> termIdMap = new HashMap<String,Integer>();
int id = 0;
Fields fields = MultiFields.getFields(reader);
Terms terms = fields.terms("contents");
TermsEnum itr = terms.iterator(null);
BytesRef term = null;
while ((term = itr.next()) != null) {
String termText = term.utf8ToString();
if (termIdMap.containsKey(termText))
continue;
//System.out.println(termText);
termIdMap.put(termText, id++);
}
return termIdMap;
}
/**
* build term-document matrix for the given directory
*/
public RealMatrix buildTermDocumentMatrix () throws IOException {
//iterate through directory to work with each doc
int col = 0;
int numDocs = countDocs(corpus); //get the number of documents here
int numTerms = termIdMap.size(); //total number of terms
RealMatrix tdMatrix = new Array2DRowRealMatrix(numTerms, numDocs);
for (File f : corpus.listFiles()) {
if (!f.isHidden() && f.canRead()) {
//I build term document matrix for a subset of corpus so
//I need to lookup document by path name.
//If you build for the whole corpus, just iterate through all documents
String path = f.getPath();
BooleanQuery pathQuery = new BooleanQuery();
pathQuery.add(new TermQuery(new Term("path", path)), BooleanClause.Occur.SHOULD);
TopDocs hits = searcher.search(pathQuery, 1);
//get term vector
Terms termVector = reader.getTermVector(hits.scoreDocs[0].doc, "contents");
TermsEnum itr = termVector.iterator(null);
BytesRef term = null;
//compute term weight
while ((term = itr.next()) != null) {
String termText = term.utf8ToString();
int row = termIdMap.get(termText);
long termFreq = itr.totalTermFreq();
long docCount = itr.docFreq();
double weight = computeTfIdfWeight(termFreq, docCount, numDocs);
tdMatrix.setEntry(row, col, weight);
}
col++;
}
}
return tdMatrix;
}
}
One can refer this code also. In the latest Lucene version It will be quite easy.
Example 15
public void testSparseFreqDoubleArrayConversion() throws Exception {
Terms fieldTerms = MultiFields.getTerms(index, "text");
if (fieldTerms != null && fieldTerms.size() != -1) {
IndexSearcher indexSearcher = new IndexSearcher(index);
for (ScoreDoc scoreDoc : indexSearcher.search(new MatchAllDocsQuery(), Integer.MAX_VALUE).scoreDocs) {
Terms docTerms = index.getTermVector(scoreDoc.doc, "text");
Double[] vector = DocToDoubleVectorUtils.toSparseLocalFreqDoubleArray(docTerms, fieldTerms);
assertNotNull(vector);
assertTrue(vector.length > 0);
}
}
}
i have a collection of raw text in a table in database, i need to replace some words in this collection using a set of words.
i put all the term to be replace and its substitutes in a text file as below
min=admin
lelet=lambat
lemot=lambat
nii=nih
ntu=itu
and so on.
i have successfully initiate a variabel of File and Scanner to read the collection of the term and its substitutes.
i loop all the dataset and save the raw text in a string
in the same loop
i loop all the term collection and save its row to a string name 'pattern', and split the pattern into two string named 'term' and 'replacer'
in this loop i initiate a new string which its value is the string from the dataset modified by replaceAll(term,replacer)
end loop for term collection
then i insert the new string to another table in database
end loop for dataset
i do it manualy as below
replaceAll("min","admin")
and its works but its really something to code it manually for almost 2000 terms to be replace it.
anyone ever face this kind of really something..
i really need a help now desperate :(
package sentimenrepo;
import javax.swing.*;
import java.sql.*;
import java.io.*;
//import java.util.HashMap;
import java.util.Scanner;
//import java.util.Map;
/**
*
* #author herman
*/
public class synonimReplaceV2 extends SwingWorker {
protected Object doInBackground() throws Exception {
new skripsisentimen.sentimenttwitter().setVisible(true);
Integer row = 0;
File synonimV2 = new File("synV2/catatan_kata_sinonim.txt");
String newTweet = "";
DB db = new DB();
Connection conn = db.dbConnect("jdbc:mysql://localhost:3306/tweet", "root", "");
try{
Statement select = conn.createStatement();
select.executeQuery("select * from synonimtweet");
ResultSet RS = select.getResultSet();
Scanner scSynV2 = new Scanner(synonimV2);
while(RS.next()){
row++;
String no = RS.getString("no");
String tweet = " "+ RS.getString("tweet");
String published = RS.getString("published");
String label = RS.getString("label");
clean2 cleanv2 = new clean2();
newTweet = cleanv2.cleanTweet(tweet);
try{
Statement insert = conn.createStatement();
insert.executeUpdate("INSERT INTO synonimtweet_v2(no,tweet,published,label) values('"
+no+"','"+newTweet+"','"+published+"','"+label+"')");
String current = skripsisentimen.sentimenttwitter.txtAreaResult.getText();
skripsisentimen.sentimenttwitter.txtAreaResult.setText(current+"\n"+row+"original : "+tweet+"\n"+newTweet+"\n______________________\n");
skripsisentimen.sentimenttwitter.lblStat.setText(row+" tweet read");
skripsisentimen.sentimenttwitter.txtAreaResult.setCaretPosition(skripsisentimen.sentimenttwitter.txtAreaResult.getText().length() - 1);
}catch(Exception e){
skripsisentimen.sentimenttwitter.lblStat.setText(e.getMessage());
}
skripsisentimen.sentimenttwitter.lblStat.setText(e.getMessage());
}
}catch(Exception e){
skripsisentimen.sentimenttwitter.lblStat.setText(e.getMessage());
}
return row;
}
class clean2{
public clean2(){}
public String cleanTweet(String tweet){
File synonimV2 = new File("synV2/catatan_kata_sinonim.txt");
String pattern = "";
String term = "";
String replacer = "";
String newTweet="";
try{
Scanner scSynV2 = new Scanner(synonimV2);
while(scSynV2.hasNext()){
pattern = scSynV2.next();
term = pattern.split("=")[0];
replacer = pattern.split("=")[1];
newTweet = tweet.replace(term, replacer);
}
}catch(Exception e){
e.printStackTrace();
}
System.out.println(newTweet+"\n"+tweet);
return newTweet;
}
}
}
update
ive just realize that the code actually works but only for the first row in database, the second row and so on stand still. here is i update the newest code i ve build
public class synonimReplaceV2 extends SwingWorker {
protected Object doInBackground() throws Exception {
new skripsisentimen.sentimenttwitter().setVisible(true);
Integer row = 0;
String newTweet = "";
DB db = new DB();
Connection conn = db.dbConnect("jdbc:mysql://localhost:3306/tweet", "root", "");
try{
Statement select = conn.createStatement();
select.executeQuery("select * from synonimtweet limit 2,10");
ResultSet RS = select.getResultSet();
FileReader readSyn = new FileReader("synV2/catatan_kata_sinonim.txt");
BufferedReader buffSyn = new BufferedReader(readSyn);
while(RS.next()){
row++;
String no = RS.getString("no");
String tweet = " "+ RS.getString("tweet");
String published = RS.getString("published");
String label = RS.getString("label");
String pattern = "";
while((pattern=buffSyn.readLine())!=null){
String patternTerm = pattern.split("=")[0];
String patternSubs = pattern.split("=")[1];
tweet = tweet.replaceAll("\\s"+patternTerm, patternSubs);
}
try{
Statement insert = conn.createStatement();
insert.executeUpdate("INSERT INTO synonimtweet_v2(no,tweet,published,label) values('"
+no+"','"+tweet+"','"+published+"','"+label+"')");
String current = skripsisentimen.sentimenttwitter.txtAreaResult.getText();
skripsisentimen.sentimenttwitter.txtAreaResult.setText(current+"\n"+row+"original : "+tweet+"\n"+newTweet+"\n______________________\n");
skripsisentimen.sentimenttwitter.lblStat.setText(row+" tweet read");
skripsisentimen.sentimenttwitter.txtAreaResult.setCaretPosition(skripsisentimen.sentimenttwitter.txtAreaResult.getText().length() - 1);
}catch(Exception e){
skripsisentimen.sentimenttwitter.lblStat.setText(e.getMessage());
}
}
}catch(Exception e){
skripsisentimen.sentimenttwitter.lblStat.setText(e.getMessage());
// System.out.println(e.getMessage());
}
Thread.sleep(100);
return row;
}
}
Opening the synonym file and iterating over 2,000 lines for every row in your ResultSet is a bit wasteful.
Load your synonyms into an in-memory Map once, keyed by unique misspelt term, then do a lookup on the map for every row in your result set, and replace as necessary.
Let us use both solutions to build a single solution for you:
First, you create a HashMap with all your keys:
public static HashMap<String, String> getMap() {
//your version would read from the file
HashMap<String,String> myMap=new HashMap<String,String>();
myMap.put("min", "admin");
myMap.put("lelet", "lambat");
myMap.put("lemot", "lambat");
myMap.put("nii", "nih");
myMap.put("ntu", "itu");
return(myMap);
}
Second, you create a pattern that contains all the keys in your hashmap:
public static String getPattern(HashMap<String,String> mapReplacement) {
String pattern="";
for (String s : mapReplacement.keySet()) {
if (!pattern.isEmpty()) {
pattern=pattern+"|";
}
pattern=pattern+s;
}
return(pattern);
}
Next, you can create a cleanTweet method that uses both structures you created:
public static String cleanTweet(String tweet, Pattern pattern,HashMap<String, String> myMap) {
String newTweet=tweet;
Matcher matcher = pattern.matcher(newTweet);
int start=0;
while (matcher.find()) {
String key=matcher.group();
String replacement=myMap.get(key);
if (replacement!=null) {
newTweet=newTweet.replace(key, replacement );
}
}
return(newTweet);
}
This might require some tweaking to perfect (I onyl tested a few cases), but the point is that you are going to iterate a single time in your keys and then iterate only on your tweets.
I hope it helps.
I didn't try, but it seems to me that you've almost got it - just replace this line:
newTweet = tweet.replace(term, replacer);
with this:
tweet = tweet.replaceAll(term, replacer);
As you're not using newTweet any more, return tweet:
return tweet;
You should also delete the newTweet declaration.
Also, you shouldn't read Scanner to read lines. Use FileReader instead.
thanks folks
i ve found the answer why the code is not working,
the txt file containing terms and its substitutes should be initiated each time the program read a row from database.
the code would be like this
public class synonimReplaceV2 extends SwingWorker {
protected Object doInBackground() throws Exception {
new skripsisentimen.sentimenttwitter().setVisible(true);
Integer row = 0;
String newTweet = "";
DB db = new DB();
Connection conn = db.dbConnect("jdbc:mysql://localhost:3306/tweet", "root", "");
try{
Statement select = conn.createStatement();
select.executeQuery("select * from synonimtweet limit 2,10");
ResultSet RS = select.getResultSet();
while(RS.next()){
row++;
FileReader readSyn = new FileReader("synV2/catatan_kata_sinonim.txt");
BufferedReader buffSyn = new BufferedReader(readSyn);
String no = RS.getString("no");
String tweet = " "+ RS.getString("tweet");
String published = RS.getString("published");
String label = RS.getString("label");
String pattern = "";
while((pattern=buffSyn.readLine())!=null){
String patternTerm = pattern.split("=")[0];
String patternSubs = pattern.split("=")[1];
tweet = tweet.replaceAll("\\s"+patternTerm, patternSubs);
}
try{
Statement insert = conn.createStatement();
insert.executeUpdate("INSERT INTO synonimtweet_v2(no,tweet,published,label) values('"
+no+"','"+tweet+"','"+published+"','"+label+"')");
String current = skripsisentimen.sentimenttwitter.txtAreaResult.getText();
skripsisentimen.sentimenttwitter.txtAreaResult.setText(current+"\n"+row+"original : "+tweet+"\n"+newTweet+"\n______________________\n");
skripsisentimen.sentimenttwitter.lblStat.setText(row+" tweet read");
skripsisentimen.sentimenttwitter.txtAreaResult.setCaretPosition(skripsisentimen.sentimenttwitter.txtAreaResult.getText().length() - 1);
}catch(Exception e){
skripsisentimen.sentimenttwitter.lblStat.setText(e.getMessage());
}
}
}catch(Exception e){
skripsisentimen.sentimenttwitter.lblStat.setText(e.getMessage());
// System.out.println(e.getMessage());
}
Thread.sleep(100);
return row;
}
}
but im actually want to apply the code in which rlinden made above, but cant figure it out how to call the cleanTweet function.