I'm writing a simple java program, that does a simple task : it takes in input a text files folder, and it returns as output the 5 words with highest frequency per document.
At first, I tried to do it without any database support, but when I started having memory problems, I decided to change approach and configured the program to run with SQLite.
Everything works just fine now, but it takes a lot of time to just add the words in the database ( 67 seconds for 801 words).
Here is how I initiate the database :
this.Execute(
"CREATE TABLE words ("+
"word VARCHAR(20)"+
");"
);
this.Execute(
"CREATE UNIQUE INDEX wordindex ON words (word);"
);
then, once the programs has counted the documents in the folder ( let's say N), I add N counter columns and N frequency columns to the table
for(int i = 0; i < fileList.size(); i++)
{
db.Execute("ALTER TABLE words ADD doc"+i+" INTEGER");
db.Execute("ALTER TABLE words ADD freq"+i+" DOUBLE");
}
At last, I add words using the following funcion:
public void AddWord(String word, int docid)
{
String query = "UPDATE words SET doc"+docid+"=doc"+docid+"+1 WHERE word='"+word+"'";
int rows = this.ExecuteUpdate(query);
if( rows <= 0)
{
query = "INSERT INTO words (word,doc"+docid+") VALUES ('"+word+"',1)";
this.ExecuteUpdate(query);
}
}
Am i doing something wrong, or it's normal for an update query to take this long to execute?
Wrap all commands inside one transaction, otherwise you get one transaction (with the associated storage synchronizatrion) per command.
12 per second is slow but not unreasonable. With a database like MySQL I would expect it to be closer to 100/second with a HDD storage disk.
Related
I am about to generate an excel based on the user's request.
Input:
DateRange - 2022/02/01-2022/02/07
Scenario
The system will retrieve the logs from the database based on the DateRange. The logs contain the names of people & date when it was added. Also, the system will retrieve the list of people from the database. Now, after retrieving the logs and the people, I want to get the number of occurrence a person appeared on each date.
Database Info:
logs table - 10k or more
person table - at least 1,500 people.
Expected output:
Problem Issue
From the given data above there is a possibility of 10,000(logs) * 1,500(person) = 15m or more iteration to get the total occurrence of a person. This results to a heavy traffic on the response which took almost 60seconds or more.
Here is my code:
// initialize days
List<Date> days = getDaysFromRequest(); // get the range from request
for (Person person: getPersonList()) {
// .... code here to display Persons
for (Date day: days) {
// .... code here to display day
int total = 0;
for(UserLog log: getUserLog()) {
if ( day == log.dateAdded && log.personName == person.Name) {
total++;
}
}
System.out.printLn(total); // write total here in excel sheet Like, B2 address
}
}
How should I optimize this?
If I get it right, all the information you want seems to be in the logs or if not it defaults to zero. Therefor I would do something like:
Map<String<Map<LocalDate,Long>> occurrenceByNameAndDate = // Map<Username<Map<Date,Count>>
userLogs.strream().collect(Collectors.groupingBy(UserLog::personName,
Collectors.groupingBy(UserLog::dateAdded,
Collectors.counting())));
and use the above map some how like:
personList.forEach(person -> dateRange.forEach(day -> {
long count = occurrenceByNameAndDate.getOrDefault(person.Name,Collections.EMPTY_MAP).getOrDefault(day,0);
writeToExcel(person,day,count);
}));
Or do it on the DB side
SELECT personName, dateAdded,COUNT(*)
FROM UserLog
WHERE dateAdded between(...)
GROUP BY personName,dateAdded
I'm trying to write an sql query which runs over a set and sees if the id is in the set but it gives the error that only 1000 items can be in the array. I'm trying to solve it but I got stuck here:
for (int i = 0; i < e.getEmployeeSet().size(); i+=1000) {
sqlQuery.append("AND employee.id");
if(!e.incudeEmployee()){
sqlQuery.append("NOT ");
}
sqlQuery.append("IN (");
for(Employee employee: e.getEmployeeSet().){
sqlQuery.append(employee.getEmployeeId())
.append(",");
}
sqlQuery.deleteCharAt(sqlQuery.length()-1)
.append(") ");
}
I still have to figure out that the first time it has to be AND id.., the other times it has to be OR ... and I have to go over the set in a way that the first time I only go over the first 1000 employee's and so on. Any clean way to fix this?
Sql allows upto 1000 list values in SQL statements. And it is not an efficient way of including the list in IN clause.
Better store data in a temporary table and add join in your query.
Temporary table creation :
create table temp_emprecords as
select * from a, b,c
where clause...;
Now add the temp_emprecords table in your query and join with employee id.
select *
from employee emp,
temp_emprecords tmp
where emp.id = tmp.id;
You can modify your sql to be like:
SELECT /* or UPDATE (whatever you do) */
...
WHERE
employee.id IN (... first thousand elements ...)
OR
employee.id IN (... next thousand elements ...)
OR
... and so on ...
Your Java code will be slightly different to produce "OR employee.id IN" block for each thousand of ids
UPD: to make it just introduce another counter to do like (pseudo code):
counter=0;
for each employeeId {
if counter equals 1000 {
complete current IN block;
counter=0;
if not first thousand {
start new OR block;
}
start new IN block;
}
add employeeId into IN block;
counter++;
}
but important: I do not recommend go the way as you do either with or without OR blocks
It is because construct SQL as you do is direct way to SQL injection.
To avoid it just follow simple rule:
No any actual data must be inline in SQL String. All data must be passed to Query as parameters
You have to use prepared SQL statement with parameters for employee.id values.
Also: Simple way is to run separate query for each 1000 ids in the loop
So the solution that works is like this:
int counter = 1;
String queryString= "WHERE (employeeId IN ( ";
Iterator<Long> it = getEmployeeSet().iterator();
while (it.hasNext()) {
if (counter % 999 == 0)
queryString= queryString.substring(0, queryString.length() - 1) + " ) or
employeeId IN ( '" + it.next()+ "',";
else
queryString+= "'" + it.next() + "',";
counter++;
}
append(queryString.substring(0, queryString.length() - 1) + " )) ");
I have a cassandra server that is queried by another service and I need to reduce the amount of queries.
My first thought was to create a bloom filter of the whole database every couple of minutes and send it to the service.
but as I have a couple of hundreds of gigabytes in the database (which is expected to grow to a couple of terabytes), it doesn't seem like a good idea overloading the database every few minutes.
After a while of searching for a better solution, I remembered that cassandra maintains its own bloom filter.
Is it possible to copy the *-Filter.db files and use them in my code instead of creating my own bloom filter?
I have Created a table test
CREATE TABLE test (
a int PRIMARY KEY,
b int
);
Inserted 1 row
INSERT INTO test(a,b) VALUES(1, 10);
After flush data to disk. we can use the *-Filter.db file. For my case it was la-2-big-Filter.db
Here is the sample code to check if a partition key exist
Murmur3Partitioner partitioner = new Murmur3Partitioner();
try (DataInputStream in = new DataInputStream(new FileInputStream(new File("la-2-big-Filter.db"))); IFilter filter = FilterFactory.deserialize(in, true)) {
for (int i = 1; i <= 10; i++) {
DecoratedKey decoratedKey = partitioner.decorateKey(Int32Type.instance.decompose(i));
if (filter.isPresent(decoratedKey)) {
System.out.println(i + " is present ");
} else {
System.out.println(i + " is not present ");
}
}
}
Output :
1 is present
2 is not present
3 is not present
4 is not present
5 is not present
6 is not present
7 is not present
8 is not present
9 is not present
10 is not present
I have case where I need to scan table with about 50 columns and every column containing about 100 versions. Nothing special (this.htable is just appropriate HTable and processor is intended to handle resulting rows):
final Scan scan = new Scan();
scan.setCaching(1000);
scan.setMaxVersions(Integer.MAX_VALUE);
final ResultScanner rs = this.table.getScanner(scan);
try {
for (Result r = rs.next(); r != null; r = rs.next()) {
processor.processRow(r);
}
} finally {
rs.close();
}
When I try to scan in such approach table with about 20 x 10^6 rows I get only about 50 x 10^3 rows. No special configuration is applied for scanner, HBase is 0.98.1 (CDH5.1). What do I miss in this? Is it some HBase drawback or I do something seriously wrong? What can I check? I have checked result size limit (not a case) and you see maxVersions is configured. Who can limit such scans?
UPDATE
It was checked returned Result instances and their Cell instances inside are seriously different in number from expected results. Yet another time, table was about 20 x 10^6 rows which could be counted by the same code without maximum versions configuration. And returned number of rows WITH versions is about 50 * 10^3.
I am not sure what you have in processRow. But key-value pairs is inside result object. For one row key there can be many key-value pairs you know. May be this can be the missing point
for (Result result : resultScanner) {
for (KeyValue kv : result.raw()) {
Bytes.toString(kv.getQualifier());
Bytes.toString(kv.getValue());
Bytes.toString(result.getRow());
}
}
I'm having a problem with a java OutOfMemoryError. The program basically looks at mysql tables that are running on mysql workbench, and queries them to get out certain information, and then puts them in CSV files.
The program works just fine with a smaller data set, but once I use a larger data set (hours of logging information as opposed to perhaps 40 minutes) I get this error, which to me says that the problem comes from having a huge data set and the information not being handled too well by the program. Or it not being possible to handle this amount of data in the way that I have.
Setting Java VM arguments to -xmx1024m worked for a slightly larger data set but i need it to handle even bigger ones but it gives the error.
Here is the method which I am quite sure is the cause of the program somewhere:
// CSV is csvwriter (external lib), sment are Statements, rs is a ResultSet
public void pidsforlog() throws IOException
{
String[] procs;
int count = 0;
String temp = "";
System.out.println("Commence getting PID's out of Log");
try {
sment = con.createStatement();
sment2 = con.createStatement();
String query1a = "SELECT * FROM log, cpuinfo, memoryinfo";
rs = sment.executeQuery(query1a);
procs = new String[countThrough(rs)];
// SIMPLY GETS UNIQUE PROCESSES OUT OF TABLES AND STORES IN ARRAY
while (rs.next()) {
temp = rs.getString("Process");
if(Arrays.asList(procs).contains(temp)) {
} else {
procs[count] = temp;
count++;
}
}
// BELIEVE THE PROBLEM LIES BELOW HERE. SIZE OF THE RESULTSET TOO BIG?
for(int i = 0; i < procs.length; i++) {
if(procs[i] == null) {
} else {
String query = "SELECT DISTINCT * FROM log, cpuinfo, memoryinfo WHERE log.Process = " + "'" + procs[i] + "'" + " AND cpuinfo.Process = " + "'" + procs[i] + "'" + " AND memoryinfo.Process = " + "'" + procs[i] + "' AND log.Timestamp = cpuinfo.Timestamp = memoryinfo.Timestamp";
System.out.println(query);
rs = sment.executeQuery(query);
writer = new CSVWriter(new FileWriter(procs[i] + ".csv"), ',');
writer.writeAll(rs, true);
writer.flush();
}
}
writer.close();
} catch (SQLException e) {
notify("Error pidslog", e);
}
}; // end of method
Please feel free to ask if you want source code or more information as I'm desperate to get this fixed!
Thanks.
SELECT * FROM log, cpuinfo, memoryinfo will sure give a huge result set. It will give a cartesian product of all rows in all 3 tables.
Without seeing the table structure (or knowing the desired result) it's hard to pinpoint a solution, but I suspect that you either want some kind of join conditions to limit the result set, or use a UNION a'la;
SELECT Process FROM log
UNION
SELECT Process FROM cpuinfo
UNION
SELECT Process FROM memoryinfo
...which will just give you all distinct values for Process in all 3 tables.
Your second SQL statement also looks a bit strange;
SELECT DISTINCT *
FROM log, cpuinfo, memoryinfo
WHERE log.Process = #param1
AND cpuinfo.Process = #param1
AND memoryinfo.Process = #param1
AND log.Timestamp = cpuinfo.Timestamp = memoryinfo.Timestamp
Looks like you're trying to select from all 3 logs simultaneously, but ending up with another cartesian product. Are you sure you're getting the result set you're expecting?
You could limit the result returned by your SQL queryes with the LIMIT estatementet.
For example:
SELECT * FROM `your_table` LIMIT 100
This will return the first 100 results
SELECT * FROM `your_table` LIMIT 100, 200
This will return results from 100 to 200
Obviously you can iterate with those values so you get to all the elements on the data base no matter how many there are.
I think your are loading too many data at the same in the memory. try to use offset and limit in your sql statement so that you can avoid this problem
Your Java code is doing things that the database could do more efficiently. From query1a, it looks like all you really want is the unique processes. select distinct Process from ... should be sufficient to do that.
Then, think carefully about what table or tables are needed in that query. Do you really need log, cpuinfo, and memoryinfo? As Joachim Isaksson mentioned, this is going to return the Cartesian product of those three tables, giving you x * y * z rows (where x, y, and z are the row counts in each of those three tables) and a + b + c columns (where a, b, and c are the column counts in each of the tables). I doubt that's what you want or need. I assume you could get those unique processes from one table, or a union (rather than join) of the three tables.
Lastly, your second loop and query are essentially doing a join, something again better and more efficiently left to the database.
Like others said, fetching the data in smaller chunks might resolve the issue.
This is one of the other threads in stackoverflow that talks about this issue:
How to read all rows from huge table?