Slow when querying cassandra with apache spark in Java.

Slow when querying cassandra with apache spark in Java. - java

I'm still new with No SQL solution and I just starting to learn nosql since few months ago.
I have a project and it was built by spring boot framework and has a DAO layer. My database was cassandra and I'm using datastax java cassandra driver to communicate. I found cassandra or maybe all nosql key/value solutions don't support for case sensitive and query with "like%" use cases. After done some research through stackoverflow and other forums, figure out those have to use some tools like apache spark, elastic search, or apache lucene to dig the data in cassandra. So that I chose apache spark and i'm not sure whether the code should be done in this way (in term of best practice).
Here's my code to query data:
#Override
public Login getLoginByEmail(String shopId, String email) throws InterruptedException, ExecutionException {
JavaFutureAction<List<Login>> loginRDDFuture = javaFunctions(getSparkContext())
.cassandraTable("shop_abc", "app_login", loginRowReader)
.filter(new Function<Login, Boolean>() {
private static final long serialVersionUID = 1L;
#Override
public Boolean call(Login login) throws Exception {
return login.getEmail().equalsIgnoreCase(email.trim());
}
}).collectAsync();
List<Login> lgnList = loginRDDFuture.get();
if(lgnList.size() > 0){
return lgnList.get(0);
}
return null;
}
It took me 9 seconds to get the result and database only with a table and 3 records. I would think that what happen if the database if more than million records.
I'm not sure whether this is good practice or it has better way or better tools to do that, I hope someone can give me a guidance.
Appreciate.

I think this kind of query will be rather slow because it has to retrieve all data from your C* database, breaking up queries by token ranges and mapping them into RDDs and then filters through them using a spark job. That is going to have some overhead even when your data set is small, although 9 seconds does seem like quite a while, but hard to know why without knowing more about your environment.
Alternatively, have you considered using SSTable Attached Secondary Indices (SASI)? SASI was introduced in C* 3.4 and allow you to do LIKE % queries with cassandra with or without case sensitivity, i.e.:
CREATE CUSTOM INDEX fn_suffix_allcase ON cyclist_name (firstname)
USING 'org.apache.cassandra.index.sasi.SASIIndex'
WITH OPTIONS = {
'mode': 'CONTAINS',
'analyzer_class':'org.apache.cassandra.index.sasi.analyzer.NonTokenizingAnalyzer',
'case_sensitive': 'false'
};
A good talk for reference on SASI is SASI: Cassandra on the Full Text Search Ride.

Related

jOOQ batch insertion inconsistency

While working with batch insertion in jOOQ (v3.14.4) I noticed some inconsistency when looking into PostgreSQL (v12.6) logs.
When doing context.batch(<query>).bind(<1st record>).bind(<2nd record>)...bind(<nth record>).execute() the logs show that the records are actually inserted one by one instead of all in one go.
While doing context.insert(<fields>).values(<1st record>).values(<2nd record>)...values(<nth record>) actually inserts everything in one go judging by the postgres logs.
Is it a bug in the jOOQ itself or was I using the batch(...) functionality incorrectly?
Here are 2 code snippets that are supposed to do the same but in reality, the first one inserts records one by one while the second one actually does the batch insertion.
public void batchInsertEdges(List<EdgesRecord> edges) {
Query batchQuery = context.insertInto(Edges.EDGES,
Edges.EDGES.SOURCE_ID, Edges.EDGES.TARGET_ID, Edges.EDGES.CALL_SITES,
Edges.EDGES.METADATA)
.values((Long) null, (Long) null, (CallSiteRecord[]) null, (JSONB) null)
.onConflictOnConstraint(Keys.UNIQUE_SOURCE_TARGET).doUpdate()
.set(Edges.EDGES.CALL_SITES, Edges.EDGES.as("excluded").CALL_SITES)
.set(Edges.EDGES.METADATA, field("coalesce(edges.metadata, '{}'::jsonb) || excluded.metadata", JSONB.class));
var batchBind = context.batch(batchQuery);
for (var edge : edges) {
batchBind = batchBind.bind(edge.getSourceId(), edge.getTargetId(),
edge.getCallSites(), edge.getMetadata());
}
batchBind.execute();
}
public void batchInsertEdges(List<EdgesRecord> edges) {
var insert = context.insertInto(Edges.EDGES,
Edges.EDGES.SOURCE_ID, Edges.EDGES.TARGET_ID, Edges.EDGES.CALL_SITES, Edges.EDGES.METADATA);
for (var edge : edges) {
insert = insert.values(edge.getSourceId(), edge.getTargetId(), edge.getCallSites(), edge.getMetadata());
}
insert.onConflictOnConstraint(Keys.UNIQUE_SOURCE_TARGET).doUpdate()
.set(Edges.EDGES.CALL_SITES, Edges.EDGES.as("excluded").CALL_SITES)
.set(Edges.EDGES.METADATA, field("coalesce(edges.metadata, '{}'::jsonb) || excluded.metadata", JSONB.class))
.execute();
}
I would appreciate some help to figure out why the first code snippet does not work as intended and second one does. Thank you!

There's a difference between "batch processing" (as in JDBC batch) and "bulk processing" (as in what many RDBMS call "bulk updates").
This page of the manual about data import explains the difference.
Bulk size: The number of rows that are sent to the server in one SQL statement.
Batch size: The number of statements that are sent to the server in one JDBC statement batch.
These are fundamentally different things. Both help improve performance. Bulk data processing does so by helping the RDBMS optimise resource allocation algorithms as it knows it is about to insert 10 records. Batch data processing does so by reducing the number of round trips between client and server. Whether either approach has a big impact on any given RDBMS is obviously vendor specific.
In other words, both of your approaches work as intended.

How to persist data to a MySql database from a 32k row Excel using JPA Pagination?

I have a large Excel file, with 32k rows, and a Java-Spring code that persist the Excel data to mySQL database. My code works for about 6k row's, but not for the entire Excel due to JPA limitation. I read that it can be done with JPA Pagination but all so far I found only info that collect data from DB (already persisted with data) and render to a UI. The Excel file contain 32k medicines, and this rows will be persisted into DB.
I have this Controller layer with the following method:
public ResponseEntity<ResponseMessage> uploadFile(#RequestParam("file") MultipartFile file,
#RequestParam(defaultValue = "0") int page,
#RequestParam(defaultValue = "6000") int size) {
String message = "";
if (ExcelHelper.hasExcelFormat(file)) {
try {
// the following 6 row are my patetic attempt to resolve with pagination
List<Medicine> medicines = new ArrayList<>();
Pageable paging = PageRequest.of(page, size);
Page<Medicine> pageMedicamente = medicineRepositoryDao.save(paging);
medicines = pageMedicamente.getContent();
medicineService.save(file);
message = "Uploaded the file successfully: " + file.getOriginalFilename();
return ResponseEntity.status(HttpStatus.OK).body(new ResponseMessage(message));
} catch (Exception e) {
message = "Could not upload the file: " + file.getOriginalFilename() + "!";
return ResponseEntity.status(HttpStatus.EXPECTATION_FAILED).body(new ResponseMessage(message));
}
}
And the Repository layer:
#Repository
public interface MedicineRepositoryDao extends JpaRepository<Medicine, Long> {
Page<Medicine> save( Pageable pageable);
}
And also the Service layer:
try {
List<Medicine> medicines = ExcelHelper.excelToMedicine(file.getInputStream());
medicineRepositoryDao.saveAll(medicines);
} catch (IOException e) {
throw new RuntimeException("fail to store excel data: " + e.getMessage());
}
}

I think you have a couple of things mixed up here.
I don't think Spring has any relevant limitation on the number of rows you may persist here. But JPA does. JPA does keeps are reference to any entity that you save in its first level cache. So for large number of rows/entities this hogs memory and also makes some operations slower since entities get looked up or processed one by one.
Pagination is for reading entities, not for saving.
You have a couple of options in this situation.
Don't use JPA. For simply writing data from a file and writing it into a database JPA does hardly offer any benefit. This can almost trivially performed using just a JdbcTemplate or NamedParameterJdbcTemplate and will be much faster, since the overhead of JPA is skipped which you don't benefit from anyway in this scenario. If you want to use an ORM you might want to take a look at Spring Data JDBC which is conceptually simpler and doesn't keep references to entities and therefore should show better characteristics in this scenario. I recommend not using an ORM here since you don't seem to benefit from having entities, so creating them and then having the ORM extract the data from it is really a waste of time.
Break your import into batches. This means you persist e.g. 1000 rows at time, write them to the database and commit the transaction, before you continue with the next 1000 rows. For JPA this is pretty much a necessity for the reasons laid out above. With JDBC (i.e. JdbcTemplate&Co) this probably isn't necessary for 32K rows but might improve performance and might be useful for recoverability if an insert fails. Spring Batch will help you implement that.
While the previous point talks about batching in the sense of breaking your import into chunks you should also look into batching on the JDBC side, where you send multiple statements, or a single statements with multiple sets of parameters in one go to the database, which again should improve performance.
Finally there are often alternatives outside of the Javaverse that might be more suitable for the job. Some databases have tools to load flat files extremely efficient.

Faster way of updating database table using Hibernate (Java 8 reduction?)

I am working on a monitoring tool developed in Spring Boot using Hibernate as ORM.
I need to compare each row (already persisted rows of sent messages) in my table and see if a MailId (unique) has received a feedback (status: OPENED, BOUNCED, DELIVERED...) Yes or Not.
I get the feedbacks by reading csv files from a network folder. The CSV parsing and reading of files goes very fast, but the update of my database is very slow. My algorithm is not very efficient because I loop trough a list that can have hundred thousands of objects and look in my table.
This is the method that make the update in my table by updating the "target" Object (row in table database)
#Override
public void updateTargetObjectFoo() throws CSVProcessingException, FileNotFoundException {
// Here I make a call to performProcessing method which reads files on a folder and parse them to JavaObjects and I map them in a feedBackList of type Foo
List<Foo> feedBackList = performProcessing(env.getProperty("foo_in"), EXPECTED_HEADER_FIELDS_STATUS, Foo.class, ".LETTERS.STATUS.");
for (Foo foo: feedBackList) {
//findByKey does a simple Select in mySql where MailId = foo.getMailId()
Foo persistedFoo = fooDao.findByKey(foo.getMailId());
if (persistedFoo != null) {
persistedFoo.setStatus(foo.getStatus());
persistedFoo.setDnsCode(foo.getDnsCode());
persistedFoo.setReturnDate(foo.getReturnDate());
persistedFoo.setReturnTime(foo.getReturnTime());
//The save account here does an MySql UPDATE on the table
fooDao.saveAccount(foo);
}
}
}
What if I achieve this selection/comparison and update action in Java side? Then re-update the whole list in database?
Will it be faster?
Thanks to all for your help.

Hibernate is not particularly well-suited for batch processing.
You may be better off using Spring's JdbcTemplate to do jdbc batch processing.
However, if you must do this via Hibernate, this may help: https://docs.jboss.org/hibernate/orm/5.2/userguide/html_single/chapters/batch/Batching.html

Transform Cassandra query result to POJO with Astyanax

I am working in a Spring web application using Cassandra with Astyanax client. I want to transform result data retrieved from Cassandra queries to a POJO, but I do not know which library or Astyanax API support this.
For example, I have User column family (CF) with some basic properties (username, password, email) and other related additional information can be added to this CF. Then I fetch one User row from that CF by using OperationResult> to hold the data returned, like this:
OperationResult<ColumnList<String>> columns = getKeyspace().prepareQuery(getColumnFamily()).getRow(rowKey).execute();
What I want to do next is populating "columns" to my User object. Here, I have 2 problems and could you please help me solve this:
1/ What is the best structure of User class to hold the corresponding data retrieved from User CF? My suggestion is:
public class User {
String userName, password, email; // Basic properties
Map<String, Object> additionalInfo;
}
2/ How can I transform the Cassandra data to this POJO by using a generic method (so that it can be applied to every single CF which has mapped POJO)?
I am so sorry if there are some stupid dummy things in my questions, because I have just approached NoSQL concepts and Cassandra as well as Astyanax for 2 weeks.
Thank you so much for your help.

You can try Achilles : https://github.com/doanduyhai/achilles, an JPA compliant Entity Manager for Cassandra
Right now there is a complete implementation using Thrift API via Hector.
The CQL3 implementation using Datastax Java Driver is in progress. A beta version will be available in few months (July-August 2013)
CQL3 is great but it's still too low level because you need to extract the data yourself from the ResultSet. It's like coming back to the time when only JDBC Template was available.
Achilles is there to fill the gap.

I would suggest you to use some library like Playorm using which you can easily perform CRUD operations on your entities. See this for an example that how you can create a User object and then you can get the POJO easily by
User user1 = mgr.find(User.class, email);
Assuming that email is your NoSqlId(Primary key or row key in Cassandra).

I use com.netflix.astyanax.mapping.Mapping and com.netflix.astyanax.mapping.MappingCache for exactly this purpose.

Cleanest way to build an SQL string in Java

I want to build an SQL string to do database manipulation (updates, deletes, inserts, selects, that sort of thing) - instead of the awful string concat method using millions of "+"'s and quotes which is unreadable at best - there must be a better way.
I did think of using MessageFormat - but its supposed to be used for user messages, although I think it would do a reasonable job - but I guess there should be something more aligned to SQL type operations in the java sql libraries.
Would Groovy be any good?

First of all consider using query parameters in prepared statements:
PreparedStatement stm = c.prepareStatement("UPDATE user_table SET name=? WHERE id=?");
stm.setString(1, "the name");
stm.setInt(2, 345);
stm.executeUpdate();
The other thing that can be done is to keep all queries in properties file. For example
in a queries.properties file can place the above query:
update_query=UPDATE user_table SET name=? WHERE id=?
Then with the help of a simple utility class:
public class Queries {
private static final String propFileName = "queries.properties";
private static Properties props;
public static Properties getQueries() throws SQLException {
InputStream is =
Queries.class.getResourceAsStream("/" + propFileName);
if (is == null){
throw new SQLException("Unable to load property file: " + propFileName);
}
//singleton
if(props == null){
props = new Properties();
try {
props.load(is);
} catch (IOException e) {
throw new SQLException("Unable to load property file: " + propFileName + "\n" + e.getMessage());
}
}
return props;
}
public static String getQuery(String query) throws SQLException{
return getQueries().getProperty(query);
}
}
you might use your queries as follows:
PreparedStatement stm = c.prepareStatement(Queries.getQuery("update_query"));
This is a rather simple solution, but works well.

For arbitrary SQL, use jOOQ. jOOQ currently supports SELECT, INSERT, UPDATE, DELETE, TRUNCATE, and MERGE. You can create SQL like this:
String sql1 = DSL.using(SQLDialect.MYSQL)
.select(A, B, C)
.from(MY_TABLE)
.where(A.equal(5))
.and(B.greaterThan(8))
.getSQL();
String sql2 = DSL.using(SQLDialect.MYSQL)
.insertInto(MY_TABLE)
.values(A, 1)
.values(B, 2)
.getSQL();
String sql3 = DSL.using(SQLDialect.MYSQL)
.update(MY_TABLE)
.set(A, 1)
.set(B, 2)
.where(C.greaterThan(5))
.getSQL();
Instead of obtaining the SQL string, you could also just execute it, using jOOQ. See
http://www.jooq.org
(Disclaimer: I work for the company behind jOOQ)

One technology you should consider is SQLJ - a way to embed SQL statements directly in Java. As a simple example, you might have the following in a file called TestQueries.sqlj:
public class TestQueries
{
public String getUsername(int id)
{
String username;
#sql
{
select username into :username
from users
where pkey = :id
};
return username;
}
}
There is an additional precompile step which takes your .sqlj files and translates them into pure Java - in short, it looks for the special blocks delimited with
#sql
{
...
}
and turns them into JDBC calls. There are several key benefits to using SQLJ:
completely abstracts away the JDBC layer - programmers only need to think about Java and SQL
the translator can be made to check your queries for syntax etc. against the database at compile time
ability to directly bind Java variables in queries using the ":" prefix
There are implementations of the translator around for most of the major database vendors, so you should be able to find everything you need easily.

I am wondering if you are after something like Squiggle (GitHub). Also something very useful is jDBI. It won't help you with the queries though.

I would have a look at Spring JDBC. I use it whenever I need to execute SQLs programatically. Example:
int countOfActorsNamedJoe
= jdbcTemplate.queryForInt("select count(0) from t_actors where first_name = ?", new Object[]{"Joe"});
It's really great for any kind of sql execution, especially querying; it will help you map resultsets to objects, without adding the complexity of a complete ORM.

I tend to use Spring's Named JDBC Parameters so I can write a standard string like "select * from blah where colX=':someValue'"; I think that's pretty readable.
An alternative would be to supply the string in a separate .sql file and read the contents in using a utility method.
Oh, also worth having a look at Squill: https://squill.dev.java.net/docs/tutorial.html

I second the recommendations for using an ORM like Hibernate. However, there are certainly situations where that doesn't work, so I'll take this opportunity to tout some stuff that i've helped to write: SqlBuilder is a java library for dynamically building sql statements using the "builder" style. it's fairly powerful and fairly flexible.

I have been working on a Java servlet application that needs to construct very dynamic SQL statements for adhoc reporting purposes. The basic function of the app is to feed a bunch of named HTTP request parameters into a pre-coded query, and generate a nicely formatted table of output. I used Spring MVC and the dependency injection framework to store all of my SQL queries in XML files and load them into the reporting application, along with the table formatting information. Eventually, the reporting requirements became more complicated than the capabilities of the existing parameter mapping frameworks and I had to write my own. It was an interesting exercise in development and produced a framework for parameter mapping much more robust than anything else I could find.
The new parameter mappings looked as such:
select app.name as "App",
${optional(" app.owner as "Owner", "):showOwner}
sv.name as "Server", sum(act.trans_ct) as "Trans"
from activity_records act, servers sv, applications app
where act.server_id = sv.id
and act.app_id = app.id
and sv.id = ${integer(0,50):serverId}
and app.id in ${integerList(50):appId}
group by app.name, ${optional(" app.owner, "):showOwner} sv.name
order by app.name, sv.name
The beauty of the resulting framework was that it could process HTTP request parameters directly into the query with proper type checking and limit checking. No extra mappings required for input validation. In the example query above, the parameter named serverId
would be checked to make sure it could cast to an integer and was in the range of 0-50. The parameter appId would be processed as an array of integers, with a length limit of 50. If the field showOwner is present and set to "true", the bits of SQL in the quotes will be added to the generated query for the optional field mappings. field Several more parameter type mappings are available including optional segments of SQL with further parameter mappings. It allows for as complex of a query mapping as the developer can come up with. It even has controls in the report configuration to determine whether a given query will have the final mappings via a PreparedStatement or simply ran as a pre-built query.
For the sample Http request values:
showOwner: true
serverId: 20
appId: 1,2,3,5,7,11,13
It would produce the following SQL:
select app.name as "App",
app.owner as "Owner",
sv.name as "Server", sum(act.trans_ct) as "Trans"
from activity_records act, servers sv, applications app
where act.server_id = sv.id
and act.app_id = app.id
and sv.id = 20
and app.id in (1,2,3,5,7,11,13)
group by app.name, app.owner, sv.name
order by app.name, sv.name
I really think that Spring or Hibernate or one of those frameworks should offer a more robust mapping mechanism that verifies types, allows for complex data types like arrays and other such features. I wrote my engine for only my purposes, it isn't quite read for general release. It only works with Oracle queries at the moment and all of the code belongs to a big corporation. Someday I may take my ideas and build a new open source framework, but I'm hoping one of the existing big players will take up the challenge.

Why do you want to generate all the sql by hand? Have you looked at an ORM like Hibernate Depending on your project it will probably do at least 95% of what you need, do it in a cleaner way then raw SQL, and if you need to get the last bit of performance you can create the SQL queries that need to be hand tuned.

You can also have a look at MyBatis (www.mybatis.org) . It helps you write SQL statements outside your java code and maps the sql results into your java objects among other things.

Google provides a library called the Room Persitence Library which provides a very clean way of writing SQL for Android Apps, basically an abstraction layer over underlying SQLite Database. Bellow is short code snippet from the official website:
#Dao
public interface UserDao {
#Query("SELECT * FROM user")
List<User> getAll();
#Query("SELECT * FROM user WHERE uid IN (:userIds)")
List<User> loadAllByIds(int[] userIds);
#Query("SELECT * FROM user WHERE first_name LIKE :first AND "
+ "last_name LIKE :last LIMIT 1")
User findByName(String first, String last);
#Insert
void insertAll(User... users);
#Delete
void delete(User user);
}
There are more examples and better documentation in the official docs for the library.
There is also one called MentaBean which is a Java ORM. It has nice features and seems to be pretty simple way of writing SQL.

Read an XML file.
You can read it from an XML file. Its easy to maintain and work with.
There are standard STaX, DOM, SAX parsers available out there to make it few lines of code in java.
Do more with attributes
You can have some semantic information with attributes on the tag to help do more with the SQL. This can be the method name or query type or anything that helps you code less.
Maintaince
You can put the xml outside the jar and easily maintain it. Same benefits as a properties file.
Conversion
XML is extensible and easily convertible to other formats.
Use Case
Metamug uses xml to configure REST resource files with sql.

If you put the SQL strings in a properties file and then read that in you can keep the SQL strings in a plain text file.
That doesn't solve the SQL type issues, but at least it makes copying&pasting from TOAD or sqlplus much easier.

How do you get string concatenation, aside from long SQL strings in PreparedStatements (that you could easily provide in a text file and load as a resource anyway) that you break over several lines?
You aren't creating SQL strings directly are you? That's the biggest no-no in programming. Please use PreparedStatements, and supply the data as parameters. It reduces the chance of SQL Injection vastly.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.