How to read from Cassandra using Apache Flink?

How to read from Cassandra using Apache Flink? - java

My flink program should do a Cassandra look up for each input record and based on the results, should do some further processing.
But I'm currently stuck at reading data from Cassandra. This is the code snippet I've come up with so far.
ClusterBuilder secureCassandraSinkClusterBuilder = new ClusterBuilder() {
#Override
protected Cluster buildCluster(Cluster.Builder builder) {
return builder.addContactPoints(props.getCassandraClusterUrlAll().split(","))
.withPort(props.getCassandraPort())
.withAuthProvider(new DseGSSAPIAuthProvider("HTTP"))
.withQueryOptions(new QueryOptions().setConsistencyLevel(ConsistencyLevel.LOCAL_QUORUM))
.build();
}
};
for (int i=1; i<5; i++) {
CassandraInputFormat<Tuple2<String, String>> cassandraInputFormat =
new CassandraInputFormat<>("select * from test where id=hello" + i, secureCassandraSinkClusterBuilder);
cassandraInputFormat.configure(null);
cassandraInputFormat.open(null);
Tuple2<String, String> out = new Tuple8<>();
cassandraInputFormat.nextRecord(out);
System.out.println(out);
}
But the issue with this is, it takes nearly 10 seconds for each look up, in other words, this for loop takes 50 seconds to execute.
How do I speed up this operation? Alternatively, is there any other way of looking up Cassandra in Flink?

I came up with a solution that is fairly fast at querying Cassandra with streaming data. Would be of use to someone with the same issue.
Firstly, Cassandra can be queried with as little code as,
Session session = secureCassandraSinkClusterBuilder.getCluster().connect();
ResultSet resultSet = session.execute("SELECT * FROM TABLE");
But the problem with this is, creating Session is a very time-expensive operation and something that should be done once per key space. You create Session once and reuse it for all read queries.
Now, since Session is not Java Serializable, it cannot be passed as an argument to Flink operators like Map or ProcessFunction. There are a few ways of solving this, you can use a RichFunction and initialize it in its Open method, or use a Singleton. I will use the second solution.
Make a Singleton Class as follows where we create the Session.
public class CassandraSessionSingleton {
private static CassandraSessionSingleton cassandraSessionSingleton = null;
public Session session;
private CassandraSessionSingleton(ClusterBuilder clusterBuilder) {
Cluster cluster = clusterBuilder.getCluster();
session = cluster.connect();
}
public static CassandraSessionSingleton getInstance(ClusterBuilder clusterBuilder) {
if (cassandraSessionSingleton == null)
cassandraSessionSingleton = new CassandraSessionSingleton(clusterBuilder);
return cassandraSessionSingleton;
}
}
You can then make use of this session for all future queries. Here I'm using the ProcessFunction to make queries as an example.
public class SomeProcessFunction implements ProcessFunction <Object, ResultSet> {
ClusterBuilder secureCassandraSinkClusterBuilder;
// Constructor
public SomeProcessFunction (ClusterBuilder secureCassandraSinkClusterBuilder) {
this.secureCassandraSinkClusterBuilder = secureCassandraSinkClusterBuilder;
}
#Override
public void ProcessElement (Object obj) throws Exception {
ResultSet resultSet = CassandraLookUp.cassandraLookUp("SELECT * FROM TEST", secureCassandraSinkClusterBuilder);
return resultSet;
}
}
Note that you can pass ClusterBuilder to ProcessFunction as it is Serializable. Now for the cassandraLookUp method where we execute the query.
public class CassandraLookUp {
public static ResultSet cassandraLookUp(String query, ClusterBuilder clusterBuilder) {
CassandraSessionSingleton cassandraSessionSingleton = CassandraSessionSingleton.getInstance(clusterBuilder);
Session session = cassandraSessionSingleton.session;
ResultSet resultSet = session.execute(query);
return resultSet;
}
}
The singleton object is created only the first time the query is run, after that, the same object is reused, so there is no delay in look up.

Related

Java - Asynchronous multi-database with connection poll

I am developing a program that, based on a configuration file, allows different types of databases (e.g., YAML, MySQL, SQLite, and others to be added in the future) to be used to store data.
Currently it is all running on the main thread but I would like to start delegating to secondary threads so as not to block the execution of the program.
For supported databases that use a connection, I use HikariCP so that the process is not slowed down too much by opening a new connection every time.
The main problem is the multitude of available databases. For example, for some databases it might be sufficient to store the query string in a queue and have an executor check it every X seconds; if it is not empty it executes all the queries. For others, however, it is not, because perhaps they require other operations (e.g., YAML files that use a key-value system with a map).
What I can't do is something "universal", that doesn't give problems with the order of queries (cannot just create a Thread and execute it, because then one fetch thread might execute before another insertion thread and the data might not be up to date) and that can return data on completion (in the case of get functions).
I currently have an abstract Database class that contains all the get() and set(...) methods for the various data to be stored. Some methods need to be executed synchronously (must be blocking) while others can and should be executed asynchronously.
Example:
public abstract class Database {
public abstract boolean hasPlayedBefore(#Nonnull final UUID uuid);
}
public final class YAMLDatabase extends Database {
#Override
public boolean hasPlayedBefore(#Nonnull final UUID uuid) { return getFile(uuid).exists(); }
}
public final class MySQLDatabase extends Database {
#Override
public boolean hasPlayedBefore(#Nonnull final UUID uuid) {
try (
final Connection conn = getConnection(); // Get a connection from the poll
final PreparedStatement statement = conn.prepareStatement("SELECT * FROM " + TABLE_NAME + " WHERE UUID= '" + uuid + "';");
final ResultSet result = statement.executeQuery()
) {
return result.isBeforeFirst();
} catch (final SQLException e) {
// Notifies the error
Util.sendMessage("Database error: " + e.getMessage() + ".");
writeLog(e, uuid, "attempt to check whether the user is new or has played before");
}
return true;
}
}
// Simple example class that uses the database
public final class Usage {
private final Database db;
public Usage(#Nonnull final Database db) { this.db = db; }
public User getUser(#Nonnull final UUID uuid) {
if(db.hasPlayedBefore(uuid))
return db.getUser(uuid); // Sync query
else {
// Set default starting balance
final User user = new User(uuid, startingBalance);
db.setBalance(uuid, startingBalance); // Example of sync query that I would like to be async
return user;
}
}
}
Any advice? I am already somewhat familiar with Future, CompletableFuture and Callback.

Jooq Converter Cast Exceptions

I'm currently trying to store encrypted data in some of the columns of a Postgres DB. After receiving helpful feedback from this question: client-side-encryption-with-java-and-postgres-database I am using converters/bindings to implement transparent encryption in the JDBC layer.
Right now I'm trying to insert a BigDecimal[][][] into a Postgres DB column of type bytea.
The insertion works but the problem is that the encryption code I've added in the converters/binding doesn't seem to run. Unfortunately, when I check the database I'm seeing an unencrypted 3D matrix. (FYI my encryption utility code is tested and does work)
To test, I put my encryption code in the DAO layer and the BigDecimal[][][] matrix does get encrypted on DB inserts. Although I could do this it defeats the purpose of using converters/bindings for encryption.
So my question:
With the code I provided below am I doing anything wrong that is preventing the encryption code in my converter/binding to be run? I thought after a Prepared Statement is executed the converter is the next step but maybe not? I have a lack of knowledge on just when the converter/binding code gets called in the whole JOOQ flow so any insight is much appreciated! Thanks :D
First I'm using a PreparedStatment in a DAO to execute the insert query.
I can't show the full code but basically for the stmt I'm setting the BigDecimal[][][] as an object parameter:
private Result executeInsert(BigDecimal[][][] valueToEncrypt, String insertSql) {
try (Connection conn = config.connectionProvider().acquire();
PreparedStatement stmt = conn.prepareStatement(insertSql)) {
// Get a human readable version of the 3d matrix to insert into the db.
PostgresReadableArray humanReadableMatrix = getPostgresReadableArray(valueToEncrypt)
stmt.setObject(parameterIndex++, humanReadableMatrix, Types.OTHER);
ResultSet res = stmt.executeQuery();
}
...
}
I am currently attaching the binding to a codegen xml file here:
<forcedType>
<userType>
java.math.BigDecimal[][][]
</userType>
<binding>com.myapp.EncryptionBinding</binding>
<includeExpression>matrix_column</includeExpression>
<includeTypes>bytea</includeTypes>
</forcedType>
Here is my binding class EncryptionBinding:
public class EncryptionBinding implements Binding<byte[], BigDecimal[][][]> {
#Override
public Converter<byte[], BigDecimal[][][]> converter() {
return new MatrixConverter();
}
// Rending a bind variable for the binding context's value and casting it to the json type
#Override
public void sql(BindingSQLContext<BigDecimal[][][]> ctx) throws SQLException {
}
// Registering VARCHAR types for JDBC CallableStatement OUT parameters
#Override
public void register(BindingRegisterContext<BigDecimal[][][]> ctx) throws SQLException {
ctx.statement().registerOutParameter(ctx.index(), Types.VARCHAR);
}
// Converting the BigDecimal[][][] to a Encrypted value and setting that on a JDBC PreparedStatement
#Override
public void set(BindingSetStatementContext<BigDecimal[][][]> ctx) throws SQLException {
ctx.statement().setBytes(ctx.index(), ctx.convert(converter()).value());
}
...
Here is my converter class MatrixConverter used in the above EncryptionBinding class:
public class MatrixConverter extends AbstractConverter<byte[], BigDecimal[][][]> {
private static final Logger logger = LoggerFactory.getLogger(MatrixConverter.class);
public MatrixConverter() {
super(byte[].class, BigDecimal[][][].class);
}
#Override
public BigDecimal[][][] from(byte[] databaseObject) {
return EncryptionUtils.decrypt(databaseObject);
}
#Override
public byte[] to(BigDecimal[][][] userObject) {
return EncryptionUtils.encrypt(JsonUtils.toJson(userObject));
}
}

Execute multiple different different query using spring boot and hibernate

My requirement is to create the multiple threads and execute the query and give the final output like Map<String,List<Object>>;
Map contains table name string and List<Object> is the query output that contains list of tables record.
The requirement:
I have one table that contains the list of field like TableName and Query
Eg.
employ | select * from employ; that query have more than 100000 record
employ_detail| select * from employ_detail; that query have more than 300000 record
employ_salary| select * from employ_salary; that query have more than 600000 record
Above table may have 10 000 queries
I want to create one API for that above query using the spring boot + hibernate.
My problem:
I want to create one solution with multiple threading using JAVA 8.
#RestController
public class ApiQueries {
#RequestMapping(value = "/getAllQueries", method = RequestMethod.GET)
public CommonDTO getAllQuery(){
list=apiQueryService.findAll();
if(null!=list){
objectMap= apiQueryService.executeQueryData(list); //here apiQueryService have one method named is executeQuery()
}
}
}
I wrote the below logic in that method.
#Override
public Map<String,List<Object>> executeQueryData(List<ApiQueries>
apiQuerylist, String fromDate, String toDate) {
addExecutor = new ThreadPoolExecutor(3, 5, 10, TimeUnit.MILLISECONDS,new LinkedBlockingQueue<Runnable>());
List<Object> obj=null;
Map<String,List<Object>> returnMap=new HashMap<String,List<Object>>();
try {
if(session==null) {
session = sessionFactory.openSession();
}
apiQuerylist.forEach(list-> addExecutor.execute(new Runnable() {
#Override
public void run() {
apiQueryObject = session.createSQLQuery(list.getQuery()).list();
returnMap.put(list.getTableName(), apiQueryObject);
}
}));
}catch(Exception ex) {
System.out.println("Inside [B] Exception "+ex);
ex.printStackTrace();
}finally {
if(session !=null) {
session.close();
}
}
return returnMap;
}
Issue is when i call that api the below code will run in background and that method is return the null object, But in background i will see the list of queries which executes one by one
apiQuerylist.forEach(list-> addExecutor.execute(new Runnable() {
#Override
public void run() {
apiQueryObject = session.createSQLQuery(list.getQuery()).list();
returnMap.put(list.getTableName(), apiQueryObject);
}
}));

You need to wait for thread pool completion. Something like below after apiQuerylist.forEach should work:
addExecutor.shutdown();
// waiting for executors to finish their jobs
while (!addExecutor.awaitTermination(50, TimeUnit.MILLISECONDS));

How to identify non thread-safe code in a multi-threaded environment?

I have designed and implemented a simple webstore based on traditional MVC Model 1 architecture using pure JSP and JavaBeans (Yes, I still use that legacy technology in my pet projects ;)).
I am using DAO design pattern to implement my persistence layer for a webstore. But I am not sure if I have implemented the classes correctly in my DAO layer. I am specifically concerned about the QueryExecutor.java and DataPopulator.java classes (mentioned below). All the methods in both these classes are defined as static which makes me think if this is the correct approach in multithreaded environment. Hence, I have following questions regarding the static methods.
Will there be synchronization issues when multiple users are trying to do a checkout with different products? If answer to the above question is yes, then how can I actually reproduce this synchronization issue?
Are there any testing/tracing tools available which will actually show that a specific piece of code will/might create synchronization issues in a multithreaded environment? Can I see that a User1 was trying to access Product-101 but was displayed Product-202 because of non thread-safe code?
Assuming there are synchronization issues; Should these methods be made non-static and classes instantitable so that we can create an instance using new operator OR Should a synchronized block be placed around the non thread-safe code?
Please guide.
MasterDao.java
public interface MasterDao {
Product getProduct(int productId) throws SQLException;
}
BaseDao.java
public abstract class BaseDao {
protected DataSource dataSource;
public BaseDao(DataSource dataSource) {
this.dataSource = dataSource;
}
}
MasterDaoImpl.java
public class MasterDaoImpl extends BaseDao implements MasterDao {
private static final Logger LOG = Logger.getLogger(MasterDaoImpl.class);
public MasterDaoImpl(DataSource dataSource) {
super(dataSource);
}
#Override
public Product getProduct(int productId) throws SQLException {
Product product = null;
String sql = "select * from products where product_id= " + productId;
//STATIC METHOD CALL HERE, COULD THIS POSE A SYNCHRONIZATION ISSUE ??????
List<Product> products = QueryExecutor.executeProductsQuery(dataSource.getConnection(), sql);
if (!GenericUtils.isListEmpty(products)) {
product = products.get(0);
}
return product;
}
}
QueryExecutor.java
public final class QueryExecutor {
private static final Logger LOG = Logger.getLogger(QueryExecutor.class);
//SO CANNOT NEW AN INSTANCE
private QueryExecutor() {
}
static List<Product> executeProductsQuery(Connection cn, String sql) {
Statement stmt = null;
ResultSet rs = null;
List<Product> al = new ArrayList<>();
LOG.debug(sql);
try {
stmt = cn.createStatement();
rs = stmt.executeQuery(sql);
while (rs != null && rs.next()) {
//STATIC METHOD CALL HERE, COULD THIS POSE A SYNCHRONIZATION ISSUE ???????
Product p = DataPopulator.populateProduct(rs);
al.add(p);
}
LOG.debug("al.size() = " + al.size());
return al;
} catch (Exception ex) {
LOG.error("Exception while executing products query....", ex);
return null;
} finally {
try {
if (rs != null) {
rs.close();
}
if (stmt != null) {
stmt.close();
}
if (cn != null) {
cn.close();
}
} catch (Exception ex) {
LOG.error("Exception while closing DB resources rs, stmt or cn.......", ex);
}
}
}
}
DataPopulator.java
public class DataPopulator {
private static final Logger LOG = Logger.getLogger(DataPopulator.class);
//SO CANNOT NEW AN INSTANCE
private DataPopulator() {
}
//STATIC METHOD DEFINED HERE, COULD THIS POSE A SYNCHRONIZATION ISSUE FOR THE CALLING METHODS ???????
public static Product populateProduct(ResultSet rs) throws SQLException {
String productId = GenericUtils.nullToEmptyString(rs.getString("PRODUCT_ID"));
String name = GenericUtils.nullToEmptyString(rs.getString("NAME"));
String image = GenericUtils.nullToEmptyString(rs.getString("IMAGE"));
String listPrice = GenericUtils.nullToEmptyString(rs.getString("LIST_PRICE"));
Product product = new Product(new Integer(productId), name, image, new BigDecimal(listPrice));
LOG.debug("product = " + product);
return product;
}
}

Your code is thread-safe.
The reason, and the key to thread-safety, is your (static) methods do not maintain state. ie your methods only use local variables (not fields).
It doesn't matter if the methods are static or not.

Assuming there are synchronization issues; Should these methods be made non-static and classes instantitable so that we can create an instance using new operator
This won't help. Multiple threads can do as they please with a single object just as they can with a static method, and you will run into synchronization issues.
OR Should a synchronized block be placed around the non thread-safe code?
Yes this is the safe way. Any code inside of a synchronized block is guaranteed to have at most one thread in it for any given time.
Looking through your code, I don't see many data structures that could possibly be shared amongst threads. Assuming you had something like
public final class QueryExecutor {
int numQueries = 0;
public void doQuery() {
numQueries++;
}
}
Then you run into trouble because 4 threads could have executed doQuery at the same moment, and so you have 4 threads modifying the value of numQueries - a big problem.
However with your code, the only shared class fields is the logging class, which will have it's own thread-safe synchronization built in - therefore the code you have provided looks good.

There is no state within your code (no mutable member variables or fields, for example), so Java synchronisation is irrelevant.
Also as far as I can tell there are no database creates, updates, or deletes, so there's no issue there either.
There's some questionable practice, for sure (e.g. the non-management of the database Connection object, the wide scope of some variables, not to mention the statics), but nothing wrong as such.
As for how you would test, or determine thread-safety, you could do worse than operate your site manually using two different browsers side-by-side. Or create a shell script that performs automated HTTP requests using curl. Or create a WebDriver test that runs multiple sessions across a variety of real browsers and checks that the expected products are visible under all scenarios...

Best solution for multiple queries in a limited time

For a MORPG Hack'n'Slash game i am currently using Neo4j with a pattern like this :
I have a Neo4J connector class, creating my connection and implementing Singleton and this instance is used by every xxxMapper classes, calling Neo4jConnetor.getInstance().query(String query) which returns the iterator of the queryresult.
Atm I'm asking myself a question, the game will have a ton of queries per second (like 5 per player per second). So I don't know, in terms of perfs, which pattern to use, if I should keep using my Singleton system or using another one like a pool of Neo4jConnector or anything else i don't know yet.
Here is the connector class :
public class Neo4jConnector{
private String urlRest;
private String url = "http://localhost:7474";
protected QueryEngine<?> engine;
protected static Neo4jConnector INSTANCE = new Neo4jConnector();
private Neo4jConnector(){
urlRest = url+"/db/data";
final RestAPI graphDb = new RestAPIFacade(urlRest);
engine = new RestCypherQueryEngine(graphDb);
}
public static Neo4jConnector getInstance(){
if (INSTANCE == null)
{
INSTANCE = new Neo4jConnector();
}
return INSTANCE;
}
#SuppressWarnings("unchecked")
public Iterator<Map<String, Object>> query(String query){
QueryResult<Map<String, Object>> row = (QueryResult<Map<String, Object>>) engine.query(query, Collections.EMPTY_MAP);
return row.iterator();
}
}
and an example call of this class :
Iterator<Map<String, Object>> iterator = Neo4jConnector.getInstance().query("optional Match(u:User{username:'"+username+"'}) return u.password as password, u.id as id");

Neo4j's embedded GraphDatabaseService is not pooled and threadsafe.
I would not recommend RestGraphDatabase and friends, because it is slow and outdated.
Just use parameters instead of literal strings and don't use optional match to start a query.
If you look for faster access look into the JDBC driver (which will be updated soonish).

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to read from Cassandra using Apache Flink? - java

Related

Java - Asynchronous multi-database with connection poll

Jooq Converter Cast Exceptions

Execute multiple different different query using spring boot and hibernate

How to identify non thread-safe code in a multi-threaded environment?

Best solution for multiple queries in a limited time

Categories

Resources