cassandra read timeout in java butruns well on cqlsh - java

I'm runnig a select query on cassandra in java.It return read timeout exception.The problem is that is doen not produce an error in cqlsh.So the problem is in the code i guess.I'm using flink to connect to cassandra.Also i think that maybe the problem is that in cqlsh i need to hit enter the get more results.Maybe that's the cause for stopping the drive to read through all the result.Also when i use limit 100 it runs well.I have changed the cassandra.yaml read configuration but nothing changed.
ClusterBuilder cb = new ClusterBuilder() {
#Override
public Cluster buildCluster(Cluster.Builder builder) {
return builder.withPort(9042).addContactPoint("127.0.0.1")
.withSocketOptions(new SocketOptions().setReadTimeoutMillis(60000).setKeepAlive(true).setReuseAddress(true))
.withLoadBalancingPolicy(new RoundRobinPolicy())
.withReconnectionPolicy(new ConstantReconnectionPolicy(500L))
.withQueryOptions(new QueryOptions().setConsistencyLevel(ConsistencyLevel.ONE))
.build();
}
};
CassandraInputFormat<Tuple1<ProfileAlternative>> cassandraInputFormat = new CassandraInputFormat("SELECT toJson(profilealternative) FROM profiles.profile where skills contains 'Financial Sector' ", cb);
cassandraInputFormat.configure(null);
cassandraInputFormat.open(null);
Tuple1<ProfileAlternative> testOutputTuple = new Tuple1();
while (!cassandraInputFormat.reachedEnd()) {
cassandraInputFormat.nextRecord(testOutputTuple);
System.out.println(testOutputTuple.f0);
}

Related

Kafka Streams application strange behavior in docker container

I am running Kafka Streams application in a docker container with docker-compose. However, the streams application is behaving strangely. So, I have a source topic (topicSource) and multiple destination topics (topicDestination1 , topicDestination2 ... topicDestination10) that I am branching to based on certain predicates.
topicSoure and topicDestination1 have a direct mapping i.e all the records are simply going into the destination topic without any filtering.
Now all this works perfectly fine when I am run the application locally or on a server without containers.
On the other hand, when I run streams app in container (using docker-compose and using kubernetes) then it doesn't forward all logs from topicSoure to topicDestination1. In fact, only a few number of records are forwarded. For Example some 3000 + records on source topic and only 6 records in destination topic. And all this is really strange.
This is my Dockerfile:
#FROM openjdk:8u151-jdk-alpine3.7
FROM openjdk:8-jdk
COPY /target/streams-examples-0.1.jar /streamsApp/
COPY /target/libs /streamsApp/libs
COPY log4j.properties /
CMD ["java", "-jar", "/streamsApp/streams-examples-0.1.jar"]
NOTE: I am building a jar before creating the image so that I always have an updated code. I have made sure that both the codes, the one running without container and the one with container are same.
Main.java:
Creating Source Stream from Source Topic:
KStream<String, String> source_stream = builder.stream("topicSource");
Branching based on predicates:
KStream<String, String>[] branches_source_topic = source_stream.branch(
(key, value) -> (value.contains("Operation\":\"SharingSet") && value.contains("ItemType\":\"File")), // Sharing Set by Date
(key, value) -> (value.contains("Operation\":\"AddedToSecureLink") && value.contains("ItemType\":\"File")), // Added to secure link
(key, value) -> (value.contains("Operation\":\"AddedToGroup")), // Added to group
(key, value) -> (value.contains("Operation\":\"Add member to role.") || value.contains("Operation\":\"Remove member from role.")),//Role update by date
(key, value) -> (value.contains("Operation\":\"FileUploaded") || value.contains("Operation\":\"FileDeleted")
|| value.contains("Operation\":\"FileRenamed") || value.contains("Operation\":\"FileMoved")), // Upload file by date
(key, value) -> (value.contains("Operation\":\"UserLoggedIn")), // User logged in by date
(key, value) -> (value.contains("Operation\":\"Delete user.") || value.contains("Operation\":\"Add user.")
&& value.contains("ResultStatus\":\"success")), // Manage user by date
(key, value) -> (value.contains("Operation\":\"DLPRuleMatch") && value.contains("Workload\":\"OneDrive")) // MS DLP
);
Sending logs to destination topics:
This is the direct mapping topic i.e. all the records are simply going into the destination topic without any filtering.
AppUtil.pushToTopic(source_stream, Constant.USER_ACTIVITY_BY_DATE, "topicDestination1");
Sending logs from branches to destination topics:
AppUtil.pushToTopic(branches_source_topic[0], Constant.SHARING_SET_BY_DATE, "topicDestination2");
AppUtil.pushToTopic(branches_source_topic[1], Constant.ADDED_TO_SECURE_LINK_BY_DATE, "topicDestination3");
AppUtil.pushToTopic(branches_source_topic[2], Constant.ADDED_TO_GROUP_BY_DATE, "topicDestination4");
AppUtil.pushToTopic(branches_source_topic[3], Constant.ROLE_UPDATE_BY_DATE, "topicDestination5");
AppUtil.pushToTopic(branches_source_topic[4], Constant.UPLOAD_FILE_BY_DATE, "topicDestination6");
AppUtil.pushToTopic(branches_source_topic[5], Constant.USER_LOGGED_IN_BY_DATE, "topicDestination7");
AppUtil.pushToTopic(branches_source_topic[6], Constant.MANAGE_USER_BY_DATE, "topicDestination8");
AppUtli.java:
public static void pushToTopic(KStream<String, String> sourceTopic, HashMap<String, String> hmap, String destTopicName) {
sourceTopic.flatMapValues(new ValueMapper<String, Iterable<String>>() {
#Override
public Iterable<String> apply(String value) {
ArrayList<String> keywords = new ArrayList<String>();
try {
JSONObject send = new JSONObject();
JSONObject received = processJSON(new JSONObject(value), destTopicName);
boolean valid_json = true;
for(String key: hmap.keySet()) {
if (received.has(hmap.get(key))) {
send.put(key, received.get(hmap.get(key)));
}
else {
valid_json = false;
}
}
if (valid_json) {
keywords.add(send.toString());
}
} catch (Exception e) {
System.err.println("Unable to convert to json");
e.printStackTrace();
}
return keywords;
}
}).to(destTopicName);
}
Where are the logs coming from:
So the logs are coming from an online continuous stream. A python job gets the logs which are basically URLs and sends them to a pre-source-topic. Then in streams app I am creating a streams from that topic and hitting those URLs which then return json logs that I am pushing to topicSource.
I have spent a lot of time trying to resolve this. I have no idea what is going wrong or why is it not processing all logs. Kindly help me figure this out.
So after a lot of debugging I came to know that I was exploring in the wrong direction, it was a simple case of consumer being slow than the producer. The producer kept on writing new records on topic and since the messages were being consumed after stream processing the consumer obviously was slow. Simply increasing topic partitions and launching multiple application instances with the same application id did the trick.

Flink Kafka - how to make App run in Parallel?

I am creating a app in Flink to
Read Messages from a topic
Do some simple process on it
Write Result to a different topic
My code does work, however it does not run in parallel
How do I do that?
It seems my code runs only on one thread/block?
On the Flink Web Dashboard:
App goes to running status
But, there is only one block shown in the overview subtasks
And Bytes Received / Sent, Records Received / Sent is always zero ( no Update )
Here is my code, please assist me in learning how to split my app to be able to run in parallel, and am I writing the app correctly?
public class SimpleApp {
public static void main(String[] args) throws Exception {
// create execution environment INPUT
StreamExecutionEnvironment env_in =
StreamExecutionEnvironment.getExecutionEnvironment();
// event time characteristic
env_in.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
// production Ready (Does NOT Work if greater than 1)
env_in.setParallelism(Integer.parseInt(args[0].toString()));
// configure kafka consumer
Properties properties = new Properties();
properties.setProperty("zookeeper.connect", "localhost:2181");
properties.setProperty("bootstrap.servers", "localhost:9092");
properties.setProperty("auto.offset.reset", "earliest");
// create a kafka consumer
final DataStream<String> consumer = env_in
.addSource(new FlinkKafkaConsumer09<>("test", new
SimpleStringSchema(), properties));
// filter data
SingleOutputStreamOperator<String> result = consumer.filter(new
FilterFunction<String>(){
#Override
public boolean filter(String s) throws Exception {
return s.substring(0, 2).contentEquals("PS");
}
});
// Process Data
// Transform String Records to JSON Objects
SingleOutputStreamOperator<JSONObject> data = result.map(new
MapFunction<String, JSONObject>()
{
#Override
public JSONObject map(String value) throws Exception
{
JSONObject jsnobj = new JSONObject();
if(value.substring(0, 2).contentEquals("PS"))
{
// 1. Raw Data
jsnobj.put("Raw_Data", value.substring(0, value.length()-6));
// 2. Comment
int first_index_comment = value.indexOf("$");
int last_index_comment = value.lastIndexOf("$") + 1;
// - set comment
String comment =
value.substring(first_index_comment, last_index_comment);
comment = comment.substring(0, comment.length()-6);
jsnobj.put("Comment", comment);
}
else {
jsnobj.put("INVALID", value);
}
return jsnobj;
}
});
// Write JSON to Kafka Topic
data.addSink(new FlinkKafkaProducer09<JSONObject>("localhost:9092",
"FilteredData",
new SimpleJsonSchema()));
env_in.execute();
}
}
My code does work, but it seems to run only on a single thread
( One block shown ) in web interface ( No passing of data, hence the bytes sent / received are not updated ).
How do I make it run in parallel ?
To run your job in parallel you can do 2 things:
Increase the parallelism of your job at the env level - i.e. do something like
StreamExecutionEnvironment env_in =
StreamExecutionEnvironment.getExecutionEnvironment().setParallelism(4);
But this would only increase parallelism at flink end after it reads the data, so if the source is producing data faster it might not be fully utilized.
To fully parallelize your job, setup multiple partitions for your kafka topic, ideally the amount of parallelism you would want with your flink job. So, you might want to do something like below when you are creating your kafka topic:
bin/kafka-topics.sh --create --zookeeper localhost:2181
--replication-factor 3 --partitions 4 --topic test

Vert.x performance drop when starting with -cluster option

I'm wondering if any one experienced the same problem.
We have a Vert.x application and in the end it's purpose is to insert 600 million rows into a Cassandra cluster. We are testing the speed of Vert.x in combination with Cassandra by doing tests in smaller amounts.
If we run the fat jar (build with Shade plugin) without the -cluster option, we are able to insert 10 million records in about a minute. When we add the -cluster option (eventually we will run the Vert.x application in cluster) it takes about 5 minutes for 10 million records to insert.
Does anyone know why?
We know that the Hazelcast config will create some overhead, but never thought it would be 5 times slower. This implies we will need 5 EC2 instances in cluster to get the same result when using 1 EC2 without the cluster option.
As mentioned, everything runs on EC2 instances:
2 Cassandra servers on t2.small
1 Vert.x server on t2.2xlarge
You are actually running into corner cases of the Vert.x Hazelcast Cluster manager.
First of all you are using a worker Verticle to send your messages (30000001). Under the hood Hazelcast is blocking and thus when you send a message from a worker the version 3.3.3 does not take that in account. Recently we added this fix https://github.com/vert-x3/issues/issues/75 (not present in 3.4.0.Beta1 but present in 3.4.0-SNAPSHOTS) that will improve this case.
Second when you send all your messages at the same time, it runs into another corner case that prevents the Hazelcast cluster manager to use a cache of the cluster topology. This topology cache is usually updated after the first message has been sent and sending all the messages in one shot prevents the usage of the ache (short explanation HazelcastAsyncMultiMap#getInProgressCount will be > 0 and prevents the cache to be used), hence paying the penalty of an expensive lookup (hence the cache).
If I use Bertjan's reproducer with 3.4.0-SNAPSHOT + Hazelcast and the following change: send message to destination, wait for reply. Upon reply send all messages then I get a lot of improvements.
Without clustering : 5852 ms
With clustering with HZ 3.3.3 :16745 ms
With clustering with HZ 3.4.0-SNAPSHOT + initial message : 8609 ms
I believe also you should not use a worker verticle to send that many messages and instead send them using an event loop verticle via batches. Perhaps you should explain your use case and we can think about the best way to solve it.
When you're you enable clustering (of any kind) to an application you are making your application more resilient to failures but you're also adding a performance penalty.
For example your current flow (without clustering) is something like:
client ->
vert.x app ->
in memory same process eventbus (negletible) ->
handler -> cassandra
<- vert.x app
<- client
Once you enable clustering:
client ->
vert.x app ->
serialize request ->
network request cluster member ->
deserialize request ->
handler -> cassandra
<- serialize response
<- network reply
<- deserialize response
<- vert.x app
<- client
As you can see there are many encode decode operations required plus several network calls and this all gets added to your total request time.
In order to achive best performance you need to take advantage of locality the closer you are of your data store usually the fastest.
Just to add the code of the project. I guess that would help.
Sender verticle:
public class ProviderVerticle extends AbstractVerticle {
#Override
public void start() throws Exception {
IntStream.range(1, 30000001).parallel().forEach(i -> {
vertx.eventBus().send("clustertest1", Json.encode(new TestCluster1(i, "abc", LocalDateTime.now())));
});
}
#Override
public void stop() throws Exception {
super.stop();
}
}
And the inserter verticle
public class ReceiverVerticle extends AbstractVerticle {
private int messagesReceived = 1;
private Session cassandraSession;
#Override
public void start() throws Exception {
PoolingOptions poolingOptions = new PoolingOptions()
.setCoreConnectionsPerHost(HostDistance.LOCAL, 2)
.setMaxConnectionsPerHost(HostDistance.LOCAL, 3)
.setCoreConnectionsPerHost(HostDistance.REMOTE, 1)
.setMaxConnectionsPerHost(HostDistance.REMOTE, 3)
.setMaxRequestsPerConnection(HostDistance.LOCAL, 20)
.setMaxQueueSize(32768)
.setMaxRequestsPerConnection(HostDistance.REMOTE, 20);
Cluster cluster = Cluster.builder()
.withPoolingOptions(poolingOptions)
.addContactPoints(ClusterSetup.SEEDS)
.build();
System.out.println("Connecting session");
cassandraSession = cluster.connect("kiespees");
System.out.println("Session connected:\n\tcluster [" + cassandraSession.getCluster().getClusterName() + "]");
System.out.println("Connected hosts: ");
cassandraSession.getState().getConnectedHosts().forEach(host -> System.out.println(host.getAddress()));
PreparedStatement prepared = cassandraSession.prepare(
"insert into clustertest1 (id, value, created) " +
"values (:id, :value, :created)");
PreparedStatement preparedTimer = cassandraSession.prepare(
"insert into timer (name, created_on, amount) " +
"values (:name, :createdOn, :amount)");
BoundStatement timerStart = preparedTimer.bind()
.setString("name", "clusterteststart")
.setInt("amount", 0)
.setTimestamp("createdOn", new Timestamp(new Date().getTime()));
cassandraSession.executeAsync(timerStart);
EventBus bus = vertx.eventBus();
System.out.println("Bus info: " + bus.toString());
MessageConsumer<String> cons = bus.consumer("clustertest1");
System.out.println("Consumer info: " + cons.address());
System.out.println("Waiting for messages");
cons.handler(message -> {
TestCluster1 tc = Json.decodeValue(message.body(), TestCluster1.class);
if (messagesReceived % 100000 == 0)
System.out.println("Message received: " + messagesReceived);
BoundStatement boundRecord = prepared.bind()
.setInt("id", tc.getId())
.setString("value", tc.getValue())
.setTimestamp("created", new Timestamp(new Date().getTime()));
cassandraSession.executeAsync(boundRecord);
if (messagesReceived % 100000 == 0) {
BoundStatement timerStop = preparedTimer.bind()
.setString("name", "clusterteststop")
.setInt("amount", messagesReceived)
.setTimestamp("createdOn", new Timestamp(new Date().getTime()));
cassandraSession.executeAsync(timerStop);
}
messagesReceived++;
//message.reply("OK");
});
}
#Override
public void stop() throws Exception {
super.stop();
cassandraSession.close();
}
}

Spark Checkpoint doesn't remember state (Java HDFS)

ALready Looked at Spark streaming not remembering previous state
but doesn't help.
Also looked at http://spark.apache.org/docs/latest/streaming-programming-guide.html#checkpointing but cant find JavaStreamingContextFactory although I am using spark streaming 2.11 v 2.0.1
My code works fine but when I restart it... it won't remember the last checkpoint...
Function0<JavaStreamingContext> scFunction = new Function0<JavaStreamingContext>() {
#Override
public JavaStreamingContext call() throws Exception {
//Spark Streaming needs to checkpoint enough information to a fault- tolerant storage system such
JavaStreamingContext ssc = new JavaStreamingContext(conf, Durations.milliseconds(SPARK_DURATION));
//checkpointDir = "hdfs://user:pw#192.168.1.50:54310/spark/checkpoint";
ssc.sparkContext().setCheckpointDir(checkpointDir);
StorageLevel.MEMORY_AND_DISK();
return ssc;
}
};
JavaStreamingContext ssc = JavaStreamingContext.getOrCreate(checkpointDir, scFunction);
Currently data is streaming from kafka and I am performing some transformation and action.
JavaPairDStream<Integer, Long> responseCodeCountDStream = logObject.transformToPair
(MainApplication::responseCodeCount);
JavaPairDStream<Integer, Long> cumulativeResponseCodeCountDStream = responseCodeCountDStream.updateStateByKey
(COMPUTE_RUNNING_SUM);
cumulativeResponseCodeCountDStream.foreachRDD(rdd -> {
rdd.checkpoint();
LOG.warn("Response code counts: " + rdd.take(100));
});
Could somebody point me to right direction, if I am missing something?
Also, I can see that checkpoint is being saved in hdfs. But why wont it read from it?

Cassandra updates not working consistently

I run the following code on my local (mac) machine and on a remote unix server.:
public void deleteValue(final String id, final String value) {
log.info("Removing value " + value);
final Collection<String> valuesBeforeRemoval = getValues(id);
final MutationBatch m = keyspace.prepareMutationBatch();
m.withRow(VALUES_CF, id).deleteColumn(value);
try {
m.execute();
} catch (final ConnectionException e) {
log.error("Unable to delete location " + value, e);
}
final Collection<String> valuesAfterRemoval = getValues(id);
if (valuesAfterRemoval.size()!=(valuesBeforeRemoval.size()-1)) {
log.error("value " + value + " was supposed to be removed from list " + valuesBeforeRemoval + " but it wasn't: " + valuesAfterRemoval);
}
...
}
protected Collection<String> getValues(final String id) {
try {
final OperationResult<ColumnList<String>> operationResult = keyspace
.prepareQuery(VALUES_CF).getKey(id).execute();
final ColumnList<String> result = operationResult.getResult();
if (result.isEmpty()) {
log.info("No value found for id: " + id);
return new ArrayList<String>();
}
return result.getColumnNames();
} catch (final ConnectionException e) {
log.error("Unable to retrieve session " + id, e);
}
return new ArrayList<String>();
}
Locally, that line is never executed, which makes sense:
log.error("value " + value + " was supposed to be removed from list " + valuesBeforeRemoval + " but it wasn't: " + valuesAfterRemoval);
but that line is executed on my dev server:
[ERROR] [main] [n.o.w.s.d.SessionDaoCassandraImpl] [2013-03-08 13:12:24,801]
[] - value 3 was supposed to be removed from list [3, 2, 1, 0, 7, 6, 5, 4, 9, 8] but it wasn't: [3, 2, 1, 0, 7, 6, 5, 4, 9, 8]
I am using com.netflix.astyanax
Both my local machine and the remote dev server connect to the very
same cassandra instance.
Both my local machine and the remote dev server run the very same test
creating a new row family, and adding 10 records before one is deleted.
When the error occurs on dev, log.error("Unable to delete
location " + value, e); was not executed (i.e. running the deletion
command didn't produce any exception).
I am 100% positive that no other code is affecting the content of the
database while I am running the test on dev so this isn't some
strange concurrency issue.
What could possibly explain that the deleteColumn(value) request runs without producing any error but still does not remove the column from the database?
ADDITIONAL INFO
Here is how I created the keyspace:
create keyspace sessiondata
with placement_strategy = 'org.apache.cassandra.locator.SimpleStrategy'
and strategy_options = {replication_factor:1};
Here is how I created the column family values, referenced as VALUES_CF in the code above:
create column family values
with comparator = UTF8Type
;
Here is how the keyspace referenced in the java code above is defined:
final AstyanaxContext.Builder contextBuilder = getBuilder();
final AstyanaxContext<Keyspace> keyspaceContext = contextBuilder
.forKeyspace(keyspaceName).buildKeyspace(
ThriftFamilyFactory.getInstance());
keyspaceContext.start();
keyspace = keyspaceContext.getEntity();
where getBuilder is:
private Builder getBuilder() {
final AstyanaxConfigurationImpl conf = new AstyanaxConfigurationImpl()
.setDiscoveryType(NodeDiscoveryType.NONE)
.setRetryPolicy(new RunOnce());
final ConnectionPoolConfigurationImpl poolConf = new ConnectionPoolConfigurationImpl("MyPool")
.setPort(port)
.setMaxConnsPerHost(1)
.setSeeds(value);
return new AstyanaxContext.Builder()
.forCluster(cluster)
.withAstyanaxConfiguration(conf)
.withConnectionPoolConfiguration(poolConf)
.withConnectionPoolMonitor(new CountingConnectionPoolMonitor());
}
SECOND UPDATE
First, the issues are not solely related to deletes. I observe similar problems when updating records in the database, reading them, and not being able to read the updates I just wrote
Second, I created a test that does 100 times the following operations:
write a row into cassandra
update that row in cassandra
read back that row from cassandra and check whether the row was indeed updated, and checking again regularly after delays if it wasn't
What I observe from that test is that:
again, when I run that code locally, all 100 iterations pass right away (no retry ever needed)
when I run that code on the remote server, some of the iterations pass, some fail. When they fail, no matter how large the delay (I wait up to 10 seconds), the test always fail.
At this point, I am really not sure how any cassandra setup could explain this behavior since I connect to the very same server for my tests and since the delays I insert are much larger than any additional latency I may need to run the test when connecting from my local machine.
The only relevant difference seems to be which machine the code is running on.
THIRD UPDATE
If in the test mentioned in the previous update, I insert a delay between the 2 writes, the code starts passing if the delay is >= 1,000 ms. A delay of, say, 100 ms doesn't help. I also modified the builder to set the default read and write consistencies to the most demanding: ALL, and that had no impact on the results of the test (still failing about half of the time unless delay between writes >1s):
final AstyanaxConfigurationImpl conf = new AstyanaxConfigurationImpl()
.setDiscoveryType(NodeDiscoveryType.NONE)
.setRetryPolicy(new RunOnce()).setDefaultReadConsistencyLevel(ConsistencyLevel.CL_ALL).setDefaultWriteConsistencyLevel(ConsistencyLevel.CL_ALL);
To debug, try printing the full row instead of just the column names. When I say the full row I mean the column name, column value and the time stamp. A long shot is clocks are wrong on one of your test machines and this is throwing out your tests on the other.
Another thing to double check is that ip is indeed what you think it is, in both your application and cassandra. When you retrieve it print it between something, like println("-" + ip "-"). Before and after your try block for the execute in deleteSecureLocation do a get for only that column, not the entire row. I'm not too sure how to do that in astynax, on the cli it would be get[id][ip].
Something to keep in mind is that a delete won't fail even if there's nothing to delete. To cassandra it's a write, the only thing that will make it a delete is if on read it's the latest timestamped entry against that row/column name.

Categories