Trouble with Cassandra and ConsistencyLevel (Redundancy) - java

So, I am been playing with Cassandra, and have setup a cluster with three nodes. I am trying to figure out how redundancy works with ConsistencyLevels. Currently, I am writing data with ConsistenyLevel.ALL and am reading data with ConsistencyLevel.ONE. From what I have been reading, this seems to make sense. I have three Cassandra nodes, and I want to write to all three of them. I only care about reading from one of them, so I will take the first response. To test this, I have written a bunch of data (again, with ConsistencyLevel.ALL). I then kill one of my nodes (not the "seed" or "listen_address" machine).
When I then try to read, I expect, maybe after some delay, to get my data back. Initially, I get a TimeoutException... which I expect. This is what one gets when Cassandra is trying to deal with an unexpected node loss, right? After about 20 seconds, I try again, and now am getting an UnavailableException, which is described as "Not all the replicas required could be created and/or read".
Well, I don't care about all the replicas... just one (as in ConsistencyLevel.ONE on my get statement), right?
Am I missing the ConsistencyLevel point here? How can I configure this to still get my information if a node dies?
Thanks

It sounds like you have Replication Factor (RF) set to 1, meaning only one node holds any given row. Thus, when you take a node down, no matter what consistency level you use, you won't be able to read or write 1/3 of your data. Your expectations match what should happen with RF = 3.

Related

Kafka Streams: Should we advance stream time per key to test Windowed suppression?

I learnt from This blog and this tutorial that in order to test suppression with event time semantics, one should send dummy records to advance stream time.
I've tried to advance time by doing just that. But this does not seem to work unless time is advanced for a particular key.
I have a custom TimestampExtractor which associates my preferred "stream-time" with the records.
My stream topology pseudocode is as follows (I use the Kafka Streams DSL API):
source.mapValues(someProcessingLambda)
.flatMap(flattenRecordsLambda)
.groupByKey(Grouped.with(Serdes.ByteArray(), Serdes.ByteArray()))
.windowedBy(TimeWindows.of(Duration.ofMinutes(10)).grace(Duration.ZERO))
.aggregate(()->null, aggregationLambda)
.suppress(Suppressed.untilWindowCloses(Suppressed.BufferConfig.unbounded()));
My input is of the following format:
1 - {"stream_time":"2019-04-09T11:08:36.000-04:00", id:"1", data:"..."}
2 - {"stream_time":"2019-04-09T11:09:36.000-04:00", id:"1", data:"..."}
3 - {"stream_time":"2019-04-09T11:18:36.000-04:00", id:"2", data:"..."}
4 - {"stream_time":"2019-04-09T11:19:36.000-04:00", id:"2", data:"..."}
.
.
Now records 1 and 2 belong to a 10 minute window according to stream_time and 3 and 4 belong to another.
Within that window, records are aggregated as per id.
I expected that record 3 would signal that the stream has advanced and cause suppress to emit the data corresponding to 1st window.
However, the data is not emitted until I send a dummy record with id:1 to advance the stream time for that key.
Have I understood the testing instruction incorrectly? Is this expected behavior? Does the key of the dummy record matter?
I’m sorry for the trouble. This is indeed a tricky problem. I have some ideas for adding some operations to support this kind of integration testing, but it’s hard to do without breaking basic stream processing time semantics.
It sounds like you’re testing a “real” KafkaStreams application, as opposed to testing with TopologyTestDriver. My first suggestion is that you’ll have a much better time validating your application semantics with TopologyTestDriver, if it meets your needs.
It sounds to me like you might have more than one partition in your input topic (and therefore your application). In the event that key 1 goes to one partition, and key 3 goes to another, you would see what you’ve observed. Each partition of your application tracks stream time independently.
TopologyTestDriver works nicely because it only uses one partition, and also because it processes data synchronously. Otherwise, you’ll have to craft your “dummy” time advancement messages to go to the same partition as the key you’re trying to flush out.
This is going to be especially tricky because your “flatMap().groupByKey()” is going to repartition the data. You’ll have to craft the dummy message so that it goes into the right partition after the repartition. Or you could experiment with writing your dummy messages directly into the repartition topic.
If you do need to test with KafkaStreams instead of TopologyTestDriver, I guess the easiest thing is just to write a “time advancement” message per key, as you were suggesting in your question. Not because it’s strictly necessary, but because it’s the easiest way to meet all these caveats.
I’ll also mention that we are working on some general improvements to stream time handling in Kafka Streams that should simplify the situation significantly, but that doesn’t help you right now, of course.

How to count input and output rows on the Spark SQL API from Java?

I am trying to count the number of rows that a Java process reads and writes. The process is using the SQL API dealing with Datasets of Row. Adding .count() at various points seems to slow it down a lot, even if I do a .persist() prior to those points.
I have also seen code that does a
.map(row -> {
accumulator.add(1);
return row;
}, SomeEncoder)
which works well enough but the deserialization and re-serialization of the whole row seems unnecessary and it isn't mentally automatic since one has to come up with the correct SomeEncoder at each point.
A third option is maybe to call a UDF0 that does the counting and then drop the dummy object it would return but I'm not sure if Spark would be allowed to optimize the whole code away if it can tell the UDF0 isn't changing the output.
Is there a good way of counting without deserializing the rows? Or alternatively, is there a method that does the equivalent of Java's streams' .peek() where the returned data isn't important?
EDIT: to clarify, the job isn't just counting. The counting is just for record-keeping purposes. The job is doing other things. In fact, this is a pretty generic problem, I've got lots of jobs that are doing some transformations on data and saving them somewhere, I just want to keep a running record of how many rows these jobs read and wrote.
Thank you

Java- Writing huge data to csv

I am just trying to write huge data which is fetching from mysql db to CSV by using supercsv. How simply I can manage the performance issue. Does super csv write with some limits?
Since you included almost no detail in your question about how you are approaching the problem, it's hard to make concrete recommendations. So, here's a general one:
Unless you are writing your file to a really slow medium (some old USB stick or something), the slowest step in your process should be reading the data from the database.
There are two general ways how you can structure your program:
The bad way: Reading all the data from the database into your application's memory first and then, in a second step, writing it all in one shot to the csv file.
The right way: "Stream" the data from the db into the csv file, i.e. write the data to the csv file as it comes in to your application (record by record or batch by batch).
The idea is to set up something usually referred to as a "pipeline". Think of it like conveyor belt construction in a factory: You have multiple steps in your process of assembling some widget. What you don't want to do is have station 1 process all widgets and have stations 2 and 3 sit idle meanwhile, and then pass the whole container of widgets to station 2 to begin work, while stations 1 and 3 sit idle and so forth. Instead, station 1 needs to send small batches (1 at a time or 10 at a time or so) of widgets that are done to station 2 immediately so that they can start working on it as soon as possible. The goal is to keep all stations as busy as possible at all times.
In your example, station 1 is mysql retrieving the records, station 2 is your application that forwards (and processes?) them, and station 3 is supercsv. So, simply make sure that supercsv can start working as soon as possible, rather than having to wait for mysql to finish the entire request.
If you do this right, you should be able to generate the csv file as quickly as mysql can throw records at you*, and then, if it's still too slow, you need to rethink your database backend.
*I haven't used supercsv yet, so I don't know how well it performs, but given how trivial its job is and how popular it is, I would find it hard to believe that it would end up performing less well (as measured in processing time needed for one record) than mysql in this task. But this might be something that is worth verifying...

measuring statistics in java simulation

I have a group of nodes who send measurements to a bootstrap server. In the end I want the bootstrap server to sum all the measurements and write it to a file. One way to do that is to over-write the data to the file each time a measurement message is received(after summing up the current measurements). But this would be very inefficient. I want to store the measurement data and write it to file only once after the simulation is completed.
But the problem is that the simulator code that I am using is not under my control, its a library that I am using. So, I cant tell when exactly the simulation is going to end (and hence I cant tell which measurement message will be the last one).
I naively tried to store the measurement data in a static class but this data is not accessible when the simulation terminates. Is there any other way that I can do this ?
Thanks,
I would find the last message using a timeout.
Write to disk if you have new data but you haven't got anything for a while e.g. a second.
If you cannot store the data you need in the process (which it seems you can't, since the static class failed), you need to persist the data some other way. To an on-disk file is one option, and another common one would be to a database.

User matching with current data

I have a database full of two different types of users (Mentors and Mentees), whereby I want the second group (Mentees) to be able to "search" for people in the first group (Mentors) who match their profile. Mentors and Mentees can both go in and change items in their profile at any point in time.
Currently, I am using Apache Mahout for the user matching (recommender.mostSimilarIDs()). The problem I'm running into is that I have to reload the user data every single time anyone searches. By itself, this doesn't take that long, but when Mahout processes the data it seems to take a very long time (14 minutes for 3000 Mentors and 3000 Mentees). After processing, matching takes mere seconds. I also get the same INFO message over and over again while it's processing ("Processed 2248 users"), while looking at the code shows that the message should only be outputted every 10000 users.
I'm using the GenericUserBasedRecommender and the GenericDataModel, along with the NearestNUserNeighborhood, AveragingPreferenceInferrer and PearsonCorrelationSimilarity. I load mentors from the database, add the mentee to the list of POJOs and convert them to a FastByIDMap to give to the DataModel.
Is there a better way to be doing this? The product owner needs the data to be current for every search.
(I'm the author.)
You shouldn't need to ask it to reload the data every time, why's that?
14 minutes sounds way, way too long to load such a small amount of data too, something's wrong. You might follow up with more info at user#mahout.apache.org.
You are seeing log messages from a DataModel, which you can disable in your logging system of choice. It prints one final count. This is nothing to worry about.
I would advise you against using a PreferenceInferrer unless you absolutely know you want it. Do you actually have ratings here? I might suggest LogLikelihoodSimilarity if not.

Categories