The implementation of the AbstractCassandraTupleSink is not serializable - java

I create a program to count words in Wikipedia. It works without any errors. Then I created the Cassandra table with two columns "word(text) and count(bigint)". The problem is when I wanted to enter words and counts to Cassandra table.My program is in following:
public class WordCount_in_cassandra {
public static void main(String[] args) throws Exception {
// Checking input parameters
final ParameterTool params = ParameterTool.fromArgs(args);
// set up the execution environment
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
// make parameters available in the web interface
env.getConfig().setGlobalJobParameters(params);
DataStream<String> text=env.addSource(new WikipediaEditsSource()).map(WikipediaEditEvent::getTitle);
DataStream<Tuple2<String, Integer>> counts =
// split up the lines in pairs (2-tuples) containing: (word,1)
text.flatMap(new Tokenizer())
// group by the tuple field "0" and sum up tuple field "1"
.keyBy(0).sum(1);
// emit result
if (params.has("output")) {
counts.writeAsText(params.get("output"));
} else {
System.out.println("Printing result to stdout. Use --output to specify output path.");
counts.print();
CassandraSink.addSink(counts)
.setQuery("INSERT INTO mar1.examplewordcount(word, count) values values (?, ?);")
.setHost("127.0.0.1")
.build();
}
// execute program
env.execute("Streaming WordCount");
}//main
public static final class Tokenizer implements FlatMapFunction<String, Tuple2<String, Integer>> {
#Override
public void flatMap(String value, Collector<Tuple2<String, Integer>> out) {
// normalize and split the line
String[] tokens = value.toLowerCase().split("\\W+");
// emit the pairs
for (String token : tokens) {
if (token.length() > 0) {
out.collect(new Tuple2<>(token, 1));
}
}
}
}
}
After running this code I got this error:
Exception in thread "main" org.apache.flink.api.common.InvalidProgramException: The implementation of the AbstractCassandraTupleSink is not serializable. The object probably contains or references non serializable fields.
I searched a lot but I could not find any solutions for it.Would you please tell me how I can solve the issue?
Thank you in advance.

I tried to replicate your problem, but I didn't get the serialization issue. Though because I don't have a Cassandra cluster running, it fails in the open() call. But this happens after serialization, as it's called when the operator being started by the TaskManager. So it feels like you have something maybe wrong with your dependencies, such that it's somehow using the wrong class for the actual Cassandra sink.
BTW, it's always helpful to include context for your error - e.g. what version of Flink, are you running this from an IDE or on a cluster, etc.
Just FYI, here are the Flink jars on my classpath...
flink-java/1.7.0/flink-java-1.7.0.jar
flink-core/1.7.0/flink-core-1.7.0.jar
flink-annotations/1.7.0/flink-annotations-1.7.0.jar
force-shading/1.7.0/force-shading-1.7.0.jar
flink-metrics-core/1.7.0/flink-metrics-core-1.7.0.jar
flink-shaded-asm/5.0.4-5.0/flink-shaded-asm-5.0.4-5.0.jar
flink-streaming-java_2.12/1.7.0/flink-streaming-java_2.12-1.7.0.jar
flink-runtime_2.12/1.7.0/flink-runtime_2.12-1.7.0.jar
flink-queryable-state-client-java_2.12/1.7.0/flink-queryable-state-client-java_2.12-1.7.0.jar
flink-shaded-netty/4.1.24.Final-5.0/flink-shaded-netty-4.1.24.Final-5.0.jar
flink-shaded-guava/18.0-5.0/flink-shaded-guava-18.0-5.0.jar
flink-hadoop-fs/1.7.0/flink-hadoop-fs-1.7.0.jar
flink-shaded-jackson/2.7.9-5.0/flink-shaded-jackson-2.7.9-5.0.jar
flink-clients_2.12/1.7.0/flink-clients_2.12-1.7.0.jar
flink-optimizer_2.12/1.7.0/flink-optimizer_2.12-1.7.0.jar
flink-streaming-scala_2.12/1.7.0/flink-streaming-scala_2.12-1.7.0.jar
flink-scala_2.12/1.7.0/flink-scala_2.12-1.7.0.jar
flink-shaded-asm-6/6.2.1-5.0/flink-shaded-asm-6-6.2.1-5.0.jar
flink-test-utils_2.12/1.7.0/flink-test-utils_2.12-1.7.0.jar
flink-test-utils-junit/1.7.0/flink-test-utils-junit-1.7.0.jar
flink-runtime_2.12/1.7.0/flink-runtime_2.12-1.7.0-tests.jar
flink-queryable-state-runtime_2.12/1.7.0/flink-queryable-state-runtime_2.12-1.7.0.jar
flink-connector-cassandra_2.12/1.7.0/flink-connector-cassandra_2.12-1.7.0.jar
flink-connector-wikiedits_2.12/1.7.0/flink-connector-wikiedits_2.12-1.7.0.jar

How to debug serializable exception in Flink?, this might helps. It's happening because you are assigning an unserialized field to serialized one.

Related

Write elements of a map to a CSV correctly in a simplified way in Java 8

I have a countries Map with the following design:
England=24
Spain=21
Italy=10
etc
Then, I have a different citiesMap with the following design:
London=10
Manchester=5
Madrid=7
Barcelona=4
Roma=3
etc
Currently, I am printing these results on screen:
System.out.println("\nCountries:");
Map<String, Long> countryMap = countTotalResults(orderDataList, OrderData::getCountry);
writeResultInCsv(countryMap);
countryMap.entrySet().stream().forEach(System.out::println);
System.out.println("\nCities:\n");
Map<String, Long> citiesMap = countTotalResults(orderDataList, OrderData::getCity);
writeResultInCsv(citiesMap);
citiesMap.entrySet().stream().forEach(System.out::println);
I want to write each line of my 2 maps in the same CSV file. I have the following code:
public void writeResultInCsv(Map<String, Long> resultMap) throws Exception {
File csvOutputFile = new File(RUTA_FICHERO_RESULTADO);
try (PrintWriter pw = new PrintWriter(csvOutputFile)) {
resultMap.entrySet().stream()
.map(this::convertToCSV)
.forEach(pw::println);
}
}
public String convertToCSV(String[] data) {
return Stream.of(data)
.map(this::escapeSpecialCharacters)
.collect(Collectors.joining("="));
}
public String escapeSpecialCharacters(String data) {
String escapedData = data.replaceAll("\\R", " ");
if (data.contains(",") || data.contains("\"") || data.contains("'")) {
data = data.replace("\"", "\"\"");
escapedData = "\"" + data + "\"";
}
return escapedData;
}
But I get compilation error in writeResultInCsv method, in the following line:
.map(this::convertToCSV)
This is the compilation error I get:
reason: Incompatible types: Entry is not convertible to String[]
How can I indicate the following result in a CSV file in Java 8 in a simplified way?
This is the result and design that I want my CSV file to have:
Countries:
England=24
Spain=21
Italy=10
etc
Cities:
London=10
Manchester=5
Madrid=7
Barcelona=4
Roma=3
etc
Your resultMap.entrySet() is a Set<Map.Entry<String, Long>>. You then turn that into a Stream<Map.Entry<String, Long>>, and then run .map on this. Thus, the mapper you provide there needs to map objects of type Map.Entry<String, Long> to whatever you like. but you pass the convertToCSV method to it, which maps string arrays.
Your code tries to join on comma (Collectors.joining(",")), but your desired output contains zero commas.
It feels like one of two things is going on:
you copy/pasted this code from someplace or it was provided to you and you have no idea what any of it does. I would advise tearing this code into pieces: Take each individual piece, experiment with it until you understand it, then put it back together again and now you know what you're looking at. At that point you would know that having Collectors.joining(",") in this makes no sense whatsoever, and that you're trying to map an entry of String, Long using a mapping function that maps string arrays - which obviously doesn't work.
You would know all this stuff but you haven't bothered to actually look at your code. That seems a bit surprising, so I don't think this is it. But if it is - the code you have is so unrelated to the job you want to do, that you might as well remove your code entirely and turn this question into: "I have this. I want this. How do I do it?"
NB: A text file listing key=value pairs is not usually called a CSV file.

java.lang.ClassCastException: com.google.gson.internal.LinkedTreeMap cannot be cast to java.util.LinkedHashMap

I apologize for opening another question about this general issue, but none of the questions I've found on SO seem to relate closely to my issue.
I've got an existing, working dataflow pipeline that accepts objects of KV<Long, Iterable<TableRow>> and outputs TableRow objects. This code is in our production environment, running without issue. I am now trying to implement a unit test with direct runner to test this pipeline, however, but the unit test fails when it hits the line
LinkedHashMap<String, Object> evt = (LinkedHashMap<String, Object>) row.get(Schema.EVT);
in the pipeline, throwing the error message:
java.lang.ClassCastException: com.google.gson.internal.LinkedTreeMap
cannot be cast to java.util.LinkedHashMap
A simplified version of the existing dataflow code looks like this:
public static class Process extends DoFn<KV<Long, Iterable<TableRow>>, TableRow> {
/* private variables */
/* constructor */
/* private functions */
#ProcessElement
public void processElement(ProcessContext c) throws InterruptedException, ParseException {
EventProcessor eventProc = new EventProcessor();
Processor.WorkItem workItem = new Processor.WorkItem();
Iterator<TableRow> it = c.element().getValue().iterator();
// process all TableRows having the same id
while (it.hasNext()) {
TableRow item = it.next();
if (item.containsKey(Schema.EVT))
eventProc.process(item, workItem);
else
/* process by different Proc class */
}
/* do additional logic */
/* c.output() is somewhere far below */
}
}
public class EventProcessor extends Processor {
// Extract data from an event into the WorkItem
#SuppressWarnings("unchecked")
#Override
public void process(TableRow row, WorkItem item) {
try {
LinkedHashMap<String, Object> evt = (LinkedHashMap<String, Object>) row.get(Schema.EVT);
LinkedHashMap<String, Object> profile = (LinkedHashMap<String, Object>) row.get(Schema.PROFILE);
/* if no exception, process further business logic */
/* business logic */
} catch (ParseException e) {
System.err.println("Bad row");
}
}
}
The relevant portion of the unit test, which prepares the main input to the Process() DoFn, looks like this:
Map<Long, List<TableRow>> groups = new HashMap<Long, List<TableRow>>();
List<KV<Long, Iterable<TableRow>>> collections = new ArrayList<KV<Long,Iterable<TableRow>>>();
Gson gson = new Gson();
// populate the map with events grouped by id
for(int i = 0; i < EVENTS.length; i++) {
TableRow row = gson.fromJson(EVENTS[i], TableRow.class);
Long id = EVENT_IDS[i];
if(groups.containsKey(id))
groups.get(id).add(row);
else
groups.put(id, new ArrayList<TableRow>(Arrays.asList(row)));
}
// prepare main input for pipeline
for(Long key : groups.keySet())
collections.add(KV.of(key, groups.get(key)));
The line which is causing the issue is gson.fromJson(EVENTS[i], TableRow.class);, which appears to be encoding the internal representation of the TableRow as the wrong type of LinkedTreeMap.
The encoded type of the TableRow appears to be com.google.gson.internal.LinkedTreeMap instead of the expected java.util.LinkedHashMap. Is there a way I can cast the TableRow being created in my unit test to the correct type of java.util.LinkedHashMap, so that the unit test succeeds without making any changes to the existing dataflow code that already works in production?
Reposting the solution as an answer.
It is not recommended to cast to concrete classes if you do not use their specific features. In this case, it is better to cast to Map instead of LinkedHashMap. Gson's LinkedTreeMap is a Map too, so no problem should arise.
I would consider (not only that particular) cast a code smell. Every time a cast is coded, a risk is taken that a ClassCastException happens.
As the others already said, the Map interface could be used like Map<String, Object> evt = row.get(Schema.EVT);.
Alternatively, a new LinkedHashMap could be constructed by new LinkedHashMap<String, Object>(row.get(Schema.EVT));.
The second approach has the advantage of keeping the LinkedHashMap type, which might or might not be important, that depends on your scenario.
It's because LinkedHashMap is not superior to LinkedTreeMap thus they might not have the same methods. The Java compiler so thinks that casting it that way might result in evt having different methods than row.get(Schema.EVT), resulting bad stuff.
However, you can cast LinkedTreeMap into AbstractMap, Map or Object as they're all superior to it.
So (as many comments point out) to fix it, just use
Map<String, Object> evt = row.get(Schema.EVT);
and you should be fine.

Return debug information from Hadoop

I'm writing a Java application to run a MapReduce job on Hadoop. I've set up some local variables in my mapper/reducer classes but I'm not able to return the information to the main Java application. For example, if I set up a variable inside my Mapper class:
private static int nErrors = 0;
Each time I process a line from the input file, I increment the error count if the data is not formatted correctly. Finally, I define a get function for the errors and call this after my job is complete:
public static int GetErrors()
{
return nErrors;
}
But when I print out the errors at the end:
System.out.println("Errors = " + UPMapper.GetErrors());
This always returns "0" no matter what I do! If I start with nErrors = 12;, then the final value is 12. Is it possible to get information from the MapReduce functions like this?
UPDATE
Based on the suggestion from Binary Nerd, I implemented some Hadoop counters:
// Define this enumeration in your main class
public static enum MyStats
{
MAP_GOOD_RECORD,
MAP_BAD_RECORD
}
Then inside the mapper:
if (SomeCheckOnTheInputLine())
{
// This record is good
context.getCounter(MyStats.MAP_GOOD_RECORD).increment(1);
}
else
{
// This record has failed in some way...
context.getCounter(MyStats.MAP_BAD_RECORD).increment(1);
}
Then in the output stream from Hadoop I see:
MAP_BAD_RECORD=11557
MAP_GOOD_RECORD=8676
Great! But the question still stands, how do I get those counter values back into the main Java application?

Hadoop: Implement a nested for loop in MapReduce [Java]

I am trying to implement a statistical formula that requires comparing a datapoint with all other possible datapoints. For example my dataset is something like:
10.22
15.77
16.55
9.88
I need to go through this file like:
for (i=0;i< data.length();i++)
for (j=0;j< data.length();j++)
Sum +=(data[i] + data[j])
Basically when i get each line through my map function, i need to execute some instructions on the rest of the file in the reducer like in a nested for loop.
Now i have tried using the distributedCache, some form of ChainMapper but to no avail. Any idea of how i can go about doing this would be really appreciated. Even an out of the box way will be helpful.
You need to override the run method implementation of the Reducer Class.
public void run(Context context) throws IOException, InterruptedException {
setup(context);
while (context.nextKey()) {
//This corresponds to the ones corresponding to i of first iterator
Text currentKey = context.getCurrentKey();
Iterator<VALUEIN> currentValue = context.getValues();
if(context.nextKey()){
//You can get the Next Values the ones corresponding to j of you second iterator
}
}
cleanup(context);
}
or if you don't have reducer you can do the same in the Mapper as well by overriding the
public void run(Context context) throws IOException, InterruptedException {
setup(context);
while (context.nextKeyValue()) {
/*context.nextKeyValue() if invoked again gives you the next key values which is same as the ones you are looking for in the second loop*/
}
cleanup(context);
}
Let me know if this helps.

MapReduce - WritableComparables

I’m new to both Java and Hadoop. I’m trying a very simple program to get Frequent pairs.
e.g.
Input: My name is Foo. Foo is student.
Intermediate Output:
Map:
(my, name): 1
(name ,is): 1
(is, Foo): 2 // (is, Foo) = (Foo, is)
(is, student)
So finally it should give frequent pair is (is ,Foo).
Pseudo code looks like this:
Map(Key: line_num, value: line)
words = split_words(line)
for each w in words:
for each neighbor x:
emit((w, x)), 1)
Here my key is not one, it’s pair. While going through documentation, I read that for each new key we have to implement WritableComparable.
So I'm confused about that. If someone can explain about this class, that would be great. Not sure it’s really true. Then I can figure out on my own how to do that!
I don't want any code neither mapper nor anything ... just want to understand what does this WritableComparable do? Which method of WritableComparable actually compares keys? I can see equals and compareTo, but I cannot find any explanation about that. Please no code! Thanks
EDIT 1:
In compareTo I return 0 for pair (a, b) = (b, a) but still its not going to same reducer, is there any way in compareTo method I reset key (b, a) to (a, b) or generate totally new key?
EDIT 2:
I don't know for generating new key, but in compareTo changing logic, it worked fine ..! Thanks everyone!
WritableComparable is an interface that makes the class that implements it be two things: Writable, meaning it can be written to and read from your network via serialization, etc. This is necessary if you're going to use it as a key or value so that it can be sent between Hadoop nodes. And Comparable, which means that methods must be provided that show how one object of the given class can be compared to another. This is used when the Reducer organizes by key.
This interface is neceesary when you want to create your own object to be a key. And you'd need to create your own InputFormat as opposed to using one of the ones that come with Hadoop. This can get be rather difficult (from my experience), especially if you're new to both Java and Hadoop.
So if I were you, I wouldn't bother with that as there's a much simpler way. I would use TextInputFormat which is conveniently both the default InputFormat as well as pretty easy to use and understand. You could simply emit each key as a Text object which is pretty simliar to a string. There is a caveat though; like you mentioned "is Foo" and "Foo is" need to be evaluated to be the same key. So with every pair of words you pull out, sort them alphabetically before passing them as a key with the String.compareTo method. That way you're guarenteed to have no repeats.
Here is mapper class for your problem ,
the frequent pair of words logic is not implemented . i guess u were not lookin for that .
public class MR {
public static class Mapper extends org.apache.hadoop.mapreduce.Mapper<LongWritable, Text, Text, LongWritable>
{
public static int check (String keyCheck)
{
// logig to check key is frequent or not ?
return 0;
}
#Override
protected void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
Map< String, Integer> keyMap=new HashMap<String, Integer>();
String line=value.toString();
String[] words=line.split(" ");
for(int i=0;i<(words.length-1);i++)
{
String mapkeyString=words[i]+","+words[i+1];
// Logic to check is mapKeyString is frequent or not .
int count =check(mapkeyString);
keyMap.put(mapkeyString, count);
}
Set<Entry<String,Integer>> entries=keyMap.entrySet();
for(Entry<String, Integer> entry:entries)
{
context.write(new Text(entry.getKey()), new LongWritable(entry.getValue()));
}
}
}
public static class Reduce extends Reducer<Text, LongWritable, Text, Text>
{
protected void reduce(Text key, Iterable<LongWritable> Values,
Context context)
throws IOException, InterruptedException {
}
}
public static void main(String[] args) {
Configuration configuration=new Configuration();
try {
Job job=new Job(configuration, "Word Job");
job.setMapperClass(Mapper.class);
job.setReducerClass(Reduce.class);
job.setInputFormatClass(TextInputFormat.class);
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (InterruptedException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (ClassNotFoundException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}

Categories