I'm writing a Java application to run a MapReduce job on Hadoop. I've set up some local variables in my mapper/reducer classes but I'm not able to return the information to the main Java application. For example, if I set up a variable inside my Mapper class:
private static int nErrors = 0;
Each time I process a line from the input file, I increment the error count if the data is not formatted correctly. Finally, I define a get function for the errors and call this after my job is complete:
public static int GetErrors()
{
return nErrors;
}
But when I print out the errors at the end:
System.out.println("Errors = " + UPMapper.GetErrors());
This always returns "0" no matter what I do! If I start with nErrors = 12;, then the final value is 12. Is it possible to get information from the MapReduce functions like this?
UPDATE
Based on the suggestion from Binary Nerd, I implemented some Hadoop counters:
// Define this enumeration in your main class
public static enum MyStats
{
MAP_GOOD_RECORD,
MAP_BAD_RECORD
}
Then inside the mapper:
if (SomeCheckOnTheInputLine())
{
// This record is good
context.getCounter(MyStats.MAP_GOOD_RECORD).increment(1);
}
else
{
// This record has failed in some way...
context.getCounter(MyStats.MAP_BAD_RECORD).increment(1);
}
Then in the output stream from Hadoop I see:
MAP_BAD_RECORD=11557
MAP_GOOD_RECORD=8676
Great! But the question still stands, how do I get those counter values back into the main Java application?
Related
Writing my first spark program and finally have some basic stuff working but now they output isn't matching the data from the java program I am migrating.
The original Java program would go to a database and populate an Array with several years of data and I would then pass that to another object which would do several analysis and spit out the results, like averages but some datapoint are more complex so I can't just convert everything to Scala.
I have spark loading the dataset and I can confirm that it loads it correctly into my array for example
MyJavaOb{
int sales
int year
int getSales(){
return sales;
}
}
And I have an array of these
MyJavaOb[] myArr
Which correctly loaded from spark
2016 , 500
2017 , 900
2018 , 700
But now I want to pass this to my analysis object that is a bit more complex. I'm only writing out one metric but I have a lot of them.
MyAnalysisObj{
final MyJavaOb[] arr;
MyAnalysisObj(MyJavaOb [] arr){
this.arr= arr;
}
getTotSales(){
return getAvg(MyJavaOb::getSales);
}
getAvg(Function<? super MyJavaOb,Integer> x){
int tot=0
for(){
tot = tot + x.apply()
}
return tot;
}
}
returns 0;
So for some reason in the spark environment the passed function doesn't work. I can confirm the array values are present and it works in java but not in Scala.
Edit #1 Scala looks something like this:
scala> myArr
Array[your.java.object. MyJavaOb] = Array(your.java.object. MyJavaOb#123kghj,your.java.object.MyJavaOb#343jhhsa, your.java.object.MyJavaOb#834dsfd)
scala> val test = new MyAnalysisObj(myArr)
scala> test. getTotSales
Int = 0
Note: One subtle difference is that my object that is in the array is returned by a factory and what we operate on is an interface. Not sure why this would be an issue though
I'm trying to create my custom metric variable according to this tutorial
With the sample code it's provided, I can get the events and the Histogram.
I'm confused how the identifier been used by prometheus & grafana. I also trying to modify the sample code little bit but the metric just no longer work.
Also, I'm only able to access the system metric but not my own.
My question is:
how can I access the counter I created? for example counter1
What is the metricGroup exactly?
For example, I'd like to detect a pattern
from an input stream, and it's more reasonable to do it in the
metric or just output the result to a timeseries database like
influxdb?
thanks in advance.
Here is the map function
class FlinkMetricsExposingMapFunction extends RichMapFunction<SensorReading, SensorReading> {
private static final long serialVersionUID = 1L;
private transient Counter eventCounter;
private transient Counter customCounter1;
private transient Counter customCounter2;
#Override
public void open(Configuration parameters) {
eventCounter = getRuntimeContext()
.getMetricGroup().counter("events");
customCounter1 = getRuntimeContext()
.getMetricGroup()
.addGroup("customCounterKey", "mod2")
.counter("counter1");
customCounter2 = getRuntimeContext()
.getMetricGroup()
.addGroup("customCounterKey", "mod5")
.counter("counter2");
// meter = getRuntimeContext().getMetricGroup().meter("eventMeter", new DropwizardMeterWrapper(dropwizardMeter));
}
#Override
public SensorReading map(SensorReading value) {
eventCounter.inc();
if (value.getCurrTimestamp() % 2 == 0)
customCounter1.inc();
if (value.getCurrTimestamp() % 5 == 0)
customCounter2.inc();
if (value.getCurrTimestamp() % 2 == 0 && value.getCurrTimestamp() % 5 == 0)
customCounter1.dec();
return value;
}
}
Example Job:
env
.addSource(new SimpleSensorReadingGenerator())
.name(SimpleSensorReadingGenerator.class.getSimpleName())
.map(new FlinkMetricsExposingMapFunction())
.name(FlinkMetricsExposingMapFunction.class.getSimpleName())
.print()
.name(DataStreamSink.class.getSimpleName());
Update
Screenshot for access flink metrics from grafana:
flink-config.yaml
FROM flink:1.9.0
RUN echo "metrics.reporters: prom" >> "$FLINK_HOME/conf/flink-conf.yaml"; \
echo "metrics.latency.interval: 1000" >> "$FLINK_HOME/conf/flink-conf.yaml"; \
echo "metrics.reporter.prom.class: org.apache.flink.metrics.prometheus.PrometheusReporter" >> "$FLINK_HOME/conf/flink-conf.yaml"; \
mv $FLINK_HOME/opt/flink-metrics-prometheus-*.jar $FLINK_HOME/lib
COPY --from=builder /home/gradle/build/libs/*.jar $FLINK_HOME/lib/
default map function from tutorial:
#Override
public void open(Configuration parameters) {
eventCounter = getRuntimeContext().getMetricGroup().counter("events");
valueHistogram = getRuntimeContext()
.getMetricGroup()
.histogram("value_histogram", new DescriptiveStatisticsHistogram(10_000_000));
}
The counter you created is accessible by <system-scope>. customCounterKey.mod2.counter1. <system-scope> is defined in your flink-conf.yaml. If you did not defined it there the default is <host>.taskmanager.<tm_id>.<job_name>.<operator_name>.<subtask_index>.
A metric group bascially defines a hierarchy of metric names. According to the documentation the metric-group is a named container for metrics. It consist of 3 parts (scopes): The system-scope (defined in flink-conf.yaml), a user scope(whatever you define in addGroup()) and a metric name.
That depends on what you want to measure. For everything which you could detected for counters, gauges or meters I would go for the metrics. If it comes to histograms you should have a closer look on what you get from flink if you use the prometheus reporter. Flink generalizes all different metric frameworks - the way histogramms are implemented in prometheus is different than in e.g. graphite. The definition of buckets is given by flink and can't be changed as far as I know (despite some relection magic).
All this is described in more detail here: https://ci.apache.org/projects/flink/flink-docs-stable/monitoring/metrics.html#registering-metrics
Hope that helps.
I create a program to count words in Wikipedia. It works without any errors. Then I created the Cassandra table with two columns "word(text) and count(bigint)". The problem is when I wanted to enter words and counts to Cassandra table.My program is in following:
public class WordCount_in_cassandra {
public static void main(String[] args) throws Exception {
// Checking input parameters
final ParameterTool params = ParameterTool.fromArgs(args);
// set up the execution environment
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
// make parameters available in the web interface
env.getConfig().setGlobalJobParameters(params);
DataStream<String> text=env.addSource(new WikipediaEditsSource()).map(WikipediaEditEvent::getTitle);
DataStream<Tuple2<String, Integer>> counts =
// split up the lines in pairs (2-tuples) containing: (word,1)
text.flatMap(new Tokenizer())
// group by the tuple field "0" and sum up tuple field "1"
.keyBy(0).sum(1);
// emit result
if (params.has("output")) {
counts.writeAsText(params.get("output"));
} else {
System.out.println("Printing result to stdout. Use --output to specify output path.");
counts.print();
CassandraSink.addSink(counts)
.setQuery("INSERT INTO mar1.examplewordcount(word, count) values values (?, ?);")
.setHost("127.0.0.1")
.build();
}
// execute program
env.execute("Streaming WordCount");
}//main
public static final class Tokenizer implements FlatMapFunction<String, Tuple2<String, Integer>> {
#Override
public void flatMap(String value, Collector<Tuple2<String, Integer>> out) {
// normalize and split the line
String[] tokens = value.toLowerCase().split("\\W+");
// emit the pairs
for (String token : tokens) {
if (token.length() > 0) {
out.collect(new Tuple2<>(token, 1));
}
}
}
}
}
After running this code I got this error:
Exception in thread "main" org.apache.flink.api.common.InvalidProgramException: The implementation of the AbstractCassandraTupleSink is not serializable. The object probably contains or references non serializable fields.
I searched a lot but I could not find any solutions for it.Would you please tell me how I can solve the issue?
Thank you in advance.
I tried to replicate your problem, but I didn't get the serialization issue. Though because I don't have a Cassandra cluster running, it fails in the open() call. But this happens after serialization, as it's called when the operator being started by the TaskManager. So it feels like you have something maybe wrong with your dependencies, such that it's somehow using the wrong class for the actual Cassandra sink.
BTW, it's always helpful to include context for your error - e.g. what version of Flink, are you running this from an IDE or on a cluster, etc.
Just FYI, here are the Flink jars on my classpath...
flink-java/1.7.0/flink-java-1.7.0.jar
flink-core/1.7.0/flink-core-1.7.0.jar
flink-annotations/1.7.0/flink-annotations-1.7.0.jar
force-shading/1.7.0/force-shading-1.7.0.jar
flink-metrics-core/1.7.0/flink-metrics-core-1.7.0.jar
flink-shaded-asm/5.0.4-5.0/flink-shaded-asm-5.0.4-5.0.jar
flink-streaming-java_2.12/1.7.0/flink-streaming-java_2.12-1.7.0.jar
flink-runtime_2.12/1.7.0/flink-runtime_2.12-1.7.0.jar
flink-queryable-state-client-java_2.12/1.7.0/flink-queryable-state-client-java_2.12-1.7.0.jar
flink-shaded-netty/4.1.24.Final-5.0/flink-shaded-netty-4.1.24.Final-5.0.jar
flink-shaded-guava/18.0-5.0/flink-shaded-guava-18.0-5.0.jar
flink-hadoop-fs/1.7.0/flink-hadoop-fs-1.7.0.jar
flink-shaded-jackson/2.7.9-5.0/flink-shaded-jackson-2.7.9-5.0.jar
flink-clients_2.12/1.7.0/flink-clients_2.12-1.7.0.jar
flink-optimizer_2.12/1.7.0/flink-optimizer_2.12-1.7.0.jar
flink-streaming-scala_2.12/1.7.0/flink-streaming-scala_2.12-1.7.0.jar
flink-scala_2.12/1.7.0/flink-scala_2.12-1.7.0.jar
flink-shaded-asm-6/6.2.1-5.0/flink-shaded-asm-6-6.2.1-5.0.jar
flink-test-utils_2.12/1.7.0/flink-test-utils_2.12-1.7.0.jar
flink-test-utils-junit/1.7.0/flink-test-utils-junit-1.7.0.jar
flink-runtime_2.12/1.7.0/flink-runtime_2.12-1.7.0-tests.jar
flink-queryable-state-runtime_2.12/1.7.0/flink-queryable-state-runtime_2.12-1.7.0.jar
flink-connector-cassandra_2.12/1.7.0/flink-connector-cassandra_2.12-1.7.0.jar
flink-connector-wikiedits_2.12/1.7.0/flink-connector-wikiedits_2.12-1.7.0.jar
How to debug serializable exception in Flink?, this might helps. It's happening because you are assigning an unserialized field to serialized one.
I am generating the auto increment number as the code below.
public class CoachRegForm extends javax.swing.JFrame {
/**
* Creates new form CoachRegForm
*/
private static int counter = 10000;
final private int coachId;
public CoachRegForm() {
initComponents();
coachId = counter++;
String staffIDInString=new Integer(coachId).toString();
CoachRegFormIDShowLab.setText(staffIDInString);
}
It works within the system, but when I close the program and run again, it goes back to the default which is 10000.
Is there any method to carry the last number saved by the program to the next time I open the program?
Your static counter field exists in the memory until your JVM shuts down (i.e., program end for standalone main() applications), so if you wanted to save and retrieve the counter value again, you need a persistent store like a database (preferable) or file system.
You can use Java API to connect to database (java.sql.* i.e., JDBC) to connect to database or file system API (i.e., java.io.*) &
A good way to store some field value is Preferences: http://www.vogella.com/tutorials/JavaPreferences/article.html
You might also consider using an AtomicInteger to prevent problems with concurrency (within ONE multithreaded instance of your program) in future.
package preferences;
import java.util.concurrent.atomic.AtomicInteger;
import java.util.prefs.Preferences;
public class PrefsAutoinc {
private Preferences prefs;
private static final String AI_SETTING_ID = "autoincrement";
public void testPrefs() {
prefs = Preferences.userRoot().node(this.getClass().getName());
AtomicInteger autoinc = new AtomicInteger(prefs.getInt(AI_SETTING_ID, 0));
System.out.println("last saved value: " + autoinc);
System.out.println("next: " + autoinc.incrementAndGet());
System.out.println("next: " + autoinc.incrementAndGet());
prefs.putInt(AI_SETTING_ID, autoinc.get());
}
public static void main(String[] args) {
PrefsAutoinc test = new PrefsAutoinc();
test.testPrefs();
}
}
Of course if your program uses a database in any other way, then storing autoincrement value in database would be a better choice, as #javaguy suggests.
However, if you plan to run multiple instances of your application at the same time and the uniqueness of id is a must, then using centralized db storage is the only choice. Because each instance would increment it's own copy of variable and then save it in preferences, some values would be lost.
You can store the value in an ini file. You will need to update the value each time it changes.
If you do not want the users to easily detect the value, then you can store it in a database.
Anyway, you will need to not initialize it to 10000, just in the case the value was not already saved somewhere.
Trying to get as many reducer as the no of keys
public class CustomPartitioner extends Partitioner<Text, Text>
{
public int getPartition(Text key, Text value,int numReduceTasks)
{
System.out.println("In CustomP");
return (key.toString().hashCode()) % numReduceTasks;
}
}
Driver class
job6.setMapOutputKeyClass(Text.class);
job6.setMapOutputValueClass(Text.class);
job6.setOutputKeyClass(NullWritable.class);
job6.setOutputValueClass(Text.class);
job6.setMapperClass(LastMapper.class);
job6.setReducerClass(LastReducer.class);
job6.setPartitionerClass(CustomPartitioner.class);
job6.setInputFormatClass(TextInputFormat.class);
job6.setOutputFormatClass(TextOutputFormat.class);
But I am getting ootput in a single file.
Am I doing anything wrong
You can not control number of reducer without specifying it :-). But still there is no surety of getting all the keys on different reducer because you are not sure how many distinct keys you would get in the input data and your hash partition function may return same number for two distinct keys. If you want to achieve your solution then you'll have to know number of distinct keys in advance and then modify your partition function accordingly.
you need to specify the number of reduce tasks that's equal to number of keys and also you need to return the partitions based on your key's in partitioner class. for example if your input having 4 keys(here it is wood,Masonry,Reinforced Concrete etc) then your getPartition method look like this..
public int getPartition(Text key, PairWritable value, int numReduceTasks) {
// TODO Auto-generated method stub
String s = value.getone();
if (numReduceTasks ==0){
return 0;
}
if(s.equalsIgnoreCase("wood")){
return 0;
}
if(s.equalsIgnoreCase("Masonry")){
return 1%numReduceTasks;
}
if(s.equalsIgnoreCase("Reinforced Concrete")){
return 2%numReduceTasks;
}
if(s.equalsIgnoreCase("Reinforced Masonry")){
return 3%numReduceTasks;
}
else
return 4%numReduceTasks;
}
}
corresponding output will be collected in respective reducers..Try Running in CLI instead eclipse
You haven't configured the number of reducers to run.
You can configure it using below API
job.setNumReduceTasks(10); //change the number according to your
cluster
Also, you can set while executing from commandline
-D mapred.reduce.tasks=10
Hope this helps.
Veni, You need to Chain the Tasks as below
Mapper1 --> Reducer --> Mapper2 (Post Processing Mapper which creates
file for Each key)
Mapper 2 is InputFormat should be NlineInputFormat, so the output of the reducer that is for each key there will be corresponding mapper and Mapper output will be a separate file foe each key.
Mapper 1 and Reducer is your existing MR job.
Hope this helps.
Cheers
Nag