how to improve performance in analysing log file using mapreduce

how to improve performance in analysing log file using mapreduce - java

We have to analyze the log files using hadoop as it can handle large data easily. So, I wrote one piece of mapreduce program. But even my mapreduce program is taking lot of time to get the data.
String keys[] = value.toString().split(" ");
int keysLength = keys.length;
if(keysLength > 4 && StringUtils.isNumeric(keys[keysLength-5])) {
this.keyWords.set(keys[0]+"-"+keys[1]+" "+keys[2]+" "+keys[keysLength-5]+" "+keys[keysLength-2]);
context.write(new IntWritable(1), keyWords);
}
The requirement is, we will have mostly 10 to 15 of .gz files and every .gz file have one log file inside. we have to pull the data from that log file to analyze it.
Sample input in the log file:
2015-09-12 03:39:45.201 [service_client] [anhgv-63ac7ca63ac] [[ACTIVE]
ExecuteThread: '0' for queue: 'weblogic.kernel.Default (self-tuning)']
INFO TempServerImplementation - || Server: loclhost 121.1.0.0 |
Service Category: Testing | Service Method: add | Application Id: Test
| Status Code: 200 | Duration: 594ms ||
So could someone help me how can I tune up the performance.
Thanks
Sai

You can try using SPARK(We can think this as in memory map reduce), it is 10x to 100x faster than traditional Map reduce. Please check trade-offs between hadoop map-reduce and SPARK before using.

There are two main ways you can speed up your job, input size and variable initialisation.
Input Size
gz is not a splittable format. That means that if you have 15 input gz files, you will only have 15 mappers. I can see from the comments that each gz file is 50MB, so at a generous 10:1 compression ratio, each mapper would be processing 500MB. This can take time, and unless you've got a <15 node cluster, you'll have nodes that are doing nothing. By uncompressing the data before the MR job you could have more mappers which would reduce the runtime.
Variable Initialisation
In the below line:
context.write(new IntWritable(1), keyWords);
you're generating a big overheard by allocating a brand new IntWritable for each output. Instead, why not allocate it at the top of the class? It doesn't change, so it doesn't need allocating each time.
For example:
private static final IntWritable ONE_WRITABLE = new IntWritable(1);
...
context.write(ONE_WRITABLE, keyWords);
The same applies to the strings you use - " " and "-". Assign them as static variables also and again avoid creating fresh ones each time.

Related

readFile cause "Could not fulfill resource requirements of job"

I have s3 with terabytes of data, separated to small files less than 5 mb.
I try to use flink to process them.
I create source with next code.
var inputFormat = new TextInputFormat(null);
inputFormat.setNestedFileEnumeration(true);
return streamExecutionEnvironment.readFile(inputFormat, "s3://name/");
But used memory growing up to limit, and job killed, and not scheduled again with error:
Could not fulfill resource requirements of job
Without data in sink.
On small set of data it works fine.
How I can read files without using too much memory?
Thanks.
same behaviour with:
env.fromSource( FileSource.forRecordStreamFormat(
new TextLineFormat(),
new Path("s3://name/")
)
.monitorContinuously(Duration.ofMillis(10000L))
.build(),
WatermarkStrategy.noWatermarks(),
"MySourceName"
)

The FileSource is the preferred way to ingest data from files. It should be able to handle the sort of scale you are talking about.
docs
javadocs

setQueueLimit on kinesis producer solved my problem https://nightlies.apache.org/flink/flink-docs-release-1.13/docs/connectors/datastream/kinesis/#backpressure

How to generate gcs files one after another with google cloud dataflow and java?

I have a pipeline with one gcs file as input and generate two gcs output files.
One output file contains error info and another contains normal info.
And I have a cloud function with the gcs trigger of the two output files.
I want to do something with the normal info file only when the error info file is 0 byte.
So I must let the error info file be generated earlier than the normal info file to check the size of the error info file.
Now I use 2 TextIO.Write to generate the two files.
But I can not control which one is generated first.
In the cloud functions, I let the normal info file check the size of the error info file with a retry.
But the cloud functions as a timeout limit of 540s so I can not retry until the error info file is generated.
How can I handle this in Cloud Dataflow?
Can I generate the error info file before the normal info file programmatically?

You can accomplish sequencing like this by using side inputs. For example,
error_pcoll = ...
good_data_pcoll = ...
error_write_result = error_pcoll | beam.io.WriteToText(...)
(good_data_pcoll
| beam.Map(
# This lambda simply emits what it was given.
lambda element, blocking_side: element,
# This side input isn't used,
# but will force error_write_result to be computed first.
blocking_side=beam.pvalue.AsIterable(error_write_result))
| beam.io.WriteToText(...))
The Wait PTransform encapsulates this pattern.

How to read output file for collecting stats (post) processing

Summary
I need to build a set of statistics during a Camel server in-modify-out process, and emit those statistics as one object (a single json log line).
Those statistics need to include:
input file metrics (size/chars/bytes and other, file-section specific measures)
processing time statistics (start/end/duration of processing time, start/end/duration of metrics gathering time)
output file metrics (same as input file metrics, and will be different numbers, output file being changed)
The output file metrics are the problem as I can't access the file until it's written to disk, and
its not written to disk until 'process'ing finishes
Background
A log4j implementation is being used for service logging, but after some tinkering we realised it really doesn't suit the requirement here as it would output multi-line json and embed the json into a single top-level field. We need varying top level fields, depending on the file processed.
The server is expected to deal with multiple file operations asynchronously, and the files vary in size (from tiny to fairly immense - which is one reason we need to iterate stats and measures before we start to tune or review)
Current State
input file and even processing time stats are working OK, and I'm using the following technique to get them:
Inside the 'process' override method of "MyProcessor" I create a new instance of my JsonLogWriter class. (shortened pseudo code with ellipsis)
import org.apache.camel.Exchange;
import org.apache.camel.Processor;
...
#Component
public class MyProcessor implements Processor {
...
#Override
public void process(Exchange exchange) throws Exception {
...
JsonLogWriter jlw = new JsonLogWriter();
jlw.logfilePath = jsonLogFilePath;
jlw.inputFilePath = inFilePath;
jlw.outputfilePath = outFilePath;
...
jlw.metricsInputFile(); //gathers metrics using inputFilePath - OK
...
(input file is processed / changed and returned as an inputstream:
InputStream result = myEngine.readAndUpdate(inFilePath);
... get timings
jlw.write
}
From this you can see that JsonLogWriter has
properties for file paths (input file, output file, log output),
a set of methods to populate data:
a method to emit the data to a file (once ready)
Once I have populated all the json objects in the class, I call the write() method and the class pulls all the json objects together and
the stats all arrive in a log file (in a single line of json) - OK.
Error - no output file (yet)
If I use the metricsOutputFile method however:
InputStream result = myEngine.readAndUpdate(inFilePath);
... get timings
jlw.metricsOutputFile(); // using outputfilePath
jlw.write
}
... the JsonLogWriter fails as the file doesn't exist yet.
java.nio.file.NoSuchFileException: aroute\output\a_long_guid_filename
when debugging I can't see any part of the exchange or result objects which I might pipe into a file read/statistics gathering process.
Will this require more camel routes to solve? What might be an alternative approach where I can get all the stats from input and output files and keep them in one object / line of json?
(very happy to receive constructive criticism - as in why is your Java so heavy-handed - and yes it may well be, I am prototyping solutions at this stage, so this isn't production code, nor do I profess deep understanding of Java internals - I can usually get stuff working though)

Use one route and two processors: one for writing the file and the next for reading the file, so one finishes writing before the other starts reading
Or , also you can use two routes: one for writing the file (to:file) and other that listens to read the file(from:file)
You can check for common EIP patterns that will solve most of this questions here:
https://www.enterpriseintegrationpatterns.com/patterns/messaging/

Why does processing file with one very long single line as input give different numbers of records?

I use Spark 1.2.1 (in local mode) to extract and process log information from a file.
The size of the file could be more than 100Mb. The file contains a very long single line so I'm using regular expression to split this file into log data rows.
MyApp.java
JavaSparkContext sc = new JavaSparkContext(conf);
JavaRDD<String> txtFileRdd = sc.textFile(filename);
JavaRDD<MyLog> logRDD = txtFileRdd.flatMap(LogParser::parseFromLogLine).cache();
LogParser.java
public static Iterable<MyLog> parseFromLogLine(String logline) {
List<MyLog> logs = new LinkedList<MyLog>();
Matcher m = PATTERN.matcher(logline);
while (m.find()) {
logs.add(new MyLog(m.group(0)));
}
System.out.println("Logs detected " + logs.size());
return logs;
}
Actual size of the processed file is about 100 Mb and it actually contains 323863 log items.
When I use Spark to extract my log items from file I get 455651 [logRDD.count()] log items which is not correct.
I think it happens because of file partitions, checking the output I see the following:
Logs detected 18694
Logs detected 113104
Logs detected 323863
And the total sum is 455651!
So I see that my partitions are merged with each other keeping duplicate items and I'd like to prevent that behaviour.
The workaround is using repartition(1) as follows:
txtFileRdd.repartition(1).flatMap(LogParser::parseFromLogLine).cache();
That does give me the desired result 323863, but I doubt that it's good for performance.
How to do the processing better performance-wise?

The partitioning is line-based by default. This fails in an interesting way when there is a single very long line, it seems. You could consider filing a bug for this (maybe there is one already).
The splitting is performed by the Hadoop file API, specifically the TextInputFormat class. One option is to specify your own InputFormat (which could include your entire parser) and use sc.hadoopFile.
Another option is to set a different delimiter via textinputformat.record.delimiter:
// Use space instead of newline as the delimiter.
sc.hadoopConfiguration.set("textinputformat.record.delimiter", " ")

Sorting a 100MB XML file with Java?

How long does sorting a 100MB XML file with Java take ?
The file has items with the following structure and I need to sort them by event
<doc>
<id>84141123</id>
<title>kk+ at Hippie Camp</title>
<description>photo by SFP</description>
<time>18945840</time>
<tags>elphinstone tribalharmonix vancouver intention intention7 newyears hippiecamp bc sunshinecoast woowoo kk kriskrug sunglasses smoking unibomber møtleykrüg </tags>
<geo></geo>
<event>47409</event>
</doc>
I'm on a Intel Dual Duo Core and 4GB RAM.
Minutes ? Hours ?
thanks

Here are the timings for a similar task executed using Saxon XQuery on a 100Mb input file.
Saxon-EE 9.3.0.4J from Saxonica
Java version 1.6.0_20
Analyzing query from {for $i in //item order by location return $i}
Analysis time: 195 milliseconds
Processing file:/e:/javalib/xmark/xmark100.xml
Using parser com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser
Building tree for file:/e:/javalib/xmark/xmark100.xml using class net.sf.saxon.tree.tiny.TinyBuilder
Tree built in 6158 milliseconds
Tree size: 4787932 nodes, 79425460 characters, 381878 attributes
Execution time: 3.466s (3466ms)
Memory used: 471679816
So: about 6 seconds for parsing the input file and building a tree, 3.5 seconds for sorting it. That's invoked from the command line, but invoking it from Java will get very similar performance. Don't try to code the sort yourself - it's only a one-line query, and you are very unlikely to match the performance of an optimized XQuery engine.

i would say minutes - you shud be able to do that completely in-memory, so with a sax parser that would be reading-sorting-writing, should not be a problem for your hardware

I think a problem like this would be better sorted using serialisation.
Deserialise the XML file into an ArrayList of 'doc'.
Using straight Java code, apply sort on the event attribute and stored sorted arraylist in another variable.
Serialise out the sorted 'doc' ArrayList to file

If you do it in memory, you should be able to do this in under 10 seconds. You would be pusshing to do this under 2 seconds because it will spend that much times reading/writing to disk.
This program should use no more than 4-5x times the original file size. about 500 MB in your case.
String[] records = FileUtils.readFileToString(new File("my-file.xml")).split("</?doc>");
Map<Long, String> recordMap = new TreeMap<Long, String>();
for(int i=1;i<records.length;i+=2) {
String record = records[i];
int pos1 = record.indexOf("<id>");
int pos2 = record.indexOf("</id>", pos1+4);
long num = Long.parseLong(record.substring(pos1+3, pos2));
recordMap.put(num, record);
}
StringBuilder sb = new StringBuilder(records[0]);
for (String s : recordMap.values()) {
sb.append("<doc>").append(s).append("</doc>");
}
sb.append(records[records.length-1]);
FileUtils.writeStringToFile(new File("my-output-file.xml"), sb.toString());

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

how to improve performance in analysing log file using mapreduce - java

You can try using SPARK(We can think this as in memory map reduce), it is 10x to 100x faster than traditional Map reduce. Please check trade-offs between hadoop map-reduce and SPARK before using.

Related

readFile cause "Could not fulfill resource requirements of job"

How to generate gcs files one after another with google cloud dataflow and java?

How to read output file for collecting stats (post) processing

Why does processing file with one very long single line as input give different numbers of records?

Sorting a 100MB XML file with Java?

Categories

Resources