Hadoop mapreduce with input size ~ 2Mb slow

Hadoop mapreduce with input size ~ 2Mb slow - java

I tried to distribute a calculation using hadoop.
I am using Sequence input and output files, and custom Writables.
The input is a list of triangles, maximum size 2Mb, but can be smaller around 50kb too.
The intermediate values and the output is a map(int,double) in the custom Writable.
Is this the bottleneck?
The problem is that the calculation is much slower than the version without hadoop. also, increasing the nodes from 2 to 10, doesn't speed up the process.
One possibility is that I don't get enough mappers because of the small input size.
I made tests changing the mapreduce.input.fileinputformat.split.maxsize, but it just got worse, not better.
I am using hadoop 2.2.0 locally, and at amazon elastic mapreduce.
Did I overlook something? Or this is just the kind of task which should be done without hadoop? (it's my first time using mapreduce).
Would you like to see code parts?
Thank you.
public void map(IntWritable triangleIndex, TriangleWritable triangle, Context context) throws IOException, InterruptedException {
StationWritable[] stations = kernel.newton(triangle.getPoints());
if (stations != null) {
for (StationWritable station : stations) {
context.write(new IntWritable(station.getId()), station);
}
}
}
class TriangleWritable implements Writable {
private final float[] points = new float[9];
#Override
public void write(DataOutput d) throws IOException {
for (int i = 0; i < 9; i++) {
d.writeFloat(points[i]);
}
}
#Override
public void readFields(DataInput di) throws IOException {
for (int i = 0; i < 9; i++) {
points[i] = di.readFloat();
}
}
}
public class StationWritable implements Writable {
private int id;
private final TIntDoubleHashMap values = new TIntDoubleHashMap();
StationWritable(int iz) {
this.id = iz;
}
#Override
public void write(DataOutput d) throws IOException {
d.writeInt(id);
d.writeInt(values.size());
TIntDoubleIterator iterator = values.iterator();
while (iterator.hasNext()) {
iterator.advance();
d.writeInt(iterator.key());
d.writeDouble(iterator.value());
}
}
#Override
public void readFields(DataInput di) throws IOException {
id = di.readInt();
int count = di.readInt();
for (int i = 0; i < count; i++) {
values.put(di.readInt(), di.readDouble());
}
}
}

You won't get any benefit from hadoop with only 2MB of data. Hadoop is all about big data. Distributing the 2MB to your 10 nodes costs more time then just doing the job on a single node. The real benfit starts with a high number of nodes and huge data.

If the processing is really that complex, you should be able to realize a benefit from using Hadoop.
The common issue with small files, is that Hadoop will run a single java process per file and that will create overhead from having to start many processes and slows down the output. In your case this does not sound like it applies. More likely you have the opposite problem that only one Mapper is trying to process your input and it doesn't matter how big your cluster is at that point. Using the input split sounds like the right approach, but because your use case is specialized and deviates significantly from the norm, you may need to tweak a number of components to get the best performance.
So you should be able to get the benefits you are seeking from Hadoop Map Reduce, but it will probably take significant tuning and custom Input handling.
That said seldom(never?) will MapReduce be faster than a purpose built solution. It is a generic tool that is useful in that it can be used to distribute and solve many diverse problems without the need to write a purpose built solution for each.

So at the end I figured out a way to not store intermediate values in writables, only in the memory. This way it is faster.
But still, a non-hadoop solution is the best in this usecase.

Related

Which is the best approach to read a slowing change lookup and enrich a streaming input collection?

I'm using Apache beam, with a streaming collection of 1.5GB.
My lookup table is a JDBCio mysql response.
When I run the pipeline without the side input, my job will finish in about 2 minutes. When I run my job with the side input, my job will never finish, stucks and dies.
Here is the code I use to store my lookup (~1M records)
PCollectionView<Map<String,String>> sideData = pipeline.apply(JdbcIO.<KV<String, String>>read()
.withDataSourceConfiguration(JdbcIO.DataSourceConfiguration.create(
"com.mysql.jdbc.Driver", "jdbc:mysql://ip")
.withUsername("username")
.withPassword("password"))
.withQuery("select a_number from cell")
.withCoder(KvCoder.of(StringUtf8Coder.of(), StringUtf8Coder.of()))
.withRowMapper(new JdbcIO.RowMapper<KV<String, String>>() {
public KV<String, String> mapRow(ResultSet resultSet) throws Exception {
return KV.of(resultSet.getString(1), resultSet.getString(1));
}
})).apply(View.asMap());
Here is the code of my streaming collection
pipeline
.apply("ReadMyFile", TextIO.read().from("/home/data/**")
.watchForNewFiles(Duration.standardSeconds(60), Watch.Growth.<String>never()))
.apply(Window.<String>into(new GlobalWindows())
.triggering(Repeatedly.forever(AfterProcessingTime.pastFirstElementInPane().plusDelayOf(Duration.standardSeconds(30))))
.accumulatingFiredPanes()
.withAllowedLateness(ONE_DAY))
Here is the code of my parDo to iterate on each event row (of 10M records)
.apply(ParDo.of(new DoFn<KV<String,Integer>,KV<String,Integer>>() {
#ProcessElement
public void processElement(ProcessContext c) {
KV<String,Integer> i = c.element();
String sideInputData = c.sideInput(sideData).get(i.getKey());
if (sideInputData == null) {
c.output(i);
}
}
}).withSideInputs(sideData));
I'm using a flink cluster, but using direct runner outputs the same.
cluster:
2 cpu
6 cores
24gb ram
What am I doing wrong?
I've followed this

The solution was to create a "cache" MAP.
The sideInput only triggers once and then I cache it into a map equivalent suctruture.
So, that I'm avoiding doing a sideInput for every processElement.
.apply(ParDo.of(new DoFn<KV<String,Integer>,KV<String,Integer>>() {
#ProcessElement
public void processElement(ProcessContext c) {
if (isFirstTime) {
myList = c.sideInput(sideData);
}
isFirstTime = false;
boolean result = myList.containsKey(c.element().getKey());
if (result == false) {
c.output(i);
}
}
}).withSideInputs(sideData));

If it runs with much less data, I suspect that the program is using up all memory of the java process. You can monitor that through JVisualVM or JConsole. There are many articles that cover the problem, I just stumbled upon this one with a quick google search.
If memory runs out, your java process is mostly busy with cleaning up memory and you see a huge performance decline. At some point, java gives up and fails.
To solve the issue, it should be enough to increase the java heap size. How you increase that depends on how and where you execute it. Look at for Java's -Xmx parameter or some heap option in beam.

SOLR 7+ / Lucene 7+ and performance issues with DelegatingCollector and PostFilter

Here is the situation I am facing.
I am migrating from SOLR 4 to SOLR 7.
SOLR 4 is running on Tomcat 8, SOLR 7 runs with built in Jetty 9.
The largest core contains about 1,800,000 documents (about 3 GB).
The migration went through smoothly. But something's bothering me.
I have a PostFilter to collect only some documents according to a pre-selected list.
Here is the code for the org.apache.solr.search.DelegatingCollector:
#Override
protected void doSetNextReader(LeafReaderContext context) throws IOException {
this.reader = context.reader();
super.doSetNextReader(context);
}
#Override
public void collect(int docNumber) throws IOException {
if (null != this.reader && isValid(this.reader.document(docNumber).get("customid")))
{
super.collect(docNumber);
}
}
private boolean isValid(String customId) {
boolean valid = false;
if (null != customMap) // HashMap<String, String>, contains the custom IDs to keep. Contains an average of 2k items
{
valid = customMap.get(customId) != null;
}
return valid;
}
And here is an example of query sent to SOLR:
/select?fq=%7B!MyPostFilter%20sessionid%3DWST0DEV-QS-5BEEB1CC28B45580F92CCCEA32727083&q=system%20upgrade
So, the problem is:
It runs pretty fast on SOLR 4, with average QTime equals to 30.
But now on SOLR 7, it is awfully slow with average QTime around 25000!
And I am wondering what can be the source of such bad performances...
With a very simplified (or should I say transparent) collect function (see below), there is no degradation. This test just to exclude server/platform from the equation.
#Override
public void collect(int docNumber) throws IOException {
super.collect(docNumber);
}
My guess is that since LUCENE 7, there have been drastic changes in the way the API access documents, but I am not sure to have understood everything.
I got it from this post: How to get DocValue by document ID in Lucene 7+?
I suppose this has something to do with the issues I am facing.
But I have no idea how to upgrade/change my PostFilter and/or DelegatingCollector to go back to good performances.
If any LUCENE/SOLR experts could provide some hints or leads, it would be very appreciated.
Thanks in advance.
PS:
In the core schema:
<field name="customid" type="string" indexed="true" stored="true" required="true" multiValued="false" />
This field is string-type as it can be something like "100034_001".
In the solrconfig.xml:
<queryParser name="MyPostFilter" class="solrpostfilter.MyQueryPaser"/>
I can share the full schema and solrconfig.xml files if needed but so far, there is no other particular configuration in there.
EDIT
After some digging in the API, I changed the collect function with the following:
#Override
public void collect(int docNumber) throws IOException {
if (null != reader)
{
SortedDocValues sortedDocValues = reader.getSortedDocValues("customid");
if (sortedDocValues.advanceExact(docNumber) && isValid(sortedDocValues.binaryValue().utf8ToString()))
{
super.collect(docNumber);
}
}
}
Now QTime is down to an average of 1100, which is much, much better but still far from the 30 I had with SOLR 4.
Not sure it is possible to improve this even more, but any other advice/comment is still very welcome.
/cheers

Use a filter query instead of a post filter.
This answer does not attempt to increase the performance of the post filter, but uses a different approach. Nevertheless I got way better results (factor 10) than by any improvements made to the post-filter.
Checkout my code here: https://github.com/cheffe/solr-postfilter-sample
increase maxBooleanClauses
Visit your solrconfig.xml. There add or adjust the <query> ... </query> element to contain a child-element maxBooleanClauses with a value of 10024.
<query>
<!-- other content left out --->
<maxBooleanClauses>10024</maxBooleanClauses>
</query>
This will allow you to add a large filter query instead of a post filter.
add all customids as filter query
This query got huge, but the performance was just way better.
fq=customid:(0_001 1_001 2_001 3_001 4_001 5_001 ... 4999_001 5000_001)
comparison of execution time in contrast to post filter
The post filter took for 5.000 ids 320ms the filter query in contrast took 22ms for the same amount of ids.

Following the advice of Toke Eskildsen on Solr's user mailing list in a thread that is quite similar to you question, I got the response time down from 300 ms to 100 ms. Feel free to link my github repository to the mailing list. Maybe they have further advise.
These measures were the most effective
store the reference to the SortedDocValues during doSetNextReader
use org.apache.lucene.index.DocValues to get the above
preprocess the given String objects to org.apache.lucene.util.BytesRef during the parsing of the Query
public DelegatingCollector getFilterCollector(IndexSearcher searcher) {
return new DelegatingCollector() {
private SortedDocValues sortedDocValues;
#Override
protected void doSetNextReader(LeafReaderContext context) throws IOException {
super.doSetNextReader(context);
// store the reference to the SortedDocValues
// use org.apache.lucene.index.DocValues to do so
sortedDocValues = DocValues.getSorted(context.reader(), "customid");
}
#Override
public void collect(int docNumber) throws IOException {
if (sortedDocValues.advanceExact(docNumber) && isValid(sortedDocValues.binaryValue())) {
super.collect(docNumber);
}
}
private boolean isValid(BytesRef customId) {
return customSet.contains(customId);
}
};
}
Within the extension of the QParserPlugin I convert the given String to org.apache.lucene.util.BytesRef.
#Override
public QParser createParser(String qstr, SolrParams localParams, SolrParams params, SolrQueryRequest req) {
return new QParser(qstr, localParams, params, req) {
#Override
public Query parse() throws SyntaxError {
int idCount = localParams.getInt("count", 2000);
HashSet<BytesRef> customSet = new HashSet<>(idCount);
for (int id = 0; id < idCount; id++) {
String customid = id % 200000 + "_" + String.format ("%03d", 1 + id / 200000);
customSet.add(new BytesRef(customid));
}
return new IdFilter(customSet);
}
};
}

Dataflow Splittable ReadFn not using multiple workers

I have a particularly simple Dataflow pipeline where I want to read a file and output its parsed records to Avro. This works in most cases, except where the source file is particularly large (20+ GB) which causes me to OOM even with particularly large memory machines. I am pretty sure this happens because the non-splittable source is read in its entirety by Beam, so I implemented a splittable DoFn<FileIO.ReadableFile, GenericRecord>
This functionally works in that the pipeline now succeeds, which seems to validate my assumption that the single large batch from a non-splittable file is the cause. However, this does not seem to spread the work across multiple workers. I tried the following:
Disabled autoscaling (autoscalingAlgorithm=NONE) and set numWorkers to 10. This had the same throughput as numWorkers 1
Left autoscaling on with a high maxWorkers. This went briefly up to 2, and then came back down to 1
Added a shuffle (Reshuffle.viaRandomKey) after the DoFn, but before the Avro write
Any ideas? The exact code is difficult to share because of company policy, but overall is pretty simple. I implemented the following:
public class SplittableReadFn extends DoFn<FileIO.ReadableFile, GenericRecord> {
// ...
#ProcessElement
public void process(final ProcessContext c, final OffsetRangeTracker tracker) {
final FileIO.ReadableFile file = c.element();
// Followed by something like
ReadableByteStream in = file.open()
in.seek(tracker.from())
Parser parser = new Parser(in)
while (parser.next()) {
if (parser.getOffset() > tracker.to()) {
break
}
tracker.tryClaim(parser.getOffset())
c.output(parser.item())
}
tracker.markDone()
}
#GetInitialRestriction
public OffsetRange getInitialRestriction(final FileIO.ReadableFile file) {
return new Offset(0, getSize(file) - 1);
}
#SplitRestriction
public void splitRestriction(final FileIO.ReadableFile file, final OffsetRange restriction, final DoFn.OutputReceiver<OffsetRange> receiver) {
// chunkRange for test purposes just breaks into at most 500MB chunks
for (final OffsetRange chunk: chunkRange(restriction)) {
receiver.output(chunk);
}
}

Is it possible to reinitialize static mutable fields in a class?

I'm trying to automate the testing process for customly written programs designed to solve competitive programming challenges. Below is a dummy sample implementation of Solution:
public class Solution {
private static String dummyField = "initial";
public static int initialize(InputStream in) {
//competitive programmer custom code
System.out.println(dummyField);
dummyField = "changed";
return subCaseCount;
}
public void processSingleSubCase(InputStream in) {
//competitive programmer custom code
}
}
Prewritten test code for solution regardless of its implementation:
public void testSolution() throws FileNotFoundException {
for(File testResource : testResources) {
InputStream in = new FileInputStream(testResource);
int subCaseCount = Foo.initialize(in);
for (int subCase = 0; subCase < subCaseCount; subCase++) {
new Foo().processSingleSubCase(in);
}
//magic call to re-init all static fields without knowing their number/names in advance goes here
}
//console current output:
//initial
//changed
//changed
//...
//desired:
//initial
//initial
//initial
//....
}
The static fields can be mutable, so caching the initial values and mapping them to field names using reflection as a first setup, then reassigning them in between iterations won't do.
I did manage to come up with a working solution which basically reloads the class using a different class loader in between iterations, it did work but was slow: it took about 50 seconds just to reload classes 300 times (test resources are auto generated and I'd like to have the flexibility to auto generate as many as tolerable).
Is there a faster alternative?

My two thoughts for how to do this are:
Use instances rather than statics, since that way the new instance for each test is fresh.
If you need (or want) to stick with statics: Prior to the first test, cache the static values, then reassign them from the cache between tests. If the static values are object references referring to mutable objects, you'll need to make deep copies.

When using HBase as a source for MapReduce, can I extend TableInputFormatBase to create multiple splits and multiple mappers for each region?

I'm thinking about using HBase as a source for one of my MapReduce jobs. I know that TableInputFormat specifies one input split (and thus one mapper) per Region. However, this seems inefficient. I'd really like to have multiple mappers working on a given Region at once. Can I achieve this by extending TableInputFormatBase? Can you please point me to an example? Furthermore, is this even a good idea?
Thanks for the help.

You need a custom input format that extends InputFormat. you can get idea how do this from answer to question I want to scan lots of data (Range based queries), what all optimizations I can do while writing the data so that scan becomes faster. This is a good idea if the time of data processing is more greater then data retrieving time.

Not sure if you can specify multiple mappers for a given region, but consider the following:
If you think one mapper is inefficient per region (maybe your data nodes don't have enough resources like #cpus), you can perhaps specify smaller regions sizes in the file hbase-site.xml.
here's a site for the default configs options if you want to look into changing that:
http://hbase.apache.org/configuration.html#hbase_default_configurations
please note that by making the region size small, you will be increasing the number of files in your DFS, and this can limit the capacity of your hadoop DFS depending on the memory of your namenode. Remember, the namenode's memory usage is directly related to the number of files in your DFS. This may or may not be relavant to your situation as I do not know how your cluster is being used. There is never a silver bullet answer to these questions!

1 . Its absolutely fine just make sure the key set is mutually exclusive between the mappers .
you arent creating too many clients as this may lead to lot of gc , as during hbase read hbase block cache churning happens

Using this MultipleScanTableInputFormat, you can use MultipleScanTableInputFormat.PARTITIONS_PER_REGION_SERVER configuration to control how many mappers should execute against a single regionserver. The class will group all the input splits by their location (regionserver), and the RecordReader will properly iterate through all aggregated splits for the mapper.
Here is the example
https://gist.github.com/bbeaudreault/9788499#file-multiplescantableinputformat-java-L90
That work you have created the multiple aggregated splits for a single mapper
private List<InputSplit> getAggregatedSplits(JobContext context) throws IOException {
final List<InputSplit> aggregatedSplits = new ArrayList<InputSplit>();
final Scan scan = getScan();
for (int i = 0; i < startRows.size(); i++) {
scan.setStartRow(startRows.get(i));
scan.setStopRow(stopRows.get(i));
setScan(scan);
aggregatedSplits.addAll(super.getSplits(context));
}
// set the state back to where it was..
scan.setStopRow(null);
scan.setStartRow(null);
setScan(scan);
return aggregatedSplits;
}
Create partition by Region server
#Override
public List<InputSplit> getSplits(JobContext context) throws IOException {
List<InputSplit> source = getAggregatedSplits(context);
if (!partitionByRegionServer) {
return source;
}
// Partition by regionserver
Multimap<String, TableSplit> partitioned = ArrayListMultimap.<String, TableSplit>create();
for (InputSplit split : source) {
TableSplit cast = (TableSplit) split;
String rs = cast.getRegionLocation();
partitioned.put(rs, cast);
}

This would be useful if you wanna scan large regions (hundred of millions rows) with conditioned scan that finds only a few records. This will prevent of ScannerTimeoutException
package org.apache.hadoop.hbase.mapreduce;
import java.io.IOException;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hbase.util.Bytes;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.JobContext;
public class RegionSplitTableInputFormat extends TableInputFormat {
public static final String REGION_SPLIT = "region.split";
#Override
public List<InputSplit> getSplits(JobContext context) throws IOException {
Configuration conf = context.getConfiguration();
int regionSplitCount = conf.getInt(REGION_SPLIT, 0);
List<InputSplit> superSplits = super.getSplits(context);
if (regionSplitCount <= 0) {
return superSplits;
}
List<InputSplit> splits = new ArrayList<InputSplit>(superSplits.size() * regionSplitCount);
for (InputSplit inputSplit : superSplits) {
TableSplit tableSplit = (TableSplit) inputSplit;
System.out.println("splitting by " + regionSplitCount + " " + tableSplit);
byte[] startRow0 = tableSplit.getStartRow();
byte[] endRow0 = tableSplit.getEndRow();
boolean discardLastSplit = false;
if (endRow0.length == 0) {
endRow0 = new byte[startRow0.length];
Arrays.fill(endRow0, (byte) 255);
discardLastSplit = true;
}
byte[][] split = Bytes.split(startRow0, endRow0, regionSplitCount);
if (discardLastSplit) {
split[split.length - 1] = new byte[0];
}
for (int regionSplit = 0; regionSplit < split.length - 1; regionSplit++) {
byte[] startRow = split[regionSplit];
byte[] endRow = split[regionSplit + 1];
TableSplit newSplit = new TableSplit(tableSplit.getTableName(), startRow, endRow,
tableSplit.getLocations()[0]);
splits.add(newSplit);
}
}
return splits;
}
}

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Hadoop mapreduce with input size ~ 2Mb slow - java

You won't get any benefit from hadoop with only 2MB of data. Hadoop is all about big data. Distributing the 2MB to your 10 nodes costs more time then just doing the job on a single node. The real benfit starts with a high number of nodes and huge data.

So at the end I figured out a way to not store intermediate values in writables, only in the memory. This way it is faster. But still, a non-hadoop solution is the best in this usecase.

Related

Which is the best approach to read a slowing change lookup and enrich a streaming input collection?

SOLR 7+ / Lucene 7+ and performance issues with DelegatingCollector and PostFilter

Dataflow Splittable ReadFn not using multiple workers

Is it possible to reinitialize static mutable fields in a class?

When using HBase as a source for MapReduce, can I extend TableInputFormatBase to create multiple splits and multiple mappers for each region?

Categories

Resources