How does apache spark works? - java

I am testing this framework specifically the SQL module with the java API. Running it as standalone in java, I have better results (even running in the same machine) then pure sql (running over pgadmin).
Executing exactly the same query in the same bd (postgres 9.4 and 9.5), sparks is almost 5 x faster, below is my code:
private static final JavaSparkContext sc =
new JavaSparkContext(new SparkConf()
.setAppName("SparkJdbc").setMaster("local[*]"));
public static void main(String[] args) {
StringBuilder sql = new StringBuilder();
sql.append(" my query ");
long time = System.currentTimeMillis();
DbConnection dbConnection = new DbConnection("org.postgresql.Driver", "jdbc:postgresql://localhost:5432/mydb", "user", "pass");
JdbcRDD<Object[]> jdbcRDD = new JdbcRDD<>(sc.sc(), dbConnection, sql.toString(),
1, 1000000000, 20, new MapResult(), ClassManifestFactory$.MODULE$.fromClass(Object[].class));
JavaRDD<Object[]> javaRDD = JavaRDD.fromRDD(jdbcRDD, ClassManifestFactory$.MODULE$.fromClass(Object[].class));
// javaRDD.map((final Object[] record) -> {
// StringBuilder line = new StringBuilder();
// for (Object o : record) {
// line.append(o != null ? o.toString() : "").append(",");
// }
// return line.toString();
// }).collect().forEach(rec -> System.out.println(rec));
System.out.println("Total: " + javaRDD.count());
System.out.println("Total time: " + (System.currentTimeMillis() - time));
}
Even if I uncomment the code to print the results to the console, it still running faster.
I am wondering how can it be faster, if the source (bd) is the same, can anyone explain it? Maybe using parallel query?
EDIT: There is a relevant information about the postgres process, in standby, the postgres uses 12 process and running the query over pgadmin, the number does not change, over spark, it up to 20, that means probably spark uses some parallel mechanism to do the job.

Related

VMWare java SDK: When doesn't an available PerfMetricID report Data?

I'm trying to use vmware sdk for java to collect the perfomance data of each entity (cluster/datastore/Host/VM) in the vmware environment.
The idea is to get the available PerfMetricIds for the target entity with queryAvailablePerfMetric, query those and report the details of the counter, the timestamp and the value.
However when I get the PerfMetricIds for an entity, not every detected (returned) PerfMetricId is reporting data. For example for each Datastore I get at least 4 ids which do not return data when queried, these IDs represent the counters associated with the average number of read and write operations, and for a cluster I'm missing the cpu usage, and so on ...
so I was wondering when does this happen? Shouldn't every metric returned by queryAvailablePerfMetric report data? what am I missing here?
Minimal code snippet:
// VMWare credentials
String vmwareUrl = args[0];
String vmwareUsername = args[1];
String vmwarePassword = args[2];
// connect to vCenter
ServiceInstance si = new ServiceInstance(new URL(vmwareUrl), vmwareUsername, vmwarePassword, true);
// get performance manager
PerformanceManager perfMgr = si.getPerformanceManager();
// define the time window (the last one hour)
Calendar calTo = Calendar.getInstance();
Calendar calFrom = Calendar.getInstance();
calFrom.setTime(calTo.getTime());
calFrom.add(Calendar.HOUR, -1);
// get any datastore for testing purposes
Folder rootFolder = si.getRootFolder();
ManagedEntity[] datastores = new InventoryNavigator(rootFolder).searchManagedEntities("Datastore");
ManagedEntity me = datastores[1];
// query all available metrics for the entity
PerfMetricId[] availablePmis = perfMgr.queryAvailablePerfMetric(me, calFrom, calTo, perfMgr.getHistoricalInterval()[0].getSamplingPeriod());
// create PerfQuerySpec
PerfQuerySpec qSpec = new PerfQuerySpec();
qSpec.setEntity(me.getMOR());
qSpec.setMetricId(availablePmis);
qSpec.setFormat("csv");
qSpec.setStartTime(calFrom);
qSpec.setEndTime(calTo);
// query perf
PerfEntityMetricBase[] perfValues = perfMgr.queryPerf(new PerfQuerySpec[]{qSpec});
// Printing
System.out.println("Found pmis (CounterIDs only): ");
for (PerfMetricId pmi : availablePmis){
System.out.print(pmi.getCounterId() + ", ");
}
System.out.print("\nPmis with values:");
int pmisCount=0;
for (PerfEntityMetricBase value : perfValues) {
PerfMetricSeriesCSV[] csvValues = ((PerfEntityMetricCSV) value).getValue();
pmisCount += csvValues.length;;
for (PerfMetricSeriesCSV csv : csvValues) {
System.out.println("Counter ID: " + csv.getId().getCounterId() + " ---- Metric instance: " + csv.getId().getInstance());
System.out.println("\tInfo: " + ((PerfEntityMetricCSV) value).getSampleInfoCSV());
System.out.println("\tValues: " + csv.getValue());
}
}
System.out.println("---------------");
System.out.println("Detected PMIs: " + availablePmis.length);
System.out.println("PMIs with values: " + pmisCount);
Any help (or discussions) would be appreciated

How to get optimal bulk insertion rate in DynamoDb through Executor Framework in Java?

I'm doing a POC on Bulk write (around 5.5k items) in local Dynamo DB using DynamoDB SDK for Java. I'm aware that each bulk write cannot have more than 25 write operations, so I am dividing the whole dataset into chunks of 25 items each. Then I'm passing these chunks as callable actions in Executor framework. Still, I'm not having a satisfactory result as the 5.5k records are getting inserted in more than 100 seconds.
I'm not sure how else can I optimize this. While creating the table I provisioned the WriteCapacityUnit as 400(not sure what's the maximum value I can give) and experimented with it a bit, but it never made any difference. I have also tried changing the number of threads in executor.
This is the main code to perform the bulk write operation:
public static void main(String[] args) throws Exception {
AmazonDynamoDBClient client = new AmazonDynamoDBClient().withEndpoint("http://localhost:8000");
final AmazonDynamoDB aws = new AmazonDynamoDBClient(new BasicAWSCredentials("x", "y"));
aws.setEndpoint("http://localhost:8000");
JSONArray employees = readFromFile();
Iterator<JSONObject> iterator = employees.iterator();
List<WriteRequest> batchList = new ArrayList<WriteRequest>();
ExecutorService service = Executors.newFixedThreadPool(20);
List<BatchWriteItemRequest> listOfBatchItemsRequest = new ArrayList<>();
while(iterator.hasNext()) {
if (batchList.size() == 25) {
Map<String, List<WriteRequest>> batchTableRequests = new HashMap<String, List<WriteRequest>>();
batchTableRequests.put("Employee", batchList);
BatchWriteItemRequest batchWriteItemRequest = new BatchWriteItemRequest();
batchWriteItemRequest.setRequestItems(batchTableRequests);
listOfBatchItemsRequest.add(batchWriteItemRequest);
batchList = new ArrayList<WriteRequest>();
}
PutRequest putRequest = new PutRequest();
putRequest.setItem(ItemUtils.fromSimpleMap((Map) iterator.next()));
WriteRequest writeRequest = new WriteRequest();
writeRequest.setPutRequest(putRequest);
batchList.add(writeRequest);
}
StopWatch watch = new StopWatch();
watch.start();
List<Future<BatchWriteItemResult>> futureListOfResults = listOfBatchItemsRequest.stream().
map(batchItemsRequest -> service.submit(() -> aws.batchWriteItem(batchItemsRequest))).collect(Collectors.toList());
service.shutdown();
while(!service.isTerminated());
watch.stop();
System.out.println("Total time taken : " + watch.getTotalTimeSeconds());
}
}
This is the code used to create the dynamoDB table:
public static void main(String[] args) throws Exception {
AmazonDynamoDBClient client = new AmazonDynamoDBClient().withEndpoint("http://localhost:8000");
DynamoDB dynamoDB = new DynamoDB(client);
String tableName = "Employee";
try {
System.out.println("Creating the table, wait...");
Table table = dynamoDB.createTable(tableName, Arrays.asList(new KeySchemaElement("ID", KeyType.HASH)
), Arrays.asList(new AttributeDefinition("ID", ScalarAttributeType.S)),
new ProvisionedThroughput(1000L, 1000L));
table.waitForActive();
System.out.println("Table created successfully. Status: " + table.getDescription().getTableStatus());
} catch (Exception e) {
System.err.println("Cannot create the table: ");
System.err.println(e.getMessage());
}
}
DynamoDB Local is provided as a tool for developers who need to develop offline for DynamoDB and is not designed for scale or performance. As such it is not intended for scale testing, and if you need to test bulk loads or other high velocity workloads it is best to use a real table. The actual cost incurred from dev testing on a live table is usually quite minimal as the tables only need to be provisioned for high capacity during the test runs.

DSE Cassandra - Why is Astyanax faster than DataStax java driver

I'm switching a Java application from using com.netflix.astyanax:astyanax-core:1.56.44 to com.datastax.cassandra:cassandra-driver-core:3.1.0.
Right off the bat, with a simple test that inserts a row with a randomly-generated key and then reads the row 1000 times, I'm seeing terrible performance compared to the code using Astyanax. Just using a single-node Cassandra instance running locally. The table I'm testing is simple -- just a blob primary key uuid column, and an int date column.
Here's the basic code with the DataStax driver:
class DataStaxCassandra
{
final Session session;
final PreparedStatement preparedIDWriteCmd;
final PreparedStatement preparedIDReadCmd;
void DataStaxCassandra()
{
final PoolingOptions poolingOptions = new PoolingOptions()
.setConnectionsPerHost(HostDistance.LOCAL, 1, 2)
.setConnectionsPerHost(HostDistance.REMOTE, 1, 1)
.setMaxRequestsPerConnection(HostDistance.LOCAL, 128)
.setMaxRequestsPerConnection(HostDistance.REMOTE, 128)
.setPoolTimeoutMillis(0); // Don't ever wait for a connection to one host.
final QueryOptions queryOptions = new QueryOptions()
.setConsistencyLevel(ConsistencyLevel.LOCAL_ONE)
.setPrepareOnAllHosts(true)
.setReprepareOnUp(true);
final LoadBalancingPolicy dcAwareRRPolicy = DCAwareRoundRobinPolicy.builder()
.withLocalDc("my_laptop")
.withUsedHostsPerRemoteDc(0)
.build();
final LoadBalancingPolicy loadBalancingPolicy = new TokenAwarePolicy(dcAwareRRPolicy);
final SocketOptions socketOptions = new SocketOptions()
.setConnectTimeoutMillis(1000)
.setReadTimeoutMillis(1000);
final RetryPolicy retryPolicy = new LoggingRetryPolicy(DefaultRetryPolicy.INSTANCE);
Cluster.Builder clusterBuilder = Cluster.builder()
.withClusterName("test cluster")
.withPort(9042)
.addContactPoints("127.0.0.1")
.withPoolingOptions(poolingOptions)
.withQueryOptions(queryOptions)
.withLoadBalancingPolicy(loadBalancingPolicy)
.withSocketOptions(socketOptions)
.withRetryPolicy(retryPolicy);
// I've tried both V3 and V2, with lower connections/host and higher reqs/connection settings
// with V3, and it doesn't noticably affect the test performance. Leaving it at V2 because the
// Astyanax version is using V2.
clusterBuilder.withProtocolVersion(ProtocolVersion.V2);
final Cluster cluster = clusterBuilder.build();
session = cluster.connect();
preparedIDWriteCmd = session.prepare(
"INSERT INTO \"mykeyspace\".\"mytable\" (\"uuid\", \"date\") VALUES (?, ?) USING TTL 38880000");
preparedIDReadCmd = session.prepare(
"SELECT \"date\" from \"mykeyspace\".\"mytable\" WHERE \"uuid\"=?");
}
public List<Row> execute(final Statement statement, final int timeout)
throws InterruptedException, ExecutionException, TimeoutException
{
final ResultSetFuture future = session.executeAsync(statement);
try
{
final ResultSet readRows = future.get(timeout, TimeUnit.MILLISECONDS);
final List<Row> resultRows = new ArrayList<>();
// How far we can go without triggering the blocking fetch:
int remainingInPage = readRows.getAvailableWithoutFetching();
for (final Row row : readRows)
{
resultRows.add(row);
if (--remainingInPage == 0) break;
}
return resultRows;
}
catch (final TimeoutException e)
{
future.cancel(true);
throw e;
}
}
private void insertRow(final byte[] id, final int date)
throws InterruptedException, ExecutionException, TimeoutException
{
final ByteBuffer idKey = ByteBuffer.wrap(id);
final BoundStatement writeCmd = preparedIDWriteCmd.bind(idKey, date);
writeCmd.setRoutingKey(idKey);
execute(writeCmd, 1000);
}
public int readRow(final byte[] id)
throws InterruptedException, ExecutionException, TimeoutException
{
final ByteBuffer idKey = ByteBuffer.wrap(id);
final BoundStatement readCmd = preparedIDReadCmd.bind(idKey);
readCmd.setRoutingKey(idKey);
final List<Row> idRows = execute(readCmd, 1000);
if (idRows.isEmpty()) return 0;
final Row idRow = idRows.get(0);
return idRow.getInt("date");
}
}
void perfTest()
{
final DataStaxCassandra ds = new DataStaxCassandra();
final int perfTestCount = 10000;
final long startTime = System.nanoTime();
for (int i = 0; i < perfTestCount; ++i)
{
final String id = UUIDUtils.generateRandomUUIDString();
final byte[] idBytes = Utils.hexStringToByteArray(id);
final int date = (int)(System.currentTimeMillis() / 1000);
try
{
ds.insertRow(idBytes, date);
final int dateRead = ds.readRow(idBytes);
assert(dateRead == date) : "Inserted ID with date " +date +" but date read is " +dateRead;
}
catch (final InterruptedException | ExecutionException | TimeoutException e)
{
System.err.println("ERROR reading ID (test " +(i+1) +") - " +e.toString());
}
}
System.out.println(
perfTestCount +" insert+reads took " +
TimeUnit.NANOSECONDS.toMillis(System.nanoTime() - startTime) +" ms");
}
Is there something I'm doing wrong that would yield poor performance? I was hoping it'd be a decent speed improvement, given that I'm using a pretty old version of Astyanax.
I've tried not wrapping the load balancing policy with TokenAwarePolicy, and getting rid of the "setRoutingKey" lines, just because I know these things definitely shouldn't help when just using a single node as I'm currently doing.
My local Cassandra version is 2.1.15 (which supports native protocol V3), but the machines in our production environment are running Cassandra 2.0.12.156 (which only supports V2).
Keep in mind that this is targeted for an environment with a bunch of nodes and several data centers, which is why I've got the settings the way I do (which the actual values being set from a config file), even though I know for this test I could skip using things like DCAwareRoundRobinPolicy.
Any help would be greatly appreciated! I can also post the code that's using Astyanax, I just figured first it'd be good to make sure nothing's blatantly wrong with my new code.
Thanks!
Tests of 10,000 writes+reads are taking around 30 seconds with the DataStax driver, while with Astyanax, they are taking in the 15-20 second range.
I upped the test count to 100,000 to see if maybe there's some overhead with the DataStax driver that just consumes ~10 seconds at startup, after which they might perform more similarly. But even with 100,000 read/writes:
AstyanaxCassandra 100,000 insert+reads took 156593 ms
DataStaxCassandra 100,000 insert+reads took 294340 ms

Jsoup behaves different from my test PC and the server

I'm testing a web crawler with JSoup. The issue comes when I test the crawler on a regular PC, and works as expected, then I export this web crawler as a jar to work in a server in a cron job. This where the things go wrong.
The code is the same, no changes. The data I'm trying to extract is different comments from the users of how they rate a service, the problem is that the web crawler behaves differently when it's executed in the server, for example: the comments are duplicated, something that doesn't happened when I'm testing the program locally.
Also the web crawler differentiates what language the comments are written (I take that info from the URL, .de for German, .es for Spanish, etc). This info get mixed for example, a comment in Spanish is classified as Portuguese one.
Again I repeat the logic behind the crawler is correct, I tested many times with different input.
What could be the problem behind these issues?
Additional notes:
No exceptions/crashes.
I'm using jsoup 1.9.2.
This is how I get the data from the website:
Document doc = Jsoup.connect(link).userAgent(FakeAgentBooking.getAgent()).timeout(60 * 4000).get();
I already tried to use a proxy just in case the server was banned.
System.getProperties().put("https.proxyHost", "PROXY");
System.getProperties().put("https.proxyPort", "PORT");
System.getProperties().put("https.proxyUser", "USER");
System.getProperties().put("https.proxyPassword", "PASSWORD");
This is the code of the cron job:
#Crawler(name = "Booking comments", nameType = "BOOKING_COMMENTS", sentimetal = true, cron = "${cron.booking.comments}")
public class BookingCommentsJob extends MotherCrawler {
private static final org.slf4j.Logger logger = LoggerFactory.getLogger(BookingCommentsJob.class);
#Value("${full.booking.comments}")
private String full;
#Autowired
private ComentariosCDMXRepository comentariosCDMXRepository;
#Override
#PostConstruct
public void init() {
setInfo(this.getClass().getAnnotation(Crawler.class));
}
#Override
public void exec(int num) {
// <DEBUG>
String startTime = time.format(new Date());
// </DEBUG>
Set<CrawQuery> li = queryManager.getMeQueries(type, num, threadnum);
Gson gson = new GsonBuilder().create();
for (CrawQuery s : li) {
String query = s.getQuery().get("query");
try {
//the crawling begins here-->
String result = BookingComentarios.crawlBookingComentarios(query, Boolean.parseBoolean(full));
//get the result from json to a standarized class
ComentarioCDMX[] myComments = gson.fromJson(result, ComentarioCDMX[].class);
for (ComentarioCDMX myComment : myComments) {
//evaluates if the comment is positive, neutral or negative.
Integer sentiment = sentimentAnalysis.classifyVector(myComment.getComment());
myComment.setSentiment(sentiment);
myComment.setQuery(query);
/* <Analisis de sentimiento /> */
comentariosCDMXRepository.save(myComment);
}
s.setStatus(true);
} catch (Exception e) {
logger.error(query, e);
s.setStatus(false);
mailSend.add(e);
} finally {
s.setLastUse(new Date());
//Saves data to Solr
crawQueryDao.save(s);
}
}
update();
// <DEBUG>
String endTime = time.format(new Date());
logger.info(name + " " + num + " > Inicio: " + startTime + ", Fin: " + endTime);
// </DEBUG>
}
#Scheduled(cron = "${cron.booking.comments}")
public void curro0() throws InterruptedException {
exec(0);
}
}
and this is when the code should be executed:
cron.booking.comments=00 30 02 * * *
Additional notes:
The test PC OS is Windows 7 and the server OS is linux Debian 3.16.7. and tghe java version in the test PC is 1.7 oracle JDK and on the server is 1.8.0 JRE.

What is the fastest way to bulk load data into HBase programmatically?

I have a Plain text file with possibly millions of lines which needs custom parsing and I want to load it into an HBase table as fast as possible (using Hadoop or HBase Java client).
My current solution is based on a MapReduce job without the Reduce part. I use FileInputFormat to read the text file so that each line is passed to the map method of my Mapper class. At this point the line is parsed to form a Put object which is written to the context. Then, TableOutputFormat takes the Put object and inserts it to table.
This solution yields an average insertion rate of 1,000 rows per second, which is less than what I expected. My HBase setup is in pseudo distributed mode on a single server.
One interesting thing is that during insertion of 1,000,000 rows, 25 Mappers (tasks) are spawned but they run serially (one after another); is this normal?
Here is the code for my current solution:
public static class CustomMap extends Mapper<LongWritable, Text, ImmutableBytesWritable, Put> {
protected void map(LongWritable key, Text value, Context context) throws IOException {
Map<String, String> parsedLine = parseLine(value.toString());
Put row = new Put(Bytes.toBytes(parsedLine.get(keys[1])));
for (String currentKey : parsedLine.keySet()) {
row.add(Bytes.toBytes(currentKey),Bytes.toBytes(currentKey),Bytes.toBytes(parsedLine.get(currentKey)));
}
try {
context.write(new ImmutableBytesWritable(Bytes.toBytes(parsedLine.get(keys[1]))), row);
} catch (InterruptedException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
public int run(String[] args) throws Exception {
if (args.length != 2) {
return -1;
}
conf.set("hbase.mapred.outputtable", args[1]);
// I got these conf parameters from a presentation about Bulk Load
conf.set("hbase.hstore.blockingStoreFiles", "25");
conf.set("hbase.hregion.memstore.block.multiplier", "8");
conf.set("hbase.regionserver.handler.count", "30");
conf.set("hbase.regions.percheckin", "30");
conf.set("hbase.regionserver.globalMemcache.upperLimit", "0.3");
conf.set("hbase.regionserver.globalMemcache.lowerLimit", "0.15");
Job job = new Job(conf);
job.setJarByClass(BulkLoadMapReduce.class);
job.setJobName(NAME);
TextInputFormat.setInputPaths(job, new Path(args[0]));
job.setInputFormatClass(TextInputFormat.class);
job.setMapperClass(CustomMap.class);
job.setOutputKeyClass(ImmutableBytesWritable.class);
job.setOutputValueClass(Put.class);
job.setNumReduceTasks(0);
job.setOutputFormatClass(TableOutputFormat.class);
job.waitForCompletion(true);
return 0;
}
public static void main(String[] args) throws Exception {
Long startTime = Calendar.getInstance().getTimeInMillis();
System.out.println("Start time : " + startTime);
int errCode = ToolRunner.run(HBaseConfiguration.create(), new BulkLoadMapReduce(), args);
Long endTime = Calendar.getInstance().getTimeInMillis();
System.out.println("End time : " + endTime);
System.out.println("Duration milliseconds: " + (endTime-startTime));
System.exit(errCode);
}
I've gone through a process that is probably very similar to yours of attempting to find an efficient way to load data from an MR into HBase. What I found to work is using HFileOutputFormat as the OutputFormatClass of the MR.
Below is the basis of my code that I have to generate the job and the Mapper map function which writes out the data. This was fast. We don't use it anymore, so I don't have numbers on hand, but it was around 2.5 million records in under a minute.
Here is the (stripped down) function I wrote to generate the job for my MapReduce process to put data into HBase
private Job createCubeJob(...) {
//Build and Configure Job
Job job = new Job(conf);
job.setJobName(jobName);
job.setMapOutputKeyClass(ImmutableBytesWritable.class);
job.setMapOutputValueClass(Put.class);
job.setMapperClass(HiveToHBaseMapper.class);//Custom Mapper
job.setJarByClass(CubeBuilderDriver.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(HFileOutputFormat.class);
TextInputFormat.setInputPaths(job, hiveOutputDir);
HFileOutputFormat.setOutputPath(job, cubeOutputPath);
Configuration hConf = HBaseConfiguration.create(conf);
hConf.set("hbase.zookeeper.quorum", hbaseZookeeperQuorum);
hConf.set("hbase.zookeeper.property.clientPort", hbaseZookeeperClientPort);
HTable hTable = new HTable(hConf, tableName);
HFileOutputFormat.configureIncrementalLoad(job, hTable);
return job;
}
This is my map function from the HiveToHBaseMapper class (slightly edited ).
public void map(WritableComparable key, Writable val, Context context)
throws IOException, InterruptedException {
try{
Configuration config = context.getConfiguration();
String[] strs = val.toString().split(Constants.HIVE_RECORD_COLUMN_SEPARATOR);
String family = config.get(Constants.CUBEBUILDER_CONFIGURATION_FAMILY);
String column = strs[COLUMN_INDEX];
String Value = strs[VALUE_INDEX];
String sKey = generateKey(strs, config);
byte[] bKey = Bytes.toBytes(sKey);
Put put = new Put(bKey);
put.add(Bytes.toBytes(family), Bytes.toBytes(column), (value <= 0)
? Bytes.toBytes(Double.MIN_VALUE)
: Bytes.toBytes(value));
ImmutableBytesWritable ibKey = new ImmutableBytesWritable(bKey);
context.write(ibKey, put);
context.getCounter(CubeBuilderContextCounters.CompletedMapExecutions).increment(1);
}
catch(Exception e){
context.getCounter(CubeBuilderContextCounters.FailedMapExecutions).increment(1);
}
}
I pretty sure this isn't going to be a Copy&Paste solution for you. Obviously the data I was working with here didn't need any custom processing (that was done in a MR job before this one). The main thing I want to provide out of this is the HFileOutputFormat. The rest is just an example of how I used it. :)
I hope it gets you onto a solid path to a good solution. :
One interesting thing is that during insertion of 1,000,000 rows, 25 Mappers (tasks) are spawned but they run serially (one after another); is this normal?
mapreduce.tasktracker.map.tasks.maximum parameter which is defaulted to 2 determines the maximum number of tasks that can run in parallel on a node. Unless changed, you should see 2 map tasks running simultaneously on each node.

Categories