How do I determine an offset in Apache Spark? - java

I'm searching through some data files (~20GB). I'd like to find some specific terms in that data and mark the offset for the matches. Is there a way to have Spark identify the offset for the chunk of data I'm operating on?
import org.apache.spark.SparkConf;
import java.util.regex.*;
public class Grep {
public static void main( String args[] ) {
SparkConf conf = new SparkConf().setMaster( "spark://ourip:7077" );
JavaSparkContext jsc = new JavaSparkContext( conf );
JavaRDD<String> data = jsc.textFile( "hdfs://ourip/test/testdata.txt" ); // load the data from HDFS
JavaRDD<String> filterData = data.filter( new Function<String, Boolean>() {
// I'd like to do something here to get the offset in the original file of the string "babe ruth"
public Boolean call( String s ) { return s.toLowerCase().contains( "babe ruth" ); } // case insens matching
long matches = filterData.count(); // count the hits
// execute the RDD filter
System.out.println( "Lines with search terms: " + matches );
} // end main
} // end class Grep
I'd like to do something in the "filter" operation to compute the offset of "babe ruth" in the original file. I can get the offset of "babe ruth" in the current line, but what's the process or function that tells me the offset of the line within the file?

In Spark common Hadoop Input Format can be used. To read the byte offset from the file you can use class TextInputFormat from Hadoop (org.apache.hadoop.mapreduce.lib.input). It is already bundled with Spark.
It will read the file as key (byte offset) and value (text line):
An InputFormat for plain text files. Files are broken into lines. Either linefeed or carriage-return are used to signal end of line. Keys are the position in the file, and values are the line of text.
In Spark it can be used by calling newAPIHadoopFile()
SparkConf conf = new SparkConf().setMaster("");
JavaSparkContext jsc = new JavaSparkContext(conf);
// read the content of the file using Hadoop format
JavaPairRDD<LongWritable, Text> data = jsc.newAPIHadoopFile(
"file_path", // input path
TextInputFormat.class, // used input format class
LongWritable.class, // class of the value
Text.class, // class of the value
new Configuration());
JavaRDD<String> mapped = Function<Tuple2<LongWritable, Text>, String>() {
public String call(Tuple2<LongWritable, Text> tuple) throws Exception {
// you will get each line from as a tuple (offset, text)
long pos = tuple._1().get(); // extract offset
String line = tuple._2().toString(); // extract text
return pos + " " + line;

You could use the wholeTextFiles(String path, int minPartitions) method from JavaSparkContext to return a JavaPairRDD<String,String> where the key is filename and the value is a string containing the entire content of a file (thus, each record in this RDD represents a file). From here, simply run a map() that will call indexOf(String searchString) on each value. This should return the first index in each file with the occurrence of the string in question.
So finding the offset in a distributed fashion for one file (per your use case below in the comments) is possible. Below is an example that works in Scala.
val searchString = *search string*
val rdd1 = sc.textFile(*input file*, *num partitions*)
// Zip RDD lines with their indices
val zrdd1 = rdd1.zipWithIndex()
// Find the first RDD line that contains the string in question
val firstFind = zrdd1.filter { case (line, index) => line.contains(searchString) }.first()
// Grab all lines before the line containing the search string and sum up all of their lengths (and then add the inline offset)
val filterLines = zrdd1.filter { case (line, index) => index < firstFind._2 }
val offset = { case (line, index) => line.length }.reduce(_ + _) + firstFind._1.indexOf(searchString)
Note that you would additionally need to add any new line characters manually on top of this since they are not accounted for (the input format uses new lines as demarcations between records). The number of new lines is simply the number of lines before the line containing the search string so this is trivial to add.
I'm not entirely familiar with the Java API unfortunately and it's not exactly easy to test so I'm not sure if the code below works but have at it (Also, I used Java 1.7 but 1.8 compresses a lot of this code with lambda expressions.):
String searchString = *search string*;
JavaRDD<String> data = jsc.textFile("hdfs://ourip/test/testdata.txt");
JavaRDD<Tuple2<String, Long>> zrdd1 = data.zipWithIndex();
Tuple2<String, Long> firstFind = zrdd1.filter(new Function<Tuple2<String, Long>, Boolean>() {
public Boolean call(Tuple2<String, Long> input) { return input.productElement(0).contains(searchString); }
JavaRDD<Tuple2<String, Long>> filterLines = zrdd1.filter(new Function<Tuple2<String, Long>, Boolean>() {
public Boolean call(Tuple2<String, Long> input) { return input.productElement(1) < firstFind.productElement(1); }
Long offset = Function<Tuple2<String, Long>, Int>() {
public Int call(Tuple2<String, Long> input) { return input.productElement(0).length(); }
}).reduce(new Function2<Integer, Integer, Integer>() {
public Integer call(Integer a, Integer b) { return a + b; }
}) + firstFind.productElement(0).indexOf(searchString);
This can only be done when your input is one file (since otherwise, zipWithIndex() wouldn't guarantee offsets within a file) but this method works for an RDD of any number of partitions so feel free to partition your file up into any number of chunks.


Multiple Apache Flink windows validations

I'm just getting started on stream processing using Apache Flink, the thing is that I'm receiving a stream of Json that look like this:
token_id: “tok_afgtryuo”,
ip_address: ““,
device_fingerprint: “abcghift”,
card_hash: “hgtyuigash”,
“bin_number”: “424242”,
“last4”: “4242”,
“name”: “Seu Jorge”
And was asked if i could fulfill the following business rules:
Decline if number of tokens > 5 for this IP in last 10 seconds
Decline if number of tokens > 15 for this IP in last minute
Decline if number of tokens > 60 for this IP in last hour
I made 2 classes, main class when I'm making an instance to call the Window function with different parameters to avoid duplicate code:
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
//This DataStream Would be Converting the Json to a Token Object
DataStream<Token> baseStream =
env.addSource(new SocketTextStreamFunction("localhost",
.map(new MapTokens());
// 1- First rule Decline if number of tokens > 5 for this IP in last 10 seconds
DataStreamSink<String> response1 = new RuleMaker().getStreamKeyCount(baseStream, "ip", Time.seconds(10),
5, "seconds").print();
//2 -Decline if number of tokens > 15 for this IP in last minute
DataStreamSink<String> response2 = new RuleMaker().getStreamKeyCount(baseStream, "ip", Time.minutes(1),
62, "minutes").print();
//3- Decline if number of tokens > 60 for this IP in last hour
DataStreamSink<String> response3 = new RuleMaker().getStreamKeyCount(baseStream, "ip", Time.hours(1),
60, "Hours").print();
And another class where I'm doing all the logic for rules, I'm counting the times where an IP address appears and, if it is more than the allowed number in the time window I'm returning a message with some information:
public class RuleMaker {
public DataStream<String> getStreamKeyCount(DataStream<Token> stream,
String tokenProp,
Time time,
Integer maxPetitions,
String ruleType){
.flatMap(new FlatMapFunction<Token, Tuple3<String, Integer, String>>() {
public void flatMap(Token token, Collector<Tuple3<String, Integer, String>> collector) throws Exception {
String tokenSelection = "";
switch (tokenProp)
case "ip":
tokenSelection = token.getIpAddress();
case "device":
tokenSelection = token.getDeviceFingerprint();
case "cardHash":
tokenSelection = token.getCardHash();
collector.collect(new Tuple3<>(tokenSelection, 1, token.get_tokenId()));
.process(new MyProcessWindowFunction(maxPetitions, ruleType));
//Class to process the elements from the window
private class MyProcessWindowFunction extends ProcessWindowFunction<
Tuple3<String, Integer, String>,
> {
private Integer _maxPetitions;
private String _ruleType;
public MyProcessWindowFunction(Integer maxPetitions, String ruleType) {
this._maxPetitions = maxPetitions;
this._ruleType = ruleType;
public void process(Tuple tuple, Context context, Iterable<Tuple3<String, Integer, String>> iterable, Collector<String> out) throws Exception {
Integer counter = 0;
for (Tuple3<String, Integer, String> element : iterable) {
counter += element.f1++;
if(counter > _maxPetitions){
out.collect("El elemeto ha sido declinado: " + element.f2 + " Num elements: " + counter + " rule type: " + _ruleType + " token: " + element.f0 );
counter = 0;
So far, i think this code is working but I'm a begginer on Apache Flink, and I'll appreciate a lot if you could tell me if it's something wrong about the way I'm trying to work with this and point me to the right direction.
Thanks a lot.
General approach looks very good, although I would have thought that Table API would be powerful enough to help you (more concise) which supports Json out of the box.
If you want to stick to DataStream API, in getStreamKeyCount, the switch around tokenProp should be replaced by passing a key extractor to getStreamKeyCount to have only one place to add new rules.
public DataStream<String> getStreamKeyCount(DataStream<Token> stream,
KeySelector<Token, String> keyExtractor,
Time time,
Integer maxPetitions,
String ruleType){
return stream
.map(token -> new Tuple3<>(keyExtractor.getKey(token), 1, token.get_tokenId()))
.process(new MyProcessWindowFunction(maxPetitions, ruleType));
Then the invocation becomes
DataStreamSink<String> response2 = ruleMaker.getStreamKeyCount(baseStream,
Token::getIpAddress, Time.minutes(1), 62, "minutes");

Problem accessing array variable inside rdd operation in yarn-cluster mode

My input is some csv/tsv or whatever delimiter separated file and its header. I want to map by any column as key and the whole row as value. I ran the below code fine on my machine but failed when tested in yarn-cluster mode.
public class SparkController implements {
String[] header;
String path;
public static void main(String[] args) {
// some parse function
// say input file is a csv likes: (id,timestamp,ip)
// header = [ "id", "timestamp", "ip" ]
// DELIMITER = ","
SparkController sparkController = new SparkController();
JavaPariRDD<String, String> pairRdd = sparkController.map2PairRdd("ip");
private JavaPariRDD<String, String> map2PairRdd(String column) {
JavaRDD<String> rawFile = sc.textFile(path);
JavaPariRDD<String, String> pairRdd = rawFile.mapToPair((s) -> {
// DELIMITER can be accessed normally
String[] fields = s.split(DELIMITER);
// turns out header is empty when runs in yarn,
// but works fine in standalone mode
return new Tuple2<>(fields[header.indexOf("ip")], s);
// other operations continue
I understand that variables like DELIMITER and header are serialized to workers in cluster mode. But how can array header being empty inside rdd operation.
I modify the code by declare a final int variable index outside the mapToPair and access index inside then this error fixed.
But I'm still confused about the why header is empty when access inside mapToPair. Can anybody provides some insights?

Mapreduce java program to search QuadTree index and also run GeometryEngine.contains to confirm point in polygon using wkt file

This post is a map reduce implementation suggested for my previous question: "How to optimize scan of 1 huge file / table in Hive to confirm/check if lat long point is contained in a wkt geometry shape"
I am not well-versed in writing java programs for map-reduce and I mainly use Hive or Pig or spark to develop in Hadoop eco-system. To give a background of task at hand: I am trying to associate every latitude/longitude ping to corresponding ZIP postal code. I have a WKT multi-polygon shape file (500 MB) with all the zip information. I have loaded this in Hive and can do a join using ST_Contains(polygon, point). However, it takes very long to complete. To over come this bottle neck I am trying to leverage the example in ESRI ("") by building a quad tree index for searching a point derived from lat-long in polygon.
I have managed to write the code and it clogs up the Java heap memory of the cluster. Any suggestions on improving the code or looking at a different approach will be greatly appreciated:
Error message:
Error: Java heap space
Container killed by the ApplicationMaster.
Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143
My code:
public class MapperClass extends Mapper<LongWritable, Text, Text, IntWritable> {
// column indices for values in the text file
int longitudeIndex;
int latitudeIndex;
int wktZip;
int wktGeom;
int wktLineCount;
int wktStateID;
// in boundaries.wkt, the label for the polygon is "wkt"
//creating ArrayList to hold details of the file
ArrayList<ZipPolyClass> nodes = new ArrayList<ZipPolyClass>();
String labelAttribute;
EsriFeatureClass featureClass;
SpatialReference spatialReference;
QuadTree quadTree;
QuadTreeIterator quadTreeIter;
BufferedReader csvWkt;
// class to store all the values from wkt file and calculate geometryFromWKT
public class ZipPolyClass {
public String zipCode;
public String wktPoly;
public String stateID;
public int indexJkey;
public Geometry wktGeomObj;
public ZipPolyClass(int ijk, String z, String w, String s ){
zipCode = z;
wktPoly = w;
stateID = s;
indexJkey = ijk;
wktGeomObj = GeometryEngine.geometryFromWkt(wktPoly, 0, Geometry.Type.Unknown);
//building quadTree Index from WKT multiPolygon and creating an iterator
private void buildQuadTree(){
quadTree = new QuadTree(new Envelope2D(-180, -90, 180, 90), 8);
Envelope envelope = new Envelope();
int j=0;
quadTree.insert(j, new Envelope2D(envelope.getXMin(), envelope.getYMin(), envelope.getXMax(), envelope.getYMax()));
quadTreeIter = quadTree.getIterator();
* Query the quadtree for the feature containing the given point
* #param pt point as longitude, latitude
* #return index to feature in featureClass or -1 if not found
private int queryQuadTree(Point pt)
// reset iterator to the quadrant envelope that contains the point passed
quadTreeIter.resetIterator(pt, 0);
int elmHandle =;
while (elmHandle >= 0){
int featureIndex = quadTree.getElement(elmHandle);
// we know the point and this feature are in the same quadrant, but we need to make sure the feature
// actually contains the point
if (GeometryEngine.contains(nodes.get(featureIndex).wktGeomObj, pt, spatialReference)){
return featureIndex;
elmHandle =;
// feature not found
return -1;
* Sets up mapper with filter geometry provided as argument[0] to the jar
public void setup(Context context)
Configuration config = context.getConfiguration();
spatialReference = SpatialReference.create(4326);
// first pull values from the configuration
String featuresPath = config.get("sample.features.input");
//get column reference from driver class
wktZip = config.getInt("", 0);
wktGeom = config.getInt("sample.features.col.geometry", 18);
wktStateID = config.getInt("sample.features.col.stateID", 3);
latitudeIndex = config.getInt("", 5);
longitudeIndex = config.getInt("samples.csvdata.columns.long", 6);
FSDataInputStream iStream = null;
try {
// load the text WKT file provided as argument 0
FileSystem hdfs = FileSystem.get(config);
iStream = Path(featuresPath));
BufferedReader br = new BufferedReader(new InputStreamReader(iStream));
String wktLine ;
int i=0;
while((wktLine = br.readLine()) != null){
String [] val = wktLine.split("\\|");
String qtZip = val[wktZip];
String poly = val[wktGeom];
String stID = val[wktStateID];
ZipPolyClass zpc = new ZipPolyClass(i, qtZip, poly, stID);
i++; // increment in the loop before end
catch (Exception e)
if (iStream != null)
try {
} catch (IOException e) { }
// build a quadtree of our features for fast queries
if (!nodes.isEmpty()) {
public void map(LongWritable key, Text val, Context context)
throws IOException, InterruptedException {
* The TextInputFormat we set in the configuration, by default, splits a text file line by line.
* The key is the byte offset to the first character in the line. The value is the text of the line.
String line = val.toString();
String [] values = line.split(",");
// get lat long from file and convert to float
float latitude = Float.parseFloat(values[latitudeIndex]);
float longitude = Float.parseFloat(values[longitudeIndex]);
// Create our Point directly from longitude and latitude
Point point = new Point(longitude, latitude);
int featureIndex = queryQuadTree(point);
// Each map only processes one record at a time, so we start out with our count
// as 1. Since we have a distinct record file we will not run reducer
IntWritable one = new IntWritable(1);
if (featureIndex >= 0){
String zipTxt =nodes.get(featureIndex).zipCode;
String stateIDTxt = nodes.get(featureIndex).stateID;
String latTxt = values[latitudeIndex];
String longTxt = values[longitudeIndex];
String pointTxt = point.toString();
String name;
name = zipTxt+"\t"+stateIDTxt+"\t"+latTxt+"\t"+longTxt+ "\t" +pointTxt;
context.write(new Text(name), one);
} else {
context.write(new Text("*Outside Feature Set"), one);
I was able to resolve the out of memory issue by modifying the arrayList < classObject > to just hold arrayList < geometry > type.
Creating a class object (around 50k) to hold each row of a text file, consumed all the java heap memory. After this change code ran fine even in a 1-node virtual sandbox. I was able to crunch around 40 million rows in around 6 minutes.

LDA in Spark 1.3.1. Converting raw data into Term Document Matrix?

I'm trying out LDA with Spark 1.3.1 in Java and got this error:
Error: application failed with exception
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost): java.lang.NumberFormatException: For input string: "��"
My .txt file looks like this:
put weight find difficult pull ups push ups now
blindness diseases everything eyes work perfectly except ability take light use light form images
role model kid
Dear recall saddest memory childhood
This is the code:
import scala.Tuple2;
import org.apache.spark.mllib.clustering.LDAModel;
import org.apache.spark.mllib.clustering.LDA;
import org.apache.spark.mllib.linalg.Matrix;
import org.apache.spark.mllib.linalg.Vector;
import org.apache.spark.mllib.linalg.Vectors;
import org.apache.spark.SparkConf;
public class JavaLDA {
public static void main(String[] args) {
SparkConf conf = new SparkConf().setAppName("LDA Example");
JavaSparkContext sc = new JavaSparkContext(conf);
// Load and parse the data
String path = "/tutorial/input/askreddit20150801.txt";
JavaRDD<String> data = sc.textFile(path);
JavaRDD<Vector> parsedData =
new Function<String, Vector>() {
public Vector call(String s) {
String[] sarray = s.trim().split(" ");
double[] values = new double[sarray.length];
for (int i = 0; i < sarray.length; i++)
values[i] = Double.parseDouble(sarray[i]);
return Vectors.dense(values);
// Index documents with unique IDs
JavaPairRDD<Long, Vector> corpus = JavaPairRDD.fromJavaRDD(parsedData.zipWithIndex().map(
new Function<Tuple2<Vector, Long>, Tuple2<Long, Vector>>() {
public Tuple2<Long, Vector> call(Tuple2<Vector, Long> doc_id) {
return doc_id.swap();
// Cluster the documents into three topics using LDA
LDAModel ldaModel = new LDA().setK(100).run(corpus);
// Output topics. Each is a distribution over words (matching word count vectors)
System.out.println("Learned topics (as distributions over vocab of " + ldaModel.vocabSize()
+ " words):");
Matrix topics = ldaModel.topicsMatrix();
for (int topic = 0; topic < 100; topic++) {
System.out.print("Topic " + topic + ":");
for (int word = 0; word < ldaModel.vocabSize(); word++) {
System.out.print(" " + topics.apply(word, topic));
}, "myLDAModel");
Anyone know why this happened? I'm just trying LDA Spark for the first time. Thanks.
values[i] = Double.parseDouble(sarray[i]);
Why are you trying to convert each word of your text file into a Double?
That's the answer to your issue:
Your code is expecting the input file to be a bunch of lines of whitespace separated text that looks like numbers. Assuming your text is words instead:
Get a list of every word that appears in your corpus:
JavaRDD<String> words =
data.flatMap((FlatMapFunction<String, String>) s -> {
s = s.replaceAll("[^a-zA-Z ]", "");
s = s.toLowerCase();
return Arrays.asList(s.split(" "));
Make a map giving each word an integer associated with it:
Map<String, Long> vocab = words.zipWithIndex().collectAsMap();
Then instead of your parsedData doing what it's doing up there, make it look up each word, find the associated number, go to that location in an array, and add 1 to the count for that word.
JavaRDD<Vector> tokens =
(Function<String, Vector>) s -> {
String[] vals = s.split("\\s");
double[] idx = new double[vocab.size() + 1];
for (String val : vals) {
idx[vocab.get(val).intValue()] += 1.0;
return Vectors.dense(idx);
This results in an RDD of vectors, where each vector is vocab.size() long, and each spot in the vector is the count of how many times that vocab word appeared in the line.
I modified this code slightly from what I'm currently using and didn't test it, so there could be errors in it. Good luck!

How to write avro output in hadoop map reduce?

I wrote one Hadoop word count program which takes TextInputFormat input and is supposed to output word count in avro format.
Map-Reduce job is running fine but output of this job is readable using unix commands such as more or vi. I was expecting this output be unreadable as avro output is in binary format.
I have used mapper only, reducer is not present. I just want to experiment with avro so I am not worried about memory or stack overflow. Following the the code of mapper
public class WordCountMapper extends Mapper<LongWritable, Text, AvroKey<String>, AvroValue<Integer>> {
private Map<String, Integer> wordCountMap = new HashMap<String, Integer>();
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String[] keys = value.toString().split("[\\s-*,\":]");
for (String currentKey : keys) {
int currentCount = 1;
String currentToken = currentKey.trim().toLowerCase();
if(wordCountMap.containsKey(currentToken)) {
currentCount = wordCountMap.get(currentToken);
wordCountMap.put(currentToken, currentCount);
System.out.println("DEBUG : total number of unique words = " + wordCountMap.size());
protected void cleanup(Context context) throws IOException, InterruptedException {
for (Map.Entry<String, Integer> currentKeyValue : wordCountMap.entrySet()) {
AvroKey<String> currentKey = new AvroKey<String>(currentKeyValue.getKey());
AvroValue<Integer> currentValue = new AvroValue<Integer>(currentKeyValue.getValue());
context.write(currentKey, currentValue);
and driver code is as follows :
public int run(String[] args) throws Exception {
Job avroJob = new Job(getConf());
avroJob.setJobName("Avro word count");
AvroJob.setInputKeySchema(avroJob, Schema.create(Type.INT));
AvroJob.setInputValueSchema(avroJob, Schema.create(Type.STRING));
AvroJob.setMapOutputKeySchema(avroJob, Schema.create(Type.STRING));
AvroJob.setMapOutputValueSchema(avroJob, Schema.create(Type.INT));
AvroJob.setOutputKeySchema(avroJob, Schema.create(Type.STRING));
AvroJob.setOutputValueSchema(avroJob, Schema.create(Type.INT));
FileInputFormat.addInputPath(avroJob, new Path(args[0]));
FileOutputFormat.setOutputPath(avroJob, new Path(args[1]));
return avroJob.waitForCompletion(true) ? 0 : 1;
I would like to know how do avro output looks like and what am I doing wrong in this program.
Latest release of Avro library includes an updated example of their ColorCount example adopted for MRv2. I suggest you to look at it, use the same pattern as they use in Reduce class or just extend AvroMapper. Please note that using Pair class instead of AvroKey+AvroValue is also essential for running Avro on Hadoop.
