Reading delimited data from text file into different RDDs

Reading delimited data from text file into different RDDs - java

Am new to Apache Spark java
I have a text file delimited by space as below
3,45.25,23.45
5,22.15,19.35
4,33.24,12.45
2,15.67,21.22
Here the columns mean:
1st column: index value
2nd column: latitude values
3rd column: longitude values
Am trying to parse this data into 2 or 3 RDDs (or pair RDDs). This is my code so far:
JavaRDD<String> data = sc.textFile("hdfs://data.txt");
JavaRDD<Double> data1 = data.flatMap(
new FlatMapFunction<String, Double>() {
public Iterable<Double> call(Double data) {
return Arrays.asList(data.split(","));
}
});

Something like this (use Java 8 for better readability)?
JavaRDD<String> data = sc.textFile("hdfs://data.txt");
JavaRDD<Tuple3<Integer, Float, Float>> parsedData = data.map((line) -> line.split(","))
.map((line) -> new Tuple3<>(parseInt(line[0]), parseFloat(line[1]), parseFloat(line[2])))
.cache(); // Cache parsed to avoid recomputation in subsequent .mapToPair calls
JavaPairRDD<Integer, Float> latitudeByIndex = parsedData.mapToPair((line) -> new Tuple2<>(line._1(), line._2()));
JavaPairRDD<Integer, Float> longitudeByIndex = parsedData.mapToPair((line) -> new Tuple2<>(line._1(), line._3()));
JavaPairRDD<Integer, Tuple2<Float, Float>> pointByIndex = parsedData.mapToPair((line) -> new Tuple2<>(line._1(), new Tuple2<>(line._2(), line._3())));

Related

Java print items in tree map list

I currently have a TreeMap of the form <String, List<List<String>>
I'm trying to write my tree map to an output file where I get the inner values of my string[] all separated by a colon.
Do I need a second for loop to loop through each inner list and format it using a .join(":", elements)?
Or is there a more concise way to keep it all in a single for loop statement?
I've tried a few things and my current code is:
new File(outFolder).mkdir();
File dir = new File(outFolder);
//get the file we're writing to
File outFile = new File(dir, "javaoutput.txt");
//create a writer
try (BufferedWriter writer = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(outFile), "utf-8"))) {
for (Map.Entry<String, String[]> entry : allResults.entrySet()) {
writer.write(entry.getKey() + ", "+ Arrays.toString(entry.getValue()).replace("null", "").toString());
writer.newLine();
}
Current output:
ANY, [[469, 470], [206, 1013, 1014], [2607, 2608]]
Desired output:
ANY, 469:470, 206:1013:1014, 2607:2608
Any suggestions would be greatly appreciated.

String.join(":", arr) can be used to take the String array and return a colon-separated String. This can then be used with Streams with a Collector to join these strings with a comma-separator, so :
TreeMap<String, String[]> allResults = new TreeMap<>();
allResults.put("a", new String[]{"469", "470"});
allResults.put("b", new String[]{"206", "1013", "1014"});
allResults.put("c", new String[]{"2607", "2608"});
String result = allResults.entrySet().stream()
.map(e -> String.join(":", e.getValue()))
.collect(Collectors.joining(", "));
System.out.println(result);
produces :
469:470, 206:1013:1014, 2607:2608
With a List<List<String>>, you need a stream within a stream, so :
TreeMap<String, List<List<String>>> allResults = new TreeMap<>();
allResults.put("a", Arrays.asList(Arrays.asList("469", "470"), Arrays.asList("206", "1013", "1014"), Arrays.asList("2607", "2608")));
allResults.put("b", Arrays.asList(Arrays.asList("169", "470")));
allResults.put("c", Arrays.asList(Arrays.asList("269", "470")));
String result = allResults.entrySet().stream()
.map(i -> i.getKey() + "," + i.getValue().stream().map(elements -> String.join(":", elements))
.collect(Collectors.joining(", "))
)
.collect(Collectors.joining("\n"));
System.out.println(result);
which produces :
a,469:470, 206:1013:1014, 2607:2608
b,169:470
c,269:470

Empty data is returned when querying using Kafka tumbling window

I'm trying to query the state store to get the data in a window of 5 mins. For that I'm using tumbling window. Have added REST to query the data.
I've stream A which consumes data from topic1 and performs some transformations and output a key value to topic2.
Now in stream B I'm doing tumbling window operation on topic2 data. When I run the code and queried using REST, I'm seeing empty data on my browser. I can see the data in the state store flowing.
What I've observed is, instead of topic2 getting data from stream A, I used a producer class to inject the data to topic2 and able to query the data from browser. But when the topic2 is getting data from stream A, I'm getting empty data.
Here is my stream A code :
public static void main(String[] args) {
final StreamsBuilder builder = new StreamsBuilder();
KStream<String, String> source = builder.stream("topic1");
KStream<String, String> output = source
.map((k,v)->
{
Map<String, Object> Fields = new LinkedHashMap<>();
Fields.put("FNAME","ABC");
Fields.put("LNAME","XYZ");
Map<String, Object> nFields = new LinkedHashMap<>();
nFields.put("ADDRESS1","HY");
nFields.put("ADDRESS2","BA");
nFields.put("addF",Fields);
Map<String, Object> eve = new LinkedHashMap<>();
eve.put("nFields", nFields);
Map<String, Object> fevent = new LinkedHashMap<>();
fevent.put("eve", eve);
LinkedHashMap<String, Object> newMap = new LinkedHashMap<>(fevent);
return new KeyValue<>("JAY1234",newMap.toString());
});
output.to("topic2");
}
Here is my stream B code (where tumbling window operation happening):
public static void main(String[] args) {
final StreamsBuilder builder = new StreamsBuilder();
KStream<String, String> eventStream = builder.stream("topic2");
eventStream.groupByKey()
.windowedBy(TimeWindows.of(300000))
.reduce((v1, v2) -> v1 + ";" + v2, Materialized.as("TumblingWindowPoc"));
final Topology topology = builder.build();
KafkaStreams streams = new KafkaStreams(topology, props);
streams.start();
}
REST code :
#GET()
#Path("/{storeName}/{key}")
#Produces(MediaType.APPLICATION_JSON)
public List<KeyValue<String, String>> windowedByKey(#PathParam("storeName") final String storeName,
#PathParam("key") final String key) {
final ReadOnlyWindowStore<String, String> store = streams.store(storeName,
QueryableStoreTypes.<String, String>windowStore());
if (store == null) {
throw new NotFoundException(); }
long timeTo = System.currentTimeMillis();
long timeFrom = timeTo - 30000;
final WindowStoreIterator<String> results = store.fetch(key, timeFrom, timeTo);
final List<KeyValue<String,String>> windowResults = new ArrayList<>();
while (results.hasNext()) {
final KeyValue<Long, String> next = results.next();
windowResults.add(new KeyValue<String,String>(key + "#" + next.key, next.value));
}
return windowResults;
}
And this is how my key value data looks like :
JAY1234 {eve = {nFields = {ADDRESS1 = HY,ADDRESS2 = BA,Fields = {FNAME = ABC,LNAME = XYZ,}}}}
I should be able to get the data when querying using REST. Any help is greatly appreciated.
Thanks!

to fetch the window timeFrom should be before window start. So if you want the data for last 30 seconds, you can substract window duration for fetching, like timeTo - 30000 - 300000, and then filter out events required events from whole window data

com.datastax.spark.connector.writer.NullKeyColumnException: Invalid null value for key column year

Below is my code.
directKafkaStream.foreachRDD(rdd ->
{
rdd.foreach(record ->
{
messages1.add(record._2);
});
JavaRDD<String> lines = sc.parallelize(messages1);
JavaPairRDD<Integer, String> data = lines.mapToPair(new PairFunction<String, Integer, String>()
{
#Override
public Tuple2<Integer, String> call(String a)
{
String[] tokens = StringUtil.split(a, '%');
return new Tuple2<Integer, String>(Integer.getInteger(tokens[3]),tokens[2]);
}
}); // map to get year and name of the movie
Function2<String, String, String> reduceSumFunc = (accum, n) -> (accum.concat(n)); // function for reduce
JavaPairRDD<Integer, String> yearCount = data.reduceByKey(reduceSumFunc); // reduceByKey to count
javaFunctions(yearCount).writerBuilder("movie_keyspace", "movie_count", mapTupleToRow(Integer.class, String.class)).withColumnSelector(someColumns("year","list_of_movies")).saveToCassandra(); // this is the error line
});
Here is the error I am getting.
com.datastax.spark.connector.writer.NullKeyColumnException: Invalid null value for key column year
at com.datastax.spark.connector.writer.RoutingKeyGenerator$$anonfun$fillRoutingKey$1.apply$mcVI$sp(RoutingKeyGenerator.scala:49)
at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160)
at com.datastax.spark.connector.writer.RoutingKeyGenerator.fillRoutingKey(RoutingKeyGenerator.scala:47)
at com.datastax.spark.connector.writer.RoutingKeyGenerator.apply(RoutingKeyGenerator.scala:56)
at com.datastax.spark.connector.writer.TableWriter.batchRoutingKey(TableWriter.scala:126)
at com.datastax.spark.connector.writer.TableWriter$$anonfun$write$1$$anonfun$19.apply(TableWriter.scala:151)
at com.datastax.spark.connector.writer.TableWriter$$anonfun$write$1$$anonfun$19.apply(TableWriter.scala:151)
at com.datastax.spark.connector.writer.GroupingBatchBuilder.next(GroupingBatchBuilder.scala:107)
at com.datastax.spark.connector.writer.GroupingBatchBuilder.next(GroupingBatchBuilder.scala:31)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at com.datastax.spark.connector.writer.GroupingBatchBuilder.foreach(GroupingBatchBuilder.scala:31)
at com.datastax.spark.connector.writer.TableWriter$$anonfun$write$1.apply(TableWriter.scala:158)
at com.datastax.spark.connector.writer.TableWriter$$anonfun$write$1.apply(TableWriter.scala:135)
at com.datastax.spark.connector.cql.CassandraConnector$$anonfun$withSessionDo$1.apply(CassandraConnector.scala:111)
at com.datastax.spark.connector.cql.CassandraConnector$$anonfun$withSessionDo$1.apply(CassandraConnector.scala:110)
at com.datastax.spark.connector.cql.CassandraConnector.closeResourceAfterUse(CassandraConnector.scala:140)
at com.datastax.spark.connector.cql.CassandraConnector.withSessionDo(CassandraConnector.scala:110)
at com.datastax.spark.connector.writer.TableWriter.write(TableWriter.scala:135)
at com.datastax.spark.connector.RDDFunctions$$anonfun$saveToCassandra$1.apply(RDDFunctions.scala:37)
at com.datastax.spark.connector.RDDFunctions$$anonfun$saveToCassandra$1.apply(RDDFunctions.scala:37)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:86)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Description:
1) Trying to connect Kafka and cassandra using spark
2) Able to store a JavaRDD but not able to store a JavaPairRDD into cassandra
3) DB I have given comment in the line where the error is

One of your values for year is null, this is not allowed. Check your data and look for what's generating a null integer.

spark Type mismatch: cannot convert from JavaRDD<Object> to JavaRDD<String>

I have started to write my Pyspark application to Java implementation. I am using Java 8. I just started to execute some of the basic spark progrma in java. I used the following wordcount example.
SparkConf conf = new SparkConf().setMaster("local").setAppName("Work Count App");
// Create a Java version of the Spark Context from the configuration
JavaSparkContext sc = new JavaSparkContext(conf);
JavaRDD<String> lines = sc.textFile(filename);
JavaPairRDD<String, Integer> counts = lines.flatMap(line -> Arrays.asList(line.split(" ")))
.mapToPair(word -> new Tuple2(word, 1))
.reduceByKey((x, y) -> (Integer) x + (Integer) y)
.sortByKey();
I am getting Type mismatch: cannot convert from JavaRDD<Object> to JavaRDD<String> error in lines.flatMap(line -> Arrays.asList(line.split(" ")))
When i googled, in all the Java 8 based spark example, i saw the same above implementation.What went wrong in my environemnt or the program.
Can some one help me ?

Use this code. Actual issue is rdd.flatMap function expects Iterator<String> while your code is creating List<String>. Calling the iterator() will fix the problem.
JavaPairRDD<String, Integer> counts = lines.flatMap(line -> Arrays.asList(line.split(" ")).iterator())
.mapToPair(word -> new Tuple2<String, Integer>(word, 1))
.reduceByKey((x, y) -> x + y)
.sortByKey();
counts.foreach(data -> {
System.out.println(data._1()+"-"+data._2());
});

try this code
JavaRDD<String> words =
lines.flatMap(line -> Arrays.asList(line.split(" ")));
JavaPairRDD<String, Integer> counts =
words.mapToPair(w -> new Tuple2<String, Integer>(w, 1))
.reduceByKey((x, y) -> x + y);

JavaRDD<String> obj = jsc.textFile("<Text File Path>");
JavaRDD<String> obj1 = obj.flatMap(l->{
ArrayList<String> al = new ArrayList();
String[] str = l.split(" ");
for(int i=0;i<str/length;i++) {
al.add(str[i]);
}
return al.iterator();
});

Try this :
JavaRDD<String> words = input.flatMap(
new FlatMapFunction<String, String>() {
public Iterator<String> call(String s) {
return (Arrays.asList(s.split(" ")).iterator());
}
} );

Aggregate data in CSV file using Java

I have a big CSV file, thousands of rows, and I want to aggregate some columns using java code.
The file in the form:
1,2012,T1
2,2015,T2
3,2013,T1
4,2012,T1
The results should be:
T, Year, Count
T1,2012, 2
T1,2013, 1
T2,2015, 1

Put your data to a Map like structure, each time add +1 to a stored value when a key (in your case ""+T+year) found.

You can use map like
Map<String, Integer> rowMap = new HashMap<>();
rowMap("T1", 1);
rowMap("T2", 2);
rowMap("2012", 1);
or you can define your own class with T and Year field by overriding hashcode and equals method. Then you can use
Map<YourClass, Integer> map= new HashMap<>();
T1,2012, 2

String csv =
"1,2012,T1\n"
+ "2,2015,T2\n"
+ "3,2013,T1\n"
+ "4,2012,T1\n";
Map<String, Integer> map = new TreeMap<>();
BufferedReader reader = new BufferedReader(new StringReader(csv));
String line;
while ((line = reader.readLine()) != null) {
String[] fields = line.split(",");
String key = fields[2] + "," + fields[1];
Integer value = map.get(key);
if (value == null)
value = 0;
map.put(key, value + 1);
}
System.out.println(map);
// -> {T1,2012=2, T1,2013=1, T2,2015=1}

Use uniVocity-parsers for the best performance. It should take 1 second to process 1 million rows.
CsvParserSettings settings = new CsvParserSettings();
settings.selectIndexes(1, 2); //select the columns we are going to read
final Map<List<String>, Integer> results = new LinkedHashMap<List<String>, Integer>(); //stores the results here
//Use a custom implementation of RowProcessor
settings.setRowProcessor(new AbstractRowProcessor() {
#Override
public void rowProcessed(String[] row, ParsingContext context) {
List<String> key = Arrays.asList(row); // converts the input array to a List - lists implement hashCode and equals based on their values so they can be used as keys on your map.
Integer count = results.get(key);
if (count == null) {
count = 0;
}
results.put(key, count + 1);
}
});
//creates a parser with the above configuration and RowProcessor
CsvParser parser = new CsvParser(settings);
String input = "1,2012,T1"
+ "\n2,2015,T2"
+ "\n3,2013,T1"
+ "\n4,2012,T1";
//the parse() method will parse and submit all rows to your RowProcessor - use a FileReader to read a file instead the String I'm using as example.
parser.parse(new StringReader(input));
//Here are the results:
for(Entry<List<String>, Integer> entry : results.entrySet()){
System.out.println(entry.getKey() + " -> " + entry.getValue());
}
Output:
[2012, T1] -> 2
[2015, T2] -> 1
[2013, T1] -> 1
Disclosure: I am the author of this library. It's open-source and free (Apache V2.0 license).

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Reading delimited data from text file into different RDDs - java

Related

Java print items in tree map list

Empty data is returned when querying using Kafka tumbling window

com.datastax.spark.connector.writer.NullKeyColumnException: Invalid null value for key column year

spark Type mismatch: cannot convert from JavaRDD<Object> to JavaRDD<String>

Aggregate data in CSV file using Java

Categories

Resources