I have a list of list which i am looking forward to run it using akka and would want to do a operation when all of the child lists are done processing. But the Complete is running before all child's are completed.
Basically i am trying to read all the sheets in the excel and then read each rows from the excel. For this i am looking to use akka to process each sheets seperately and also in each sheet i am looking to process each rows seperately.
Sample Code:
List<List<String>> workbook = new ArrayList<List<String>>();
List<String> Sheet1 = new ArrayList<String>();
Sheet1.add("S");
Sheet1.add("a");
Sheet1.add("d");
List<String> Sheet2 = new ArrayList<String>();
Sheet2.add("S");
Sheet2.add("a1");
Sheet2.add("d");
workbook.add(Sheet1);
workbook.add(Sheet2);
final ActorSystem system = ActorSystem.create("Sys");
final ActorMaterializer materializer = ActorMaterializer.create(system);
Source.from(workbook).map(sheet -> {
return Source.from(sheet).runWith(Sink.foreach(data -> {
System.out.println(data);
Thread.sleep(1000);
}), materializer).toCompletableFuture();
}).runWith(Sink.ignore(), materializer).whenComplete((a, b) -> {
System.out.println("Complete");
});
system.terminate();
The Current output is:
S
S
Complete
a
a1
d
d
The Expected output is:
S
S
a
a1
d
d
Complete
Could anyone please help ?
Your use of a "stream within a stream" may be overcomplicating the process.
You could instead use Flow.flatMapConcat. I can only provide an example in scala but hopefully it translates easily to java:
val flattenFlow : Flow[List[String], String, NotUsed] =
Flow[List[String].flatMapConcat(sheet => Source(sheet))
val Source[String] flattenedSource = Source(worksheet).via(flattenFlow)
There is a blog post with a example of using flatMapConcat in java but I don't know if my guessed type Flow.of(List<String>.class) is valid code.
Related
I created a simple dataflow pipeline which consist of this process:
Fetch/read data from bigquery
Change the output to csv format
Create CSV file on Google Storage
//TODO send CSV file to third party
pipeline.apply("ReadFromBigQuery",
BigQueryIO.read(new MyCustomObject1(input))
.fromQuery(myCustomQuery)
.usingStandardSql()
).apply("ConvertToCsv",
ParDo.of(new myCustomObject2())
).apply("WriteToCSV",
TextIO.write().to(fileLocation)
.withSuffix(".csv")
.withoutSharding()
.withDelimiter(new char[] {'\r', '\n'})
.withHeader(csvHeader)
);
but after step 3 (write to GS), i can't add another process to dataflow
how can i achieve this?
Because TextIO.write() returns a PDone, instead of a PCollection in the prior PTransform's.
One of the possible solutions in your step 2, you could use a multi out with tags to write to different locations.
final TupleTag<String> csvOutTag= new TupleTag<String>(){};
final TupleTag<String> furtherProcessingTag= new TupleTag<String>(){};
PCollectionTuple mixedCollection =
bigQueryReadCollection.apply(ParDo
.of(new DoFn<TableRow,String>() {
#ProcessElement
public void processElement(ProcessContext c) {
// Emit to main output, which is the output
c.output(c.element().toString());
// Emit to output with tag furtherProcessing
c.output(furtherProcessingTag, c.element());
}
}
}).withOutputTags(csvOutTag,
TupleTagList.of(furtherProcessingTag)));
// Get output with tag csvOutTag.
mixedCollection.get(csvOutTag).apply("WriteToCSV",
TextIO.write().to(fileLocation)
.withSuffix(".csv")
.withoutSharding()
.withDelimiter(new char[] {'\r', '\n'})
.withHeader(csvHeader));
// Get output with tag furtherProcessingTag.
mixedCollection.get(furtherProcessingTag).apply(...);
Please add appropriate data types in TupleTag declaration, based on your output for further processing.
I'm on a heatmap project for my university, we have to get some data (212Go) from a txt file (coordinates, height), then put it in HBase to retrieve it on a web client with Express.
I practiced using a 144Mo file, this is working :
SparkConf conf = new SparkConf().setAppName("PLE");
JavaSparkContext context = new JavaSparkContext(conf);
JavaRDD<String> data = context.textFile(args[0]);
Connection co = ConnectionFactory.createConnection(getConf());
createTable(co);
Table table = co.getTable(TableName.valueOf(TABLE_NAME));
Put put = new Put(Bytes.toBytes("KEY"));
for (String s : data.collect()) {
String[] tmp = s.split(",");
put.addImmutable(FAMILY,
Bytes.toBytes(tmp[2]),
Bytes.toBytes(tmp[0]+","+tmp[1]));
}
table.put(put);
But I now that I use the 212Go file, I got some memory errors, I guess the collect method gather all the data in memory, so 212Go is too much.
So now I'm trying this :
SparkConf conf = new SparkConf().setAppName("PLE");
JavaSparkContext context = new JavaSparkContext(conf);
JavaRDD<String> data = context.textFile(args[0]);
Connection co = ConnectionFactory.createConnection(getConf());
createTable(co);
Table table = co.getTable(TableName.valueOf(TABLE_NAME));
Put put = new Put(Bytes.toBytes("KEY"));
data.foreach(line ->{
String[] tmp = line.split(",");
put.addImmutable(FAMILY,
Bytes.toBytes(tmp[2]),
Bytes.toBytes(tmp[0]+","+tmp[1]));
});
table.put(put);
And I'm getting "org.apache.spark.SparkException: Task not serializable", I searched about it and tried some fixing, without success, upon what I read here : Task not serializable: java.io.NotSerializableException when calling function outside closure only on classes not objects
Actually I don't understand everything in this topic, I'm just a student, maybe the answer to my problem is obvious, maybe not, anyway thanks in advance !
As a rule of thumb - serializing database connections (any type) doesn't make sense. There are not designed to be serialized and deserialized, Spark or not.
Create connection for each partition:
data.foreachPartition(partition -> {
Connection co = ConnectionFactory.createConnection(getConf());
... // All required setup
Table table = co.getTable(TableName.valueOf(TABLE_NAME));
Put put = new Put(Bytes.toBytes("KEY"));
while (partition.hasNext()) {
String line = partition.next();
String[] tmp = line.split(",");
put.addImmutable(FAMILY,
Bytes.toBytes(tmp[2]),
Bytes.toBytes(tmp[0]+","+tmp[1]));
}
... // Clean connections
});
I also recommend reading Design Patterns for using foreachRDD from the official Spark Streaming programming guide.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
So I have two csv files i wish to compare.
Each file could be as much as 20mb each.
Each line has the key followed by the data so key,data
But the data is then separated by comma as well.
csv1.csv
KEY , DATA
AB45,12,15,65,NN
AB46,12,15,64,YY
AB47,45,85,95,YN
csv2.csv
AB45,12,15,65,NN
AB46,15,15,65,YY
AB48,65,45,60,YY
What i want to do is read both files and compare the data for each key.
I was thinking read each file line by line adding into a TreeMap. I can then compare each set of data for a given key and if there is a difference write it to another file.
Any advice?
As I am unsure of how to read the files to extract just the keys and data in an efficient way.
Use a CSV parsing library dedicated for that to speed things up. With uniVocity-parsers you can parse these 20mb files in 100ms or less. The following solution is a bit involved to prevent loading too much data into memory. Check the tutorial I linked above, there are many ways to accomplish what you need with this library.
First we read one of the CSV files, and generate a Map:
public static void main(String... args) {
//First we parse one file (ideally the smaller one)
CsvParserSettings settings = new CsvParserSettings();
//here we tell the parser to read the CSV headers
settings.setHeaderExtractionEnabled(true);
CsvParser parser = new CsvParser(settings);
//Parse all data into a list.
List<String[]> records = parser.parseAll(new File("/path/to/csv1.csv"));
//Convert that list into a map. The first column of this input will produce the keys.
Map<String, String[]> mapOfRecords = toMap(records);
//this where the magic happens.
processFile(new File("/path/to/csv2.csv"), new File("/path/to/diff.csv"), mapOfRecords);
}
This is the code to generate a Map from the list of records:
/* Converts a list of records to a map. Uses element at index 0 as the key */
private static Map<String, String[]> toMap(List<String[]> records) {
HashMap<String, String[]> map = new HashMap<String, String[]>();
for (String[] row : records) {
//column 0 will always have an ID.
map.put(row[0], row);
}
return map;
}
With the map of records, we can process your second file and generate another with any updates found:
private static void processFile(final File input, final File output, final Map<String, String[]> mapOfExistingRecords) {
//configures a new parser again
CsvParserSettings settings = new CsvParserSettings();
settings.setHeaderExtractionEnabled(true);
//All parsed rows will be submitted to the following Processor. This way you won't have to store all rows in memory.
settings.setProcessor(new RowProcessor() {
//will write the changed rows to another file
CsvWriter writer;
#Override
public void processStarted(ParsingContext context) {
CsvWriterSettings settings = new CsvWriterSettings(); //configure at till
writer = new CsvWriter(output, settings);
}
#Override
public void rowProcessed(String[] row, ParsingContext context) {
// Incoming rows from will have the ID as index 0.
// If the map contains the ID, we'll get a row
String[] existingRow = mapOfExistingRecords.get(row[0]);
if (!Arrays.equals(row, existingRow)) {
writer.writeRow(row);
}
}
#Override
public void processEnded(ParsingContext context) {
writer.close();
}
});
CsvParser parser = new CsvParser(settings);
//the parse() method will submit all rows to the RowProcessor defined above. All differences will be
//written to the output file.
parser.parse(input);
}
This should work just fine. I hope it helps you.
Disclosure: I am the author of this library. It's open-source and free (Apache V2.0 license).
I work with a lot of CSV file comparisons for my job. I didn't know python before I started working, but I picked it up really quick. If you want to compare CSV files quickly, python is a wonderful way to go, and its fairly easy to pick up if you know java.
I modified a script I use to fit your basic use case (you'll need to modify it a bit more to do exactly what you want). It Runs under a few seconds when I use it compare csv files with millions of rows. If you need to do this in java, you can pretty much transfer this to some java methods. There are similar csv libraries you can use that will replace all the csv functions below.
import csv, sys, itertools
def getKeyPosition(header_row, key_value):
counter = 0
for header in header_row:
if (header == key_value):
return counter
counter += 1
# This will create a dictonary of your rows by their key. (key is the column location)
def getKeyDict(csv_reader, key_position):
key_dict = {}
row_counter = 0
unique_records = 0
for row in csv_reader:
row_counter += 1
if row[key_position] not in key_dict:
key_dict.update({row[key_position]: row})
unique_records += 1
# My use case requires a lot of checking for duplicates
if unique_records != row_counter:
print "Duplicate Keys in File"
return key_dict
def main():
f1 = open(sys.argv[1])
f2 = open(sys.argv[2])
f1_csv = csv.reader(f1)
f2_csv = csv.reader(f2)
f1_header = next(f1_csv)
f2_header = next(f2_csv)
f1_header_key_position = getKeyPosition(f1_header, "KEY")
f2_header_key_position = getKeyPosition(f2_header, "KEY")
f1_row_dict = getKeyDict(f1_csv, f1_header_key_position)
f2_row_dict = getKeyDict(f2_csv, f2_header_key_position)
outputFile = open("KeyDifferenceFile.csv" , 'w')
writer = csv.writer(outputFile)
writer.writerow(f1_header)
#Heres the logic for comparing rows
for key, row_1 in f1_row_dict.iteritems():
#Do whatever comparisions you need here.
if key not in f2_row_dict:
print "Oh no, this key doesn't exist in the file 2"
if key in f2_row_dict:
row_2 = f2_row_dict.get(key)
if row_1 != row_2:
print "oh no, the two rows don't match!"
# You can get more header keys to compare by if you want.
data_position = getKeyPosition(f2_header, "DATA")
row_1_data = row_1[data_position]
row_2_data = row_2[data_position]
if row_1_data != row_2_data:
print "oh no, the data doesn't match!"
# Heres how you'd right the rows
row_to_write = []
#Differences between
for row_1_column, row_2_column in itertools.izip(row_1_data, row_2_data):
row_to_write.append(row_1_column - row_2_column)
writer.writerow(row_to_write)
# Make sure to close those files!
f1.close()
f2.close()
outputFile.close()
main()
I have a big table in hbase that name is UserAction, and it has three column families(song,album,singer). I need to fetch all of data from 'song' column family as a JavaRDD object. I try this code, but it's not efficient. Is there a better solution to do this ?
static SparkConf sparkConf = new SparkConf().setAppName("test").setMaster(
"local[4]");
static JavaSparkContext jsc = new JavaSparkContext(sparkConf);
static void getRatings() {
Configuration conf = HBaseConfiguration.create();
conf.set(TableInputFormat.INPUT_TABLE, "UserAction");
conf.set(TableInputFormat.SCAN_COLUMN_FAMILY, "song");
JavaPairRDD<ImmutableBytesWritable, Result> hBaseRDD = jsc
.newAPIHadoopRDD(
conf,
TableInputFormat.class,
org.apache.hadoop.hbase.io.ImmutableBytesWritable.class,
org.apache.hadoop.hbase.client.Result.class);
JavaRDD<Rating> count = hBaseRDD
.map(new Function<Tuple2<ImmutableBytesWritable, Result>, JavaRDD<Rating>>() {
#Override
public JavaRDD<Rating> call(
Tuple2<ImmutableBytesWritable, Result> t)
throws Exception {
Result r = t._2;
int user = Integer.parseInt(Bytes.toString(r.getRow()));
ArrayList<Rating> ra = new ArrayList<>();
for (Cell c : r.rawCells()) {
int product = Integer.parseInt(Bytes
.toString(CellUtil.cloneQualifier(c)));
double rating = Double.parseDouble(Bytes
.toString(CellUtil.cloneValue(c)));
ra.add(new Rating(user, product, rating));
}
return jsc.parallelize(ra);
}
})
.reduce(new Function2<JavaRDD<Rating>, JavaRDD<Rating>, JavaRDD<Rating>>() {
#Override
public JavaRDD<Rating> call(JavaRDD<Rating> r1,
JavaRDD<Rating> r2) throws Exception {
return r1.union(r2);
}
});
jsc.stop();
}
Song column family scheme design is :
RowKey = userID, columnQualifier = songID and value = rating.
UPDATE: OK I see your problem now, for some crazy reason your turning your arrays into RDDs return jsc.parallelize(ra);. Why are you doing that?? Why are you creating an RDD of RDDs?? Why not leave them as arrays? When you do the reduce you can then concatenate the arrays. An RDD is a Resistant Distributed Dataset - it does not make logical sense to have a Distributed Dataset of Distributed Datasets. I'm surprised your job even runs and doesn't crash! Anyway that's why your job is so slow.
Anyway, in Scala after your map, you would just do a flatMap(identity) and that would concatenate all your lists together.
I don't really understand why you need to do a reduce, maybe that is where you have something inefficient going on. Here is my code to read HBase tables (its generalized - i.e. works for any scheme). One thing to be sure of is to make sure that when you read the HBase table you ensure the number of partitions is suitable (usually you want a lot).
type HBaseRow = java.util.NavigableMap[Array[Byte],
java.util.NavigableMap[Array[Byte], java.util.NavigableMap[java.lang.Long, Array[Byte]]]]
// Map(CF -> Map(column qualifier -> Map(timestamp -> value)))
type CFTimeseriesRow = Map[Array[Byte], Map[Array[Byte], Map[Long, Array[Byte]]]]
def navMapToMap(navMap: HBaseRow): CFTimeseriesRow =
navMap.asScala.toMap.map(cf =>
(cf._1, cf._2.asScala.toMap.map(col =>
(col._1, col._2.asScala.toMap.map(elem => (elem._1.toLong, elem._2))))))
def readTableAll(table: String): RDD[(Array[Byte], CFTimeseriesRow)] = {
val conf = HBaseConfiguration.create()
conf.set(TableInputFormat.INPUT_TABLE, table)
sc.newAPIHadoopRDD(conf, classOf[TableInputFormat],
classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable],
classOf[org.apache.hadoop.hbase.client.Result])
.map(kv => (kv._1.get(), navMapToMap(kv._2.getMap)))
}
As you can see, I have no need for a reduce in my code. The methods are pretty self explainatory. I could dig further into your code, but I lack the patience to read Java as it's so epically verbose.
I have some more code specifically for fetching the most recent elements from the row (rather than the entire history). Let me know if you want to see that.
Finally, recommend you look into using Cassandra over HBase as datastax is partnering with databricks.
I am running PIG script using Java API (pigserver.registerscript)
I need to find out number of records processed and number of output records
using java API.
How to implement the same.
I have simple pig script as follows
A = LOAD '$input_file_path' USING PigStorage('$') as (id:double,name:chararray,code:int);
Dump A;
B = FOREACH A GENERATE PigUdf2(id),name;
Dump B;
& java code is .....
PigStats ps;
HashMap<String, String> m = new HashMap();
Path p = new Path("/home/shweta/Desktop/debugging_pig_udf/pig_in");
m.put("input_file_path", p.toString()); PigServer pigServer = new PigServer(ExecType.LOCAL);
pigServer.registerScript("/home/shweta/Desktop/debugging_pig_udf/pig_script.pig",m);
I need to find out Input records and output records COUNT using java API