Adding header to processed RDDs in Spark java - java

My question is almost same as Add a header before text file on save in Spark. The difference is that my header RDD is
String headerSTR = "inc_id,po_id,ass,inci_type,cat,sub_cat";
JavaRDD<String> PMheader = jsc.parallelize(Arrays.asList(headerSTR));
And my lines RDD is of PM Table type.
JavaRDD<PMTable>rdd_records=noheader.map(new Function<String,PMTable>(){---
PMTable sd = new PMTable(----);
return sd;});
rdd_records.saveAsTextFile();
mergeAllFiles();
I have merged all the result files to a single csv file which does not contain header .Now I need to get union of header rdd and lines rdd .But the method union(JavaRDD) in the type JavaRDD is not applicable for the arguments (JavaRDD of PMTable type). So how can i get the union of header and lines using spark-java api.
Thanks in advance.

Related

Karate : In my CSV file, columns are not having same row count. While reading data empty values are added for columns having less rows

My csv file data : 1 column is HeaderText(6 rows) and other is accountBtn(4 rows)
accountBtn,HeaderText
New Case,Type
New Note,Phone
New Contact,Website
,Account Owner
,Account Site
,Industry
When I'm reading file with below code
* def csvData = read('../TestData/Button.csv')
* def expectedButton = karate.jsonPath(csvData,"$..accountBtn")
* def eHeaderTest = karate.jsonPath(csvData,"$..HeaderText")
data set generated as per code is : ["New Case","New Note","New Contact","","",""]
My expected data set is : ["New Case","New Note","New Contact"]
Any idea how can this be handled?
That's how it is in Karate and it shouldn't be a concern since you are just using it as data to drive a test. You can run a transform to convert empty strings to null if required: https://stackoverflow.com/a/56581365/143475
Else please consider contributing code to make Karate better !
The other option is to use JSON as a data-source instead of CSV: https://stackoverflow.com/a/47272108/143475

Weird error while parsing JSON in Apache Spark

Trying to parse a JSON document and Spark gives me an error:
Exception in thread "main" org.apache.spark.sql.AnalysisException: Since Spark 2.3, the queries from raw JSON/CSV files are disallowed when the
referenced columns only include the internal corrupt record column
(named _corrupt_record by default). For example:
spark.read.schema(schema).json(file).filter($"_corrupt_record".isNotNull).count()
and spark.read.schema(schema).json(file).select("_corrupt_record").show().
Instead, you can cache or save the parsed results and then send the same query.
For example, val df = spark.read.schema(schema).json(file).cache() and then
df.filter($"_corrupt_record".isNotNull).count().;
at org.apache.spark.sql.execution.datasources.json.JsonFileFormat.buildReader(JsonFileFormat.scala:120)
...
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3364)
at org.apache.spark.sql.Dataset.head(Dataset.scala:2545)
at org.apache.spark.sql.Dataset.take(Dataset.scala:2759)
at org.apache.spark.sql.Dataset.getRows(Dataset.scala:255)
at org.apache.spark.sql.Dataset.showString(Dataset.scala:292)
at org.apache.spark.sql.Dataset.show(Dataset.scala:746)
at org.apache.spark.sql.Dataset.show(Dataset.scala:705)
at xxx.MyClass.xxx(MyClass.java:25)
I already tried to open the JSON doc in several online editors and it's valid.
This is my code:
Dataset<Row> df = spark.read()
.format("json")
.load("file.json");
df.show(3); // this is line 25
I am using Java 8 and Spark 2.4.
The _corrupt_record column is where Spark stores malformed records when it tries to ingest them. That could be a hint.
Spark also process two types of JSON documents, JSON Lines and normal JSON (in the earlier versions Spark could only do JSON Lines). You can find more in this Manning article.
You can try the multiline option, as in:
Dataset<Row> df = spark.read()
.format("json")
.option("multiline", true)
.load("file.json");
to see if it helps. If not, share your JSON doc (if you can).
set the multiline option to true. If it does not work share your json

JavaPairRDD to Dataset<Row> in SPARK

I have data in JavaPairRDD in format
JavaPairdRDD<Tuple2<String, Tuple2<String,String>>>
I tried using below code
Encoder<Tuple2<String, Tuple2<String,String>>> encoder2 =
Encoders.tuple(Encoders.STRING(), Encoders.tuple(Encoders.STRING(),Encoders.STRING()));
Dataset<Row> userViolationsDetails = spark.createDataset(JavaPairRDD.toRDD(MY_RDD),encoder2).toDF("value1","value2");
But how to generate Dataset with 3 columns ??? As output of above code gives me data in 2 columns. Any pointers / suggestion ???
Try to run printSchema - you will see, that value2 is a complex type.
Having such information, you can write:
Dataset<Row> uvd = userViolationsDetails.selectExpr("value1", "value2._1 as value2", "value2._2 as value3")
value2._1 means first element of a tuple inside current "value2" field. We overwrite value2 field to have one value only
Note that this will work after https://issues.apache.org/jira/browse/SPARK-24548 is merged to master branch. Currently there is a bug in Spark and tuple is converted to struct with two fields named value

Splitting data in CSV file

Below is the data format in my CSV file
userid,group,username,status
In my Java code I delimited the data by using , as delimiter
Eg:
normal scennario in which my code works fine:
1001,admin,ram,active
in this scenario(user with firstname,lastname) when i take the status of the 1002 user it is coming as KUMAR since it is taking 4th column as status
1002,User,ravi,kumar,active
Kindly help me on how to change the code logic so that it works fine for both the scenenarios
You can use OpenCSV library.
CSVReader csvReader = new CSVReader(new FileReader(fileName),';');
List<String[]> rows = csvReader.readAll();
then you can test the first column : if rows.get(0)[0] == 1002 ....

How to store grouped records into multiple files with Pig?

After loading and grouping records, how can I store those grouped records into several files, one per group (=userid)?
records = LOAD 'input' AS (userid:int, ...);
grouped_records = GROUP records BY userid;
I'm using Apache Pig version 0.8.1-cdh3u3 (rexported)
Indeed, there is a MultiStorage class at Piggybank which does exactly what I want - it splits the records by a specified attribute (at index '0' in my example):
STORE records INTO 'output' USING org.apache.pig.piggybank.storage.MultiStorage('output', '0', 'none', ',');
A = LOAD 'mydata' USING PigStorage() as (a, b, c);
STORE A INTO '/my/home/output' USING MultiStorage('/my/home/output','0', 'bz2', '\\t');
Parameters:
parentPathStr - Parent output dir path
splitFieldIndex - key field index
compression - 'bz2', 'bz', 'gz' or 'none'
fieldDel - Output record field delimiter.
Reference: GrepCode

Categories