Forming DataFrames from CSV files with different headers in Spark

Forming DataFrames from CSV files with different headers in Spark - java

I am trying to read a folder of Gzipped CSV's (without extension) with a list of variables. e.g.:
CSV file 1: TIMESTAMP | VAR1 | VAR2 | VAR3
CSV file 2: TIMESTAMP | VAR1 | VAR3
Each file represents a day. The order of the columns can be different (or there can be missing columns on one file).
The first option of reading the whole folder on one shot using spark.read is discarded because the join between the files is taking into account the column order and not the column names.
My next options is to read by file:
for (String key : pathArray) {
Dataset<Row> rawData = spark.read().option("header", true).csv(key);
allDatasets.add(rawData);
}
And then do a full outer join on the column names:
Dataset<Row> data = allDatasets.get(0);
for (int i = 1; i < allDatasets.size(); i++) {
ArrayList<String> columns = new
ArrayList(Arrays.asList(data.columns()));
columns.retainAll(new
ArrayList(Arrays.asList(allDatasets.get(i).columns())));
data = data.join(allDatasets.get(i),
JavaConversions.asScalaBuffer(columns), "outer");
}
But this process is very slow as it loads a file at a time.
The next approach is to use sc.binaryFiles as with sc.readFiles is not possible to make a workaround for adding custom Hadoop codecs(in order to be able to read Gzipped files without the gz extension).
Using the latest approach and translating this code to Java I have the following:
A JavaPairRDD<String, Iterable<Tuple2<String, String>>> containing the name of the variable (VAR1) and a iterable of tuples TIMESTAMP,VALUE for that VAR.
I would like to form with this a DataFrame representing all the files, however I am completely lost on how to transform this final PairRDD to a Dataframe. The DataFrame should represent the contents of all the files together. And example of the final DataFrame that I would like to have is the following:
TIMESTAMP | VAR1 | VAR2 | VAR3
01 32 12 32 ==> Start of contents of file 1
02 10 5 7 ==> End of contents of file 1
03 1 5 ==> Start of contents of file 2
04 4 8 ==> End of contents of file 2
Any suggestions or ideas?

Finally I got it with very good performance:
Reading by month in "background" (using a Java Executor to read in parallel other folders with CSV's), with this approach the time that the Driver takes while scanning each folder is reduced because is done in parallel.
Next, the process follows extracting on the one hand the headers and on the other hand their contents (tuples with varname, timestamp, value).
Finally, union the contents using the RDD API and make the Dataframe with the headers.

Related

i want to convert a report which is in text format into a xlsx document. but the problem is data in text file has some missing column values

typical report data is like this,
A simple approach that i wanted to follow was to use space as a delimeter but the data is not in a well structured manner

read the first line of the file and split each column by checking if there is more than 1 whitespace. In addition to that you count how long each column is.
after that you can simply go through the other rows containing data and extract the information, by checking the length of the column you are at
(and please don't put images of text into stackoverflow, actual text is better)
EDIT:
python implementation:
import pandas as pd
import re
file = "path/to/file.txt"
with open("file", "r") as f:
line = f.readline()
columns = re.split(" +", line)
column_sizes = [re.finditer(column, line).__next__().start() for column in columns]
column_sizes.append(-1)
# ------
f.readline()
rows = []
while True:
line = f.readline()
if len(line) == 0:
break
elif line[-1] != "\n":
line += "\n"
row = []
for i in range(len(column_sizes)-1):
value = line[column_sizes[i]:column_sizes[i+1]]
row.append(value)
rows.append(row)
columns = [column.strip() for column in columns]
df = pd.DataFrame(data=rows, columns=columns)
print(df)
df.to_excel(file.split(".")[0] + ".xlsx")

You are correct that export from text to csv is not a practical start, however it would be good for import. So here is your 100% well structured source text to be saved into plain text.
And here is the import to Excel

you can use google lens to get your data out of this picture then copy and paste to excel file. the easiest way.
or first convert this into pdf then use google lens. go to file scroll to print option in print setting their is an option of MICROSOFT PRINT TO PDF select that and press print it will ask you for location then give it and use it

Karate : In my CSV file, columns are not having same row count. While reading data empty values are added for columns having less rows

My csv file data : 1 column is HeaderText(6 rows) and other is accountBtn(4 rows)
accountBtn,HeaderText
New Case,Type
New Note,Phone
New Contact,Website
,Account Owner
,Account Site
,Industry
When I'm reading file with below code
* def csvData = read('../TestData/Button.csv')
* def expectedButton = karate.jsonPath(csvData,"$..accountBtn")
* def eHeaderTest = karate.jsonPath(csvData,"$..HeaderText")
data set generated as per code is : ["New Case","New Note","New Contact","","",""]
My expected data set is : ["New Case","New Note","New Contact"]
Any idea how can this be handled?

That's how it is in Karate and it shouldn't be a concern since you are just using it as data to drive a test. You can run a transform to convert empty strings to null if required: https://stackoverflow.com/a/56581365/143475
Else please consider contributing code to make Karate better !
The other option is to use JSON as a data-source instead of CSV: https://stackoverflow.com/a/47272108/143475

How to Convert DataSet<Row> to DataSet of JSON messages to write to Kafka?

I use Spark 2.1.1.
I have the following DataSet<Row> ds1;
name | ratio | count // column names
"hello" | 1.56 | 34
(ds1.isStreaming gives true)
and I am trying to generate DataSet<String> ds2. other words when I write to a kafka sink I want to write something like this
{"name": "hello", "ratio": 1.56, "count": 34}
I have tried something like this df2.toJSON().writeStream().foreach(new KafkaSink()).start() but then it gives the following error
Queries with streaming sources must be executed with writeStream.start()
There are to_json and json_tuple however I am not sure how to leverage them here ?
I tried the following using json_tuple() function
Dataset<String> df4 = df3.select(json_tuple(new Column("result"), " name", "ratio", "count")).as(Encoders.STRING());
and I get the following error:
cannot resolve 'result' given input columns: [name, ratio, count];;

tl;dr Use struct function followed by to_json (as toJSON was broken for streaming datasets due to SPARK-17029 that got fixed just 20 days ago).
Quoting the scaladoc of struct:
struct(colName: String, colNames: String*): Column Creates a new struct column that composes multiple input columns.
Given you use Java API you have 4 different variants of struct function, too:
public static Column struct(Column... cols) Creates a new struct column.
With to_json function your case is covered:
public static Column to_json(Column e) Converts a column containing a StructType into a JSON string with the specified schema.
The following is a Scala code (translating it to Java is your home exercise):
val ds1 = Seq(("hello", 1.56, 34)).toDF("name", "ratio", "count")
val recordCol = to_json(struct("name", "ratio", "count")) as "record"
scala> ds1.select(recordCol).show(truncate = false)
+----------------------------------------+
|record |
+----------------------------------------+
|{"name":"hello","ratio":1.56,"count":34}|
+----------------------------------------+
I've also given your solution a try (with Spark 2.3.0-SNAPSHOT built today) and it seems it works perfectly.
val fromKafka = spark.
readStream.
format("kafka").
option("subscribe", "topic1").
option("kafka.bootstrap.servers", "localhost:9092").
load.
select('value cast "string")
fromKafka.
toJSON. // <-- JSON conversion
writeStream.
format("console"). // using console sink
start
format("kafka") was added in SPARK-19719 and is not available in 2.1.0.

Talend : tmap Nullpointer exception while merging two CSV files

I want to merge two CSV files. The problem that I am facing is one of the two CSV files has dynamic column.
e.g.
The first CSV file has two column. A and G. Column G has comma separated values.
A | G |<-Column Names
--|---------|
A1| G1,G2,G3| <-Row
A2| G2,G5,G6|<-Row
The second CSV file has dynamic columns. But it will alwas have the column A(uid). e.g.
A | C1 |C2 |Othercolumns|<-Column Names
--|-------|---------|------------|
A1|C1Value|C2Value | |<-Row
A2|C1Value| C2Value | |<-Row
I want to merge these two files So the output will be:
A |G | C1 |C2 |Othercolumns|<-Column Names
--|-----------|-------|---------|------------|
A1| G1,G2,G3 |C1Value|C2Value | |<-Row
A2| G2,G5,G6 |C1Value| C2Value | |<-Row
Here is the job.
I didn't check the include header option in tfileoutputdelimited_1.
This merges the csv files correctly, but does not bring the column information of 2nd CSV file(one with dynamic column). The output is as shown below.
A |G | | | |
--|-----------|-------|---------|------------|
A1| G1,G2,G3 |C1Value|C2Value | |<-Row
A2| G2,G5,G6 |C1Value| C2Value | |<-Row
To get the column names, When I check the "include Header" option in the output file I get the below exception.
java.lang.NullPointerException
at routines.system.DynamicUtils.writeHeaderToDelimitedFile(DynamicUtils.java:72)
at content.csvmergetest_0_1.CSVMergeTest.tFileInputDelimited_2Process(CSVMergeTest.java:2696)
at content.csvmergetest_0_1.CSVMergeTest.runJobInTOS(CSVMergeTest.java:3109)
at content.csvmergetest_0_1.CSVMergeTest.main(CSVMergeTest.java:2975)
As shown below, In this case only one row is fetched from the Tfileinputdelimited_2. I guess that row is the header column and that is why the nullpointer exception.
Why is this happening? How will I get the headers?
Please let me know how I can achieve this.

Read in the file with the "othercolumns" as 1 column of type Dynamic.
Before joining in tMap you need to extract the A column from it:
Then take care to only have one dynamic type in the output schema, because talend cannot handle two.
The resultfile including one header line and 1 "othercolumns" colum Z looks as follows:
A;G;C1;C2;Z
A1;G1,G2,G3;C1Value;C2Value;Z1
A2;G2,G5,G6;C1Value;C2Value;Z2

How to break string from an excel file into substrings and load it?

I'm actually working on a talend job. I need to load from an excel file to an oracle 11g database.
I can't figure out how to break a field of my excel entry file within talend and load the broken string into the database.
For example I've got a field like this:
toto:12;tata:1;titi:15
And I need to load into a table, for example grade:
| name | grade |
|------|-------|
| toto |12 |
| titi |15 |
| tata |1 |
|--------------|
Thank's in advance

In a Talend job, you can use tFileInputExcel to read your Excel file, and then tNormalize to split your special column into individual rows with a separator of ";". After that, use tExtractDelimitedFields with a separator of ":" to split the normalized column into name and grade columns. Then you can use a tOracleOutput component to write the result to the database.
While this solution is more verbose than the Java snippet suggested by AlexR, it has the advantage that it stays within Talend's graphical programming model.

for(String pair : str.split(";")) {
String[] kv = pair.split(":");
// at this point you have separated values
String name = kv[0];
String grade = kv[1];
dbInsert(name, grade);
}
Now you have to implement dbInsert(). Do it either using JDBC or using any higher level tools (e.g. Hivernate, iBatis, JDO, JPA etc).

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Forming DataFrames from CSV files with different headers in Spark - java

Related

i want to convert a report which is in text format into a xlsx document. but the problem is data in text file has some missing column values

Karate : In my CSV file, columns are not having same row count. While reading data empty values are added for columns having less rows

How to Convert DataSet<Row> to DataSet of JSON messages to write to Kafka?

Talend : tmap Nullpointer exception while merging two CSV files

How to break string from an excel file into substrings and load it?

Categories

Resources