How to Not print the escape character in spark csv output - java

My Dataset list has a row in the format:
1,abc,null,"
On printing this in a csv file using the following code:
writeErrorDF
.drop("FILE_NAME", "IDENTIFIER")
.repartition(1)
.write()
.mode(SaveMode.Overwrite)
.option("header", "true")
.option("encoding", "UTF-8")
.csv(writeErrorDFFilePath);
I get an output that looks like : 1,abc,"","\""
I want the output to look like 1,abc,"",""" how do i achive this?
Also I tried adding .option("escape","") to the above code but that just appends a NUL or space character to the output.

Related

i want to convert a report which is in text format into a xlsx document. but the problem is data in text file has some missing column values

typical report data is like this,
A simple approach that i wanted to follow was to use space as a delimeter but the data is not in a well structured manner
read the first line of the file and split each column by checking if there is more than 1 whitespace. In addition to that you count how long each column is.
after that you can simply go through the other rows containing data and extract the information, by checking the length of the column you are at
(and please don't put images of text into stackoverflow, actual text is better)
EDIT:
python implementation:
import pandas as pd
import re
file = "path/to/file.txt"
with open("file", "r") as f:
line = f.readline()
columns = re.split(" +", line)
column_sizes = [re.finditer(column, line).__next__().start() for column in columns]
column_sizes.append(-1)
# ------
f.readline()
rows = []
while True:
line = f.readline()
if len(line) == 0:
break
elif line[-1] != "\n":
line += "\n"
row = []
for i in range(len(column_sizes)-1):
value = line[column_sizes[i]:column_sizes[i+1]]
row.append(value)
rows.append(row)
columns = [column.strip() for column in columns]
df = pd.DataFrame(data=rows, columns=columns)
print(df)
df.to_excel(file.split(".")[0] + ".xlsx")
You are correct that export from text to csv is not a practical start, however it would be good for import. So here is your 100% well structured source text to be saved into plain text.
And here is the import to Excel
you can use google lens to get your data out of this picture then copy and paste to excel file. the easiest way.
or first convert this into pdf then use google lens. go to file scroll to print option in print setting their is an option of MICROSOFT PRINT TO PDF select that and press print it will ask you for location then give it and use it

Remove special character from a column in dataframe

I am trying to remove a special character (å) from a column in a dataframe.
My data looks like:
ClientID,PatientID
AR0001å,DH_HL704221157198295_91
AR00022,DH_HL704221157198295_92
My original data is approx 8TB in size from which I need to get rid of this special character.
Code to load data:
reader.option("header", true)
.option("sep", ",")
.option("inferSchema", false)
.option("charset", "ISO-8859-1")
.schema(schema)
.csv(path)
After loading into dataframe when I do df.show() it shows:
+--------+--------------------+
|ClientID| PatientID|
+--------+--------------------+
|AR0001Ã¥|DH_HL704221157198...|
|AR00022 |DH_HL704221157198...|
+--------+--------------------+
Code I used to try to replace this character:
df.withColumn("ClientID", functions.regexp_replace(df.col("ClientID"), "\å", ""));
But this didn't work. While loading the data in dataframe if I change the charset to "UTF-8" it works.
I am not able to find a solution with the current charset (ISO-8859-1).
Some things to note,
Make sure to assign the result to a new variable and use that afterwards
You do not need to escape "å" with \
colName in the command should be ClientId or PatientID
If you did all these things, then I would suggest to, instead of matching on "å", try matching on the characters you want to keep. For example, for the ClientID column,
df.withColumn("ClientID", functions.regexp_replace(df.col("ClientID"), "[^A-Z0-9_]", ""));
Another approach would be to convert the UTF-8 character "å" to it's ISO-8859-1 equivalent and replace with the resulting string.
String escapeChar = new String("å".getBytes("UTF-8"), "ISO-8859-1");
The below command will remove all the special characters and will keep all the lower/upper case alphabets and all the numbers in the string:
df.withColumn("ClientID", functions.regexp_replace(df.col("ClientID"), "[^a-zA-Z0-9]", ""));

replace or remove new line "\n" character from Spark dataset column value

I have below code to read xml
Dataset<Row> dataset1 = SparkConfigXMLProcessor.sparkSession.read().format("com.databricks.spark.xml")
.option("rowTag", properties.get(EventHubConsumerConstants.IG_ORDER_TAG).toString())
.load(properties.get("C:\\inputOrders.xml").toString());
one of the column value getting new line character.
i want to replace it with some character or just want to remove it.
Please help
dataset1.withColumn("menuitemname_clean", regexp_replace(col("menuitemname"), "[\n\r]", " "))
Above code will work
This is what I used. I usually add a tab (\t), too. Having both \r and \n will find UNIX (\n), Windows (\r), and OSX (\r) newlines.
Dataset<Row> newDF = dataset1.withColumn("menuitemname", regexp_replace(col("menuitemname"), "\n|\r", ""));
Below code resolve my issue
Dataset<Row> newDF = dataset1.withColumn("menuitemname", regexp_replace(col("menuitemname"), "[\\n]", ""));

Splitting data in CSV file

Below is the data format in my CSV file
userid,group,username,status
In my Java code I delimited the data by using , as delimiter
Eg:
normal scennario in which my code works fine:
1001,admin,ram,active
in this scenario(user with firstname,lastname) when i take the status of the 1002 user it is coming as KUMAR since it is taking 4th column as status
1002,User,ravi,kumar,active
Kindly help me on how to change the code logic so that it works fine for both the scenenarios
You can use OpenCSV library.
CSVReader csvReader = new CSVReader(new FileReader(fileName),';');
List<String[]> rows = csvReader.readAll();
then you can test the first column : if rows.get(0)[0] == 1002 ....

Apache Pig process CSV with fields wrapped in quotes

How I can process CSV file where some fields are wrapped in quotes?
Line to process for example (field delimiter is ',')
I am column1, I am column2, "yes, I'm am column3"
The example has three columns. But the following example will say that I have four columns:
A = load '/path/to/file' using PigStorage(',');
Please, any suggestions, link to resource..?
Try loading the data, then do a FOREACH GENERATE to regenerate the data into whatever format you need. For the fields where you need to remove the quotes, use a REPLACE($3, '\"').
data = LOAD 'testdata' USING PigStorage(",");
data = FOREACH data GENERATE
(chararray) $0 AS col1:chararray,
(chararray) $1 AS col2:chararray,
(chararray) REPLACE($3, '\"') AS col3:chararray);

Categories