Remove special character from a column in dataframe - java

I am trying to remove a special character (å) from a column in a dataframe.
My data looks like:
ClientID,PatientID
AR0001å,DH_HL704221157198295_91
AR00022,DH_HL704221157198295_92
My original data is approx 8TB in size from which I need to get rid of this special character.
Code to load data:
reader.option("header", true)
.option("sep", ",")
.option("inferSchema", false)
.option("charset", "ISO-8859-1")
.schema(schema)
.csv(path)
After loading into dataframe when I do df.show() it shows:
+--------+--------------------+
|ClientID| PatientID|
+--------+--------------------+
|AR0001Ã¥|DH_HL704221157198...|
|AR00022 |DH_HL704221157198...|
+--------+--------------------+
Code I used to try to replace this character:
df.withColumn("ClientID", functions.regexp_replace(df.col("ClientID"), "\å", ""));
But this didn't work. While loading the data in dataframe if I change the charset to "UTF-8" it works.
I am not able to find a solution with the current charset (ISO-8859-1).

Some things to note,
Make sure to assign the result to a new variable and use that afterwards
You do not need to escape "å" with \
colName in the command should be ClientId or PatientID
If you did all these things, then I would suggest to, instead of matching on "å", try matching on the characters you want to keep. For example, for the ClientID column,
df.withColumn("ClientID", functions.regexp_replace(df.col("ClientID"), "[^A-Z0-9_]", ""));
Another approach would be to convert the UTF-8 character "å" to it's ISO-8859-1 equivalent and replace with the resulting string.
String escapeChar = new String("å".getBytes("UTF-8"), "ISO-8859-1");

The below command will remove all the special characters and will keep all the lower/upper case alphabets and all the numbers in the string:
df.withColumn("ClientID", functions.regexp_replace(df.col("ClientID"), "[^a-zA-Z0-9]", ""));

Related

read unique char: 'あ' from json file in java

I am reading a JSON file in Java using this code:
String data = Files.readFile(jsonFile)
.trim()
.replaceAll("[^\\x00-\\x7F]", "")
.replaceAll("[\\p{Cntrl}&&[^\r\n\t]]", "")
.replaceAll("\\p{C}", "");
In my JSON file, there is a unique char: 'あ' (12354) that is interpreted to: "" (nothing) when reading the file.
How can I make this char show up in my variable "data"?
Due to answers I've got, I understand that the data is cleaned from high ASCII characters by adding replaceAll("[^\\x00-\\x7F]", ""). But what can I do if I want all high ASCII characters to be cleaned except this one 'あ'?
The character you want is the unicode character HIRAGANA LETTER A and has code U+3042.
You can simply add it to the list of valid characters:
...
.replaceAll("[^\\x00-\\x7F\\u3042]", "")
...

Escape XML Characters for Attribute values Java

I have an XML represented in String. I need to replace all the special characters in the Attribute values with the Escape Characters.
For Ex:
I want to convert 1st one to the second one as following.
<r1 c1=\"01\" c168=\"<A_ATTR><Updates A_VALUE="959" /><Current A_VALUE="100" /></A_ATTR>\"/>
<r1 c1=\"01\" c168=\"<A_ATTR><Updates A_VALUE="959" /><Current A_VALUE="100" /></A_ATTR>\"/>
This questions is similar to the below one : But I need to escape the attribute values. Please advise.
Escape xml characters within nodes of string xml in java
Use string replace function to replace the required character by the encoding. Example below
if your xml string is s then
s = s.replace("<", "<");
s = s.replace(">", ">");

replace or remove new line "\n" character from Spark dataset column value

I have below code to read xml
Dataset<Row> dataset1 = SparkConfigXMLProcessor.sparkSession.read().format("com.databricks.spark.xml")
.option("rowTag", properties.get(EventHubConsumerConstants.IG_ORDER_TAG).toString())
.load(properties.get("C:\\inputOrders.xml").toString());
one of the column value getting new line character.
i want to replace it with some character or just want to remove it.
Please help
dataset1.withColumn("menuitemname_clean", regexp_replace(col("menuitemname"), "[\n\r]", " "))
Above code will work
This is what I used. I usually add a tab (\t), too. Having both \r and \n will find UNIX (\n), Windows (\r), and OSX (\r) newlines.
Dataset<Row> newDF = dataset1.withColumn("menuitemname", regexp_replace(col("menuitemname"), "\n|\r", ""));
Below code resolve my issue
Dataset<Row> newDF = dataset1.withColumn("menuitemname", regexp_replace(col("menuitemname"), "[\\n]", ""));

Java Format String for Table Display: Wrap at 80 chars

I need to display a list in table/grid format, so I'm using String.format() as in the following example,
how to print object list to file with formatting in table format using java
My issue is that I need to force-wrap the output at 80 chars. The table's maximum width is 80, any further output must continue on the next line.
Is this possible?
Current code, without wrapping implemented:
StringBuilder sbOutput = new StringBuilder();
sbOutput.append(String.format("%-14s%-200s%-13s%-24s%-12s", "F1", "F2", "F3", "F4", "F5"));
for (MyObject result: myObjects) {
sbOutput.append(String.format("%-14s%-200s%-13s%-24s%-12s", result.getF1(),
result.getF2(), result.getF3(), result.getF3(), result.getF4()));
}
You can inject a newline into a string every 80 chars like this:
str.replaceAll(".{80}(?=.)", "$0\n");
So your code would become:
sbOutput.append(String.format("%-14s%-200s%-13s%-24s%-12s", result.getF1(),
result.getF2(), result.getF3(), result.getF3(), result.getF4())
.replaceAll(".{80}(?=.)", "$0\n"));
The search regex means "80 chars that have a character following" and "$0" in the replacement means "everything matched by the search".
The (?=.) is a look ahead asserting the match is followed by any character, which prevents output that is an exact multiple of 80 chars getting an unecessary newline added after it.

Insert " in correct form in a String

I want to to input this link in to the string.
String url=www.test.com;
String link=<a href=url>contact info</a>
How can I write this ?
You will need to do:
String url = "www.test.com";
You can use \ character to indicate that we want to include a special character, and that the next character should be treated differently. \" indicates a double quote character and not the termination of the string.
String link = "contact info";
A character preceded by a backslash is an escape sequence and has special meaning to the compiler. The following table shows the Java escape sequences:
Java Escape Sequences:
For More information check this link
First, let's assume you have:
String url = "www.test.com";
(Note the quotes around the string.)
To create your link string, you'd do this:
String link = "contact info";
// Note ---------------^^-----------^^
To put a " inside a string literal, you put a backslash in front of it. This is called "escaping" the quote.
First have the url value within quotes ,then concat the value in the link string.
String url="www.test.com";
String link="contact info";

Categories