How can i use lithuanian special letters in JAVA - java

I want to filter table and check string in selenium and that string in the web contains Lithuanian special letter so i get something like this "M?nesis" instead of "Mėnesis"
ElementsCollection activePlans = $$(".view-content .tile__title").filterBy(text("Mėnesis"));
How can i do that?

Related

Java Regex - Allow all regular Unicode characters for names but not obscure variants

In java (v11) I would like to allow all characters in any language for choosing a username, so ASCII, Latin, Greek, Chinese and so on.
We tried the pattern \p{IsAlphabetic}.
But with this pattern names like "𝕮𝖍𝖗𝖎𝖘" are allowed. I don't want to let people to style their name with such unicode characters. I want him to enter "Chris" and not "𝕮𝖍𝖗𝖎𝖘"
It should be allowed to name yourself "尤雨溪", "Linus" or "Gödel".
How to achieve a proper Regex not allowing strange styles in names?
Here is a regular expression that allows Latin, Han Chinese, Greek, Russian Cyrillic. It can be completed with more Unicode Scripts.
^(\p{sc=Han}+|\p{sc=Latin}+|\p{sc=Greek}+|\p{sc=Cyrillic})$
Demo here: https://regex101.com/r/yCt5xT/1
Here is the full list of Unicode Scripts that can be used: https://www.regular-expressions.info/unicode.html
\p{Common}
\p{Arabic}
\p{Armenian}
\p{Bengali}
\p{Bopomofo}
\p{Braille}
\p{Buhid}
\p{Canadian_Aboriginal}
\p{Cherokee}
\p{Cyrillic}
\p{Devanagari}
\p{Ethiopic}
\p{Georgian}
\p{Greek}
\p{Gujarati}
\p{Gurmukhi}
\p{Han}
\p{Hangul}
\p{Hanunoo}
\p{Hebrew}
\p{Hiragana}
\p{Inherited}
\p{Kannada}
\p{Katakana}
\p{Khmer}
\p{Lao}
\p{Latin}
\p{Limbu}
\p{Malayalam}
\p{Mongolian}
\p{Myanmar}
\p{Ogham}
\p{Oriya}
\p{Runic}
\p{Sinhala}
\p{Syriac}
\p{Tagalog}
\p{Tagbanwa}
\p{TaiLe}
\p{Tamil}
\p{Telugu}
\p{Thaana}
\p{Thai}
\p{Tibetan}
\p{Yi}
The challenge is that is composed of surrogate pairs, which the regex engine interprets as code points, not chars.
The solution is to match any letter using \p{L}, but exclude code points of high surrogates on up:
"[\\p{L}&&[^\\x{0d000}-\\x{10ffff}]]+"
Trying to exclude the unicode characters
"[\\p{L}&&[^\ud000-\uffff]]+" // doesn't work
doesn't work, because the surrogate pairs are merged into a single code point.
Test code:
String[] names = {"尤雨溪", "Linus", "Gödel", "\uD835\uDD6E\uD835\uDD8D\uD835\uDD97\uD835\uDD8E\uD835\uDD98"};
for (String name : names) {
System.out.println(name + ": " + name.matches("[\\p{L}&&[^\\x{0d000}-\\x{10ffff}]]+"));
}
Output:
尤雨溪: true
Linus: true
Gödel: true
𝕮𝖍𝖗𝖎𝖘: false

Escape XML Characters for Attribute values Java

I have an XML represented in String. I need to replace all the special characters in the Attribute values with the Escape Characters.
For Ex:
I want to convert 1st one to the second one as following.
<r1 c1=\"01\" c168=\"<A_ATTR><Updates A_VALUE="959" /><Current A_VALUE="100" /></A_ATTR>\"/>
<r1 c1=\"01\" c168=\"<A_ATTR><Updates A_VALUE="959" /><Current A_VALUE="100" /></A_ATTR>\"/>
This questions is similar to the below one : But I need to escape the attribute values. Please advise.
Escape xml characters within nodes of string xml in java
Use string replace function to replace the required character by the encoding. Example below
if your xml string is s then
s = s.replace("<", "<");
s = s.replace(">", ">");

Match string with normal characters and special characters in Spring

I'm trying to find a way to match user search queries with a database records in a search engine, using Spring, but I'm having trouble when the search query includes special characters such as vowels with accent.
Eg: search query = 'cafe'. Database record = 'café'
I'm using the stem of words to the query with the database records.
Which would be the most straight forward way of matching the query including a special character 'café' with the string that doesn't contain this special character 'cafe' and viceversa?
UPDATE
All the information I need is already cached so the approach of creating a new column in the db is not so appealing. I'm looking for a solution more spring based.
You could use java.text.Normalizer, like follow:
import java.text.Normalizer;
import java.text.Normalizer.Form;
public static String removeAccents(String text) {
return text == null ? null :
Normalizer.normalize(text, Form.NFD)
.replaceAll("\\p{InCombiningDiacriticalMarks}+", "");
}
The Normalizer splits the original characters into a set of two character (letter and accent).
For example the character á (U+00E1) will be split in a (U+0061) and acute accent U+0301
The \p{InCombiningDiacriticalMarks}+ regular expression will match all such diacritic codes and we will replace them with an empty string.
And your query could be like:
SQL SERVER
SELECT * FROM Table
WHERE Column Like '%stringwithoutaccents%' COLLATE Latin1_general_CI_AI
ORACLE (from 10g)
SELECT * FROM Table
WHERE NLSSORT(Column, 'NLS_SORT = Latin_AI')
Like NLSSORT('%stringwithoutaccents%', 'NLS_SORT = Latin_AI')
The CI stands for "Case Insensitive" and AI for "Accent Insensitive".
I hope it helps you.

Croatian character in Java standard output

I have a database with some cratian characters in it like Đ , in the database the character is stored correctly, when using a datatable in primefaces it also shows the character in the webpage just fine.
The problem is that when I send it to the out.println() the character Đ in the name is missing.
for (People p : people) {
System.out.println("p.getName());
}
I tried using String name2 = p.getName().getBytes("ISO-8859-2"); but it still not working
I assume you are using UTF-8 as default encoding on the Database and for Primefaces
Have also a look to this:
Display special characters using System.out.println

Java: Search in a wrong encoded String without modifying it

I have to find a user-defined String in a Document (using Java), which is stored in a database in a BLOB. When I search a String with special characters ("Umlaute", äöü etc.), it failes, meaning it does not return any positions at all. And I am not allowed to convert the document's content into UTF-8 (which would have fixed this problem but raised a new, even bigger one).
Some additional information:
The document's content is returned as String in "ISO-8859-1" (Latin1).
Here is an example, what a String could look like:
Die Erkenntnis, daà der Künstler Schutz braucht, ...
This is how it should look like:
Die Erkenntnis, daß der Künstler Schutz braucht, ...
If I am searching for Künstler it would fail to find it, because it looks for ü but only finds ü.
Is it possible to convert Künstler into Künstler so I can search for the wrong encoded version instead?
Note:
We are using the Hibernate Framework for Database access. The original Getter for the Document's Content returns a byte[]. The String is than returned by calling
new String(getContent(), "ISO-8859-1")
The problem here is, that I cannot change this to UTF-8, because it would then mess up the rest of our application which is based on a third party application that delivers data this way.
Okay, looks like I've found a way to mess up the encoding on purpose.
new String("Künstler".getBytes("UTF-8"), "ISO-8859-1")
By getting the Bytes of the String Künstler in UTF-8 and then creating a new String, telling Java that this is Latin1, it converts to Künstler. It's a hell of a hack but seems to work well.
Already answered by yourself.
An altoghether different approach:
If you can search the blob, you could search using
"SELECT .. FROM ... WHERE"
+ " ... LIKE '%" + key.replaceAll("\\P{Ascii}+", "%") + "%'"
This replaces non-ASCII sequences by the % wildcard: UTF-8 multibyte sequences are non-ASCII by design.

Categories