Output and preg to Unicode Strings in Java - java

I have normal String property inside an object, containing accented characters.
If I debug the software (with Netbeans), into the variables panel I will see that string in the right way:
But when I'm going to print out the variable with System.out.println I will see strange things:
As you can see every "à" become "a'" and so on, and this will lead to a wrong character count, even in Matcher on the string.
How I can fix this? I need the accented characters, to have the right characters count and to use the matcher on it.
I tried many ways but is not going to work, for sure I'm missing something.
Thanks in advance.
EDIT
EDIT AGAIN
This is the code:
public class TextLine {
public List<TextPosition> textPositions = null;
public String text = "";
}
public class myStripper extends PDFTextStripper {
public ArrayList<TextLine> lines = null;
boolean startOfLine = true;
public myStripper() throws IOException
{
}
private void newLine() {
startOfLine = true;
}
#Override
protected void startPage(PDPage page) throws IOException
{
newLine();
super.startPage(page);
}
#Override
protected void writeLineSeparator() throws IOException
{
newLine();
super.writeLineSeparator();
}
#Override
public String getText(PDDocument doc) throws IOException
{
lines = new ArrayList<TextLine>();
return super.getText(doc);
}
#Override
protected void writeWordSeparator() throws IOException
{
TextLine tmpline = null;
tmpline = lines.get(lines.size() - 1);
tmpline.text += getWordSeparator();
tmpline.textPositions.add(null);
super.writeWordSeparator();
}
#Override
protected void writeString(String text, List<TextPosition> textPositions) throws IOException
{
TextLine tmpline = null;
if (startOfLine) {
tmpline = new TextLine();
tmpline.text = text;
tmpline.textPositions = textPositions;
lines.add(tmpline);
} else {
tmpline = lines.get(lines.size() - 1);
tmpline.text += text;
tmpline.textPositions.addAll(textPositions);
}
if (startOfLine) {
startOfLine = false;
}
super.writeString(text, textPositions);
}
}

It is about the representation of certain Unicode characters.
What is a character? That question is hard to answer. Is à one character, or two (the a and ` on top of eachother)? It depends what you consider to be a character.
The accent graves (`) you are seeing are actually combining diacritical marks. Combining diacritical marks are separate Unicode characters, but are combined with the previous character by many text processors. For instance, java.text.Normalizer.normalize(str, Normalizer.Form.NFC) does such a job for you.
The library you are using (Apache PDFBox) possibly normalizes the text, so diacritics are combined with the preceding character. So in your text, some TextPosition instances contain two code points (more precisely, e` and a`). So the length of the list with TextPosition instances is 65.
However, your String, which is in fact a CharSequence, holds 67 characters, because the diacritic itself takes up 1 char.
System.out.println() just prints each character of the string, and that is represented as "dere che Geova e` il Creatore e Colui che da` la vita. Probabilmen-"
Then why is the Netbeans debugger showing "dere che Geova è il Creatore e Colui che dà la vita. Probabilmen-" as value of the string?
That is simply because the Netbeans debugger displays the normalized text for you.

Related

How I can use InCombiningDiacriticalMarks ignoring one case

I'm writing code for remove all diacritics for one String.
For example: áÁéÉíÍóÓúÚäÄëËïÏöÖüÜñÑ
I'm using the property InCombiningDiacriticalMarks of Unicode. But I want to ignore the replace for ñ and Ñ.
Now I'm saving these two characters before replace with:
s = s.replace('ñ', '\001');
s = s.replace('Ñ', '\002');
It's possible to use InCombiningDiacriticalMarks ignoring the diacritic of ñ and Ñ.
This is my code:
public static String stripAccents(String s)
{
/*Save ñ*/
s = s.replace('ñ', '\001');
s = s.replace('Ñ', '\002');
s = Normalizer.normalize(s, Normalizer.Form.NFD);
s = s.replaceAll("[\\p{InCombiningDiacriticalMarks}]", "");
/*Add ñ to s*/
s = s.replace('\001', 'ñ');
s = s.replace('\002', 'Ñ');
return s;
}
It works fine, but I want know if it's possible optimize this code.
It depends what you mean by "optimize". It's tough to reduce the number of lines of code from what you have written, but since you are processing the string six times there's scope to improve performance by processing the input string only once, character by character:
public class App {
// See SO answer https://stackoverflow.com/a/10831704/2985643 by virgo47
private static final String tab00c0
= "AAAAAAACEEEEIIII"
+ "DNOOOOO\u00d7\u00d8UUUUYI\u00df"
+ "aaaaaaaceeeeiiii"
+ "\u00f0nooooo\u00f7\u00f8uuuuy\u00fey"
+ "AaAaAaCcCcCcCcDd"
+ "DdEeEeEeEeEeGgGg"
+ "GgGgHhHhIiIiIiIi"
+ "IiJjJjKkkLlLlLlL"
+ "lLlNnNnNnnNnOoOo"
+ "OoOoRrRrRrSsSsSs"
+ "SsTtTtTtUuUuUuUu"
+ "UuUuWwYyYZzZzZzF";
public static void main(String[] args) {
var input = "AaBbCcáÁéÉíÍóÓúÚäÄëËïÏöÖüÜñÑçÇ";
var output = removeDiacritic(input);
System.out.println("input = " + input);
System.out.println("output = " + output);
}
public static String removeDiacritic(String input) {
var output = new StringBuilder(input.length());
for (var c : input.toCharArray()) {
if (isModifiable(c)) {
c = tab00c0.charAt(c - '\u00c0');
}
output.append(c);
}
return output.toString();
}
// Returns true if the supplied char is a candidate for diacritic removal.
static boolean isModifiable(char c) {
boolean modifiable;
if (c < '\u00c0' || c > '\u017f') {
modifiable = false;
} else {
modifiable = switch (c) {
case 'ñ', 'Ñ' ->
false;
default ->
true;
};
}
return modifiable;
}
}
This is the output from running the code:
input = AaBbCcáÁéÉíÍóÓúÚäÄëËïÏöÖüÜñÑçÇ
output = AaBbCcaAeEiIoOuUaAeEiIoOuUñÑcC
Characters without diacritics in the input string are not modified. Otherwise the diacritic is removed (e.g. Çto C), except in the cases of ñ and Ñ.
Notes:
The code does not use the Normalizer class or InCombiningDiacriticalMarks at all. Instead it processes each character in the input string only once, removing its accent if appropriate. The conventional approach for removing diacritics (as used in the OP) does not support selective removal as far as I know.
The code is based on an answer by user virgo47, but enhanced to support the selective removal of accents. See virgo47's answer for details of mapping an accented character to its unaccented counterpart.
This solution only works for Latin-1/Latin-2, but could be enhanced to support other mappings.
Your solution is very short and easy to understand, but it feels brittle, and for large input I suspect that it would be significantly slower than an approach that only processed each character once.
Ave Maria Purisima,
You can create a pattern excluding the tilde from the diacritical marks set:
private static final Pattern STRIP_ACCENTS_PATTERN = Pattern.compile("[\\p{InCombiningDiacriticalMarks}&&[^\u0303]]+");
public static String stripAccents(String input) {
if (input == null) {
return null;
}
final StringBuilder decomposed = new StringBuilder(Normalizer.normalize(input, Normalizer.Form.NFD));
return STRIP_ACCENTS_PATTERN.matcher(decomposed).replaceAll(EMPTY);
}
Hope it helps

Detect Chinese character in java

Using Java how to detect if a String contains Chinese characters?
String chineseStr = "已下架" ;
if (isChineseString(chineseStr)) {
System.out.println("The string contains Chinese characters");
}else{
System.out.println("The string contains Chinese characters");
}
Can you please help me to solve the problem?
Now Character.isIdeographic(int codepoint) would tell wether the codepoint is a CJKV (Chinese, Japanese, Korean and Vietnamese) ideograph.
Nearer is using Character.UnicodeScript.HAN.
So:
System.out.println(containsHanScript("xxx已下架xxx"));
public static boolean containsHanScript(String s) {
for (int i = 0; i < s.length(); ) {
int codepoint = s.codePointAt(i);
i += Character.charCount(codepoint);
if (Character.UnicodeScript.of(codepoint) == Character.UnicodeScript.HAN) {
return true;
}
}
return false;
}
Or in java 8:
public static boolean containsHanScript(String s) {
return s.codePoints().anyMatch(
codepoint ->
Character.UnicodeScript.of(codepoint) == Character.UnicodeScript.HAN);
}
A more direct approach:
if ("粽子".matches("[\\u4E00-\\u9FA5]+")) {
System.out.println("is Chinese");
}
If you also need to catch rarely used and exotic characters then you'll need to add all the ranges: What's the complete range for Chinese characters in Unicode?
You can try with Google API or Language Detection API
Language Detection API contains simple demo. You can try it first.

How to use MaskFormatter and DocumentFilter together

I need to have a JFormattedTextField that allows only the input of ##-###** where the hyphen is always present in the text field and the last 2 characters, represented by the *, can either be 2 letters of the alphabet (a-z/ A-Z) or nothing at all.
I know how to solve parts of this but not exactly sure how to bring everything together. I know that using a MaskFormatter of ##-###** will give me the always present hyphen but there is no way for me to enforce the rule of the last 2 characters being either letters or nothing at all. Furthermore, the MaskFormatter will replace any deletion with the last valid insert which is undesirable.
I also know that I could use a DocumentFilter to only allow the format I want by using regexes, similar to this functionality but with a different regex:
public void insertString(FilterBypass fb, int offs, int length, String str, AttributeSet a)
throws BadLocationException {
String text = fb.getDocument().getText(0, fb.getDocument().getLength());
text += str;
if ((fb.getDocument().getLength() + str.length()
- length) <= maxCharacters && text.matches("^[0-9]+[.]?[0-9]{0,1}$")) {
super.replace(fb, offs, length, str, a);
} else {
Toolkit.getDefaultToolkit().beep();
}
}
The problem I see with using this is that I would not be able to have the hyphen always present in the text field.
Can someone help me complete the bridge between these two desired functions?
"there is no way for me to enforce the rule of the last 2 characters being either letters or numbers."
Sorry didn't see you were using a MaskFormatter. If you look at the API docs, you'll see a chart of possible character formats
# Any valid number, uses Character.isDigit.
' Escape character, used to escape any of the special formatting characters.
U Any character (Character.isLetter). All lowercase letters are mapped to upper case.
L Any character (Character.isLetter). All upper case letters are mapped to lower case.
A Any character or number (Character.isLetter or Character.isDigit)
? Any character (Character.isLetter).
* Anything.
H Any hex character (0-9, a-f or A-F).
So you could actually just use "##-####UU"
EDIT using InputVerifier
import javax.swing.InputVerifier;
import javax.swing.JComponent;
import javax.swing.JOptionPane;
import javax.swing.JPanel;
import javax.swing.JTextField;
public class TestMaskFormatter {
private static final String REGEX = "^\\d{2}\\-\\d{4}([A-Z]{2})??";
private static InputVerifier getInputVerifier() {
InputVerifier verifier = new InputVerifier() {
#Override
public boolean verify(JComponent input) {
JTextField field = (JTextField) input;
String text = field.getText();
return text.matches(REGEX) || text.isEmpty();
}
#Override
public boolean shouldYieldFocus(JComponent input) {
boolean valid = verify(input);
if (!valid) {
JOptionPane.showMessageDialog(null, "Must Match format: 00-0000AA");
JTextField field = (JTextField) input;
field.setText("");
}
return valid;
}
};
return verifier;
}
public static void main(String[] args) {
JTextField fieldWithVerifier = new JTextField(10);
fieldWithVerifier.setInputVerifier(getInputVerifier());
JTextField field1 = new JTextField(10);
JPanel panel = new JPanel();
panel.add(fieldWithVerifier);
panel.add(field1);
JOptionPane.showMessageDialog(null, panel);
}
}

Sax parser read a line not totally

I'm trying to parse a simil-InkML document. Every content's node has more tuple (separated by comma) with 6 or 7 number (negative and decimal too).
In testing I see that the method character of SAX don't memorize all the data.
The code:
public class PenParser extends DefaultHandler {
//code useless
public void characters(char ch[], int start, int length) throws SAXException {
//begin my debug print
StringBuilder buffer=new StringBuilder ();
for(int i=start;i<length;i++){
buffer.append(ch[i]);
}
System.out.println(">"+buffer);
//end my debug print
In debug, I see that buffer don't contain all the number of the interested tag, but it contain only the first 107 (more or less) char of content of the tag (my rows are not longer that 4610 char): it's strange this cut of char by StringBuffer and SAX parsing, in my opinion.
I had used the StringBuilder too but the problem remain.
Any suggest?
Yes - that's pretty obvious.
characters may be called several times when one node is parsed.
You'll have to use the StringBuilder as member, append the content in characters and deal with the content in endElement.
edited
btw. you do not need to build the buffer character by character -
this is my implementation of characters (which I always use)
#Override
public void characters(char[] ch, int start, int length) throws SAXException
{
characters.append(new String(ch,start,length));
}
... and not to forget ....
#Override
public void endElement(String uri, String localName, String qName)
throws SAXException
{
final String content = characters.toString().trim();
// .... deal with content
// reset characters
characters.setLength(0);
}
private final StringBuilder characters = new StringBuilder(64);

regex for comma separated values

I am new to write regular expressions so please help.
I want to match this pattern (in Java):
"ABC",010,00,"123",0,"time","time",01,00, 10, 10,88,217," ",," "
the data I get will always be in the above format with 16 values. But the format will never change.
I am not looking for parsing as this can be parsed by java split too.
I will have large chunks of these data so want to capture the first 16 data points and match with this pattern to check if I received it correctly else ignore.
so far I have only tried this regex:
^(\".\"),.,(\".\"),.,(\".\"),(\".\"),.,.,.,.,.,.,(\".\"),.,(\".\")$
I am still in the process of building it.
I just need to match the pattern from a given pool. I take first 16data points and try to see if it matches this pattern else ignore.
Thanks!!
This should do the trick. Keep in mind that he doesn't care what order the data points occur in (ie. they could all be strings or all numbers).
(\s?("[\w\s]*"|\d*)\s?(,|$)){16}
You can try it out here.
Please find in the below code comprising comma separated evaluation for String, Number and Decimal.
public static void commaSeparatedStrings() {
String value = "'It\\'s my world', 'Hello World', 'What\\'s up', 'It\\'s just what I expected.'";
if (value.matches("'([^\'\\\\]*(?:\\\\.[^\'\\\\])*)[\\w\\s,\\.]+'(((,)|(,\\s))'([^\'\\\\]*(?:\\\\.[^\'\\\\])*)[\\w\\s,\\.]+')*")) {
System.out.println("Valid...");
} else {
System.out.println("Invalid...");
}
}
/**
*
*/
public static void commaSeparatedDecimals() {
String value = "-111.00, 22111.00, -1.00";
// "\\d+([,]|[,\\s]\\d+)*"
if (value.matches(
"^([-]?)\\d+\\.\\d{1,10}?(((,)|(,\\s))([-]?)\\d+\\.\\d{1,10}?)*")) {
System.out.println("Valid...");
} else {
System.out.println("Invalid...");
}
}
/**
*
*/
public static void commaSeparatedNumbers() {
String value = "-11, 22, -31";
if (value.matches("^([-]?)\\d+(((,)|(,\\s))([-]?)\\d+)*")) {
System.out.println("Valid...");
} else {
System.out.println("Invalid...");
}
}

Categories