Kotlin/Java - How to identify full width characters? - java

TL;DR:
Half-width: Regular width characters.
Eg. 'A' and 'οΎ†'
Full-width: Chars that take two monospaced English chars' space on the display
Eg. '中', 'に' and '1'
I need an implementation of this function:
/**
* #return Is this character a full-width character or not.
*/
fun Char.isFullWidth(): Boolean
{
// What is the most efficient implementation here?
}
No this is not about data structures for those chars, it's only about the displayed width.
Long Story:
I'm refactoring HyLogger, a logging library focused on text-coloring with gradients. Here is the problem I ran into:
If you look at the first gradient text block printed in the screenshot, the full-width text in the middle messed up the gradient pattern after it, because when calling string.length, they are counted as one character even though they take up twice the size.
You might be asking, why on earth would anyone print full-width characters? This is a real problem because almost all characters in languages like Chinese, Japanese, or Korean are full-width, therefore takes twice the space, similar to the English full-width characters.
So I need a way to identify full-width characters so that I can calculate them as two gradient-pixels instead of one to solve the problem in the picture.
Known Info:
C++ check if unicode character is full width :
There is a list of East Asian Width characters on the Unicode website (and also the report), but it's probably not efficient to traverse this entire list for every single character when rendering a gradient text block.
Python has this Unicode database library, one possible solution is to call python API using Jython, which would be heavy and the efficiency is probably not very good.
Analyzing full width or half width character in Java :
The ICU4J library has Unicode tools to achieve this function, but that library is 12.5 MB large, which isn't optimal for my 50 KB logger library.

The best solution seems to be converting EastAsianWidth.txt to a series of range conditions.
The below function is partially generated with FullWidthUtilGenerator.kt, and it still has some issues to resolve:
It does not account for characters outside the Basic Multilingual Plane (BMP) range (Eg. 𐀀 U+10000) because I haven't figured out how to effectively include them in Java/Kotlin.
(\u10000 gives compilation error)
Near values that are stated separately in EastAsianWidth.txt are not automatically combined yet. (Eg. \u3010 and \u3011)
/**
* Half-width: Regular width characters.
* Eg. 'A' and 'οΎ†'
*
* Full-width: Chars that take two monospaced English chars' space on the display
* Eg. '中', 'に' and '1'
*
* See FullWidthUtilGenerator.kt
*
* #return Is this character a full-width character or not.
*/
fun Char.isFullWidth(): Boolean
{
return when (this)
{
'\u2329','\u232A','\u23F0','\u23F3','\u267F','\u2693','\u26A1','\u26CE','\u26D4','\u26EA','\u26F5',
'\u26FA','\u26FD','\u2705','\u2728','\u274C','\u274E','\u2757','\u27B0','\u27BF','\u2B50','\u2B55',
'\u3000','\u3004','\u3005','\u3006','\u3007','\u3008','\u3009','\u300A','\u300B','\u300C','\u300D',
'\u300E','\u300F','\u3010','\u3011','\u3014','\u3015','\u3016','\u3017','\u3018','\u3019','\u301A',
'\u301B','\u301C','\u301D','\u3020','\u3030','\u303B','\u303C','\u303D','\u303E','\u309F','\u30A0',
'\u30FB','\u30FF','\u3250','\uA015','\uFE17','\uFE18','\uFE19','\uFE30','\uFE35','\uFE36','\uFE37',
'\uFE38','\uFE39','\uFE3A','\uFE3B','\uFE3C','\uFE3D','\uFE3E','\uFE3F','\uFE40','\uFE41','\uFE42',
'\uFE43','\uFE44','\uFE47','\uFE48','\uFE58','\uFE59','\uFE5A','\uFE5B','\uFE5C','\uFE5D','\uFE5E',
'\uFE62','\uFE63','\uFE68','\uFE69','\uFF04','\uFF08','\uFF09','\uFF0A','\uFF0B','\uFF0C','\uFF0D',
'\uFF3B','\uFF3C','\uFF3D','\uFF3E','\uFF3F','\uFF40','\uFF5B','\uFF5C','\uFF5D','\uFF5E','\uFF5F',
'\uFF60','\uFFE2','\uFFE3','\uFFE4',
in '\u1100'..'\u115F',in '\u231A'..'\u231B',in '\u23E9'..'\u23EC',in '\u25FD'..'\u25FE',
in '\u2614'..'\u2615',in '\u2648'..'\u2653',in '\u26AA'..'\u26AB',in '\u26BD'..'\u26BE',
in '\u26C4'..'\u26C5',in '\u26F2'..'\u26F3',in '\u270A'..'\u270B',in '\u2753'..'\u2755',
in '\u2795'..'\u2797',in '\u2B1B'..'\u2B1C',in '\u2E80'..'\u2E99',in '\u2E9B'..'\u2EF3',
in '\u2F00'..'\u2FD5',in '\u2FF0'..'\u2FFB',in '\u3001'..'\u3003',in '\u3012'..'\u3013',
in '\u301E'..'\u301F',in '\u3021'..'\u3029',in '\u302A'..'\u302D',in '\u302E'..'\u302F',
in '\u3031'..'\u3035',in '\u3036'..'\u3037',in '\u3038'..'\u303A',in '\u3041'..'\u3096',
in '\u3099'..'\u309A',in '\u309B'..'\u309C',in '\u309D'..'\u309E',in '\u30A1'..'\u30FA',
in '\u30FC'..'\u30FE',in '\u3105'..'\u312F',in '\u3131'..'\u318E',in '\u3190'..'\u3191',
in '\u3192'..'\u3195',in '\u3196'..'\u319F',in '\u31A0'..'\u31BF',in '\u31C0'..'\u31E3',
in '\u31F0'..'\u31FF',in '\u3200'..'\u321E',in '\u3220'..'\u3229',in '\u322A'..'\u3247',
in '\u3251'..'\u325F',in '\u3260'..'\u327F',in '\u3280'..'\u3289',in '\u328A'..'\u32B0',
in '\u32B1'..'\u32BF',in '\u32C0'..'\u32FF',in '\u3300'..'\u33FF',in '\u3400'..'\u4DBF',
in '\u4E00'..'\u9FFC',in '\u9FFD'..'\u9FFF',in '\uA000'..'\uA014',in '\uA016'..'\uA48C',
in '\uA490'..'\uA4C6',in '\uA960'..'\uA97C',in '\uAC00'..'\uD7A3',in '\uF900'..'\uFA6D',
in '\uFA6E'..'\uFA6F',in '\uFA70'..'\uFAD9',in '\uFADA'..'\uFAFF',in '\uFE10'..'\uFE16',
in '\uFE31'..'\uFE32',in '\uFE33'..'\uFE34',in '\uFE45'..'\uFE46',in '\uFE49'..'\uFE4C',
in '\uFE4D'..'\uFE4F',in '\uFE50'..'\uFE52',in '\uFE54'..'\uFE57',in '\uFE5F'..'\uFE61',
in '\uFE64'..'\uFE66',in '\uFE6A'..'\uFE6B',in '\uFF01'..'\uFF03',in '\uFF05'..'\uFF07',
in '\uFF0E'..'\uFF0F',in '\uFF10'..'\uFF19',in '\uFF1A'..'\uFF1B',in '\uFF1C'..'\uFF1E',
in '\uFF1F'..'\uFF20',in '\uFF21'..'\uFF3A',in '\uFF41'..'\uFF5A',in '\uFFE0'..'\uFFE1',
in '\uFFE5'..'\uFFE6' -> true
else -> false
}
}

Related

How can I determine the width of a Unicode character

me and a friend are programming our own console in java, but we have Problems to adjust the lines correctly, because of the width of the unicode characters which can not be determined exactly. This leads to the problem that not only the line of the unicode, but also following lines are shifted.
Is there a way to determine the width of the unicodes?
Screenshots of the problem can be found bellow.
This is how it should look: https://abload.de/img/richtigslkmg.jpeg
This is an example in Terminal: https://abload.de/img/terminal7dj5o.jpeg
This is an example in PowerShell: https://abload.de/img/powershelln7je0.jpeg
This is an example in Visual Studio Code: https://abload.de/img/visualstudiocode4xkuo.jpeg
This is an example in Putty: https://abload.de/img/putty0ujsk.png
EDIT:
I am sorry that the question was unclear.
It is about the display width, in the example I try to determine the display length to have each line the same length.
The function real_length is to calculate/determine and return the display width.
here the example code:
public static void main(String[] args) {
String[] tests = {
"Peter",
"οΌ³οΌ¨οΌ§οΌ‘οΌ­οΌ©",
"Marcel β„–1",
"πŸ’",
"πŸ‘¨β€β€οΈβ€πŸ‘¨",
"πŸ‘©β€β€οΈβ€πŸ’‹β€πŸ‘©",
"πŸ‘¨β€πŸ‘©β€πŸ‘¦"
};
for(String test : tests) test(test);
}
public static void test(String text) {
int max = 20;
for(int i = 0; i < max;i++) System.out.print("#");
System.out.println();
System.out.print(text);
int length = real_length(text);
for(int i = 0; i < max - length;i++) System.out.print("#");
System.out.println();
}
public static int real_length(String text) {
return text.length();
}
Unfortunately there is no easy solution to your deceptively simple question, for several reasons:
The width of the characters being rendered on the console might (and probably will) vary, based on the font being used. So the code would need to determine, or assume, the target font in order to calculate widths.
System.out is just a PrintStream that does not know or care about fonts and character width, so any solution has to be independent of that.
Even if you could determine the font being used on the console, and you had a way to determine the width of each character you were trying to render in that specific font, how would that help you? Knowing the variation in widths might conceivably allow you to cleverly tweak the lines being rendered so that they were aligned, but it's just as likely that it wouldn't be practicable.
A potential solution is to leave your code as it stands, and use a monospaced font on the console that println() is writing to, but there are still some major problems with that approach. First, you need to identify a font that is monospaced, but will also support all of the characters you want to render. This can be problematic when including emojis. Second, even if you identify such a font, you may find that all the glyphs for that font are not monospaced! Such a font will ensure that (say) a lowercase i and an uppercase W have the same width, but you can't also make that assumption for emojis, and you can't even assume that the "monospaced" emojis will all have the same non-standard width! Third, the font you identify (if it exists at all) would have to be available in your target environments (your PowerShell, your friend's PuTTY shell, etc.). That is not a major obstacle, but it is one more thing to worry about.
You may find that the rendered text varies by operating system. Your output may look aligned in a Linux terminal window, but that same output, using the same font, might be misaligned in a PowerShell window.
Given all that, a better approach might be to use Swing or JavaFX, where you have finer control over the output being rendered. Even if you are unfamiliar with those technologies, it wouldn't take too long to get something working, just by tweaking some sample code obtained through a search. And even allowing for the learning curve, it would still take less time than coming up with a robust solution for aligning arbitrary characters written to an arbitrary console, because that is a hard problem to solve.
Notes:
Your real_length() method is merely returning the number of code points in the supplied Java String. That relates to its internal representation, and has no direct correlation with the width of the rendered characters, which is determined by the font being used.
See Emoji exceed monospace character width, breaking column alignment #100730 where Microsoft have declined to address the issue for VS Code.
For SO question Java: how to align UTF Miscellaneous Symbols in plain text, see this answer which solved a similar but simpler problem, but only for the Command Prompt window on Windows.
tl;dr
Use code points rather than char. Avoid calling String#length.
input
+
"#".repeat( targetLength - input.codePoints().toArray().length )
Details
Your Question neglected to show any code. So I can only guess what you are doing and what might be the problem.
Avoid char
I am guessing that your goal is to append a certain number of NUMBER SIGN characters as needed to make a fixed-length row of text.
I am guessing the problem is that you are using the legacy char type, or its wrapper class Character. The char type has been essentially broken since Java 2. As a 16-bit value, char is physically incapable of representing most characters.
Use code point numbers
Instead, use code point integer numbers when working with individual characters. A code point is the number permanently assigned to each of the over 140,000 characters defined in Unicode.
A variety of code point related methods have been added to various classes in Java 5+: String, StringBuilder, Character, etc.
Here we use String#codePoints to get an IntStream of code points, one element for each character in the source. And we use StringBuilder#appendCodePoint to collect the code points for our final result string.
final int targetLength = 10;
final int fillerCodePoint = "#".codePointAt( 0 ); // Annoying zero-based index counting.
String input = "😷🀠🀑";
int[] codePoints = input.codePoints().toArray();
StringBuilder stringBuilder = new StringBuilder();
for ( int index = 0 ; index < targetLength ; index++ )
{
if ( index < codePoints.length )
{
stringBuilder.appendCodePoint( codePoints[ index ] );
} else
{
stringBuilder.appendCodePoint( fillerCodePoint );
}
}
Or, shorten that for loop with the use of a ternary operator.
for ( int index = 0 ; index < targetLength ; index++ )
{
int codePoint = ( index < codePoints.length ) ? codePoints[ index ] : fillerCodePoint;
stringBuilder.appendCodePoint( codePoint );
}
Report result.
System.out.println( Arrays.toString( codePoints ) );
String output = stringBuilder.toString();
System.out.println( "output = " + output );
[128567, 129312, 129313]
output = 😷🀠🀑#######
There is likely a clever way to write that code more briefly with streams and lambdas, but I cannot think of one at the moment.
And, one could cleverly use the String#repeat method in Java 11+.
String output = input + "#".repeat( targetLength - input.codePoints().toArray().length ) ;
Note: This answer is distinct and qualitatively different from my earlier one (which I still stand by).
There is a simple way for a Java application (i.e. one not using a graphical user interface) to obtain the width of a String being rendered in a given font with a given font size. It requires the use of some awt classes which are supported even in a non-AWT environment. Here's a demo using the data provided in the question:
package fixedwidth;
import java.awt.Canvas;
import java.awt.Font;
import java.awt.FontMetrics;
public class FixedWidth {
static String[] tests = {
"Peter", "οΌ³οΌ¨οΌ§οΌ‘οΌ­οΌ©", "Marcel β„–1", "πŸ’", "πŸ‘¨β€β€οΈβ€πŸ‘¨", "πŸ‘©β€β€οΈβ€πŸ’‹β€πŸ‘©", "πŸ‘¨β€πŸ‘©β€πŸ‘¦"
};
static Font smallFont = new Font("Monospaced", Font.PLAIN, 10);
static Font bigFont = new Font("Monospaced", Font.BOLD, 24);
/**
* This code is based on an answer by SO user Lonzak.
* See SO Answer https://stackoverflow.com/a/18123024/2985643
*/
public static void main(String[] args) {
FontMetrics fm1 = new Canvas().getFontMetrics(FixedWidth.smallFont);
FixedWidth.demo(tests, fm1);
FontMetrics fm2 = new Canvas().getFontMetrics(FixedWidth.bigFont);
FixedWidth.demo(tests, fm2);
}
static void demo(String[] tests, FontMetrics fm) {
Font f = fm.getFont();
System.out.println("\nFont name:" + f.getName() + ", font size:" +
f.getSize() + ", font style:" + f.getStyle());
for (String test : tests) {
int width = fm.stringWidth(test);
System.out.println("width=" + width + ", data=" + test);
}
}
}
The code above is based on this old answer by user Lonzak to the question Java - FontMetrics without Graphics. Those AWT classes allow you to create a Font with defined characteristics (i.e. name, size, style), and then use a FontMetrics instance to obtain the width of an arbitrary String when using that font.
Here is the output from running the code shown above:
Font name:Monospaced, font size:10, font style:0
width=30, data=Peter
width=60, data=οΌ³οΌ¨οΌ§οΌ‘οΌ­οΌ©
width=59, data=Marcel β„–1
width=10, data=πŸ’
width=30, data=πŸ‘¨β€β€οΈβ€πŸ‘¨
width=40, data=πŸ‘©β€β€οΈβ€πŸ’‹β€πŸ‘©
width=30, data=πŸ‘¨β€πŸ‘©β€πŸ‘¦
Font name:Monospaced, font size:24, font style:1
width=70, data=Peter
width=149, data=οΌ³οΌ¨οΌ§οΌ‘οΌ­οΌ©
width=140, data=Marcel β„–1
width=25, data=πŸ’
width=73, data=πŸ‘¨β€β€οΈβ€πŸ‘¨
width=98, data=πŸ‘©β€β€οΈβ€πŸ’‹β€πŸ‘©
width=74, data=πŸ‘¨β€πŸ‘©β€πŸ‘¦
Notes:
The first set of results shows the widths of the sample data in the question when using plain Monospaced 10 point font. The second set of results shows the widths of those same strings when using bold Monospaced 24 point font.
The widths don't look correct for some of the emojis, but that is because when the source code and output results are pasted into SO some emoji representations are changed, presumably because of the different font being used in the browser. (I was using Monospaced for both the source and the output.) Here's a screen shot of the original output, showing that the widths at least look plausible:
Even though the widths are being calculated and rendered for a fixed width font (Monospaced), it's clear that the width of the emojis cannot be predicted from the widths of normal keyboard characters.
Sounds like you're looking for a Java implementation of the POSIX wcwidth and wcswidth functions, which implement the rules defined in Unicode Technical Report #11 (which exclusively focuses on display widths for Unicode codepoints when rendered to fixed width devices - terminals and the like). The only such Java implementation that I'm aware of is in the JLine3 library, which is a lot of code to bring in for just this one class, but that may be your best bet.
Note however that that code appears to be incomplete. Unicode codepoint 0x26AA (βšͺ️), for example, is reported as having a width of 1 by the JLine3 code, but on every platform I've tested on (including here in the StackOverflow editor, which is a fixed width "device") that codepoint is displayed over two columns.
Good luck - this stuff is a lot more complex than it looks. The JVM's unfortunate UCS-2 history (not Sun's fault - it was bad timing wrt the Unicode standard) only makes matters worse, and as others have said here, avoid the char and Character data types like the plague - they do not work the way you expect, and the instant code that uses those types encounters data including codepoints from the Unicode supplemental planes, it is almost certain to function incorrectly (unless the author has been especially careful - do you feel lucky? πŸ˜‰).

How to center a print statement text?

So I was working on my java project and in one part of the program I'm printing out text
The text is displayed on the left side
However I wanted it be displayed in the middle
How many I accomplish this?
Is this a newbie question?
Example:
public static void main(String[] args)
{
System.out.println("Hello");
}
VERY QUICK answer
You can use the JavaCurses library to do fun things on the console. Read below it's in there.
Before you do though let's answer your entire question in some context
It is a newbie question :) but it's a valid question. So some hints for you:
First question is, how wide is the terminal? (it's counted in number of characters) old terminals had a fixed dimensions of 80 characters and 25 lines;
So as a first step start with the assumption that it's 80 characters wide.
How would you center a string on an 80 character wide terminal screen?
Do you need to worry about the length of the string? How do you position something horizontally? Do you add spaces? Is there a format string you can come up with?
Once you've written a program such that you can give it any string that will display properly on those assumptions (that terminal is 80 characters wide) you can now start worrying about what happens if you are connected to a terminal which is more or less than 80 characters? Or whether or not you are even connected to a terminal. For example if you are not does it make sense to "prettify" your code? probably not.
So question is how do you get all this information?
What you are asking for is the ability to treat the console as a smart teletype (tty) terminal with character-based control capabilities. Teletype terminals of the old can do a lot of fun things.
Some history
Teletype terminals were complicated things and come from the legacy that there were a lots of terminal manufacturers (IBM, DEC, etc.) ... These teletype terminals were developed to solve lots of problems like being able to display content remotely from mainframes and minicomputers.
There were a bunch of terminal standards vt100, vt200, vt220, ansi, that came about at various points in terminal development history and hundreds of proprietary ones along the way.
These terminals could do positioning of cursors and windowing and colors, highlight text, underline etc. but not everyone could do everything. However this was done using "control" characters. ctrl-l is clear screen on ansi and vt terminals, but it may be page feed on something else.
If you wrote a program specific to one it would make no sense elsewhere. So the necessity to make that simple caused a couple of abstraction libraries to developed that would hide away the hideousness.
The first one is called termcap (terminal-capabilities) library, circa 1978, which provided a generic way to deal with terminals on UNIX systems. It could tell a running program of the available capabilities of the terminal (for example the ability to change text color) or to position cursor at a location, or to clear itself etc, and the program would then modify its behavior accordingly.
The second library is called curses, circa 1985 (??) it was developed as part of the BSD system and was used to write games ... One of the most popular versions of this library is the GNU curses library (previously known as ncurses).
On VMS I believe the library is called SMG$ (screen management library).
On with the answer
Any how, so you can use one of these libraries in java to determine whether or not you are working on a proper terminal. There is a library called JavaCurses on source forge that provides this capability to java programs. This will be an exercise in learning how to utilize a new library into your programs and should be exciting.
JavaCurses provides terminal programming capability on both Unix and Windows environments. It will be a fun exercise for you to see if you can use it to play with.
advanced exercise
Another exercise would be to use that same library to see if you can create a program that display nicely on a terminal and also writes out to a text file without the terminal codes;
If you have any issues, post away, I'll help as you go along.
If you have a definite line length, apache commons StringUtils.center will easily do the job. However, you have to add that library. javadoc
Java print statements to the console can't be centered as there is no maximum width to a line.
If your console is limited to, for example, 80 chars, you could write a special logger that would pad the string with spaces.
If your string was greater than 80 chars then you would have to cut the string and print the remainder on the next line. Also, if someone else was using your app with a console with a different width (especially smaller) if would look weird.
So basically, no, there is no easy way to center the output...
You could do something like:
public static void main(String[] args) {
String h = "Hello";
System.out.println(String.format("%-20s", h));
}
This approach outputs a string offset by a given number of spaces. In this case Hello is preceded by 20 spaces. The spaces precede Hello because the integer between % and s is negative, otherwise the spaces would be trailing.
Just mess with the integer between % and s until you get the desired result.
As lot of programming questions, dont reinvent the wheel!
Apache have a nice library: "org.apache.commons" that come with a StringUtils class:
https://commons.apache.org/proper/commons-lang/apidocs/org/apache/commons/lang3/StringUtils.html
The pad method is what you want:
int w = 20;
System.out.println(StringUtils.rightPad("+", w - 1, "-") + "+");
System.out.println(StringUtils.center(StringUtils.center("output", w - 2), w, "|"));
System.out.println(StringUtils.rightPad("+", w - 1, "-") + "+");
will give you:
+----------------------+
| output |
+----------------------+
You can't. You are writing to the console which does not have a width so the center is undefined.
If you know the size and don't want to use an external library you could do something like this:
static void printer(String str, int size) {
int left = (size - str.length()) / 2;
int right = size - left - str.length();
String repeatedChar = "-";
StringBuffer buff = new StringBuffer();
for (int i = 0; i < left; i++) {
buff.append(repeatedChar);
}
buff.append(str);
for (int i = 0; i < right; i++) {
buff.append(repeatedChar);
}
// to see the end (and debug) if using spaces as repeatedChar
//buff.append("$");
System.out.println(buff.toString());
}
// testing:
printer("string", 30);
// output:
// ------------string------------
If you call it with an odd number for the size variable, then it would be with one - more to the right. And you can change the repeatedChar to be a space.
Edit
If you want to print just one char and you know the size, you could do it with the default System.out.printf like so:
int size = 10;
int left = size/2;
int right = size - left;
String format = "%" + left + "c%-" + right + "c";
// would produce: "%5c%-5c"
System.out.printf(format,' ', '#');
// output: " # " (without the quotes)
The %-5c align the # character to the left of the 5 spaces assigned to it

How to accept strings approximately correct as correct, when comparing?

I have a prepopulated sqlite database imported to assets folder and I use it to set some text to my buttons and to compare user's input with my correct answers in that database. But I have two problems which I don't how to solve.
For example I have an answer which is "Michael Jordan" or some other two words. I a user enters Michael Jordan i'm good to go, but if he enter Jordan Michael I'm in trouble. It will popup a wrong answer alert. Is there a way to accept these words shuffles?
Also, if I have an answer "Balls" and user type in "ball" this will be wrong aswer. How to make sure that all singulars and plurals get accepted?
Fuzzy String Comparison Algorithm
The custom brute force method below provides word swapping and gives you complete control over the vowel/consonant score thresholds, but increases the total number of comparisons.
You will also want to check methods such as Apache Lucene described in this thread: Fuzzy string search library in Java
Custom Fuzzy Comparison Recipe:
Lower Case: All comparisons will be with lower-case text. Either make sure that all words in the reference database are in lower case, or use a String.toLower() on each item in the database before comparison. Obviously, preprocessing the list in the database will dramatically increase performance.
Remove Spaces and Punctuation: You must make a function that removes all spaces and other punctuation from any phrase. You should have a separate column in your reference with this information pre-calculated for an increase in performance.
Custom Compare Function: Your String comparison function will compare each character and assign a custom score based on closeness of letters, in which the lowest scores will indicate the best match. For example, identical characters will add zero score. Each mismatched consonant pair will add 2 to the score. Each mismatched vowel will add 1. Mixed mismatches will add 3. Normalize the score by the number of characters. Apply a simple threshold to determine acceptable matches. In the above example, start with threshold=0.2 which will allow approximately one small mistake per 5 characters (this solves simple misspellings, but not missing characters. See Step 4 below).
Extra or Missing Characters: Loop through each comparison an extra time for each character position. Once without the character in that position and once with an extra character in that position. Report the smallest score for all the loops. Compare that score against the threshold. Break out of the loop and stop comparing if the score is below the threshold, thus indicating a match. This will catch misspellings such as "colage" for "collage".
Swap Words: After the loop in Step #4, if the score is still above the threshold, loop through each word of the input phrase and swap with its nearest neighbor adjacent word. and rerun the comparison suite. Obviously, you will have to look at the original raw user phrase to find the word boundaries, rather than the processed phrase without spaces and punctuation of Step #2. This will catch your requirement of allowing "Jordan Michael" to substitute for "Michael Jordan".
For long entries with more than 2 words, this method will incur 10's of comparisons per database entry or more, so there is a definite performance hit.
This is a great question. I think, realistically you need a dictionary of "valid" words. However a dictionary on its own will not solve your problems. You also need a set of heuristics based on your dictionary as to what constitutes a valid entry.
I would be tempted to try "tries" here as you can encapsulate a rich text base better that alternate methods. Tries, in this case will offer comparable performance to say a word dictionary or the likes. The additional benefit of using tries is that it is fairly trivial to add new words/phrases to your application. The downside, tries use a fair amount of memory. That said, there are techniques one can use to compact data.

Parse BigDecimal from String containing a number in arbitrary format

We read data from XLS cells formatted as text.
The cell hopefully contains a number, output will be a BigDecimal (because of arbitrary precision).
Problem is, the cell format is also arbitrary, which means it may contain numbers like:
with currency symbols ($1000)
leading and trailing whitespaces, or whitespaces in between digits (eg. 1 000 )
digit grouping symbols (eg. 1,000.0)
of course, negative numbers
'o's and 'O's as zeros (eg. 1,ooo.oo)
others I can't think of
It's mostly because of this last point that I'm looking for a standard library that can do all this, and which is configurable, well tested etc.
I looked at Apache first, found nothing but I might be blind... perhaps it's a trivial answer for someone else...
UPDATE: the domain of the question is financial applications. Actually I'm expecting a library where the domain could be an input parameter - financial, scientific, etc. Maybe even more specific: financial with currency symbols? With stock symbols? With distances and other measurement units? I can't believe I'm the first person to think of something like this...
I don't know any library, but you can try that:
Put your number on a string. (ex: $1,00o,oOO.00)
Remove all occurrences of $,white-spaces or any other strang symbols you can think of...
Replace occurrences of o and O.
Try to parse the number =]
That should solve 99% of the entrys...
Buy bunch photos or even better videos with legal adult content. Create a web site with these resources but limit the access with captcha which will be displaying unsolved number formats. Create a set of number decoders out of known number formats and create an algorithm which will add new ones based on user solved captchas.
I think this is what I've been looking for:
http://site.icu-project.org/
Very powerful library, although at the moment it's not clear whether it can only format or all the formatted stuff can be parsed back as well.

Where can I find "reference barcodes" to verify barcode library output?

This question is not about 'best' barcode library recommendation, we use various products on different platforms, and need a simple way to verify if a given barcode is correct (according to its specification).
We have found cases where a barcode is rendered differently by different barcode libraries and free online barcode generators in the Internet. For example, a new release of a Delphi reporting library outputs non-numeric characters in Code128 as '0' or simply skips them in the text area. Before we do the migration, we want to check if these changes are caused by a broken implementation in the new library so we can report this as a bug to the author.
We mainly need Code128 and UCC/EAN-128 with A/B/C subcodes.
Online resources I checked so far are:
IDAutomation.com (displays ABC123 as 0123 with Code128-C)
Morovia.com
BarcodesInc (does not accept comma)
TEC-IT
They show different results too, for example in support for characters like comma or plus signs, at least in the human readable text.
For Code128 there isn't a single correct answer. If you use Code128-A you can get a different result than Code128-C. By result I mean how it looks. Take "803150" as an example. In Code128-A you'll need 6 characters (+ start, checksum, stop) to represent this number. Code128-C only consists of numbers, so you can compress two digits into one character. Hence you'll need only 3 characters (+ start, checksum, stop) to represent the same number. The barcodes will look different (A being longer in this case), but if you scan them both will give the correct number.
Further, Code128 doesn't need to be just A, B or C. You can actually combine the different subsets. This is common for cases like "US123457890", where Code128-A or B is used on "US" and Code128-C is used on the remaining digits. This is sometime referred to as Code-128 Auto, or just Code-128. The result is a "compressed" barcode in terms of width. You could represent the same data with A/B but again that would give you a longer barcode.
Take two online generators:
IDAutomation
BarcodesInc
I recommend the first one, where you can select between Auto/A/B/C. Here is an example image illustrating the differences:
On IDAutomation, Auto is default while A is default on Barcodes-Inc. Both are correct, you just need to be careful what subset you have selected when comparing output. I also recommend a barcode reader for use in development to test the output. Also, see this page for a comparision of the different subsets with ASCII values. I also find grandzebu.net useful, which has a free Code128 font you can use as well.
It sounds like your Delphi library always use Code128-C, since it's only possible to represent numbers in this subset.
Why not just scan them and see what comes back?

Categories