Related
me and a friend are programming our own console in java, but we have Problems to adjust the lines correctly, because of the width of the unicode characters which can not be determined exactly. This leads to the problem that not only the line of the unicode, but also following lines are shifted.
Is there a way to determine the width of the unicodes?
Screenshots of the problem can be found bellow.
This is how it should look: https://abload.de/img/richtigslkmg.jpeg
This is an example in Terminal: https://abload.de/img/terminal7dj5o.jpeg
This is an example in PowerShell: https://abload.de/img/powershelln7je0.jpeg
This is an example in Visual Studio Code: https://abload.de/img/visualstudiocode4xkuo.jpeg
This is an example in Putty: https://abload.de/img/putty0ujsk.png
EDIT:
I am sorry that the question was unclear.
It is about the display width, in the example I try to determine the display length to have each line the same length.
The function real_length is to calculate/determine and return the display width.
here the example code:
public static void main(String[] args) {
String[] tests = {
"Peter",
"SHGAMI",
"Marcel №1",
"💏",
"👨❤️👨",
"👩❤️💋👩",
"👨👩👦"
};
for(String test : tests) test(test);
}
public static void test(String text) {
int max = 20;
for(int i = 0; i < max;i++) System.out.print("#");
System.out.println();
System.out.print(text);
int length = real_length(text);
for(int i = 0; i < max - length;i++) System.out.print("#");
System.out.println();
}
public static int real_length(String text) {
return text.length();
}
Unfortunately there is no easy solution to your deceptively simple question, for several reasons:
The width of the characters being rendered on the console might (and probably will) vary, based on the font being used. So the code would need to determine, or assume, the target font in order to calculate widths.
System.out is just a PrintStream that does not know or care about fonts and character width, so any solution has to be independent of that.
Even if you could determine the font being used on the console, and you had a way to determine the width of each character you were trying to render in that specific font, how would that help you? Knowing the variation in widths might conceivably allow you to cleverly tweak the lines being rendered so that they were aligned, but it's just as likely that it wouldn't be practicable.
A potential solution is to leave your code as it stands, and use a monospaced font on the console that println() is writing to, but there are still some major problems with that approach. First, you need to identify a font that is monospaced, but will also support all of the characters you want to render. This can be problematic when including emojis. Second, even if you identify such a font, you may find that all the glyphs for that font are not monospaced! Such a font will ensure that (say) a lowercase i and an uppercase W have the same width, but you can't also make that assumption for emojis, and you can't even assume that the "monospaced" emojis will all have the same non-standard width! Third, the font you identify (if it exists at all) would have to be available in your target environments (your PowerShell, your friend's PuTTY shell, etc.). That is not a major obstacle, but it is one more thing to worry about.
You may find that the rendered text varies by operating system. Your output may look aligned in a Linux terminal window, but that same output, using the same font, might be misaligned in a PowerShell window.
Given all that, a better approach might be to use Swing or JavaFX, where you have finer control over the output being rendered. Even if you are unfamiliar with those technologies, it wouldn't take too long to get something working, just by tweaking some sample code obtained through a search. And even allowing for the learning curve, it would still take less time than coming up with a robust solution for aligning arbitrary characters written to an arbitrary console, because that is a hard problem to solve.
Notes:
Your real_length() method is merely returning the number of code points in the supplied Java String. That relates to its internal representation, and has no direct correlation with the width of the rendered characters, which is determined by the font being used.
See Emoji exceed monospace character width, breaking column alignment #100730 where Microsoft have declined to address the issue for VS Code.
For SO question Java: how to align UTF Miscellaneous Symbols in plain text, see this answer which solved a similar but simpler problem, but only for the Command Prompt window on Windows.
tl;dr
Use code points rather than char. Avoid calling String#length.
input
+
"#".repeat( targetLength - input.codePoints().toArray().length )
Details
Your Question neglected to show any code. So I can only guess what you are doing and what might be the problem.
Avoid char
I am guessing that your goal is to append a certain number of NUMBER SIGN characters as needed to make a fixed-length row of text.
I am guessing the problem is that you are using the legacy char type, or its wrapper class Character. The char type has been essentially broken since Java 2. As a 16-bit value, char is physically incapable of representing most characters.
Use code point numbers
Instead, use code point integer numbers when working with individual characters. A code point is the number permanently assigned to each of the over 140,000 characters defined in Unicode.
A variety of code point related methods have been added to various classes in Java 5+: String, StringBuilder, Character, etc.
Here we use String#codePoints to get an IntStream of code points, one element for each character in the source. And we use StringBuilder#appendCodePoint to collect the code points for our final result string.
final int targetLength = 10;
final int fillerCodePoint = "#".codePointAt( 0 ); // Annoying zero-based index counting.
String input = "😷🤠🤡";
int[] codePoints = input.codePoints().toArray();
StringBuilder stringBuilder = new StringBuilder();
for ( int index = 0 ; index < targetLength ; index++ )
{
if ( index < codePoints.length )
{
stringBuilder.appendCodePoint( codePoints[ index ] );
} else
{
stringBuilder.appendCodePoint( fillerCodePoint );
}
}
Or, shorten that for loop with the use of a ternary operator.
for ( int index = 0 ; index < targetLength ; index++ )
{
int codePoint = ( index < codePoints.length ) ? codePoints[ index ] : fillerCodePoint;
stringBuilder.appendCodePoint( codePoint );
}
Report result.
System.out.println( Arrays.toString( codePoints ) );
String output = stringBuilder.toString();
System.out.println( "output = " + output );
[128567, 129312, 129313]
output = 😷🤠🤡#######
There is likely a clever way to write that code more briefly with streams and lambdas, but I cannot think of one at the moment.
And, one could cleverly use the String#repeat method in Java 11+.
String output = input + "#".repeat( targetLength - input.codePoints().toArray().length ) ;
Note: This answer is distinct and qualitatively different from my earlier one (which I still stand by).
There is a simple way for a Java application (i.e. one not using a graphical user interface) to obtain the width of a String being rendered in a given font with a given font size. It requires the use of some awt classes which are supported even in a non-AWT environment. Here's a demo using the data provided in the question:
package fixedwidth;
import java.awt.Canvas;
import java.awt.Font;
import java.awt.FontMetrics;
public class FixedWidth {
static String[] tests = {
"Peter", "SHGAMI", "Marcel №1", "💏", "👨❤️👨", "👩❤️💋👩", "👨👩👦"
};
static Font smallFont = new Font("Monospaced", Font.PLAIN, 10);
static Font bigFont = new Font("Monospaced", Font.BOLD, 24);
/**
* This code is based on an answer by SO user Lonzak.
* See SO Answer https://stackoverflow.com/a/18123024/2985643
*/
public static void main(String[] args) {
FontMetrics fm1 = new Canvas().getFontMetrics(FixedWidth.smallFont);
FixedWidth.demo(tests, fm1);
FontMetrics fm2 = new Canvas().getFontMetrics(FixedWidth.bigFont);
FixedWidth.demo(tests, fm2);
}
static void demo(String[] tests, FontMetrics fm) {
Font f = fm.getFont();
System.out.println("\nFont name:" + f.getName() + ", font size:" +
f.getSize() + ", font style:" + f.getStyle());
for (String test : tests) {
int width = fm.stringWidth(test);
System.out.println("width=" + width + ", data=" + test);
}
}
}
The code above is based on this old answer by user Lonzak to the question Java - FontMetrics without Graphics. Those AWT classes allow you to create a Font with defined characteristics (i.e. name, size, style), and then use a FontMetrics instance to obtain the width of an arbitrary String when using that font.
Here is the output from running the code shown above:
Font name:Monospaced, font size:10, font style:0
width=30, data=Peter
width=60, data=SHGAMI
width=59, data=Marcel №1
width=10, data=💏
width=30, data=👨❤️👨
width=40, data=👩❤️💋👩
width=30, data=👨👩👦
Font name:Monospaced, font size:24, font style:1
width=70, data=Peter
width=149, data=SHGAMI
width=140, data=Marcel №1
width=25, data=💏
width=73, data=👨❤️👨
width=98, data=👩❤️💋👩
width=74, data=👨👩👦
Notes:
The first set of results shows the widths of the sample data in the question when using plain Monospaced 10 point font. The second set of results shows the widths of those same strings when using bold Monospaced 24 point font.
The widths don't look correct for some of the emojis, but that is because when the source code and output results are pasted into SO some emoji representations are changed, presumably because of the different font being used in the browser. (I was using Monospaced for both the source and the output.) Here's a screen shot of the original output, showing that the widths at least look plausible:
Even though the widths are being calculated and rendered for a fixed width font (Monospaced), it's clear that the width of the emojis cannot be predicted from the widths of normal keyboard characters.
Sounds like you're looking for a Java implementation of the POSIX wcwidth and wcswidth functions, which implement the rules defined in Unicode Technical Report #11 (which exclusively focuses on display widths for Unicode codepoints when rendered to fixed width devices - terminals and the like). The only such Java implementation that I'm aware of is in the JLine3 library, which is a lot of code to bring in for just this one class, but that may be your best bet.
Note however that that code appears to be incomplete. Unicode codepoint 0x26AA (⚪️), for example, is reported as having a width of 1 by the JLine3 code, but on every platform I've tested on (including here in the StackOverflow editor, which is a fixed width "device") that codepoint is displayed over two columns.
Good luck - this stuff is a lot more complex than it looks. The JVM's unfortunate UCS-2 history (not Sun's fault - it was bad timing wrt the Unicode standard) only makes matters worse, and as others have said here, avoid the char and Character data types like the plague - they do not work the way you expect, and the instant code that uses those types encounters data including codepoints from the Unicode supplemental planes, it is almost certain to function incorrectly (unless the author has been especially careful - do you feel lucky? 😉).
TL;DR:
Half-width: Regular width characters.
Eg. 'A' and 'ニ'
Full-width: Chars that take two monospaced English chars' space on the display
Eg. '中', 'に' and 'A'
I need an implementation of this function:
/**
* #return Is this character a full-width character or not.
*/
fun Char.isFullWidth(): Boolean
{
// What is the most efficient implementation here?
}
No this is not about data structures for those chars, it's only about the displayed width.
Long Story:
I'm refactoring HyLogger, a logging library focused on text-coloring with gradients. Here is the problem I ran into:
If you look at the first gradient text block printed in the screenshot, the full-width text in the middle messed up the gradient pattern after it, because when calling string.length, they are counted as one character even though they take up twice the size.
You might be asking, why on earth would anyone print full-width characters? This is a real problem because almost all characters in languages like Chinese, Japanese, or Korean are full-width, therefore takes twice the space, similar to the English full-width characters.
So I need a way to identify full-width characters so that I can calculate them as two gradient-pixels instead of one to solve the problem in the picture.
Known Info:
C++ check if unicode character is full width :
There is a list of East Asian Width characters on the Unicode website (and also the report), but it's probably not efficient to traverse this entire list for every single character when rendering a gradient text block.
Python has this Unicode database library, one possible solution is to call python API using Jython, which would be heavy and the efficiency is probably not very good.
Analyzing full width or half width character in Java :
The ICU4J library has Unicode tools to achieve this function, but that library is 12.5 MB large, which isn't optimal for my 50 KB logger library.
The best solution seems to be converting EastAsianWidth.txt to a series of range conditions.
The below function is partially generated with FullWidthUtilGenerator.kt, and it still has some issues to resolve:
It does not account for characters outside the Basic Multilingual Plane (BMP) range (Eg. 𐀀 U+10000) because I haven't figured out how to effectively include them in Java/Kotlin.
(\u10000 gives compilation error)
Near values that are stated separately in EastAsianWidth.txt are not automatically combined yet. (Eg. \u3010 and \u3011)
/**
* Half-width: Regular width characters.
* Eg. 'A' and 'ニ'
*
* Full-width: Chars that take two monospaced English chars' space on the display
* Eg. '中', 'に' and 'A'
*
* See FullWidthUtilGenerator.kt
*
* #return Is this character a full-width character or not.
*/
fun Char.isFullWidth(): Boolean
{
return when (this)
{
'\u2329','\u232A','\u23F0','\u23F3','\u267F','\u2693','\u26A1','\u26CE','\u26D4','\u26EA','\u26F5',
'\u26FA','\u26FD','\u2705','\u2728','\u274C','\u274E','\u2757','\u27B0','\u27BF','\u2B50','\u2B55',
'\u3000','\u3004','\u3005','\u3006','\u3007','\u3008','\u3009','\u300A','\u300B','\u300C','\u300D',
'\u300E','\u300F','\u3010','\u3011','\u3014','\u3015','\u3016','\u3017','\u3018','\u3019','\u301A',
'\u301B','\u301C','\u301D','\u3020','\u3030','\u303B','\u303C','\u303D','\u303E','\u309F','\u30A0',
'\u30FB','\u30FF','\u3250','\uA015','\uFE17','\uFE18','\uFE19','\uFE30','\uFE35','\uFE36','\uFE37',
'\uFE38','\uFE39','\uFE3A','\uFE3B','\uFE3C','\uFE3D','\uFE3E','\uFE3F','\uFE40','\uFE41','\uFE42',
'\uFE43','\uFE44','\uFE47','\uFE48','\uFE58','\uFE59','\uFE5A','\uFE5B','\uFE5C','\uFE5D','\uFE5E',
'\uFE62','\uFE63','\uFE68','\uFE69','\uFF04','\uFF08','\uFF09','\uFF0A','\uFF0B','\uFF0C','\uFF0D',
'\uFF3B','\uFF3C','\uFF3D','\uFF3E','\uFF3F','\uFF40','\uFF5B','\uFF5C','\uFF5D','\uFF5E','\uFF5F',
'\uFF60','\uFFE2','\uFFE3','\uFFE4',
in '\u1100'..'\u115F',in '\u231A'..'\u231B',in '\u23E9'..'\u23EC',in '\u25FD'..'\u25FE',
in '\u2614'..'\u2615',in '\u2648'..'\u2653',in '\u26AA'..'\u26AB',in '\u26BD'..'\u26BE',
in '\u26C4'..'\u26C5',in '\u26F2'..'\u26F3',in '\u270A'..'\u270B',in '\u2753'..'\u2755',
in '\u2795'..'\u2797',in '\u2B1B'..'\u2B1C',in '\u2E80'..'\u2E99',in '\u2E9B'..'\u2EF3',
in '\u2F00'..'\u2FD5',in '\u2FF0'..'\u2FFB',in '\u3001'..'\u3003',in '\u3012'..'\u3013',
in '\u301E'..'\u301F',in '\u3021'..'\u3029',in '\u302A'..'\u302D',in '\u302E'..'\u302F',
in '\u3031'..'\u3035',in '\u3036'..'\u3037',in '\u3038'..'\u303A',in '\u3041'..'\u3096',
in '\u3099'..'\u309A',in '\u309B'..'\u309C',in '\u309D'..'\u309E',in '\u30A1'..'\u30FA',
in '\u30FC'..'\u30FE',in '\u3105'..'\u312F',in '\u3131'..'\u318E',in '\u3190'..'\u3191',
in '\u3192'..'\u3195',in '\u3196'..'\u319F',in '\u31A0'..'\u31BF',in '\u31C0'..'\u31E3',
in '\u31F0'..'\u31FF',in '\u3200'..'\u321E',in '\u3220'..'\u3229',in '\u322A'..'\u3247',
in '\u3251'..'\u325F',in '\u3260'..'\u327F',in '\u3280'..'\u3289',in '\u328A'..'\u32B0',
in '\u32B1'..'\u32BF',in '\u32C0'..'\u32FF',in '\u3300'..'\u33FF',in '\u3400'..'\u4DBF',
in '\u4E00'..'\u9FFC',in '\u9FFD'..'\u9FFF',in '\uA000'..'\uA014',in '\uA016'..'\uA48C',
in '\uA490'..'\uA4C6',in '\uA960'..'\uA97C',in '\uAC00'..'\uD7A3',in '\uF900'..'\uFA6D',
in '\uFA6E'..'\uFA6F',in '\uFA70'..'\uFAD9',in '\uFADA'..'\uFAFF',in '\uFE10'..'\uFE16',
in '\uFE31'..'\uFE32',in '\uFE33'..'\uFE34',in '\uFE45'..'\uFE46',in '\uFE49'..'\uFE4C',
in '\uFE4D'..'\uFE4F',in '\uFE50'..'\uFE52',in '\uFE54'..'\uFE57',in '\uFE5F'..'\uFE61',
in '\uFE64'..'\uFE66',in '\uFE6A'..'\uFE6B',in '\uFF01'..'\uFF03',in '\uFF05'..'\uFF07',
in '\uFF0E'..'\uFF0F',in '\uFF10'..'\uFF19',in '\uFF1A'..'\uFF1B',in '\uFF1C'..'\uFF1E',
in '\uFF1F'..'\uFF20',in '\uFF21'..'\uFF3A',in '\uFF41'..'\uFF5A',in '\uFFE0'..'\uFFE1',
in '\uFFE5'..'\uFFE6' -> true
else -> false
}
}
I need to change input mask dynamically. For example, if user inputs 13 digits then one mask, if 20 then another.
I am using redmadrobot:inputmask. Here is my code
ArrayList<String> affineFormats = new ArrayList<>();
affineFormats.add("[0000] [000] [000] [000]");
affineFormats.add("[0000] [0000] [0000] [0000] [0000]");
String format = "[0000] [000] [000] [000]";
MaskedTextChangedListener listener = new PolyMaskTextChangedListener(
format,
affineFormats,
true,
etCardNumber,
null,
new MaskedTextChangedListener.ValueListener() {
#Override
public void onTextChanged(boolean b, String s) {
//here some code
}
});
etCardNumber.addTextChangedListener(listener);
But when I enter the card number is used the last one added is formatted according to affineFormats. Please help me fix this problem.
From your code it looks like you are using a slightly outdated version of our library.
In v.4 we already have PolyMaskTextChangedListener merged with the MaskedTextChangedListener. We also introduced a handy utility called AffinityCalculationStrategy, which might actually help with your problem.
From our Wiki:
Affinity calculation strategy
Affinity is an integer number, which represents the similarity between the input and the current mask. Thus, the mask with the highest affinity is picked to format the output.
Affinity calculation strategy is a text field listener property allowing to alter the math behind the affinity calculation.
...
AffinityCalculationStrategy.EXTRACTED_VALUE_CAPACITY— this strategy comes in handy when the mask format radically changes depending on the extracted value length.
(and your digits are the extracted value)
So I was working on my java project and in one part of the program I'm printing out text
The text is displayed on the left side
However I wanted it be displayed in the middle
How many I accomplish this?
Is this a newbie question?
Example:
public static void main(String[] args)
{
System.out.println("Hello");
}
VERY QUICK answer
You can use the JavaCurses library to do fun things on the console. Read below it's in there.
Before you do though let's answer your entire question in some context
It is a newbie question :) but it's a valid question. So some hints for you:
First question is, how wide is the terminal? (it's counted in number of characters) old terminals had a fixed dimensions of 80 characters and 25 lines;
So as a first step start with the assumption that it's 80 characters wide.
How would you center a string on an 80 character wide terminal screen?
Do you need to worry about the length of the string? How do you position something horizontally? Do you add spaces? Is there a format string you can come up with?
Once you've written a program such that you can give it any string that will display properly on those assumptions (that terminal is 80 characters wide) you can now start worrying about what happens if you are connected to a terminal which is more or less than 80 characters? Or whether or not you are even connected to a terminal. For example if you are not does it make sense to "prettify" your code? probably not.
So question is how do you get all this information?
What you are asking for is the ability to treat the console as a smart teletype (tty) terminal with character-based control capabilities. Teletype terminals of the old can do a lot of fun things.
Some history
Teletype terminals were complicated things and come from the legacy that there were a lots of terminal manufacturers (IBM, DEC, etc.) ... These teletype terminals were developed to solve lots of problems like being able to display content remotely from mainframes and minicomputers.
There were a bunch of terminal standards vt100, vt200, vt220, ansi, that came about at various points in terminal development history and hundreds of proprietary ones along the way.
These terminals could do positioning of cursors and windowing and colors, highlight text, underline etc. but not everyone could do everything. However this was done using "control" characters. ctrl-l is clear screen on ansi and vt terminals, but it may be page feed on something else.
If you wrote a program specific to one it would make no sense elsewhere. So the necessity to make that simple caused a couple of abstraction libraries to developed that would hide away the hideousness.
The first one is called termcap (terminal-capabilities) library, circa 1978, which provided a generic way to deal with terminals on UNIX systems. It could tell a running program of the available capabilities of the terminal (for example the ability to change text color) or to position cursor at a location, or to clear itself etc, and the program would then modify its behavior accordingly.
The second library is called curses, circa 1985 (??) it was developed as part of the BSD system and was used to write games ... One of the most popular versions of this library is the GNU curses library (previously known as ncurses).
On VMS I believe the library is called SMG$ (screen management library).
On with the answer
Any how, so you can use one of these libraries in java to determine whether or not you are working on a proper terminal. There is a library called JavaCurses on source forge that provides this capability to java programs. This will be an exercise in learning how to utilize a new library into your programs and should be exciting.
JavaCurses provides terminal programming capability on both Unix and Windows environments. It will be a fun exercise for you to see if you can use it to play with.
advanced exercise
Another exercise would be to use that same library to see if you can create a program that display nicely on a terminal and also writes out to a text file without the terminal codes;
If you have any issues, post away, I'll help as you go along.
If you have a definite line length, apache commons StringUtils.center will easily do the job. However, you have to add that library. javadoc
Java print statements to the console can't be centered as there is no maximum width to a line.
If your console is limited to, for example, 80 chars, you could write a special logger that would pad the string with spaces.
If your string was greater than 80 chars then you would have to cut the string and print the remainder on the next line. Also, if someone else was using your app with a console with a different width (especially smaller) if would look weird.
So basically, no, there is no easy way to center the output...
You could do something like:
public static void main(String[] args) {
String h = "Hello";
System.out.println(String.format("%-20s", h));
}
This approach outputs a string offset by a given number of spaces. In this case Hello is preceded by 20 spaces. The spaces precede Hello because the integer between % and s is negative, otherwise the spaces would be trailing.
Just mess with the integer between % and s until you get the desired result.
As lot of programming questions, dont reinvent the wheel!
Apache have a nice library: "org.apache.commons" that come with a StringUtils class:
https://commons.apache.org/proper/commons-lang/apidocs/org/apache/commons/lang3/StringUtils.html
The pad method is what you want:
int w = 20;
System.out.println(StringUtils.rightPad("+", w - 1, "-") + "+");
System.out.println(StringUtils.center(StringUtils.center("output", w - 2), w, "|"));
System.out.println(StringUtils.rightPad("+", w - 1, "-") + "+");
will give you:
+----------------------+
| output |
+----------------------+
You can't. You are writing to the console which does not have a width so the center is undefined.
If you know the size and don't want to use an external library you could do something like this:
static void printer(String str, int size) {
int left = (size - str.length()) / 2;
int right = size - left - str.length();
String repeatedChar = "-";
StringBuffer buff = new StringBuffer();
for (int i = 0; i < left; i++) {
buff.append(repeatedChar);
}
buff.append(str);
for (int i = 0; i < right; i++) {
buff.append(repeatedChar);
}
// to see the end (and debug) if using spaces as repeatedChar
//buff.append("$");
System.out.println(buff.toString());
}
// testing:
printer("string", 30);
// output:
// ------------string------------
If you call it with an odd number for the size variable, then it would be with one - more to the right. And you can change the repeatedChar to be a space.
Edit
If you want to print just one char and you know the size, you could do it with the default System.out.printf like so:
int size = 10;
int left = size/2;
int right = size - left;
String format = "%" + left + "c%-" + right + "c";
// would produce: "%5c%-5c"
System.out.printf(format,' ', '#');
// output: " # " (without the quotes)
The %-5c align the # character to the left of the 5 spaces assigned to it
We read data from XLS cells formatted as text.
The cell hopefully contains a number, output will be a BigDecimal (because of arbitrary precision).
Problem is, the cell format is also arbitrary, which means it may contain numbers like:
with currency symbols ($1000)
leading and trailing whitespaces, or whitespaces in between digits (eg. 1 000 )
digit grouping symbols (eg. 1,000.0)
of course, negative numbers
'o's and 'O's as zeros (eg. 1,ooo.oo)
others I can't think of
It's mostly because of this last point that I'm looking for a standard library that can do all this, and which is configurable, well tested etc.
I looked at Apache first, found nothing but I might be blind... perhaps it's a trivial answer for someone else...
UPDATE: the domain of the question is financial applications. Actually I'm expecting a library where the domain could be an input parameter - financial, scientific, etc. Maybe even more specific: financial with currency symbols? With stock symbols? With distances and other measurement units? I can't believe I'm the first person to think of something like this...
I don't know any library, but you can try that:
Put your number on a string. (ex: $1,00o,oOO.00)
Remove all occurrences of $,white-spaces or any other strang symbols you can think of...
Replace occurrences of o and O.
Try to parse the number =]
That should solve 99% of the entrys...
Buy bunch photos or even better videos with legal adult content. Create a web site with these resources but limit the access with captcha which will be displaying unsolved number formats. Create a set of number decoders out of known number formats and create an algorithm which will add new ones based on user solved captchas.
I think this is what I've been looking for:
http://site.icu-project.org/
Very powerful library, although at the moment it's not clear whether it can only format or all the formatted stuff can be parsed back as well.