Parse BigDecimal from String containing a number in arbitrary format

Parse BigDecimal from String containing a number in arbitrary format - java

We read data from XLS cells formatted as text.
The cell hopefully contains a number, output will be a BigDecimal (because of arbitrary precision).
Problem is, the cell format is also arbitrary, which means it may contain numbers like:
with currency symbols ($1000)
leading and trailing whitespaces, or whitespaces in between digits (eg. 1 000 )
digit grouping symbols (eg. 1,000.0)
of course, negative numbers
'o's and 'O's as zeros (eg. 1,ooo.oo)
others I can't think of
It's mostly because of this last point that I'm looking for a standard library that can do all this, and which is configurable, well tested etc.
I looked at Apache first, found nothing but I might be blind... perhaps it's a trivial answer for someone else...
UPDATE: the domain of the question is financial applications. Actually I'm expecting a library where the domain could be an input parameter - financial, scientific, etc. Maybe even more specific: financial with currency symbols? With stock symbols? With distances and other measurement units? I can't believe I'm the first person to think of something like this...

I don't know any library, but you can try that:
Put your number on a string. (ex: $1,00o,oOO.00)
Remove all occurrences of $,white-spaces or any other strang symbols you can think of...
Replace occurrences of o and O.
Try to parse the number =]
That should solve 99% of the entrys...

Buy bunch photos or even better videos with legal adult content. Create a web site with these resources but limit the access with captcha which will be displaying unsolved number formats. Create a set of number decoders out of known number formats and create an algorithm which will add new ones based on user solved captchas.

I think this is what I've been looking for:
http://site.icu-project.org/
Very powerful library, although at the moment it's not clear whether it can only format or all the formatted stuff can be parsed back as well.

Related

Randomly Generate Meaningful(Valid) English Words In Android Application

I am making a Dictionary Application. I am using Pearson Dictionary API for the same. I need to generate a word so that I could query that word for its definition.
PROBLEM
I know how to generate a random word but I don't know how to generate a meaningful English word.
I tried to solve this problem by requesting a JSON response and checking the results[](results[ ] hold definitions for the word) in the response. So, if results[].lenght > 0 then the word is a valid English word.
But the solution above has its own serious problem: Suppose I want to generate a 5 letter word, there are as many as 26^5 = 11881376different combinations whereas there aren't as many 5 letter meaningful English words. As the letters in the word increases, the number of combinations increases too. Thus, generating a meaningful word can take a very long time.
How can I check if the generated word is a meaningful English word or not? Isn't there any feasible programmatic way of doing this?
OR Is there any other way I could solve this Problem?

As far as I can see, you either generate random strings of letters and check to see if they're words (which, as you realise, is very slow, hit-or-miss approach) or you store a list of "known good" words and select randomly from that list.
How big that list needs to be depends on what you're trying to achieve.
According to this page the OED has around 171,476 main entries, not including variants like plurals (cat, cats), standard variants (sit, sitting), nor words that have multiple classes (e.g. dog can be a noun [the animal] or a verb [to follow persistently] etc.). According to this page an average adult knows between 20,000 and 35,000 words, so a prudent selection of 50,000 should cover most general purpose uses.
The answers to this question (now closed) provide a number of sources for word-lists. Examining one of them (originally provided by infochimps.org but available as a simple text-list on github) shows that the average length of 350,000+ words is just under 10 characters. For Linux (and possibly other flavours) /usr/share/dict/words may be a useful place to start.

There is this beautifull text file containing all english wordS:
https://github.com/AlexHakman/Java-challenge/blob/master/words.txt
You can then generate 5 letter words based on whats inside this text document :)
Get per line the length of the line, or just generate and compare it with the text file :)

Instead of doing it random because you need to spend time verifying just store a dictionary of the words that you would require and have a lookup table for it.
A relatively complete dictionary for English is about 2MBs compressed like the one here http://wordlist.aspell.net/12dicts/
Even for an Android app unless you're targeting really under powered devices it shouldn't be that big.
You can use SQLite to store the data so it may take up a bit more storage but you get SQL as your query language rather than making up your own.
Since you would also need a bit of randomness, each row can add some sort of randomized key that you can further query.
If you really wanted to limit it to 5 characters then just use a subset of the dictionary. But this will allow you to have an arbitrary length even length ranges (e.g. 2 to 10 characters)

Java: Simple format standard for various precision data

I'm trying to format output for user/report appeal, and there are two criteria I'm finding to be in a bit of conflict.
First, the decimal values should line up (format on "%12.10f", predicted integer value range 0-99)
Second, the decimal shouldn't trail an excessive series of zeroes.
For example, I have output that looks like
0.5252772000
0.2053628186
10.5234500000
But using a general formatting, I also end up with:
0.53260000000
0.52630000000
12.43540000000
In certain cases, and it looks kind of garbage.
Is there a simple way to solve this problem? The only solution I can come up with at the moment involves pre-interrogating the data before printing (instead of formatting it during print) which, while technically not expensive, just bugs me as being redundant data handling (ie I have to go through all data once to find the extrema of trailing zeroes to parse against it, and then set the format so that it can go through the data again to parse it)

You can set a DecimalFormat:
DecimalFormat format = new DecimalFormat("0.#");
for (float f : yourFloats){
System.out.println(format.format(f));
}
This also works on doubles.

Dealing with phone numbers formats

I think I'm facing a paradox here.
What I'm trying to do is when I receive/make a call, I have the number, so I need to know if its an international number, if its a local number, etc.
The problem is:
For me to know if a number is international, I need to parse it and check its length, but, the length differs from country to country, so, should I do a method that parses and recognizes for each country? (Unfunctional in my opinion);
For me to know if its a local number, I need the area code, so I have to make the same thing, parse the number and check the lenght, get the first numbers based on the area code lenght;
Its kinda hard to find the solution for this. The library libphonenumber offers a lot of usefull classes, but the one that I thought that could help me, took me to another paradox.
The method phoneUtil.parse(number, countryAcronym) returns the number with its country code, but what it does is, if I pass the number with the acronym "US" it return the number with country code '1', now if I change the acronym to "BR" it changes the number and return '55' that is the country code for Brazil. So, anyways, I need the country acronym based on the number I get.
EX:
numberReturned = phoneUtil.parse(phoneNumber, "US");
phoneUtil.format(numberReturned, PhoneNumberFormat.INTERNATIONAL);
The above code, returns the number with the US country code but now if I change the "US" to any other country acronym it will return the same number but with the country code of that country.
I know that this lib is not supposed to guess from which country the number is (THAT WOULD BE AWESOME!!), but thats what I need.
This is really making my mind goes crazy. I need good advices from the wise mages of SO.
If you please could help me with a good decision, I'd be so thankfull.
Thanks.
PS: If you already use libphonenumber and has more experience with this, please guide me on which class to use, if there is one capable of solving this problem. =)

1) The second parameter to phoneUtil.parse must match the country you're currently in - it's used if the phone number received does not include an international prefix. That's why you get different results when you change the parameter: the phone number you pass it does not contain such a prefix, so it just uses what you've told it.
Any parsing solution set to determine if the phone number is international or not will need to be aware of this: depending on the source, even a national number may be represented with the international dialing prefix (usually abstracted as +, since it differs between countries, but this is not guaranteed).
2) For area code parsing, there is no universal standard; some countries don't use them, and even within a country, area codes may have differing lengths (e.g. Germany). I'm not aware of an international library for this - and a quick search doesn't find anything (though that doesn't mean one does not exist). You might need to roll your own here; if you only need to support a single country, this shouldn't be too hard.

Where can I find "reference barcodes" to verify barcode library output?

This question is not about 'best' barcode library recommendation, we use various products on different platforms, and need a simple way to verify if a given barcode is correct (according to its specification).
We have found cases where a barcode is rendered differently by different barcode libraries and free online barcode generators in the Internet. For example, a new release of a Delphi reporting library outputs non-numeric characters in Code128 as '0' or simply skips them in the text area. Before we do the migration, we want to check if these changes are caused by a broken implementation in the new library so we can report this as a bug to the author.
We mainly need Code128 and UCC/EAN-128 with A/B/C subcodes.
Online resources I checked so far are:
IDAutomation.com (displays ABC123 as 0123 with Code128-C)
Morovia.com
BarcodesInc (does not accept comma)
TEC-IT
They show different results too, for example in support for characters like comma or plus signs, at least in the human readable text.

For Code128 there isn't a single correct answer. If you use Code128-A you can get a different result than Code128-C. By result I mean how it looks. Take "803150" as an example. In Code128-A you'll need 6 characters (+ start, checksum, stop) to represent this number. Code128-C only consists of numbers, so you can compress two digits into one character. Hence you'll need only 3 characters (+ start, checksum, stop) to represent the same number. The barcodes will look different (A being longer in this case), but if you scan them both will give the correct number.
Further, Code128 doesn't need to be just A, B or C. You can actually combine the different subsets. This is common for cases like "US123457890", where Code128-A or B is used on "US" and Code128-C is used on the remaining digits. This is sometime referred to as Code-128 Auto, or just Code-128. The result is a "compressed" barcode in terms of width. You could represent the same data with A/B but again that would give you a longer barcode.
Take two online generators:
IDAutomation
BarcodesInc
I recommend the first one, where you can select between Auto/A/B/C. Here is an example image illustrating the differences:
On IDAutomation, Auto is default while A is default on Barcodes-Inc. Both are correct, you just need to be careful what subset you have selected when comparing output. I also recommend a barcode reader for use in development to test the output. Also, see this page for a comparision of the different subsets with ASCII values. I also find grandzebu.net useful, which has a free Code128 font you can use as well.
It sounds like your Delphi library always use Code128-C, since it's only possible to represent numbers in this subset.

Why not just scan them and see what comes back?

Text similarity algorithm

I have two subtitles files.
I need a function that tells whether they represent the same text, or the similar text
Sometimes there are comments like "The wind is blowing... the music is playing" in one file only.
But 80% percent of the contents will be the same. The function must return TRUE (files represent the same text).
And sometimes there are misspellings like 1 instead of l (one - L ) as here:
She 1eft the baggage.
Of course, it means function must return TRUE.
My comments:
The function should return percentage of the similarity of texts - AGREE
"all the people were happy" and "all the people were not happy" - here that'd be considered as a misspelling, so that'd be considered the same text. To be exact, the percentage the function returns will be lower, but high enough to say the phrases are similar
Do consider whether you want to apply Levenshtein on a whole file or just a search string - not sure about Levenshtein, but the algorithm must be applied to the file as a whole. It'll be a very long string, though.

Levenshtein algorithm: http://en.wikipedia.org/wiki/Levenshtein_distance
Anything other than a result of zero means the text are not "identical". "Similar" is a measure of how far/near they are. Result is an integer.

For the problem you've described (i.e. compering large strings), you can use Cosine Similarity, which return a number between 0 (completely different) to 1 (identical), base on the term frequency vectors.
You might want to look at several implementations that are described here: Cosine Similarity

You're expecting too much here, it looks like you would have to write a function for your specific needs. I would recommend starting with an existing file comparison application (maybe diff already has everything you need) and improve it to provide good results for your input.

Have a look at approximate grep. It might give you pointers, though it's almost certain to perform abysmally on large chunks of text like you're talking about.
EDIT: The original version of agrep isn't open source, so you might get links to OSS versions from http://en.wikipedia.org/wiki/Agrep

There are many alternatives to the Levenshtein distance. For example the Jaro-Winkler distance.
The choice for such algorithm is depending on the language, type of words, are the words entered by human and many more...
Here you find a helpful implementation of several algorithms within one library

if you are still looking for the solution then go with S-Bert (Sentence Bert) which is light weight algorithm which internally uses cosine similarly.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.