As input I am getting an address as a String. It may say something like "123 Fake Street\nLos Angeles, CA 99988". How can I convert this into an object with fields like this:
Address1
Address2
City
State
Zip Code
Or something similar to this? If there is a java library that can do this, all the better.
Unfortunately, I don't have a choice about the String as input. It's part of a specification I'm trying to implement.
The input is not going to be very well structured so the code will need to be very fault tolerant. Also, the addresses could be from all over the world, but 99 out of 100 are probably in the US.
You can use JGeocoder
public static void main(String[] args) {
Map<AddressComponent, String> parsedAddr = AddressParser.parseAddress("Google Inc, 1600 Amphitheatre Parkway, Mountain View, CA 94043");
System.out.println(parsedAddr);
Map<AddressComponent, String> normalizedAddr = AddressStandardizer.normalizeParsedAddress(parsedAddr);
System.out.println(normalizedAddr);
}
Output will be:
{street=Amphitheatre, city=Mountain View, number=1600, zip=94043, state=CA, name=Google Inc, type=Parkway}
{street=AMPHITHEATRE, city=MOUNTAIN VIEW, number=1600, zip=94043, state=CA, name=GOOGLE INC, type=PKWY}
There is another library International Address Parser you can check its trial version. It supports country as well.
AddressParser addressParser = AddressParser.getInstance();
AddressStandardizer standardizer = AddressStandardizer.getInstance();//if enabled
AddressFormater formater = AddressFormater.getInstance();
String rawAddress = "101 Avenue des Champs-Elysées 75008 Paris";
//you can try to detect the country
CountryDetector detector = CountryDetector.getInstance();
String countryCode = detector.getCountryCode("7580 Commerce Center Dr ALABAMA");
System.out.println("detected country=" + countryCode);
Also, please check Implemented Countries in this library.
Cheers !!
I work at SmartyStreets where we develop address parsing and extraction algorithms.
It's hard.
If most of your addresses are in the US, you can use an address verification service to provide guaranteed accurate parse results (since the addresses are checked against a master list).
There are several providers out there, so take a look around and find one that suits you. Since you probably won't be able to install the database locally (not without a big fee, because address data is licensed by the USPS), look for one that offers a REST endpoint so you can just make an HTTP request. Since it sounds like you have a lot of addresses, make sure the API is high-performing and lets you do batch requests.
For example, with ours:
Input:
13001 Point Richmond Dr NW, Gig Harbor WA
Output:
Or the more specific breakdown of components, if needed:
If the input is even messier, there are a few address extraction services available that can handle a little bit of noise within an address and parse addresses out of text and turn them into their components. (SmartyStreets offers this also, as a beta API. I believe some other NLP services do similar things too.)
Granted, this only works for US addresses. I'm not as expert on UK or Canadian addresses, but I believe they may be slightly simpler in general.
(Beyond a small handful of well-developed countries, international data is really hit-and-miss. Reliable data sets are hard to obtain or don't exist. But if you're on a really tight budget you could write your own parser for all the address formats.)
If you are sure on the format, you can use regular expressions to get the address out of the string. For the example you provided something like this:
String address = "123 Fake Street\\nLos Angeles, CA 99988";
String[] parts = address.split("(.*)\\n(.*), ([A-Z]{2}) ([0-9]{5})");
I assume the sequence of information is always the same, as in the user will never enter postal code before State. If I got your question correctly you need logic to process afdress that may be incomplete (like missing a portion).
One way to do it is look for portions of string you know are correct. You can treat the known parts of Address as separators. You will need City and State names and address words (Such as "Street", "Avenue", "Road" etc) in an array.
Perform Index of with cities,states and the address words (and store them).
Substring and cut out the 1st line of address (from start to the index of address signifying word +it's length).
Check index of city name (index found in step 1). If it's -1 skip this step. If it's 0 Take it out (0 also means address line 2 is not in string). If it's more than 0, Substring and cut out anything from start of string to index of city name as the 2nd line of address.
Check the index of state name. Once again if -1 skip this step. If 0 substring and cut out as state name.
Whatever remains is your postal code
Check the strings you just extracted for left over separators (commas, dots, new lines etc) and extract them;
If the address is missing both state and city you would actually need an a list of zip codes too, so better ensure the user enters at least 1 of them.
It's not impossible to implement what you need, but you probably don't want to waste all that time doing it. It's easier to just ensure user enters everything correctly.
Maybe you can use Regular Expression
Related
For example, I have this string:
String fullName = "Andre Santos Silva";
the name is Andre, the surname I want is Santos.
so I'll return "Andre Santos"
But I have some issues, for example:
String fullName = "Andre di Santos Silva"
the name is "Andre" and I want my surname "di Santos"
my return must be "Andre di Santos"
Another example:
String fullName = "Andre e Santos Silva"
my name is "Andre" and surname I will be "e Santos"
my return must be "Andre e Santos"
how can I get this string with name and "first" surname ?
Completely impossible.
The concept 'first name' and 'last name' are not global, and names generally are. Even if you decide to just toss a middle finger to a significant chunk of the global population and act like those people just don't matter, from the remaining places that do have a first/last name scheme that roughly fits your evident idea of how the entire world names things, it's not consistent enough to be able to just determine first and last name from an input string unless you throw some quite significant pattern matching Artificial Intelligence algorithms at it.
SOLUTION: Stop worrying about it. There is no such thing as first and last name, there's just name. If you have some bass ackwards old timey system that must know, tell the developers of it to get with the program. If you can't tell them, then ask whatever you're getting this input from to give it to you separated out in 'first' and 'last' name. If you can't do that either, you're completely hosed; tell whomever gave you the instruction to build this software that it is not possible, and that the next step isn't technical/development, it's political/organizational: Convince the suppliers to change the process so the input is provided in first/last name form, or convince the ones you are passing this data to, to stop wanting it in first/lastname form.
Some example names to show why the world doesn't work the way you think it does. Please be the computer algorithm and explain to me exactly what the first and last names are of each of these full names. These are official names that e.g. show in passports where relevant.
IN: Prince Harry, Duke of Sussex. (Correct OUT: Henry, Mountbatten-Windsor, which clearly cannot possibly be derived from that name!)
IN: Ivan Ivanovich (Correct OUT: Ivan, and there is no last name here. That's a patronymic, which is not the same thing. Russian origin names usually do have an actual last name (in the sense that their parent or parents also had that name, a thing you can call a 'family name', but they don't commonly use that, and if they have to enter their full name in a form, you're likely getting first name + patronymic, and that's all.
IN: Nanna Bryndís Hilmarsdóttir (Correct OUT: Nanna Bryndís, Hilmarsdóttir - probably. But if you expect her father, mother, or hypothetical children to also have that last name, no they wont, and calling that their 'family name' is wrong. This too is a patronymic, but unlike in e.g. Russia family names aren't a thing, as far as I know, in Iceland - their patronymic is for all intents and purposes their last name. It's just.. not a family name).
IN Kim Jong-il. (Correct OUT: Jong-il Kim or possibly Yuri Kim or maybe Yuri Irsenovich Kim - note that the first substring in the input is the last name. This is common in many asian cultures, including Korea (both of them), china, and many more.
IN José Antonio Gómez Iglesias (OUT: Well, if this is spanish person, which the name certainly suggests, then the right breakdown is José Antonio and Gómez Iglesias, but it is rare but possible that the correct breakdown is José Antonio Gómez and Iglesias. There is absolutely no way to be sure. The first is by far the most likely but that's based on the fact that the name 'sounds spanish'. Which is where that whole 'you need a quite complicated AI ruleset to try to figure this out', which needs to match this behaviour: Check the name against a giant neural net or other database to guesstimate that it is highly likely to be spanish in origin, and that Gómez is a common surname).
IN: Johannes Vennegoor of Hesselink. (Correct OUT: Johannes, Vennegoor of Hesselink. Sort under 'V' if sorting on last name).
IN: Jan Willem Vergeer (Correct OUT: Jan Willem, Vergeer. Contrast to the previous answer. Completely impossible to separate out using basic string algorithms. Only way is to use an AI to determine that Jan Willem is a common dutch first name, and the official spelling is usually without a hyphen).
IN: Andries de Witt (Correct OUT: Andries de Witt, but de is an interstitial. If sorting, you must sort on W, and not d. In systems that can't handle this, it is common to split this out as Andries and Witt, de instead, and e.g. dutch phonebooks will take the latter approach).
I have several lists of Strings already classified like
<string> <tag>
088 9102355 PHONE NUMBER
091 910255 PHONE NUMBER
...
Alfred St STREET
German St STREET
...
RE98754TO IDENTIFIER
AUX9654TO IDENTIFIER
...
service open all day long DESCRIPTION
service open from 8 to 22 DESCRIPTION
...
jhon.smith#email.com EMAIL
jhon.smith#anothermail.com EMAIL
...
www.serviceSite.com URL
...
Florence CITY
...
with a lot of strings per tag and i have to make a java program which, given
a new List of String(supposed all of the same tag), assigns a probability for each tag to the list.
The program has to be completely language independent and all the knowledge has to came from the lists of tagged strings as the one described above.
I think that this problem can be solved with NER approaches (i.e machine learning algorithms like CRF) but those are usually for unstructured text like a chapter from a book, or a paragraph of a web page, and not for list of independent strings.
I Thought to use CRF (i.e Conditional Random Field) because I found a similar approach used in the Karma Data integration Tool as described in this Article, paragraph 3.1
where the "semantic types" are the my tags.
To tackle the program I have downloaded the Stanford Named Entity Recognizer (NER) and played a bit
with it's JAVA API through NERDemo.java finding two problems:
the training file for the CRFClassifier has to have one word per row, therefore I haven't found a way to classify groups of words with a single tag
I don't understand if I have to make one Classifier per tag or a single Classifier for all, because a single string could be classified with n different tags and it is the user that chooses between them. So I'm rather interested in the probability assigned by the classifiers instead of the exact class matching. Furthermore
i haven't any "no Tag" Strings so I don't know how the Classifier behaves without them to assign the probabilities.
Is this the right approach to the problem? Is There a way To use The Stanford NER
or another JAVA API with CRF or other suitable Machine Learning Algoritm to do it?
Update
I managed to train the CRF classifier first with each word classified independently with the tag and each group of words separated by two commas( classified as "no Tag"(0) ), then with the group of words as a single word with underscores replacing spaces but I have very disappointing results in the little test I made. I haven't quite get which features I have to include and which exclude from the ones described in the NERFeatureFactory javadoc considering they can't have anything to do with language.
Update 2
The test results are beginning to make sense, I've divided each string(tagging every Token) from the others with two new Lines, instead of the horrible "two commas labeled with 0", and I've used the Stanford PTBTokenizer instead of the one that I made. Moreover I've tuned the features, turning on the usePrev and useNext features and using suffix/prefix Ngrams up to 6 characters of length and other things.
The training file named training.tsv has this format:
rt05201201010to identifier
1442955884000 identifier
rt100005154602cv identifier
Alfred street
Street street
Robert street
Street street
and theese are the flags in the the propeties file:
# these are the features we'd like to train with
# some are discussed below, the rest can be
# understood by looking at NERFeatureFactory
useClassFeature=true
useWord=true
# word character ngrams will be included up to length 6 as prefixes
# and suffixes only
useNGrams=true
noMidNGrams=true
maxNGramLeng=6
usePrev=true
useNext=true
useTags=false
useWordPairs=false
useDisjunctive=true
useSequences=false
usePrevSequences=true
useNextSequences=true
# the next flag can have these values: IO, IOB1, IOB2, IOE1, IOE2, SBIEO
entitySubclassification=IO
printClassifier=HighWeight
cacheNGrams=true
# the last 4 properties deal with word shape features
useTypeSeqs=true
useTypeSeqs2=true
useTypeySequences=true
wordShape=chris2useLC
However I found another problem, I managed to train only 39 labels with 100 strings each, though I have like 150 labels with more than 1000 string each, but even so it takes like 5 minutes to train and if I rise these numbers a bit it throws a Java Heap Out of Memory Error.
Is there a way to scale up to those numbers with a single classifier? Is it better to train 150 (or less, maybe one with two or three labels) little classifiers and combine them later? Do I need to train with 1000+ strings each label or can I stop to 100(maybe choosing them quite different from one another)?
The first thing you should be aware of is that (linear chain) CRF taggers are not designed for this purpose. They came as a very nice solution for context-based prediction, i.e. when you have words before and after named entities, and you look for clues in a limited window (e.g. 2 words before / after current word). This is why you had to insert double lines: to delimit sentences. They also provide coherence between tags affected to words, which is indeed a good thing in your case.
A CRF tagger should work, but with an extra cost in learning step which you could be avoided by using simpler (maximum entropy, SVM) but still accurate machine learning methods. In Java, for your task, wouldn't Weka be a better solution? I would also consider BIO tagging as not relevant in your case.
Whatever software / coding you use, it is not surprising that ngrams at character level gives good improvements, but I believe you may add dedicated features. For instance, since morphological clues are important (presence of an "#", upper case or digits characters), you may use codes (see ref [1]) that are a very convenient method to describe strings. You'll also most probably obtain better results by using lists of names (lexicon) that may be triggered as additional features.
[1] Ranking algorithms for named-entity extraction: Boosting and the voted perceptron (Michael Collins, 2002)
I need to compare two phone numbers to determine if they're from the same sender/receiver. The user may send a message to a contact, and that contact may reply.
The reply usually comes in
+[country-code][area-code-if-any][and-then-the-actual-number] format. For example,
+94 71 2593276 for a Sri Lankan phone number.
And when the user sends a message, he will usually enter in the format (for the above example) 0712593276 (assume he's also in Sri Lanka).
So what I need is, I need to check if these two numbers are the same. This app will be used internationally. So I can't just replace the first 2 digits with a 0 (then it will be a problem for countries like the US). Is there any way to do this in Java or Android-specifically?
Thanks.
Android has nice PhoneNumberUtils, and i guess your looking for :
public static boolean compare (String a, String b)
look in :
http://developer.android.com/reference/android/telephony/PhoneNumberUtils.html
using it should look like this :
String phone1
String phone2
if (PhoneNumberUtils.compare(phone1, phone2)) {
//they are the same do whatever you want!
}
android.telephony.PhoneNumberUtils class provides almost all necessary functions to deal with phone numbers and standards.
For your case, the solution is PhoneNumberUtils.compare(Context context, String number1, String number2) or else PhoneNumberUtils.compare(String number1, String number2).The former one checks a resource to determine whether to use a strict or loose comparison algorithm, thus the better choice in most cases.
PhoneNumberUtils.compare("0712593276", "+94712593276") // always true
PhoneNumberUtils.compare("0712593276", "+44712593276") // always true
// true or false depends on the context
PhoneNumberUtils.compare(context, "0712593276", "+94712593276")
Take a look at the official documentation. And the source code.
How about checking if the number is a substring of the receiver's number?
For instance, let's say my Brazilian number is 888-777-666 and yours is 111-222-333.
To call you, from here, I need to dial additional numbers to make international calls. Let's say I need to add 9999 + your_number, resulting in 9999111222333.
If RawNumber.substring(your_number) returns true I can say that I'm calling you.
just apply your logic to remove () and -
and follow PhoneNumberUtils
I think I'm facing a paradox here.
What I'm trying to do is when I receive/make a call, I have the number, so I need to know if its an international number, if its a local number, etc.
The problem is:
For me to know if a number is international, I need to parse it and check its length, but, the length differs from country to country, so, should I do a method that parses and recognizes for each country? (Unfunctional in my opinion);
For me to know if its a local number, I need the area code, so I have to make the same thing, parse the number and check the lenght, get the first numbers based on the area code lenght;
Its kinda hard to find the solution for this. The library libphonenumber offers a lot of usefull classes, but the one that I thought that could help me, took me to another paradox.
The method phoneUtil.parse(number, countryAcronym) returns the number with its country code, but what it does is, if I pass the number with the acronym "US" it return the number with country code '1', now if I change the acronym to "BR" it changes the number and return '55' that is the country code for Brazil. So, anyways, I need the country acronym based on the number I get.
EX:
numberReturned = phoneUtil.parse(phoneNumber, "US");
phoneUtil.format(numberReturned, PhoneNumberFormat.INTERNATIONAL);
The above code, returns the number with the US country code but now if I change the "US" to any other country acronym it will return the same number but with the country code of that country.
I know that this lib is not supposed to guess from which country the number is (THAT WOULD BE AWESOME!!), but thats what I need.
This is really making my mind goes crazy. I need good advices from the wise mages of SO.
If you please could help me with a good decision, I'd be so thankfull.
Thanks.
PS: If you already use libphonenumber and has more experience with this, please guide me on which class to use, if there is one capable of solving this problem. =)
1) The second parameter to phoneUtil.parse must match the country you're currently in - it's used if the phone number received does not include an international prefix. That's why you get different results when you change the parameter: the phone number you pass it does not contain such a prefix, so it just uses what you've told it.
Any parsing solution set to determine if the phone number is international or not will need to be aware of this: depending on the source, even a national number may be represented with the international dialing prefix (usually abstracted as +, since it differs between countries, but this is not guaranteed).
2) For area code parsing, there is no universal standard; some countries don't use them, and even within a country, area codes may have differing lengths (e.g. Germany). I'm not aware of an international library for this - and a quick search doesn't find anything (though that doesn't mean one does not exist). You might need to roll your own here; if you only need to support a single country, this shouldn't be too hard.
I would like to know what is the right way to handle internationalization for statements with runtime data added to it. For example
1) Your input "xyz" is excellent!
2) You were "4 years" old when you switched from "Barney and Freinds" to "Spongebob" shows.
The double quoted values are user data obtained or calculated at run time. My platforms are primarily Java/Android. A right solution for Western languages are preferred over weaker universal one.
Java
To retrieve localized text (messages), use the java.util.ResourceBundle API. To format messages, use java.text.MessageFormat API.
Basically, first create a properties file like so:
key1 = Your input {0} is excellent!
key2 = You were {0} old when you switched from {1} to {2} shows.
The {n} things are placeholders for arguments as you can pass in by MessageFormat#format().
Then load it like so:
ResourceBundle bundle = ResourceBundle.getBundle("filename", Locale.ENGLISH);
Then to get the messages by key, do:
String key1 = bundle.getString("key1");
String key2 = bundle.getString("key2");
Then to format it, do:
String formattedKey1 = MessageFormat.format(key1, "xyz");
String formattedKey2 = MessageFormat.format(key2, "4 years", "Barney and Friends", "Spongebob");
See also:
Trail: internationalization
Android
With regards to Android, the process is easier. You just have to put all those String messages in the res/values/strings.xml file. Then, you can create a version of that file in different languages, and place the file in a values folder that contains the language code. For instance, if you want to add Spanish support, you just have to create a folder called res/values-es/ and put the Spanish version of your strings.xml there. Android will automatically decide which file to use depending on the configuration of the handset.
See also:
Developer guide - Localization
One non-technical consideration. Embedding free data inside English phrases isn't going to look very smooth in many cultures (including Western ones), where you need grammatical agreement on e.g. number, gender or case. A more telegraphic style usually helps (e.g. Excellent input: "xyz") -- then at least everybody gets the same level of clunkiness!
I think one will probably have to define one's format string to include a "1-of-N" feature, preferably defined so as to make common cases easy (e.g. plurals). For example, define {0#string1/string2/string3} to output string1 if parameter 0 is zero or less, string2 if it's precisely 1, and string3 if it's greater than 1}. Then one could say "You have {0} {0#knives/knife/knives} in the drawer."