How can I get the name and surname from string?

How can I get the name and surname from string? - java

For example, I have this string:
String fullName = "Andre Santos Silva";
the name is Andre, the surname I want is Santos.
so I'll return "Andre Santos"
But I have some issues, for example:
String fullName = "Andre di Santos Silva"
the name is "Andre" and I want my surname "di Santos"
my return must be "Andre di Santos"
Another example:
String fullName = "Andre e Santos Silva"
my name is "Andre" and surname I will be "e Santos"
my return must be "Andre e Santos"
how can I get this string with name and "first" surname ?

Completely impossible.
The concept 'first name' and 'last name' are not global, and names generally are. Even if you decide to just toss a middle finger to a significant chunk of the global population and act like those people just don't matter, from the remaining places that do have a first/last name scheme that roughly fits your evident idea of how the entire world names things, it's not consistent enough to be able to just determine first and last name from an input string unless you throw some quite significant pattern matching Artificial Intelligence algorithms at it.
SOLUTION: Stop worrying about it. There is no such thing as first and last name, there's just name. If you have some bass ackwards old timey system that must know, tell the developers of it to get with the program. If you can't tell them, then ask whatever you're getting this input from to give it to you separated out in 'first' and 'last' name. If you can't do that either, you're completely hosed; tell whomever gave you the instruction to build this software that it is not possible, and that the next step isn't technical/development, it's political/organizational: Convince the suppliers to change the process so the input is provided in first/last name form, or convince the ones you are passing this data to, to stop wanting it in first/lastname form.
Some example names to show why the world doesn't work the way you think it does. Please be the computer algorithm and explain to me exactly what the first and last names are of each of these full names. These are official names that e.g. show in passports where relevant.
IN: Prince Harry, Duke of Sussex. (Correct OUT: Henry, Mountbatten-Windsor, which clearly cannot possibly be derived from that name!)
IN: Ivan Ivanovich (Correct OUT: Ivan, and there is no last name here. That's a patronymic, which is not the same thing. Russian origin names usually do have an actual last name (in the sense that their parent or parents also had that name, a thing you can call a 'family name', but they don't commonly use that, and if they have to enter their full name in a form, you're likely getting first name + patronymic, and that's all.
IN: Nanna Bryndís Hilmarsdóttir (Correct OUT: Nanna Bryndís, Hilmarsdóttir - probably. But if you expect her father, mother, or hypothetical children to also have that last name, no they wont, and calling that their 'family name' is wrong. This too is a patronymic, but unlike in e.g. Russia family names aren't a thing, as far as I know, in Iceland - their patronymic is for all intents and purposes their last name. It's just.. not a family name).
IN Kim Jong-il. (Correct OUT: Jong-il Kim or possibly Yuri Kim or maybe Yuri Irsenovich Kim - note that the first substring in the input is the last name. This is common in many asian cultures, including Korea (both of them), china, and many more.
IN José Antonio Gómez Iglesias (OUT: Well, if this is spanish person, which the name certainly suggests, then the right breakdown is José Antonio and Gómez Iglesias, but it is rare but possible that the correct breakdown is José Antonio Gómez and Iglesias. There is absolutely no way to be sure. The first is by far the most likely but that's based on the fact that the name 'sounds spanish'. Which is where that whole 'you need a quite complicated AI ruleset to try to figure this out', which needs to match this behaviour: Check the name against a giant neural net or other database to guesstimate that it is highly likely to be spanish in origin, and that Gómez is a common surname).
IN: Johannes Vennegoor of Hesselink. (Correct OUT: Johannes, Vennegoor of Hesselink. Sort under 'V' if sorting on last name).
IN: Jan Willem Vergeer (Correct OUT: Jan Willem, Vergeer. Contrast to the previous answer. Completely impossible to separate out using basic string algorithms. Only way is to use an AI to determine that Jan Willem is a common dutch first name, and the official spelling is usually without a hyphen).
IN: Andries de Witt (Correct OUT: Andries de Witt, but de is an interstitial. If sorting, you must sort on W, and not d. In systems that can't handle this, it is common to split this out as Andries and Witt, de instead, and e.g. dutch phonebooks will take the latter approach).

Related

How do you convert a java String to a mailing address object?

As input I am getting an address as a String. It may say something like "123 Fake Street\nLos Angeles, CA 99988". How can I convert this into an object with fields like this:
Address1
Address2
City
State
Zip Code
Or something similar to this? If there is a java library that can do this, all the better.
Unfortunately, I don't have a choice about the String as input. It's part of a specification I'm trying to implement.
The input is not going to be very well structured so the code will need to be very fault tolerant. Also, the addresses could be from all over the world, but 99 out of 100 are probably in the US.

You can use JGeocoder
public static void main(String[] args) {
Map<AddressComponent, String> parsedAddr = AddressParser.parseAddress("Google Inc, 1600 Amphitheatre Parkway, Mountain View, CA 94043");
System.out.println(parsedAddr);
Map<AddressComponent, String> normalizedAddr = AddressStandardizer.normalizeParsedAddress(parsedAddr);
System.out.println(normalizedAddr);
}
Output will be:
{street=Amphitheatre, city=Mountain View, number=1600, zip=94043, state=CA, name=Google Inc, type=Parkway}
{street=AMPHITHEATRE, city=MOUNTAIN VIEW, number=1600, zip=94043, state=CA, name=GOOGLE INC, type=PKWY}
There is another library International Address Parser you can check its trial version. It supports country as well.
AddressParser addressParser = AddressParser.getInstance();
AddressStandardizer standardizer = AddressStandardizer.getInstance();//if enabled
AddressFormater formater = AddressFormater.getInstance();
String rawAddress = "101 Avenue des Champs-Elysées 75008 Paris";
//you can try to detect the country
CountryDetector detector = CountryDetector.getInstance();
String countryCode = detector.getCountryCode("7580 Commerce Center Dr ALABAMA");
System.out.println("detected country=" + countryCode);
Also, please check Implemented Countries in this library.
Cheers !!

I work at SmartyStreets where we develop address parsing and extraction algorithms.
It's hard.
If most of your addresses are in the US, you can use an address verification service to provide guaranteed accurate parse results (since the addresses are checked against a master list).
There are several providers out there, so take a look around and find one that suits you. Since you probably won't be able to install the database locally (not without a big fee, because address data is licensed by the USPS), look for one that offers a REST endpoint so you can just make an HTTP request. Since it sounds like you have a lot of addresses, make sure the API is high-performing and lets you do batch requests.
For example, with ours:
Input:
13001 Point Richmond Dr NW, Gig Harbor WA
Output:
Or the more specific breakdown of components, if needed:
If the input is even messier, there are a few address extraction services available that can handle a little bit of noise within an address and parse addresses out of text and turn them into their components. (SmartyStreets offers this also, as a beta API. I believe some other NLP services do similar things too.)
Granted, this only works for US addresses. I'm not as expert on UK or Canadian addresses, but I believe they may be slightly simpler in general.
(Beyond a small handful of well-developed countries, international data is really hit-and-miss. Reliable data sets are hard to obtain or don't exist. But if you're on a really tight budget you could write your own parser for all the address formats.)

If you are sure on the format, you can use regular expressions to get the address out of the string. For the example you provided something like this:
String address = "123 Fake Street\\nLos Angeles, CA 99988";
String[] parts = address.split("(.*)\\n(.*), ([A-Z]{2}) ([0-9]{5})");

I assume the sequence of information is always the same, as in the user will never enter postal code before State. If I got your question correctly you need logic to process afdress that may be incomplete (like missing a portion).
One way to do it is look for portions of string you know are correct. You can treat the known parts of Address as separators. You will need City and State names and address words (Such as "Street", "Avenue", "Road" etc) in an array.
Perform Index of with cities,states and the address words (and store them).
Substring and cut out the 1st line of address (from start to the index of address signifying word +it's length).
Check index of city name (index found in step 1). If it's -1 skip this step. If it's 0 Take it out (0 also means address line 2 is not in string). If it's more than 0, Substring and cut out anything from start of string to index of city name as the 2nd line of address.
Check the index of state name. Once again if -1 skip this step. If 0 substring and cut out as state name.
Whatever remains is your postal code
Check the strings you just extracted for left over separators (commas, dots, new lines etc) and extract them;
If the address is missing both state and city you would actually need an a list of zip codes too, so better ensure the user enters at least 1 of them.
It's not impossible to implement what you need, but you probably don't want to waste all that time doing it. It's easier to just ensure user enters everything correctly.

Maybe you can use Regular Expression

RitaWordnet API has no support for retrieving sense number?

I'm comparing some senses with RitaWordNet and using SenseRelate::AllWords to word sense disambiguate them, but I'm in trouble. I can't figure out how to compare the output from RitaWordnet with AllWords script.
Rita give me senseid, name, description, pos/bestPos (adjective, verb, noun etc) but not sense number (#1,#2,#3..) The output I I get is like this:
"user","n", "someone who use something..".
::AllWords can't retrieve description, but (wsd.pl) give me
name#pos#sensenumber ("User#n#1").
Which was what I was hoping for actually, but then I realized that Rita doesn't support sense numbers (strangely).
So now I'm a little stuck on how to compare them, how to determine if they are the same sense. Any ideas on how to solve this?

I realized that WordNet returns a sorted list based on nouns, verbs etc.
Example: "story"
Noun: narrative, narration, story, tale #1
Noun: story #2
Noun: floor, level, storey, story #3
So now I can simply add sense number 1, 2 and 3 to the first three nouns (and the rest will continune of course). Then I can compare this with user#n#1 from the word sense disambiguation. The creators of RitaWordNet responded that they would like to implement the sense number in their api, but they don't have an upcoming relase for a new version so this may take a while.

Tools for text simplification (Java) [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 9 years ago.
Improve this question
What is the best tool that can do text simplification using Java?
Here is an example of text simplification:
John, who was the CEO of a company, played golf.
↓
John played golf. John was the CEO of a company.

I see your problem as a task of converting complex or compound sentence into simple sentences.
Based on literature Sentence Types, a simple sentence is built from one independent clause. A compound and complex sentence is built from at least two clauses. Also, clause must have subject and verb.
So your task is to split sentence into clauses that form your sentence.
Dependency parsing from Stanford CoreNLP is a perfect tools to split compound and complex sentence into simple sentence. You can try the demo online.
From your sample sentence, we will get parse result in Stanford typed dependency (SD) notation as shown below:
nsubj(CEO-6, John-1)
nsubj(played-11, John-1)
cop(CEO-6, was-4)
det(CEO-6, the-5)
rcmod(John-1, CEO-6)
det(company-9, a-8)
prep_of(CEO-6, company-9)
root(ROOT-0, played-11)
dobj(played-11, golf-12)
A clause can be identified from relation (in SD) which category is subject, e.g. nsubj, nsubjpass. See Stanford Dependency Manual
Basic clause can be extracted from head as verb part and dependent as subject part. From SD above, there are two basic clause i.e.
John CEO
John played
After you get basic clause, you can add another part to make your clause a complete and meaningful sentence. To do so, please consult Stanford Dependency Manual.
By the way, your question might be related with Finding meaningful sub-sentences from a sentence
Answer to 3rd comment:
Once you got the pair of subject an verb, i.e. nsubj(CEO-6, John-1), get all dependencies that have link to that dependency, except any dependency which category is subject, then extract unique word from these dependencies.
Based on example, nsubj(CEO-6, John-1), if you start traversing from John-1, you'll get nsubj(played-11, John-1) but you should ignore it since its category is subject.
Next step is traversing from CEO-6 part. You'll get
cop(CEO-6, was-4)
det(CEO-6, the-5)
rcmod(John-1, CEO-6)
prep_of(CEO-6, company-9)
From result above, you got new dependencies to traverse (i.e. find another dependencies that have was-4, the-5, company-9 in either head or dependent).
Now your dependencies are
cop(CEO-6, was-4)
det(CEO-6, the-5)
rcmod(John-1, CEO-6)
prep_of(CEO-6, company-9)
det(company-9, a-8)
In this step, you've finished traversing all dependecies linked to nsubj(CEO-6, John-1). Next, extract words from all head and dependent, then arrange the word in ascending order based on number appended to these words. This number indicating word order in original sentence.
John was the CEO a company
Our new sentence is missing one part, i.e of. This part is hidden in prep_of(CEO-6, company-9). If you read Stanford Dependency Manual, there are two kinds of SD, collapsed and non-collapsed. Please read them to understand why this of is hidden and how to get the word order of this hidden part.
With same approach, you'll get second sentence
John played golf

I think one can design a very simple algorithm for the basic cases of this situation, while real world cases may be too many, that such an approach will become unruly :)
Still I thought I should
think aloud and write my approach and maybe add some python code. My basic idea is that derive a solution from first principles,
mostly by explicitly exposing our model of what is really happening. And not to rely on other theories, models, libraries BEFORE we do one by HAND and from SCRATCH.
Goal: given a sentence, extract subsentences from it.
Example: John, who was the ceo of the company, played Golf.
Expected output: John was the CEO of the company. John played Golf.
Here is my model of what is happening here written out in the form of model assumptions:
(axioms?)
MA1. Simple sentences can be expanded by inserting subsentences.
MA2. A subsentence is a qualification/modification(additional information) on one or more of the entities.
MA3. To insert a subsentence, we put a comma right next to the entity we want to expand on (provide more information on) and attach the subsentence, I am going to call it an extension - and place another comma when the extension ends.
Given this model, the algorithm can be straightforward at least to address the simple cases first.
DETECT: Given a sentence, detect if it has an extension clause, by looking for a pair of commas in the sentence.
EXTRACT: If you find two commas, generate two sentences:
2.1 EXTRACT-BASE: base sentence:
delete everything out between the two commas, You get the base sentence.
2.2 EXTRACT-EXTENSION: extension sentence:
take everything inside the extension sentence, replace 'who' with the word right before it.
That is your second sentence.
PRINT: In fact you should print the extension sentence first, because the base sentence depends on it.
Well, that is our algorithm. Yes it sounds like a hack. It is. But something I am learning now, is that, if you use a trick in one program it is a hack, if it can handle more stuff, it is a technique.
So let us expand and complicate the situation a bit.
Compounding cases:
Example 2. John, who was the CEO of the company, played Golf with Ram, the CFO.
As I am writing it, I noticed that I had omitted the 'who was' phrase for the CFO!
That brings us to the complicating case that our algorithm will fail. Before going there,
let me create a simpler version of 2 that WILL work.
Example 3. John, who was the CEO of the company, played Golf with Ram, who was the CFO.
Example 4. John, the CEO of the company, played Golf with Ram, the CFO.
Wait we are not done yet!
Example 5. John, who is the CEO and Ram, who was the CFO at that time, played Golf, which is an engaging game.
To allow for this I need to extend my model assumptions:
MA4. More than one entities may be expanded likewise, but should not cause confusion because the
extension clause occurs right next to the entity being informed about. (accounts for example 3)
MA5. The 'who was' phrase may be omitted since it can be inferred by the listener. (accounts for example 4)
MA6. Some entities are persons, they will be extended using a 'who' and some entities are things, extended using a 'which'. Either of these extension heads may be omitted.
Now how do we handle these complications in our algorithm?
Try this:
SPLIT-SENTENCE-INTO-BASE-AND-EXTENSIONS:
If sentence contains a comma, look for the following comma, and extract whatever is in between into extension sentence. Continue until you find no more closing comma or opening comma left.
At this point you should have list with base sentence and one or more extension sentences.
PROCESS_EXTENSIONS:
For each extension, if it has 'who is' or 'which is', replace it by name before the extension headword.
If extension does not have a 'who is' or 'which is', place the leading word and and an is.
PRINT: all extension sentences first and then the base sentences.
Not scary.
When I get some time in the next few days, I will add a python implementation.
Thank you
Ravi Annaswamy

You are unlikely to solve this problem using any known algorithm in the general case - this is getting into strong AI territory. Even humans can't parse grammar very well!
Note that the problem is quite ambiguous regarding how far you simplify and what assumptions you are willing to make. You could take your example further and say:
John is assumed to be the name of a being. The race of John is unknown. John played
golf at some point in the past. Golf is assumed to refer to the ball
game called golf, but the variant of golf that John played is unknown.
At some point in the past John was the CEO of a company. CEO is assumed to
mean "Chief Executive Officer" in the context of a company but this is
not specified. The company is unknown.
In case the lesson is not obvious: the more you try to determine the exact meaning of words, the more cans of worms you start to open up...... it takes human-like levels of judgement and interpretation to know when to stop.
You may be able to solve some simpler cases using various Java-based NLP tools: see Is there a good natural language processing library

I believe AlchemyApi is your best option. Still it will require a lot of work on your side to do exactly what you need, and how the most commentators have alredy told you, most probably you'll not get 100% quality results.

Localizing a string containing list of names

I have string containing a list of name like below:
"John asked Kim, Kelly, Lee and Bob about the new year plans". The number of names in the list can very.
How can I localize this in Java?
I am thinking about ResourceBundle and MessageFormat. How will I write the pattern for this in MessageFormat?
Is there any better approach?

Localizing an (inline) list is more than just translating the word “and.” CLDR deals with the issue of formatting lists, check out their page on lists. I’m afraid ICU doesn’t have support to this yet, so you might need to code it separately.
Another issue is that you cannot expect to be able to use names as such in sentences like this. Many languages require the object to be in an inclined form, for example. In Finnish, your sample sentence would read as “John kysyi Kimiltä, Kellyltä, Leeltä ja Bobilta uudenvuoden suunnitelmista.” So you may need to find out and include different inclined forms of the names. Moreover, if the language used does not have Latin alphabet, you may need transliterated forms of the names (e.g., in Arabic, John is جون). There are other problems as well. In Russian, the verb corresponding to “asked” depends on the gender of the subject (e.g., спросила vs. спросил).
I know this sounds complex, but localization is often complex. If you target a limited set of languages only, things can be much easier, so it is important to defined your goals—perhaps accepting some simplifications that may result in grammatically incorrect expressions. But for localization that is to cover a wide range languages, you may need to make the generating function localized. That is, you would have, for each language, a function that accepts a list of names as arguments and returns a string representing the statement, possibly using resource files containing information (transliterated form, different inclined form, gender) about proper names that may appear.
In some situations, you might even consider generating the sentence in English, then sending it to an online translator. For example, Google Translator can deal with some of the issues that I mentioned. It surely produces wrong translations a lot, but for sentences with grammatically very simple structure, it might be a pragmatic solution, if you can accept some amount of errors. If you consider trying this, make sure you test sufficiently how the automatic translator can handle the specific sentences you will use. Quite often you can improve the results by reformulating the sentences. Dividing a sentence with several clauses into separate sentences often helps. But even your simple sentence causes problems in automatic translation.
You might avoid some complications if you can reformulate the sentence structure, e.g. so that all the nouns appear in the subject position and you avoid “packed” expressions like “new year plans.” For example, “John asked what plans Kim, Kelly, Lee, and Bob have for the new year” would be simpler, both for automatic translation and for pattern-based localization.

You could do something like:
"{0} asked {1} about the new year plans"
where 0 is the first name and 1 is a comma-separated list of the other names.
Hope this helps.

I see an answer was already accepted, I'm just adding this here as an alternative. The code has hard coded values for the data, but is only meant to present an idea that can be refined:
MessageFormat people = new MessageFormat("{0} asked {1,choice,0#no one|1#{2}|2#{2} and {3}|2<{2}, and {3}} about the new year plans");
String john = "John";
Object[][] parties = new Object[][] { {john, 0}, {john, 1, "Kim"}, {john, 2, "Kim", "Kelly}, {john, 4, "Kim, Kelly, Lee", "Bob"}};
for (final Object[] strings : parties) {
System.out.println(people.format(strings));
}
This outputs the following:
John asked no one about the new year plans
John asked Kim about the new year plans
John asked Kim and Kelly about the new year plans
John asked Kim, Kelly, Lee, and Bob about the new year plans
Determining the number of names that is used for the 2nd argument and creating the comma-delimited string for the 3rd argument isn't displayed in that sample, but can easily be done instead of using the hard coded values I used.

For localization, the normal approach is to use external language packs, which is a file contains the text you're going to display, assign each text a name/key, then load the text in the program by the key.

You could combine your ResourceBundle (for I18N) with a MessageFormat (to replace placeholders with the names) : "{0} asked {1} about the new year plans"
It would be up to you to prepare the names beforehand, though.

Dealing with phone numbers formats

I think I'm facing a paradox here.
What I'm trying to do is when I receive/make a call, I have the number, so I need to know if its an international number, if its a local number, etc.
The problem is:
For me to know if a number is international, I need to parse it and check its length, but, the length differs from country to country, so, should I do a method that parses and recognizes for each country? (Unfunctional in my opinion);
For me to know if its a local number, I need the area code, so I have to make the same thing, parse the number and check the lenght, get the first numbers based on the area code lenght;
Its kinda hard to find the solution for this. The library libphonenumber offers a lot of usefull classes, but the one that I thought that could help me, took me to another paradox.
The method phoneUtil.parse(number, countryAcronym) returns the number with its country code, but what it does is, if I pass the number with the acronym "US" it return the number with country code '1', now if I change the acronym to "BR" it changes the number and return '55' that is the country code for Brazil. So, anyways, I need the country acronym based on the number I get.
EX:
numberReturned = phoneUtil.parse(phoneNumber, "US");
phoneUtil.format(numberReturned, PhoneNumberFormat.INTERNATIONAL);
The above code, returns the number with the US country code but now if I change the "US" to any other country acronym it will return the same number but with the country code of that country.
I know that this lib is not supposed to guess from which country the number is (THAT WOULD BE AWESOME!!), but thats what I need.
This is really making my mind goes crazy. I need good advices from the wise mages of SO.
If you please could help me with a good decision, I'd be so thankfull.
Thanks.
PS: If you already use libphonenumber and has more experience with this, please guide me on which class to use, if there is one capable of solving this problem. =)

1) The second parameter to phoneUtil.parse must match the country you're currently in - it's used if the phone number received does not include an international prefix. That's why you get different results when you change the parameter: the phone number you pass it does not contain such a prefix, so it just uses what you've told it.
Any parsing solution set to determine if the phone number is international or not will need to be aware of this: depending on the source, even a national number may be represented with the international dialing prefix (usually abstracted as +, since it differs between countries, but this is not guaranteed).
2) For area code parsing, there is no universal standard; some countries don't use them, and even within a country, area codes may have differing lengths (e.g. Germany). I'm not aware of an international library for this - and a quick search doesn't find anything (though that doesn't mean one does not exist). You might need to roll your own here; if you only need to support a single country, this shouldn't be too hard.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.