should i screen out odd characters from names - java

From
Personal names in a global application: What to store and How can I validate a name, middle name, and last name using regex in Java?
i have read that you can't really validate names because of international possibilities long names, multiple names, weird names. the general verdict is to avoid it and play safe instead - which means allowing all possible characters, combinations and just print it as html-safe mark-up.
but what about special characters? Shift + "one to nine" series and others, should i just allow them to be placed in the database and "play safe" or should i screen them out?
i would also want users of my program to responsibly input names (though i can't guarantee that) but at least at some point there should be enforced rules but without totally locking out others who legitimately have a reason to use $ or # in their names.
i'm on PHP and JS but same goes for all languages that use input validations
EDIT:
i do have to note, it does not really mean just the Shift 1-9. that's just what i call them. it also includes special characters outside the 1-9. sorry for the confusion.
here's the thing, my application is like a library application. a book has a title, an author, and a year. while the title and year may go to one table, the author i want listed to another table. these inputs are from the users. now i'm going to implement an autocomplete for the authors. but the data for autocomplete is based on the input of the users - the reliability of the autocomplete data will be based on the author inputs of the users.
just like facebook, how do they implement this? i haven't seen any friend using special characters, unlike those friendster times where everytime i search, people with numeric or special charactered names come up first - not really great for an autocomplete.

Shift + "one to nine" doesn’t really specify a set of characters, as it depends on the keyboard what such combinations produce. If you mean the characters in Shift positions of keys 0 to 9 in standard US keyboards, then I have to admit that I have never seen a person’s real name (as opposite to nicknames) with such characters. But I would not bet on their absolute absence from names. Yesterday, I learned that some orthography of the Venetian language uses “£” (pound sign) as a letter. Moreover, people might use easily available characters as replacements of characters they cannot easily produce on a keyboard, e.g. using “!” instead of “ǃ” (U+01C3 Latin letter retroflex click) or “e^” instead of “ê”.
The question is what you expect to gain by excluding some characters. To catch typos?

Related

Detecting phone numbers using sapi

This is regarding to the question in Detecting phone numbers using sapi. I used that grammar in the answer.
But it gets numbers with space how can I create grammar get numbers without space?
Short answer - it would be quite difficult to do that, as SAPI is designed to recognize words, which have spaces in between them.
However, it's pretty straightforward to annotate the grammar so that you can find the start & end position of the phone number within the reco result, at which point you can remove the spaces using string replacement.
Alternatively, it's also pretty straightforward to place the phone number in a tag, at which point you can extract the number from the rules structure. (This might be trickier from Java.)

Java & Regex: Matching a substring that is not preceded by specific characters

This is one of those questions that has been asked and answered hundreds of times over, but I'm having a hard time adapting other solutions to my needs.
In my Java-application I have a method for censoring bad words in chat messages. It works for most of my words, but there is one particular (and popular) curse word that I can't seem to get rid of. The word is "faen" (which is simply a modern slang for "satan", in the language in question).
Using the pattern "fa+e+n" for matching multiple A's and E's actually works; however, in this language, the word for "that couch" or "that sofa" is "sofaen". I've tried a lot of different approaches, using variations of [^so] and (?!=so), but so far I haven't been able to find a way to match one and not the other.
The real goal here, is to be able to match the bad words, regardless of the number of vowels, and regardless of any non-letters in between the components of the word.
Here's a few examples of what I'm trying to do:
"String containing faen" Should match
"String containing sofaen" Should not match
"Non-letter-censored string with f-a#a-e.n" Should match
"Non-letter-censored string with sof-a#a-e.n" Should not match
Any tips to set me off in the right direction on this?
You want something like \bf[^\s]+a[^\s]+e[^\s]+n[^\s]\b. Note that this is the regular expression; if you want the Java then you need to use \\b[^\\s]+f[^\\s]+a[^\\s]+e[^\\s]+n[^\\s]\b.
Note also that this isn't perfect, but does handle the situations that you have suggested.
It's a terrible idea to begin with. You think, your users would write something like "f-aeen" to avoid your filter but would not come up with "ffaen" or "-faen" or whatever variation that you did not prepare for? This is a race you cannot win and the real loser is usability.

Normalizing/unaccenting text in Java

How can I normalize/unaccent text in Java? I am currently using java.text.Normalizer:
Normalizer.normalize(str, Normalizer.Form.NFD)
.replaceAll("\\p{InCombiningDiacriticalMarks}+", "")
But it is far from perfect. For example, it leaves Norwegian characters æ and ø untouched. Does anyone know of an alternative? I am looking for something that would convert characters in all sorts of languages to just the a-z range. I realize there are different ways to do this (e.g. should æ be encoded as 'a', 'e' or even 'ae'?) and I'm open for any solution. I prefer to not write something myself since I think it's unlikely that I will be able to do this well for all languages. Performance is NOT critical.
The use case: I want to convert a user entered name to a plain a-z ranged name. The converted name will be displayed to the user, so I want it to match as close as possible what the user wrote in his original language.
EDIT:
Alright people, thanks for negging the post and not addressing my question, yay! :) Maybe I should have left out the use case. But please allow me to clarify. I need to convert the name in order to store it internally. I have no control over the choice of letters allowed here. The name will be visible to the user in, for example, the URL. The same way that your user name on this forum is normalized and shown to you in the URL if you click on your name. This forum converts a name like "Bășan" to "baan" and a name like "Øyvind" to "yvind". I believe it can be done better. I am looking for ideas and preferably a library function to do this for me. I know I can not get it right, I know that "o" and "ø" are different, etc, but if my name is "Øyvind" and I register on an online forum, I would likely prefer that my user name is "oyvind" and not "yvind". Hope that this makes any sense! Thanks!
(And NO, we will not allow the user to pick his own user name. I am really just looking for an alternative to java.text.Normalizer. Thanks!)
Assuming you have considering ALL of the implications of what you're doing, ALL the ways it can go wrong, what you'll do when you get Chinese pictograms and other things that have no equivalent in the Latin Alphabet...
There's not a library that I know of that does what you want. If you have a list of equivalencies (as you say, the 'æ' to 'ae' or whatever), you could store them in a file (or, if you're doing this a lot, in a sorted array in memory, for performance reason) and then do a lookup and replace by character. If you have the space in memory to store the (# of unicode characters) as a char array, being able to run through the unicode values of each character and do a straight lookup would be the most efficient.
i.e., /u1234 => lookupArray[1234] => 'q'
or whatever.
so you'll have a loop that looks like:
StringBuffer buf = new StringBuffer();
for (int i = 0; i < string.length(); i++) {
buf.append(lookupArray[Character.unicodeValue(string.charAt(i))]);
}
I wrote that from scratch, so there are probably some bad method calls or something.
You'll have to do something to handle decomposed characters, probably with a lookahead buffer.
Good luck - I'm sure this is fraught with pitfalls.

Sentence Auto-Complete with Java

Lets say I have about 1000 sentences that I want to offer as suggestions when user is typing into a field.
I was thinking about running lucene in memory search and then feeding the results into the suggestions set.
The trigger for running the searches would be space char and exit from the input field.
I intend to use this with GWT so the client with be just getting the results from server.
I don't want to do what google is doing; where they complete each word and than make suggestions on each set of keywords. I just want to check the keywords and make suggestions based on that. Sort of like when I'm typing the title for the question here on stackoverflow.
Did anyone do something like this before? Is there already library I could use?
I was working on a similar solution. This paper titled Effective Phrase Prediction was quite helpful for me . You will have to prioritize the suggestions as well
If you've only got 1000 sentences, you probably don't need a powerful indexer like lucene. I'm not sure whether you want to do "complete the sentence" suggestions or "suggest other queries that have the same keywords" suggestions. Here are solutions to both:
Assuming that you want to complete the sentence input by the user, then you could put all of your strings into a SortedSet, and use the tailSet method to get a list of strings that are "greater" than the input string (since the string comparator considers a longer string A that starts with string B to be "greater" than B). Then, iterate over the top few entries of the set returned by tailSet to create a set of strings where the first inputString.length() characters match the input string. You can stop iterating as soon as the first inputString.length() characters don't match the input string.
If you want to do keyword suggestions instead of "complete the sentence" suggestions, then the overhead depends on how long your sentences are, and how many unique words there are in the sentences. If this set is small enough, you'll be able to get away with a HashMap<String,Set<String>>, where you mapped keywords to the sentences that contained them. Then you could handle multiword queries by intersecting the sets.
In both cases, I'd probably convert all strings to lower case first (assuming that's appropriate in your application). I don't think either solution would scale to hundreds of thousands of suggestions either. Do either of those do what you want? Happy to provide code if you'd like it.

Regex for checking > 1 upper, lower, digit, and special char

^.(?=.{15,})(?=.\d)(?=.[a-z])(?=.[A-Z])(?=.[!##$%^&+=]).*$
This is the regex I am currently using which will evaluate on 1 of each: upper,lower,digit, and specials of my choosing. The question I have is how do I make it check for 2 of each of these? Also I ask because it is seemingly difficult to write a test case for this as I do not know if it is only evaluating the first set of criteria that it needs. This is for a password, however the requirement is that it needs to be in regex form based on the package we are utilizing.
EDIT
Well as it stands in my haste to validate the expression I forgot to validate my string length. Thanks to Ken and Gumbo on helping me with this.
This is the code I am executing:
I do apologize as regex is not my area.
The password I am using is the following string "$$QiouWER1245", the behavior I am experiencing at the moment is that it randomly chooses to pass or fail. Any thoughts on this?
Pattern pattern = Pattern.compile(regEx);
Matcher match = pattern.matcher(password);
while(match.find()){
System.out.println(match.group());
}
From what I see if it evaluates to true it will throw the value in password back to me else it is an empty string.
Personally, I think a password policy that forces use of all three character classes is not very helpful. You can get the same degree of randomness by letting people make longer passwords. Users will tend to get frustrated and write passwords down if they have to abide by too many password rules (which make the passwords too difficult to remember). I recommend counting bits of entropy and making sure they're greater than 60 (usually requires a 10-14 character password). Entropy per character would depend roughly on the number of characters, the range of character sets they use, and maybe how often they switch between character sets (I would guess that passwords like HEYthere are more predictable than heYThEre).
Another note: do you plan not to count the symbols to the right of the keyboard (period, comma, angle brackets, etc.)?
If you still have to find groups of two characters, why not just repeat each pattern? For example, make (?=.\d) into (?=.\d.*\d).
For your test cases, if you are worried that it would only check the first criteria, then write a test case that makes sure each of the following passwords fails (because one and only one of the criteria is not met in each case): Just for fun I reversed the order of expectation of each character set, though it probably won't make a difference unless someone removes/forgets the ?= at some future date.
!##TESTwithoutnumbers
TESTwithoutsymbols123
&*(testwithoutuppercase456
+_^TESTWITHOUTLOWERCASE3498
I should point out that technically none of these passwords should be acceptable because they use dictionary words, which have about 2 bits of entropy per character instead of something more like 6. However, I realize that it's difficult to write a (maintainable and efficient) regular expression to check for dictionary words.
Try this:
"^(?=(?:\\D*\\d){2})(?=(?:[^a-z]*[a-z]){2})(?=(?:[^A-Z]*[A-Z]){2})(?=(?:[^!##$%^&*+=]*[!##$%^&*+=]){2}).{15,}$"
Here non-capturing groups (?:…) are used to group the conditions and repeat them. I’ve also used the complements of each character class for optimization instead of the universal ..
If I understand your question correctly, you want at least 15 characters, and to require at least 2 uppercase characters, at least 2 lowercase characters, at least 2 digits, and at least 2 special characters. In that case you could it like this:
^.*(?=.{15,})(?=.*\d.*\d)(?=.*[a-z].*[a-z])(?=.*[A-Z].*[A-Z])(?=.*[!##$%^&*+=].*[!##$%^&*+=]).*$
BTW, your original regex had an extra backslash before the \d
I'm not sure that one big regex is the right way to go here. It already looks far too complicated and will be very difficult to change in the future.
My suggestion is to structure the code in the following way:
check that the string has 2 lower case characters
return failure if not found or continue
check that the string has 2 upper case characters
return failure if not found or continue
etc.
This will also allow you to pass out a return code or errors string specifying why the password was not accepted and the code will be much simpler.

Categories