I have a string description of a company, which is nasty written by different users (hand-typed). Here is a example (focus on the dots, spaces, first letters etc..):
XXXX is a Global menagement consulting,Technology services and
outsourcing company, with 257000people serving clients in more than
120 countries.. combining unparalleled experience, comprehensive
capabilities across all industries and business functions,and
extensive research on the worlds most successfull companies, XXXX
collaborates with clients to help them become high-performance
businesses and governments., the company generated net revenues of
US$27.9 Billion for the fiscal year ended 31.07.2012..
Now what i want is to format the string to a bit nicer version like this:
XXXX is a global management consulting, technology services and
outsourcing company, with 257,000 people serving clients in more than
120 countries. Combining unparalleled experience, comprehensive
capabilities across all industries and business functions, and
extensive research on the world’s most successful companies, XXXX
collaborates with clients to help them become high-performance
businesses and governments. The company generated net revenues of
US$27.9 billion for the fiscal year ended Aug. 31, 2012.
My question is: Is there any library with already defined methods which could do all the spelling corrections, unneeded space removal, etc .. ?
So far, I do it be replacing stuff like " ," with ", " and toUpperCase() if the is a "///." in front etc..
desc = desc.replace(" ", " ");
desc = desc.replace("..", ".");
desc = desc.replace(" .", ".");
desc = desc.replace(" ,", ", ");
desc = desc.replace(".,", ".");
desc = desc.replace(",.", ".");
desc = desc.replace(", .", ".");
desc = desc.replace("*", "");
I'm sure there is a cleaner and better version to do this. Using regex maybe??
Any solution would be appreciated.
If I were trying to solve your problem, I would probably read the text 1 char at a time, and format it as you go. For example, in psuedocode...
while (has more chars){
char letter = readChar();
if (letter == ','){
// checking for the ',.' combination
letter = readChar();
if (readChar == '.'){
// write out a '.' only
out.print('.');
}
else {
// it wasn't the ',.' combination, so you need to output both characters, whatever they are
out.print(',');
out.print(letter);
}
}
else if (another letter you want to filter){
// etc.
}
else {
// doesn't match any of the filters, so just output the letter
out.print(letter);
}
}
Basically if you read the text 1 char at a time, you can detect any of your chosen formatting problems as you go, and correct them immediately. This provides a performance improvement, as you're only reading over the text string once (not 8 times, like you are currently doing), and allows you to add as many different/complex formatting changes as you want. The downside, however, is that you need to write the logic yourself rather than relying on in-built functions.
Related
I'm working on a Java Program which takes a question from a user, sends it to the Wolfram Alpha API and then cleans up the result and prints it.
If the user asks the question "Who is the President of the USA?" the result is as follows
Response: <section><title>Input interpretation</title> <sectioncontents>United States | President</sectioncontents></section><section><title>Result</title><sectioncontents>Barack Obama (from 20/01/2009 to present)</sectioncontents></section><section><title>Basic information</title><sectioncontents>official position | President (44th)..........etc
I would like to Extract "Barack Obama (from 20/01/2009 to present)"
I have been able to trim up to Barack using the below code:
String clean =response.substring(response.indexOf("Result") + 31 , response.length());
System.out.println("Response: " + clean);
How would I trim the rest of the result?
Well, in case it helps, I came up with this regex:
Result.+?>([^<]+?)<
After finding "Result" it captures the first instance of > and < with at least one character between them.
UPDATE
Below is some sample code that might be helpful:
String response = "Response: <section><title>..."
Pattern pattern = Pattern.compile("Result.+?>([^<]+?)<");
Matcher match = pattern.matcher(response);
String clean = "";
if (match.find())
clean = match.group(1);
System.out.println(clean);
The response is essentially XML.
As has been discussed endlessly in many programming fora, regular expressions are not suitable for parsing XML - you should use an XML parser.
I am displaying a table with Userobjects. The displayed information are:
User.firstName
User.lastName
User.email
but displayed by using user.toString() which results in the following output:
Gordon, Tomas (gordon.tomas#company.com)
Hanks, Jessica (hanks.jessica#company.com)
I want to have a filter on this list to allow people to search for specific users. These are the requirements:
1) 1 search field only
2) generic text input
currently i do the following to update the list, wheras owner is the input:
def user // input as string from the search field
def potentialUsers = User.withCriteria {
or {
ilike("firstName", '%' + user + '%')
ilike("lastName", '%' + user + '%')
ilike("email", '%' + user + '%')
}
}
this works very well when there is only 1 word of input.
what I however expect is that people will search like this:
'tom'
'gordon tomas'
'jessica#company hanks'
'tomas gordon'
... and so on
the best solution in my eyes would be to search directly in toString() but I have not figured out how to do so..
any ideas on how to filter that correctly?
Basically you have 2 options here: do it quick or do it right.
Quick) add a field to your domain class to contain the concatenation of the field values you want to search, like User.concatenated = 'Gordon Tomas gordon.tomas#company.com'. then you can fire your search like:
def potentialUsers = User.withCriteria {
user.split( /\s+/ ).each{
ilike 'concatenated', '%' + it + '%'
}
}
Right) use Lucene or a Lucene-based proper full-text search framework, like hibernate-search or grails search plugin or elastic search to index your fields, so you can fire the complex multi-word queries
Our project contains many statements in the method chaining fluent style:
int totalCount = ((Number) em
.createQuery("select count(up) from UserPermission up where " +
"up.database.id = :dbId and " +
"up.user.id <> :currentUserId ")
.setParameter("dbId", cmd.getDatabaseId())
.setParameter("currentUserId", currentUser.getId())
.getSingleResult())
.intValue();
I've got checkstyle mostly configured to match our existing code style, but now it's failing on these snippets, preferring instead:
int totalCount = ((Number) em
.createQuery("select count(up) from UserPermission up where " +
"up.database.id = :dbId and " +
"up.user.id <> :currentUserId ")
.setParameter("dbId", cmd.getDatabaseId())
.setParameter("currentUserId", currentUser.getId())
.getSingleResult())
.intValue();
Which is totally inappropriate. Is there anyway to configure checkstyle to accept the method chaining style? Is there an alternate tool I can run from maven to enforce this kind of indentation?
I never made this work in Eclipse so we barely use Format Source. In the end it is often best to extend. We tried hard and failed. It was one and half year ago. In the end we use formatting text only in Eclipse by Selecting the line or to preformat before we format by hand.
Usually the formating done by a engineer carries a certain meaning. And so automatic format will never work. Especially if you do something like
public static void myMethod(
int value, String value2, String value3)
If you autoformat this it fails similar to your example.
So feel free to join the club of not using automatic formatting beside as a step before you format it the human way.
with intellij , it can be done by selecting "align when multiline" in case of "method chain calls" so i guess this property is misconfigured in the configurations.
I use java and a regexp.
I've made a regexp for password validation :
String PASSWORD_PATTERN_ADVANCED = "^(?=.*\\d)(?=.*[a-z])(?=.*[A-Z])(?=.*[\\\\##$¤£µ§%&<>,.!:?;~{-|`'_^¨éèçàù)=}()°\"\\]\\[²³*/+]).{8,20}$";
or without the extra slash :
^(?=.*\d)(?=.*[a-z])(?=.*[A-Z])(?=.*[\\##$¤£µ§%&<>,.!:?;~{-|`'_^¨éèçàù)=}()°"\]\[²³*/+]).{8,20}$
whuch means (i may be wrong): at least one digit / at least one lowercase / at least one uppercase / at least one of the special chars listed / with a minimum total length of 8 and a max of 20...
made a test case generating password for success and failure...
success -> OK, all passed
failure -> Almost OK ...
The only password that fails to fail :D are the ones with space in it like :
iF\ !h6 2A3|Gm
¨I O7 gZ2%L£k vd~39
2< A Uw a7kEw6,6S^
cC2c5N#
6L kIw~ Béj7]5
ynRZ #44ç
9A `sè53Laj A
s²R[µ3 9UrR q8n
I am puzzled.
Any thoughts to make it works ?
Thanks
A regex may not be the right tool for the job here.
Regexes are best suited for matching patterns; what you're describing isn't really a pattern, per se; it's more of a rule set. Sure, you may be able to create some regex that helps, but it's a really complex and opaque piece of code which make maintenance a challenge.
A method like this might be a better fit:
public boolean isValidPassword(String password) {
boolean containsLowerCase;
boolean containsUpperCase;
boolean containsInvalid;
boolean containsSpecialChar;
boolean containsDigit;
for(char c: password.toCharArray()) {
containsLowerCase ||= Character.isLowerCase(c);
containsUpperCase ||= Character.isUpperCase(c);
containsDigit ||= Character.isDigit(c);
containsSpecialChar ||= someMethodForDetectingIfItIsSpecial(c);
}
return containsLowerCase &&
containsUpperCase &&
containsSpecialChar &&
containsDigit &&
!containsInvalid &&
password.length >=8 && password.length <=20;
}
You'd need to decide the best way to detect a special character (specialCharArray.contains(c), regular expression, etc).
However, this approach would make adding new rules a lot simpler.
I may be wrong but if you simply don't want spaces then use [^\\s] instead of . in your lookahead.
String PASSWORD_PATTERN_ADVANCED =
"^(?=[^\\s]*\\d)"
+ "(?=[^\\s]*[a-z])"
+ "(?=[^\\s]*[A-Z])"
+ "(?=[^\\s]*[\\\\##$¤£µ§%&<>,.!:?;~{-|`'_^¨éèçàù)=}()°\"\\]\\[²³*/+])"
+ ".{8,20}$";
None of your conditions are stating what can't be in the password, only what must. You need one more condition that combines all the possible valid characters and makes sure all characters in the password are in that list (i.e., (\d|[a-z]|[A-Z]|##$...){8,20} as the final condition). Either that or a list of rejected characters.
I have data in a database in the format below:
a:19:{s:9:"raceclass";a:5:{i:0;a:1:{i:0;s:7:"250cc B";}i:1;a:1:{i:1;s:6:"OPEN B";}i:2;a:1:{i:2;s:9:"Plus 25 B";}i:3;a:1:{i:3;s:8:"Vet 30 B";}i:4;a:1:{i:4;s:7:"Vintage";}}s:9:"firstname";a:1:{i:0;a:1:{i:0;s:5:"James";}}s:12:"middle_FIELD";a:1:{i:0;a:1:{i:0;s:1:"R";}}s:8:"lastname";a:1:{i:0;a:1:{i:0;s:9:"Slaughter";}}s:5:"email";a:1:{i:0;a:1:{i:0;s:29:"jslaughter#xtrememxseries.com";}}s:8:"address1";a:1:{i:0;a:1:{i:0;s:18:"21 DiMartino Court";}}s:4:"city";a:1:{i:0;a:1:{i:0;s:6:"Walden";}}s:5:"state";a:1:{i:0;a:1:{i:0;s:8:"New York";}}s:3:"zip";a:1:{i:0;a:1:{i:0;s:5:"12586";}}s:7:"country";a:1:{i:0;a:1:{i:0;s:13:"United States";}}s:6:"gender";a:1:{i:0;a:1:{i:0;s:4:"Male";}}s:3:"dob";a:1:{i:0;a:1:{i:0;s:10:"06/04/1974";}}s:5:"phone";a:1:{i:0;a:1:{i:0;s:12:"845-713-4421";}}s:5:"skill";a:1:{i:0;a:1:{i:0;s:12:" AMATEUR (B)";}}s:11:"ridernumber";a:1:{i:0;a:1:{i:0;s:2:"69";}}s:8:"bikemake";a:1:{i:0;a:1:{i:0;s:3:"HON";}}s:8:"enginecc";a:1:{i:0;a:1:{i:0;s:3:"450";}}s:9:"amanumber";a:1:{i:0;a:1:{i:0;s:7:"1094649";}}s:10:"amaexpdate";a:1:{i:0;a:1:{i:0;s:5:"03/12";}}}
How can I write a regular expression to manipulate the above string to get data in the following format?:
raceclass - 250cc B, OPEN B, Plus 25 B, Vet30, Vintage
firstname - James
middle_FIELD - R
address1 = 21 DiMartino Court
city - walden
state - New york
zip - 12586
country - United States
gender - Male
dob - 06/04/1974
phone - 845-713-4421
skill - AMATEUR (B)
ridernumber - 69
bikemake - HON
enginecc - 450
amanumber - 1094649
amaexpdate - 03/12
This data isn't suitable for a regular expression. You should use a proper parser with a proper grammar for handling this string. There are several good options for that in Java, such as ANTLR.
Alternatively, if that is not an option it looks like you only want to handle things between "". Take a look at the java class Scanner. You should be able to get something working with that. Just look through the string and look for a ". If found start to gather text into a buffer. Once you have found another " ignore tokens until you have found the next " or the end of the input text.