what I want to achieve is that I want to obtain the context of an acronym. Can you help me pls with the regular expression?
I am looping over the text (String) and looking for dots, after match I am trying to get the context of the particular found acronym, so that I can do some other processing after that, but I cant get the context. I need to take at least 5 words before and 5 words after the acronym.
//Pattern to match each word ending with dot
Pattern pattern = Pattern.compile("(\\w+)\\b([.])");
Matcher matchDot = pattern.matcher(textToCorrect);
while (matchDot.find()) {
System.out.println("zkratka ---"+matchDot.group()+" ---");
//5 words before and after tha match = context
// Matcher matchContext = Pattern.compile("(.{25})("+matchDot.group()+")(.{25})").matcher(textToCorrect);
Pattern patternContext = Pattern.compile("(?:[a-zA-Z'-]+[^a-zA-Z'-]+){0,10}"+matchDot.group()+"(?:[^a-zA-Z'-]+[a-zA-Z'-]+){0,10}");
Matcher matchContext = patternContext.matcher(textToCorrect);
if (matchContext.find()) {
System.out.println("context: "+matchContext.group()+" :");
// System.out.println("context: "+matchContext.group(1)+" :");
// System.out.println("context: "+matchContext.group(2)+" :");
}
}
Example:
input:
Some 84% of Paris residents see fighting pol. as a priority and 54% supported a diesel ban in the city by 2020, according a poll carried out for the Journal du Dimanche.
output:
1-st regex will find pol.
2-nd regex will find "of Paris residents see fighting pol. as a priority and 54%"
Another example with more text
I need to loop through this once and every time I match an acronym to get the context of this particular acronym. After that I am processing some datamining. Here's the original text
neklidná nemocná, vyš. je možné provést pouze nativně
Na mozku je patrna hyperdenzita v počátečním úseku a. cerebri media
vlevo, vlevo se objevuje již smazání hranic mezi bazálními ganglii a
okolní bílou hmotou a mírná difuzní hypointenzita v periventrikulární
bílé hmotě. Kromě těchto čerstvých změn jsou patrné staré
postmalatické změny temporálně a parietookcipitálně vlevo. Oboustranně
jsou patrné vícečetné vaskulární mikroléze v centrum semiovale bilat.
Nejsou známky nitrolebního krvácení. skelet kalvy orientačně nihil tr.
Z á v ě r: Známky hyperakutní ischemie v povodí ACM vlevo, staré
postmalatickéé změny T,P a O vlevo, vaskulární mikroléze v centrum
semiovale bilat.
CT AG: vyš. po bolu k.l..
Po zklidnění nemocné se podařilo provést CT AG. Na krku je naznačený
kinkink na ACC vlevo a ACI vlevo pod bazí. Kalcifikace v karotických
sifonech nepůsobí hemodynamicky významné stenozy. Intrakraniálně je
patrný konický uzávěr operkulárního úseku a. cerebri media vlevo pro
parietální lalok. Ostatní nález na intrakraniálním tepenném řečišti je
v mezích normy.
Z á v ě r: uzávěr operkulárního úseku a. cerebri media vlevo.
Of course if it matches end of sentence is ok for me :-) The question is to find all the acronyms even if they are before new line (\n)
I would try this out:
(?:\w+\W+){5}((?:\w.?)+)(?:\w+\W+){5}
Though natural language processing with regular expressions cannot be accurate.
((?:[\w!##$%&*]+\s+){5}([\w!##$%&*]+\.)(?:\s+[\w!##$%&*]+){5})
Try this.See demo.
https://regex101.com/r/aQ3zJ3/9
Related
I want to print armenian month names but it doesn't work. This is my code:
Locale loc = new Locale("hy");
Calendar cal = Calendar.getInstance(loc);
System.out.println(cal.getDisplayName(Calendar.MONTH, Calendar.LONG_STANDALONE, loc));
I have tried many others abbreviation like "hye" or "arm", but nothing works. Other language such as russian "ru" work fine. I have no idee what i'm doing wrong
There was an enhancement in JDK8 wherein the CLDR's XML-based locale data has been incorporated into the JDK 8 release, however it is disabled by default.
So, if you run your code with the argument -Djava.locale.providers=CLDR or add the same through the java.locale.providers System.property in your code, hy: Armenian hy_AM: Armenian will be supported.
With JDK 9 enhancements , CLDR locale data is enabled by default. So, the code will run without adding any system property.
Hope this helps.
After browsing Oracles Website I've found a list of supported languishes and Locale_IDs. As it seems the languish you want is not supported by JDK7 Locale.
http://www.oracle.com/technetwork/java/javase/javase7locales-334809.html
This language is not supported, but you can create your own locale by following this guide.
This is the javadoc of Locale.Builder
https://docs.oracle.com/javase/8/docs/api/java/util/Locale.Builder.html
The answer of #Pallavi is correct for Java-8 and Java-9.
However, if you are on Java-7, then you could set up your own DateFormatSymbolsProvider specialized for Armenian language via the service loader mechanism.
You will need a file within META-INF/services-subdirectory like with exactly this name:
META-INF/services/java.text.spi.DateFormatSymbolsProvider
And the content of this file should contain a line like this (please adjust the names to your real implementation class of service provider mentioned above):
mypackage.MyImplementationOfDateFormatSymbolsProvider
As soon as you have created an appropriate jar-library with this META-INF-substructure included, the new service provider for Armenian will be queried, too.
About the required text resources, I have imported the CLDR-v30-resources into my own library Time4J. Maybe you can take profit from the resource file for Armenian (also containing standalone-forms for month names) and use a part of the content for your own service provider.
With the following code you can print out all supported Calendar locales (sorted by languageTag):
Locale[] locales = Calendar.getAvailableLocales();
Arrays.sort(locales, Comparator.comparing(Locale::toLanguageTag));
for (Locale locale : locales)
System.out.print(" " + locale.toLanguageTag());
Unfortunately, in my Oracle Java 8, there is no Armenian locale (beginning with "hy") in this list.
ar ar-AE ar-BH ar-DZ ar-EG ar-IQ ar-JO ar-KW ar-LB ar-LY ar-MA ar-OM ar-QA ar-SA ar-SD ar-SY ar-TN ar-YE be be-BY bg bg-BG ca ca-ES cs cs-CZ da da-DK de de-AT de-CH de-DE de-GR de-LU el el-CY el-GR en en-AU en-CA en-GB en-IE en-IN en-MT en-NZ en-PH en-SG en-US en-ZA es es-AR es-BO es-CL es-CO es-CR es-CU es-DO es-EC es-ES es-GT es-HN es-MX es-NI es-PA es-PE es-PR es-PY es-SV es-US es-UY es-VE et et-EE fi fi-FI fr fr-BE fr-CA fr-CH fr-FR fr-LU ga ga-IE he he-IL hi hi-IN hr hr-HR hu hu-HU id id-ID is is-IS it it-CH it-IT ja ja-JP ja-JP-u-ca-japanese-x-lvariant-JP ko ko-KR lt lt-LT lv lv-LV mk mk-MK ms ms-MY mt mt-MT nl nl-BE nl-NL nn-NO no no-NO pl pl-PL pt pt-BR pt-PT ro ro-RO ru ru-RU sk sk-SK sl sl-SI sq sq-AL sr sr-BA sr-CS sr-Latn sr-Latn-BA sr-Latn-ME sr-Latn-RS sr-ME sr-RS sv sv-SE th th-TH th-TH-u-nu-thai-x-lvariant-TH tr tr-TR uk uk-UA und vi vi-VN zh zh-CN zh-HK zh-SG zh-TW
Edit:
With Oracle Java 8 and additional option -Djava.locale.providers=CLDR as suggested in
Pallavi's answer
the resulting list contains the Armenian locale ("hy"):
aa af af-NA agq ak am ar ar-AE ar-BH ar-DZ ar-EG ar-IQ ar-JO ar-KW ar-LB ar-LY ar-MA ar-OM ar-QA ar-SA ar-SD ar-SY ar-TN ar-YE as asa az az-Cyrl bas be be-BY bem bez bg bg-BG bm bn bn-IN bo br brx bs byn ca ca-ES cgg chr cs cs-CZ cy da da-DK dav de de-AT de-CH de-DE de-GR de-LI de-LU dje dua dyo dz ebu ee el el-CY el-GR en en-AU en-BE en-BW en-BZ en-CA en-Dsrt en-GB en-HK en-IE en-IN en-JM en-MT en-NA en-NZ en-PH en-PK en-SG en-TT en-US en-US-POSIX en-ZA en-ZW eo es es-419 es-AR es-BO es-CL es-CO es-CR es-CU es-DO es-EC es-ES es-GQ es-GT es-HN es-MX es-NI es-PA es-PE es-PR es-PY es-SV es-US es-UY es-VE et et-EE eu ewo fa fa-AF ff fi fi-FI fil fo fr fr-BE fr-CA fr-CH fr-FR fr-LU fur ga ga-IE gd gl gsw gu guz gv ha haw he he-IL hi hi-IN hr hr-HR hu hu-HU hy ia id id-ID ig ii is is-IS it it-CH it-IT ja ja-JP ja-JP-u-ca-japanese-x-lvariant-JP jmc ka kab kam kde kea khq ki kk kl kln km kn ko ko-KR kok ksb ksf ksh kw lag lg ln lo lt lt-LT lu luo luy lv lv-LV mas mer mfe mg mgh mk mk-MK ml mr ms ms-BN ms-MY mt mt-MT mua my naq nb nd ne ne-IN nl nl-BE nl-NL nmg nn nn-NO no no-NO nr nso nus nyn om or pa pa-Arab pl pl-PL ps pt pt-BR pt-PT rm rn ro ro-RO rof ru ru-RU ru-UA rw rwk saq sbp se seh ses sg shi shi-Tfng si sk sk-SK sl sl-SI sn so sq sq-AL sr sr-BA sr-CS sr-Cyrl-BA sr-Latn sr-Latn-BA sr-Latn-ME sr-Latn-RS sr-ME sr-RS ss ssy st sv sv-FI sv-SE sw sw-KE swc ta te teo th th-TH th-TH-u-nu-thai-x-lvariant-TH ti ti-ER tig tn to tr tr-TR ts twq tzm uk uk-UA und ur ur-IN uz uz-Arab uz-Latn vai vai-Latn ve vi vi-VN vun wae wal xh xog yav yo zh zh-CN zh-HK zh-Hans-HK zh-Hans-MO zh-Hans-SG zh-Hant zh-Hant-HK zh-Hant-MO zh-SG zh-TW zu
I'm attempting to TSV from IMDB:
$hutter Battle of the Sexes (2017) (as $hutter Boy) [Bobby Riggs Fan] <10>
NVTION: The Star Nation Rapumentary (2016) (as $hutter Boy) [Himself] <1>
Secret in Their Eyes (2015) (uncredited) [2002 Dodger Fan]
Steve Jobs (2015) (uncredited) [1988 Opera House Patron]
Straight Outta Compton (2015) (uncredited) [Club Patron/Dopeman]
$lim, Bee Moe Fatherhood 101 (2013) (as Brandon Moore) [Himself - President, Passages]
For Thy Love 2 (2009) [Thug 1]
Night of the Jackals (2009) (V) [Trooth]
"Idle Talk" (2013) (as Brandon Moore) [Himself]
"Idle Times" (2012) {(#1.1)} (as Brandon Moore) [Detective Ryan Turner]
As you can some lines start with a tab and some do not. I want a map with the actor's name as a key and a list of movies as the value. Between the actor's name is one or more tabs to until the movie listing.
My code:
while ((line = reader.readLine()) != null) {
Matcher matcher = headerPattern.matcher(line);
boolean headerMatchFound = matcher.matches();
if (headerMatchFound) {
Logger.getLogger(ActorListParser.class.getName()).log(Level.INFO, "Header for actor list found");
String newline;
reader.readLine();
while ((newline = reader.readLine()) != null) {
String[] fullLine = null;
String actor;
String title;
Pattern startsWithTab = Pattern.compile("^\t.*");
Matcher tab = startsWithTab.matcher(newline);
boolean tabStartMatcher = tab.matches();
if (!tabStartMatcher) {
fullLine = newline.split("\t.*");
System.out.println("Actor: " + fullLine[0] +
"Movie: " + fullLine[1]);
}//this line will have code to match lines that start with tabs.
}
}
}
The way I've done this only works for a few lines before I get and arrayoutofbounds exception. How can I parse the lines and split them into 2 strings at max if they have one or more tabs?
There are subtleties in parsing tab/comma-delimited data files having to do with quoting and escaping.
To save yourself a lot of work, frustration and headaches you really should consider using one of the existing CSV parsing libaries such as OpenCSV or Apache Commons CSV.
Posted as an answer instead of a comment because the OP has not stated a reason for reinventing the wheel and there are some tasks that really have been "solved" once and for all.
This is in my properties file:
message=You are scheduled {0} at {1} {2} for your {3} at {4}. We'll see you then! Any questions please call {5}.
Java code for setting values:
String[] msgParams = new String[6];
msgParams[0] = "Tire Rotation"
msgParams[1] = "2016-06-03"
msgParams[2] = "12:00"
msgParams[3] = "vehicle"
msgParams[4] = "dehli"
msgParams[5] = "9876543210"
String message = messageSource.getMessage("message", msgParams , Locale.getDefault());
System.out.println(message);
Output is:
You are scheduled for a Tire Rotation at 2016-06-03 12:00 for your vehicle at dehli. Well see you then! Any questions please call {5}.
Value of {5} is not set.
It is maybe because you are missing this \before ' :
message=You are scheduled {0} at {1} {2} for your {3} at {4}. We\'ll see you then! Any questions please call {5}.
Finally find it. Use '' instead of single '
I have names of 1100 hospitals from NY region. I need to find the address of these hospitals from google. I am looking for some script which I can use to supply all these hospital name and it could return me with an address. The script could return a simple google search result.
Input format:
Hospital Name
Center for Ambulatory Surgery
Genetic Diagnostic Labs Inc
Desired output format:
Hospital Name Hospital Address
Center for Ambulatory Surgery 3112 Sheridan Dr, Amherst, NY 14226
Genetic Diagnostic Labs Inc 490 Delaware Ave, Buffalo, NY 14202
A solution with google Places API, but the results may be not very accurate:
http://codepen.io/anon/pen/JogeyV?editors=101
var NY_latlng = new google.maps.LatLng(40.828624, -73.898605);
map = new google.maps.Map(document.getElementById('map-canvas'), {
center: NY_latlng,
zoom: 15
});
var hospitals = [];
var hospitals_names = ["Center for Ambulatory Surgery","Genetic Diagnostic Labs Inc"];//insert your full list here
var service = new google.maps.places.PlacesService(map);
hospitals_names.forEach( function(name ){
service.textSearch(
{
query: name,
location: NY_latlng,
radius: 50000, //in meter
},function(results,status){
if (status == google.maps.places.PlacesServiceStatus.OK){
var hospital= { name: name, addresses: []};
$('#address-list').append("<h2>"+name+"</h2><ul></ul>");
for (var i = 0; i < results.length; i++) {
hospital.addresses.push( results[i].formatted_address );
$('#address-list > ul').append("<li>"+results[i].formatted_address);
}
hospitals.push( hospital );
}
});
You can do this in R with the ggmap package, though perhaps not reliably enough to produce the results you want. For instance this attempt to geocode fails:
geocode("Genetic Diagnostic Labs Inc")
Warning message:
geocode failed with status ZERO_RESULTS, location = "Genetic Diagnostic Labs Inc"
So to illustrate a solution, I appended " NY" to the Google searches:
library(ggmap)
hospital_names <- c("Center for Ambulatory Surgery", "Genetic Diagnostic Labs Inc")
address_vec <- lapply(hospital_names, function(x) revgeocode(as.numeric(geocode(paste(x,", NY")))))
result <- data.frame(name = hospital_names, address = unlist(address_vec))
Result:
result
name address
1 Center for Ambulatory Surgery 426 Union Road, West Seneca, NY 14224, USA
2 Genetic Diagnostic Labs Inc City Hall Park Path, New York, NY 10007, USA
But these are not the addresses you specified - you may need to refine your inputs.
I have no idea about creating regular expressions for extracting different text from a text file. I am working on text file consisting of message details in whatsapp chat.
Consider the following data from a text file of whatsapp chat:
25/12/2012 9:15 am: User1: Faith makes all things possible,
Hope makes all things work,
Love makes all things beautiful,
May you have all the three for this Christmas.
MERRY CHRISTMAS
01/01/2013 12:03 am: User1: <message>.
04/08/2013 10:54 am: User2: Happy Friendship day
13/10/2013 11:57 am: User1:<message>
<message continues>
<message continues>
30/12/2013 10:07 pm: User3:<message>
30/12/2013 11:12 pm: User4: Same to you
This is a sample chat text from which I need to extract Date, Time, Username, Message. I am working in java for this.
The java code for this that I have worked out is as follows.But Didnt found any correct REGEX according to my requirement.
BufferedReader br = new BufferedReader(new FileReader("text filepath"));
String sCurrentLine;
Pattern r = Pattern.compile(REGEX); //REGEX required for extracting data
while ((sCurrentLine = br.readLine()) != null) {
System.out.println(sCurrentLine);
Matcher m = r.matcher(sCurrentLine);
if (m.find()) {
System.out.println("Date: " + m.group(1) );
System.out.println("Time: " + m.group(2) );
System.out.println("User: " + m.group(3) );
System.out.println("Message: " + m.group(4) );
} else {
System.out.println("NO MATCH");
}
Thanks in advance for any help!
I think you're looking for this regex,
(\d{2}\/\d{2}\/\d{4})\s(\d(?:\d)?:\d{2} [ap]m):\s([^:]*):(.*?)(?=\s*\d{2}\/|$)
Java regex would be,
"(?s)(\\d{2}/\\d{2}/\\d{4})\\s(\\d(?:\\d)?:\\d{2} [ap]m):\\s([^:]*):(.*?)(?=\\s*\\d{2}/|$)"
DEMO