I'm writing some code that is being used to parse dates out of a very large data set. I have the following regex to match different variations of dates
"(((0?[1-9]|1[012])(/|-)(0?[1-9]|[12][0-9]|3[01])(/|-))|"
+"((january|february|march|april|may|june|july|august|september|october|november|december)"
+ "\\s*(0?[1-9]|[12][0-9]|3[01])(th|rd|nd|st)?,*\\s*))((19|20)\\d\\d)"
which matches dates of format 'Month dd, yyyy', 'mm/dd/yyyy', and 'mm-dd-yyyy'. This works fine for those formats, but I'm now encountering dates in the European 'dd Month, yyyy' format. I tried adding (\\d{1,2})? at the beginning of the regex and adding a ? quantifier after the current day matching section of the regex as such
"((\\d{1,2})?((0?[1-9]|1[012])(/|-)(0?[1-9]|[12][0-9]|3[01])(/|-))|"
+"((january|february|march|april|may|june|july|august|september|october|november|december)"
+ "\\s*(0?[1-9]|[12][0-9]|3[01])?(th|rd|nd|st)?,*\\s*))((19|20)\\d\\d)"
but this is not entirely viable as it sometimes captures numeric characters both before and after the month (ex. '00 January 15, 2013') and sometimes neither ('January 2013'). Is there a way to ensure that exactly one of the two is captured?
Give you one Java implementation for your requirements (searching the date from inpiut texts):
String input = "which matches dates of format 'january 31, 1976', '9/18/2013', "
+ "and '11-20-1988'. This works fine for those formats, but I'm now encountering dates" +
"in the European '26th May, 2020' format. I tried adding (\\d{1,2})? at the"+
"beginning of the regex and adding a ? quantifier after the current day matching section of the regex as such";
String months_t = "(january|february|march|april|may|june|july|august|september|october|november|december)";
String months_d = "(1[012]|0?[1-9])";
String days_d = "(3[01]|[12][0-9]|0?[1-9])"; //"\\d{1,2}";
String year_d = "((19|20)\\d\\d)";
String days_d_a = "(" + days_d + "(th|rd|nd|st)?)";
// 'mm/dd/yyyy', and 'mm-dd-yyyy'
String regexp1 = "(" + months_d + "[/-]" + days_d + "[/-]"
+ year_d + ")";
// 'Month dd, yyyy', and 'dd Month, yyyy'
String regexp2 = "(((" + months_t + "\\s*" + days_d_a + ")|("
+ days_d_a + "\\s*" + months_t + "))[,\\s]+" + year_d + ")";
String regexp = "(?i)" + regexp1 + "|" + regexp2;
Pattern pMod = Pattern.compile(regexp);
Matcher mMod = pMod.matcher(input);
while (mMod.find()) {
System.out.println(mMod.group(0));
}
The Output is :
january 31, 1976
9/18/2013
11-20-1988
26th May, 2020
Related
I have a class that parses ZonedDateTime objects using.split() to get rid of all the extra information I don't want.
My Question: Is there a way to use square brackets as delimiters that I am missing, OR how do I get the time zone ([US/Mountain]) by itself without using square brackets as delimiters?
I want the String timeZone to look like "US/Mountian" or "[US/Mountian]
What I've Tried:
Ive tried wholeThing.split("[[-T:.]]?) and wholeThing.split("[%[-T:.%]]") but those both give me 00[US/Mountain]
I've also tried wholeThing.split("[\\[-T:.\\]]) and wholeThing.split("[\[-T:.\]") but those just give me errors.
(part of) My Code:
//We start out with something like 2016-09-28T17:38:38.990-06:00[US/Mountain]
String[] whatTimeIsIt = wholeThing.split("[[-T:.]]"); //wholeThing is a TimeDateZone object converted to a String
String year = whatTimeIsIt[0];
String month = setMonth(whatTimeIsIt[1]);
String day = whatTimeIsIt[2];
String hour = setHour(whatTimeIsIt[3]);
String minute = whatTimeIsIt[4];
String second = setAmPm(whatTimeIsIt[5],whatTimeIsIt[3]);
String timeZone = whatTimeIsIt[8];
Using split() is the right idea.
String[] timeZoneTemp = wholeThing.split("\\[");
String timeZone = timeZoneTemp[1].substring(0, timeZoneTemp[1].length() - 1);
If you want to parse the string yourself, use a regular expression to extract the values.
Don't use a regex to find characters to split on, which is what split() does.
Instead, use a regex with capture groups, compile it using Pattern.compile(), obtain a Matcher on your input text using matcher(), and check it using matches().
If it matches you can get the captured groups using group().
Example regex:
(\d{4})-(\d{2})-(\d{2})T(\d{2}):(\d{2}):(\d{2}).(\d+)[-+]\d{2}:\d{2}\[([^\]]+)\]
In a Java string, you have to escape the \, so here is code showing how it works:
String input = "2016-09-28T17:38:38.990-06:00[US/Mountain]";
String regex = "(\\d{4})-(\\d{2})-(\\d{2})T(\\d{2}):(\\d{2}):(\\d{2}).(\\d+)[-+]\\d{2}:\\d{2}\\[([^\\]]+)\\]";
Matcher m = Pattern.compile(regex).matcher(input);
if (m.matches()) {
System.out.println("Year : " + m.group(1));
System.out.println("Month : " + m.group(2));
System.out.println("Day : " + m.group(3));
System.out.println("Hour : " + m.group(4));
System.out.println("Minute : " + m.group(5));
System.out.println("Second : " + m.group(6));
System.out.println("Fraction: " + m.group(7));
System.out.println("TimeZone: " + m.group(8));
} else {
System.out.println("** BAD INPUT **");
}
Output
Year : 2016
Month : 09
Day : 28
Hour : 17
Minute : 38
Second : 38
Fraction: 990
TimeZone: US/Mountain
UPDATED
You can of course get all the same values using ZonedDateTime.parse(), which will also ensure that the date is valid, something none of the other solutions will do.
String input = "2016-09-28T17:38:38.990-06:00[US/Mountain]";
ZonedDateTime zdt = ZonedDateTime.parse(input);
System.out.println("Year : " + zdt.getYear());
System.out.println("Month : " + zdt.getMonthValue());
System.out.println("Day : " + zdt.getDayOfMonth());
System.out.println("Hour : " + zdt.getHour());
System.out.println("Minute : " + zdt.getMinute());
System.out.println("Second : " + zdt.getSecond());
System.out.println("Milli : " + zdt.getNano() / 1000000);
System.out.println("TimeZone: " + zdt.getZone());
Output
Year : 2016
Month : 9
Day : 28
Hour : 17
Minute : 38
Second : 38
Milli : 990
TimeZone: US/Mountain
I have to parse a multi line string and retrieve the email addresses in a specific location.
And I have done it using the below code:
String input = "Content-Type: application/ms-tnef; name=\"winmail.dat\"\r\n"
+ "Content-Transfer-Encoding: binary\r\n" + "From: ABC aa DDD <aaaa.b#abc.com>\r\n"
+ "To: DDDDD dd <sssss.r#abc.com>\r\n" + "CC: Rrrrr rrede <sssss.rv#abc.com>, Dsssssf V R\r\n"
+ " <dsdsdsds.vr#abc.com>, Psssss A <pssss.a#abc.com>, Logistics\r\n"
+ " <LOGISTICS#abc.com>, Gssss Bsss P <gdfddd.p#abc.com>\r\n"
+ "Subject: RE: [MyApps] (PRO-34604) PR for Additional Monitor allocation [CITS\r\n"
+ " Ticket:258849]\r\n" + "Thread-Topic: [MyApps] (PRO-34604) PR for Additional Monitor allocation\r\n"
+ " [CITS Ticket:258849]\r\n" + "Thread-Index: AQHRXMJHE6KqCFxKBEieNqGhdNy7Pp8XHc0A\r\n"
+ "Date: Mon, 1 Feb 2016 17:56:17 +0530\r\n"
+ "Message-ID: <B7F84439E634A44AB586E3FF2EA0033A29E27E47#JETWINSRVRPS01.abc.com>\r\n"
+ "References: <JA.101.1453963700000#myapps.abc.com>\r\n"
+ " <JA.101.1453963700000.978.1454311765375#myapps.abc.com>\r\n"
+ "In-Reply-To: <JIRA.450101.1453963700000.978.1454311765375#myapps.abc.com>\r\n"
+ "Accept-Language: en-US\r\n" + "Content-Language: en-US\r\n" + "X-MS-Has-Attach:\r\n"
+ "X-MS-Exchange-Organization-SCL: -1\r\n"
+ "X-MS-TNEF-Correlator: <B7F84439E634A44AB586E3FF2EA0033A29E27E47#JETWINSRVRPS01.abc.com>\r\n"
+ "MIME-Version: 1.0\r\n" + "X-MS-Exchange-Organization-AuthSource: TURWINSRVRPS01.abc.com\r\n"
+ "X-MS-Exchange-Organization-AuthAs: Internal\r\n" + "X-MS-Exchange-Organization-AuthMechanism: 04\r\n"
+ "X-Originating-IP: [1.1.1.7]";
Pattern pattern = Pattern.compile("To:(.*<([^>]*)>).*Message-ID", Pattern.DOTALL);
Matcher matcher = pattern.matcher(input);
while (matcher.find()) {
Pattern innerPattern = Pattern.compile("<([^>]*)>");
Matcher innerMatcher = innerPattern.matcher(matcher.group(1));
while (innerMatcher.find()) {
System.out.println("-->:" + innerMatcher.group(1));
}
}
Here it works fine. I'm first grouping the part from To till the Message which is the required part. And then I have another grouping to extract the email ids.
Is there any better way to do this? Can we do it with one pattern matcher set?
Update:
This is the expected output:
-->:sssss.r#abc.com
-->:sssss.rv#abc.com
-->:dsdsdsds.vr#abc.com
-->:pssss.a#abc.com
-->:LOGISTICS#abc.com
-->:gdfddd.p#abc.com
Ideally, you could have used lookarounds:
(?<=To:.*)<([^>]+)>(?=.*Message-ID)
Visualization by Debuggex
Unfortunately, Java doesn't support variable length in lookbehinds. A workaround could be:
(?<=To:.{0,1000})<([^>]+)>(?=.*Message-ID)
I think you are looking for all the emails inside <...> that come after To: and before Message-ID. So, you may use a \G based regex for one pass:
Pattern pt = Pattern.compile("(?:\\bTo:|(?!^)\\G).*?<([^>]*)>(?=.*Message-ID)", Pattern.DOTALL);
Matcher m = pt.matcher(input);
while (m.find()) {
System.out.println(m.group(1));
}
See IDEONE demo and a regex demo
The regex matches:
(?:\\bTo:|(?!^)\\G) - a leading boundary, either To: as a whole word or the location after the previous successful match
.*? - any characters, any number of occurrences up to the first
<([^>]*)> - substring starting with < followed with zero or more characters other than > (Group 1) and followed with a closing >
(?=.*Message-ID) - a positive lookahead that makes sure there is Message-ID somewhere ahead of the current match.
I'm not too sure why my word doesn't get replaced in android studio.
private Calendar dateTime = Calendar.getInstance();
private SimpleDateFormat dateFormatter = new SimpleDateFormat("dd/MM/yyyy, EEE");
String dateFormat = dateFormatter.format(dateTime.getTime());
Button beginDate = (Button) findViewById(R.id.startDate);
beginDate.setText("From " + dateFormat);
// Other codes removed for simplicity
String beginD = beginDate.getText().toString().replace("From: ", "");
Log.d("Test", beginD);
Log result as follows:
06-16 14:14:01.957 23893-23893/packagename D/Testīš From
16/06/2015, Tue
You're trying to replace "From: " but you only added "From " (without the :).
I don't see : in the input you entered, so From: won't be matched. I recommend using more generic pattern:
replaceAll("From:?\\s+", "");
Since replaceAll takes a regex, you can ask for optional :, followed by one or more space(s).
In your button text is no :. So you have to change the text of the botton to :
beginDate.setText("From: " + dateFormat);
or your regex to
String beginD = beginDate.getText().toString().replace("From ", "");
I am trying to create a pattern in Java that matches the following string;
String message ="%%140911,A,140929100526,S0117.6262E03647.8107,000,067,F100,4F000100,108";
The pattern I have formed is not matching the string. What am I missing? Ihis is my pattern what I tried so far:
private static final Pattern pattern = Pattern.compile(
"(\\%\\%)"+"(\\d)," + // Id
"([AL])," + // Validity a for valid and l for invalid
"(\\d{2})(\\d{2})(\\d{2})(\\d{2})(\\d{2})(\\d{2})," + // Date (YYMMDD)Time (HHMMSS)
"([NS])" + "(\\d{2})(\\d{2}\\.\\d+)" + "([EW])" + "(\\d{3})(\\d{2}\\.\\d+)," + //loc
"(\\d+)," + // Speed
"(\\d+)," + // Direction
"([FC])" + "(\\d{3})," + // temperature in Fahrenheit/celsius
"(\\w{8})," + // status
"(\\d+)"); // event
You're missing + in first line. Try changing
"(\\%\\%)"+"(\\d),"
to
"(\\%\\%)"+"(\\d+),"
I am trying to delimit the , and space my input is 21, May, 2012 my output should be 2012-May-21.
String s = args[0];
String[] s1 = s.split(",\\s+");
System.out.print(s1[2] + "-" + s1[1] + "-" + s1[0]);
It is working if I am writing for only , delimiter but getting ArrayIndexOutOfBoundsException when trying for space as delimiter.
Since both ,,space are optional as mentioned in the comment..
String[] s1 = s.split(",|\\s+");
Though I won't use regex to parse date
input=input.replaceAll("\\s*","");//remove any space if any
java.util.Date date= (new SimpleDateFormat("dd,MMM,yyyy")).parse(input);
String output=(new SimpleDateFormat("yyyy-MMM-dd")).format(date);
Try this,
String date = " 21 , May, 2012";
String[] s1 = date.split(",\\s*");
System.out.println(s1[2].trim() + "-" + s1[1].trim() + "-" + s1[0].trim());
You can also do this using String#replaceAll :
s.replaceAll(",\\s*", "-");