Finding Timestamp pattern from a String - java

I want to convert a String into Timestamp, the pattern for which is unknown. Is there any API given by java which allows us to feed in the string and returns the possible patterns that could work with it? I understand that one could have some pre-defined patterns against which the String can be parsed and checked. But, since this is a timestamp, the number combinations of different dates and times will be quite a lot. So, was looking for an efficient way of figuring out the pattern of the String-Timestamp.

Never heard about ready-to-use library for something like this, as #Eugene noted, possible combinations for all the patterns is huge so there is probably no such a library.
I would recommend rethinking your architecture.
If you just want to play with something like this you can create your own implementation.
Lets say that you parse the input and then you figure out the array of integers
(09, 21, 12, 0, 44, 33)
, you can assume that the array contains year, day, month, minute, hour and seconds (not sure if you can assume that - its just an example)
Once you have that array you can create all possible permutations of this array - example here
Then you can create a Date object for each combination:
DateTime dt = new DateTime(09, 12, 21, 0, 44, 33, 0);
(above example is for JodaTime)
If you know for example that the year value will always be sent with 4 characters then possible number of combinations will decrease of course, further you can assume that '26' will not be the value for the month etc.. you probably get the idea.

In this case , first I would switch to Joda Time : http://joda-time.sourceforge.net/
Then I would generate some (try to decrease the possibilities as much as you can, as there are way too many) the Patterns and try the date over them. If it does not throw an error (aka it fits, put it in an array), then return the array. This is probably a very non-optimized solution, but this is where I would start.
I really do not think that there are libraries for that. Also, you might want to explain why you want to do that. may be the solution is a bit simpler.
Cheers,
Eugene.

Related

How to identify date from a string in Java

Recently I am being challenged by quite an "easy" problem. Suppose that there is sentences (saved in a String), and I need to find out if there is any date in this String. The challenges is that the date can be in a lot of different formats. Some examples are shown in the list:
June 12, 1956
London, 21st October 2014
13 October 1999
01/11/2003
Worth mentioning that these are contained in one string. So as an example it can be like:
String s = "This event took place on 13 October 1999.";
My question in this case would be how can I detect that there is a date in this string. My first approach was to search for the word "event", and then try to localize the date. But with more and more possible formats of the date this solution is not very beautiful. The second solution that I tried is to create a list for months and search. This had good results but still misses the cases when the date is expressed all in digits.
One solution which I have not tried till now is to design regular expressions and try to find a match in the string. Not sure how much this solution might decrease the performance.
What could be a good solution that I should probably consider? Did anybody face a similar problem before and what solutions did you find?
One thing is for sure that there are no time, so the only interesting part is the date.
Using the natty.joestelmach.com library
Natty is a natural language date parser written in Java. Given a date expression, natty will apply standard language recognition and translation techniques to produce a list of corresponding dates with optional parse and syntax information.
import com.joestelmach.natty.*;
List<Date> dates =new Parser().parse("Start date 11/30/2013 , end date Friday, Sept. 7, 2013").get(0).getDates();
System.out.println(dates.get(0));
System.out.println(dates.get(1));
//output:
//Sat Nov 30 11:14:30 BDT 2013
//Sat Sep 07 11:14:30 BDT 2013
You are after Named Entity Recognition. I'd start with Stanford NLP. The 7 class model includes date, but the online demo struggles and misses the "13". :(
Natty mentioned above gives a better answer.
If it's only one String you could use the Regular Expression as you mentioned. Having to find the different date format expressions. Here are some examples:
Regular Expressions - dates
In case it's a document or a big text, you will need a parser. You could use a Lexical analysis approach.
Depending on the project using an external library as mentioned in some answers might be a good idea. Sometimes it's not an option.
I've done this before with good precision and recall. You'll need GATE and its ANNIE plugin.
Use GATE UI tool to create a .GAPP file that will contain your
processing resources.
Use the .GAPP file to use the extracted Date
annotation set.
Step 2 can be done as follows:
Corpus corpus = Factory.newCorpus("Gate Corpus");
Document gateDoc = Factory.newDocument("This event took place on 13 October 1999.");
corpus.add(gateDoc);
File pluginsHome = Gate.getPluginsHome();
File ANNIEPlugin = new File(pluginsHome, "ANNIE");
File AnnieGapp = new File(ANNIEPlugin, "Test.gapp");
AnnieController =(CorpusController) PersistenceManager.loadObjectFromFile(AnnieGapp);
AnnieController.setCorpus(corpus);
AnnieController.execute();
Later you can see the extracted annotations like this:
AnnotationSetImpl ann = (AnnotationSetImpl) gateDoc.getAnnotations();
System.out.println("Found annotations of the following types: "+ gateDoc.getAnnotations().getAllTypes());
I'm sure you can do it easily with the inbuilt annotation set Date. It is also very enhancable.
To enhance the annotation set Date create a lenient annotation rule in JAPE say 'DateEnhanced' from inbuilt ANNIE annotation Date to include certain kinds of dates like "9/11" and use a Chaining of Java regex on R.H.S. of the 'DateEnhanced' annotations JAPE RULE, to filter some unwanted outputs (if any).

How to retrieve all kinds of dates and temporal values from text

I wanted to retrieve dates and other temporal entities from a set of Strings. Can this be done without parsing the string for dates in JAVA as most parsers deal with a limited scope of input patterns. But input is a manual entry which here and hence ambiguous.
Inputs can be like:
12th Sep |mid-March |12.September.2013
Sep 12th |12th September| 2013
Sept 13 |12th, September |12th,Feb,2013
I've gone through many answers on finding date in Java but most of them don't deal with such a huge scope of input patterns.
I've tried using SimpleDateFormat class and using some parse() functions to check if parse function breaks which mean its not a date. I've tried using regex but I'm not sure if it falls fit in this scenario. I've also used ClearNLP to annotate the dates but it doesn't give a reliable annotation set.
The closest approach to getting these values could be using a Chain of responsibility as mentioned below. Is there a library that has a set of patterns for date. I can use that maybe?
A clean and modular approach to this problem would be to use a chain,
every element of the chain tries to match the input string against a regex,
if the regex matches the input string than you can convert the input string to something that can feed a SimpleDateFormat to convert it to the data structure you prefer (Date? or a different temporal representation that better suits your needs) and return it, if the regexp doesn't matches the chain element just delegates to the next element in the chain.
The responsibility of every element of the chain is just to test the regex against the string, give a result or ask the next element of the chain to give it a try.
The chain can be created and composed easily without having to change the implementation of every element of the chain.
In the end the result is the same as in #KirkoR response, with a 'bit' (:D) more code but a modular approach. (I prefer the regex approach to the try/catch one)
Some reference: https://en.wikipedia.org/wiki/Chain-of-responsibility_pattern
You could just implement support for all the pattern possibilities you can think of, then document that ... OK, these are all patterns my module supports. You could then throw some RuntimeException for all the other possibilities.
Then ... in an iterative way you can keep running your module over the input data, and keep adding support for more date formats until it stops raising any RuntimeException.
I think that's the best you can do here if you want to keep it reasonably simple.
Yes! I've finally extracted all sorts of dates/temporal values that can be as generic as :
mid-March | Last Month | 9/11
To as specific as:
11/11/11 11:11:11
This finally happened because of awesome libraries from GATE and JAPE
I've created a more lenient annotation rule in JAPE say 'DateEnhanced' to include certain kinds of dates like "9/11 or 11TH, February- 2001" and used a Chaining of Java regex on R.H.S. of the 'DateEnhanced' annotations JAPE RULE, to filter some unwanted outputs.
I can recommend to you very nice implementation of your problem, unfortunetlly in polish: http://koziolekweb.pl/2015/04/15/throw-to-taki-inny-return/
You can use google translator:
https://translate.google.pl/translate?sl=pl&tl=en&js=y&prev=_t&hl=en&ie=UTF-8&u=http%3A%2F%2Fkoziolekweb.pl%2F2015%2F04%2F15%2Fthrow-to-taki-inny-return&edit-text=
The code there looks really nice:
private static Date convertStringToDate(String s) {
if (s == null || s.trim().isEmpty()) return null;
ArrayList<String> patterns = Lists.newArrayList(YYYY_MM_DD_T_HH_MM_SS_SSS,
YYYY_MM_DD_T_HH_MM_SS
, YYYY_MM_DD_T_HH_MM
, YYYY_MM_DD);
for (String pattern : patterns) {
try {
return new SimpleDateFormat(pattern).parse(s);
} catch (ParseException e) {
}
}
return new Date(Long.valueOf(s));
}
mark.util.DateParser dp = new DateParser();
ParsePositionEx parsePosition = new ParsePositionEx(0);
Date startDate = dp.parse("12.September.2013", parsePosition);
System.out.println(startDate);
output: Thu Sep 12 17:18:18 IST 2013
mark.util.Dateparser is a part of library which is used by DateNormalizer PR. So in Jape file, we have to just import it.

Parse any string to Sql date

I wonder if it's possible to parse any string (at least to try) to sql Date without specifing the string format? In other words I want to make a generic method who take as input a string and return an sql Date.
For instance I have:
String date1="31/12/2099";
String date2="31-12-2099";
and call parseToSqlDate(date1) and parseToSqlDate(date2) which will returns sql dates.
Short answer: No
Why: Parsing any string to a valid date is a task you as an intelligent being could not do (there is no "logical" way to determine the correct date), so you cannot "tell" a computer(program) to do that for you (see JGrice's comment, and there we still have 4-digit years).
Long answer: Maybe, if you are willed to either take risks or do not need a high rate of success.
How:
Define your minimal (format) requirements of a date. E.g. "a minimal date contains 1-8 numbers; 01/01/2001 , 01-01-01 , 01.01 (+current year) , 1.1 (+current year), 1 (+current month + current year) and/or "..contains 1-6 numbers and the letters for months"; 01-Jan-2001 and so on.
Split the input along any non-number/non-month-name characters, with a regex like [^0-9a-zA-Z] (quick thought, may hold some pitfalls)
You now have 1 to 3 (actually more if e.g. the time is included) separate numbers + 1 month name which can be aligned for year/month/day any way you like
For this "alignment", there are several possibilities:
Try a fixed format at first, if it "fits", take it, else try another (or fail)
(only of you get more than one entry at a time) guess the format by assuming all entries have the same (e.g. any number block containing values > 12 is not a month and > 31 is not a day)
BUT, and this is a big one, you can expect any such method to become a major PITA at some point, because you can never fully "trust" it to guess correctly (you can never be sure to have missed some special format or introduced some ambiguous interpretation). I outlined some cases/format, but definitely not all of them, so you will refine that method very often if you actually use it.
Appendix to your comment: "May be to add another parameter and in this way to know where goes day , month and so on?" So you are willed to add "pseudo-format-string" parameter specifying the order of day, month and year; that would make it a lot easier (as "simply" filtering out the delimiters can be achieved).

Performance of HashMap

I have to process 450 unique strings about 500 million times. Each string has unique integer identifier. There are two options for me to use.
I can append the identifier with the string and on arrival of the
string I can split the string to get the identifier and use it.
I can store the 450 strings in HashMap<String, Integer> and on
arrival of the string, I can query HashMap to get the identifier.
Can someone suggest which option will be more efficient in terms of processing?
It all depends on the sizes of the strings, etc.
You can do all sorts of things.
You can use a binary search to get the index in a list, and at that index is the identifier.
You can hash just the first 2 characters, rather than the entire string, that would likely be faster than the binary search, assuming the strings have an OK distribution.
You can use the first character, or first two characters, if they're unique as a "perfect index" in to 255 or 65K large array that points to the identifier.
Also, if your identifier is numeric, it's better to pre-calculate that, rather than convert it on the fly all the time. Text -> Binary is actually rather expensive (Binary -> Text is worse). So it's probably nice to avoid that if possible.
But it behooves you work the problem. 1 million anything at 1ms each, is 20 minutes of processing. At 500m, every nano-second wasted adds up to 8+ minutes extra of processing. You may well not care, but just demonstrating that at these scales "every little bit helps".
So, don't take our words for it, test different things to find what gives you the best result for your work set, and then go with that. Also consider excessive object creation, and avoiding that. Normally, I don't give it a second thought. Object creation is fast, but a nano-second is a nano-second.
If you're working in Java, and you don't REALLY need Unicode (i.e. you're working with single characters of the 0-255 range), I wouldn't use strings at all. I'd work with raw bytes. String are based on Java characters, which are UTF-16. Java Readers convert UTF-8 in to UTF-16 every. single. time. 500 million times. Yup! Another few nano-seconds. 8 nano-seconds adds an hour to your processing.
So, again, look in all the corners.
Or, don't, write it easy, fire it up, run it over the weekend and be done with it.
If each String has a unique identifier then retrieval is O(1) only in case of hashmaps.
I wouldn't suggest the first method because you are splitting every string for 450*500m, unless your order is one string for 500m times then on to the next. As Will said, appending numeric to strings then retrieving might seem straight forward but is not recommended.
So if your data is static (just the 450 strings) put them in a Hashmap and experiment it. Good luck.
Use HashMap<Integer, String>. Splitting a string to get the identifier is an expensive operation because it involves creating new Strings.
I don't think anyone is going to be able to give you a convincing "right" answer, especially since you haven't provided all of the background / properties of the computation. (For example, the average length of the strings could make a lot of difference.)
So I think your best bet would be to write a benchmark ... using the actual strings that you are going to be processing.
I'd also look for a way to extract and test the "unique integer identifier" that doesn't entail splitting the string.
Splitting the string should work faster if you write your code well enough. In fact if you already have the int-id, I see no reason to send only the string and maintain a mapping.
Putting into HashMap would need hashing the incoming string every time. So you are basically comparing the performance of the hashing function vs the code you write to append (prepending might be a bit more tricky) on sending end and to parse on receiving end.
OTOH, only 450 strings aren't a big deal, and if you're into it, writing your own hashing algo/function would actually be the most elegant and performant.

How can I assign an int to a String variable?

I am currently making an assignment for Java but I am stuck. I have to make a birthdate from the three parameters: day, month and year, which are numbers = int. With this I have to put in some checks for valid dates. That part I think is done, but I get stuck at the following:
I want an if statement to check the day, and if the day is correct, this block of code should be run trough
if (dag >=1 && dag <=31)
{
datum = dag;
}
datum Is a String, because I want to get the date like this: DD-MM-YYY
And dag is an Int. So whenever I try to compile this, BlueJ gives an error at this part saying "incompatible types". I assume this is because I try to place a Int in a String. Is this possible in any way, because I can't find out how.
Use String.valueOf method to convert int to string: -
int i = 32;
String str = String.valueOf(i);
And of course follow the advice in #Brian's answer as to what you should rather do in your case.
Don't make it a string. it's not. I think you should
create a Date object to represent your date (day/month/year combined)
use SimpleDateFormat to print that date out in the appropriate format
That's the proper OO way to do it. Otherwise you end up with a bunch of disparate disconnected variables representing in their combination some object type, but you can't manipulate them atomically, invoke methods on them etc. Holding everything as strings is known as stringly-typing (as opposed to strongly-typing) and is a particularly bad code smell!
At some stage check out Joda-Time for a better date/time API than those suggested above. However for the moment I suspect you've got enough on your plate without downloading extra jars.

Categories