Strip date from String with junk data - Java

Strip date from String with junk data - Java - java

I need to know if there is anyway to strip the date alone from text like below using java. I am trying to find something more generic, but couldn't get any help, As the input varies.
Some example inputs:
This time is Apr.19,2021 Cheers
or
19-04-2021 Cheers
or
This time is 19-APR-2021
I have seen some code which works for trailing junk characters but couldn't find anything if the date is in between the string and if it varies to different formats.

We could use String#replaceAll here for a regex one-liner approach:
String[] inputs = new String[] {
"This time is Apr.19,2021 Cheers",
"19-04-2021 Cheers",
"This time is 19-APR-2021",
"Hello 2021-19-Apr World"
};
for (String input : inputs) {
String date = input.replaceAll(".*(?<!\\S)(\\S*\\b\\d{4}\\b\\S*).*", "$1");
System.out.println(date);
}
This prints:
Apr.19,2021
19-04-2021
19-APR-2021
2021-19-Apr

If you assume a "date" is any series of letter/number/dot/comma/dash chars that ends with a 4-digit "word", match that and replace with a blank to delete it
str = str.replaceAll("\\b[A-Za-z0-9.,-]+\\b\\d{4}\\b", "");

Related

Regular expression to find everything except a pattern

I'm pretty new to regular expressions and am looking for one that matches anything except all that matches a given regex. I've found ways to find anything except a specific string, but I need it to not match a regex. Also it has to work in Java.
Background: I am working with Ansi-colored strings. I want to take a string that has some text that may be formatted with Ansi color codes and remove anything except those color codes. This should give me the current color formatting for any character appended onto the string.
A formatted string may look like this:
Hello \u001b[31;44mWorld\u001b[0m!
which would display as Hello World! where the World would be colored red on a blue background.
My regex to find the codes is
\u001b\[\d+(;\d+)*m
Now I want a regex that matches everything but the color codes, so it matches
Hello \u001b[31;44m World \u001b[0m !

Your regex in context:
public static void main(String[] args) {
String input = "Hello \u001b[31;44mWorld\u001b[0m!";
String result = Pattern.compile("\u001b\\[\\d+(;\\d+)*m").matcher(input).replaceAll("");
System.out.println("Output: '" + result + "'");
}
Output:
Output: 'Hello World!'

Regex isn't really meant to give 'everything but' the regex match. The easiest way to generally do something like this though is match what you want (like the color codes in your case), then take the string you have, and remove the matches you found, this will leave 'everything but' the match.
Quick sample (very untested)
String everythingBut = "string that has regex matches".replaceAll("r[eg]+x ", "");
Should result in string that has matches i.e. the inverse of your regex

String text="Hello \u001b[31;44mWorld\u001b[0m!";
Arrays.asList( text.split("\\[([;0-9]+)m"))
.stream()
.forEach(s -> aa.replaceAll(s,""));
OUTPUT:
[31;44m[0m

You can do it like this. It simply finds all the matches and puts them in an array which can be joined to a String if desired.
String pat = "\u001b\\[\\d+(;\\d+)*m";
String html = "Hello \u001b[31;44mWorld\u001b[0m!";
Matcher m = Pattern.compile(pat).matcher(html);
String[] s = m.results().map(mr->mr.group()).toArray(String[]::new);

Java: get input string and change the format

I am getting an input string containing digits with comma (,) separated like these formats
1) X,X
2) X,XX
3) XX,X
My desired format is XX,XX.
I want if I get the input string like in above 1,2,3 formats it should be formatted as my desired format XX,XX.
For example,
1) If I get a string in this format 1,12. I want to put a zero before 1 like this 01,12.
2) If I get a string in this format 1,1. I want to put a zero before and ofter 1 like this 01,10.
3) If I get a string in this format 11,1. I want to put a zero after the last 1 like this 11,10.
Any help will be highly appreciated, thanks in advance.

You can use regex pattern to format in your specific pattern using Lookaround
(?=^\d,\d\d$)|(?<=^\d\d,\d$)|(?<=^\d,\d$)|(?=^\d,\d$)
Online demo
Here we are using three combination of data as given by you and empty space is replaced by zero.
Sample code:
String regexPattern="(?=^\\d,\\d\\d$)|(?<=^\\d\\d,\\d$)|(?<=^\\d,\\d$)|(?=^\\d,\\d$)";
System.out.println("1,12".replaceAll(regexPattern, "0"));
System.out.println("1,1".replaceAll(regexPattern, "0"));
System.out.println("11,1".replaceAll(regexPattern, "0"));
output:
01,12
01,10
11,10

Feed in your number to the function, and get the desired String result.
public static String convert(String s){
String arr[] = s.split(",");
if(arr[0].length()!=2){
arr[0] = "0"+arr[0];
}
if(arr[1].length()!=2){
arr[1] = arr[1]+"0";
}
return arr[0]+","+arr[1];
}
But it only works in the format described above.

If your goal is to print these strings, you could use the format method, and leading and trailing zeroes.
https://docs.oracle.com/javase/tutorial/java/data/numberformat.html

Object[] splitted = input.split(",");
System.out.println(String.format("%2s,%-2s", splitted).replace(' ','0'));

How to split a string in Java using "%*%" as separator, including the separator in the result list of strings?

I'm looking for the simplest way of tokenizing strings such as
INPUT OUTPUT
"hello %my% world" -> "hello ", "%my%", " world"
in Java. Is it possible to accomplish this with regex? I am basically looking for a String.split() that takes as separator something of the form "%*%" but that won't ignore it, as it seems to generally do.
Thanks

No, you can't do this the way you explained it. The reason is--it's ambiguous!
You give the example:
"hello %my% world" -> "hello ", "%my%", " world"
Should the % be attached to the string before it or after it?
Should the output be
"hello ", "%my", "% world"
Or, perhaps the output should be
"hello %", "my%", " world"
In your example you don't follow either of these rules. You come up with %my% which attaches the delimiter first to the string after it appears and then to the string before it appears.
Do you see the ambiguity?
So, you first need to come up with a clear set of rules about where you want the delimeter to be attached to. Once you do this, one simple (although not particularly efficient since Strings are immutable) way of achieving what you want is to:
Use String.split() to split the strings in the normal way
Follow your rule set to re-add the delimiter to where it should be in the string.

A simpler solution would be to just split the string by %s. That way, every other subsequence would have been between %s. All you have to do afterwards is iterate over the results, toggling a flag to know if the result is a regular string or one between %s.
Special attention has to be taken to the split implementation, how does it handle empty subsequences. Some implementations decide to discard empty subsequences at the begin/end of the input, others discard all empty subsequences and others discard none of them.
This would not result in the exact output that you want, since the %s would be gone. However you can easily add those back if there is an actual need for them (and I presume there isn't).

why not you split by space between your words. in that case you will get "hello","%my%","world".

If possible, use a simpler delimiter. And I'm okay with jury-rigging "%" as your delimiter, just so you can get String.split() instead of regexps. But if that's not possible...
Regexps! You can parse this using a Matcher. If you know there's one delimiter per line, you specify a pattern that eats the whole line:
String singleDelimRegexp = "(.*)(%[^%]*%)(.*)";
Pattern singleDelimPattern = Pattern.compile(singleDelimRegexp);
Matcher singleDelimMatcher = singleDelimPattern.matcher(input);
if (singleDelimMatcher.matches()) {
String before = singleDelimMatcher.group(1);
String delim = singleDelimMatcher.group(2);
String after = singleDelimMatcher.group(3);
System.out.println(before + "//" + delim + "//" + after);
}
If the input is long and you need a chain of results, you use Matcher in a loop:
String multiDelimRegexp = "%[^%]*%";
Pattern multiDelimPattern = Pattern.compile(multiDelimRegexp);
Matcher multiDelimMatcher = multiDelimPattern.matcher(input);
int lastEnd = 0;
while (multiDelimMatcher.find()) {
String data = input.substring(lastEnd, multiDelimMatcher.start());
String delim = multiDelimMatcher.group();
lastEnd = multiDelimMatcher.end();
System.out.println(data);
System.out.println(delim);
}
String lastData = input.substring(lastEnd);
System.out.println(lastData);
Add those to a data structure as you go, and you'll build the whole parsed input.
Running on input: http://ideone.com/s8FzeW

Extracting dates from string

I have a list with file names that look roughly like this: Gadget1-010912000000-020912235959.csv, i.e. they contain two dates indicating the timespan of their data.
The user enters a date format and a file format:
File Format in this case: *GADGET*-*DATE_FROM*-*DATE_TO*.csv
Date format in this case: ddMMyyHHmmss
What I want to do is extracting the three values out of the file name with the given file and date format.
My problem is: Since the date format can differ heavily (hours, minutes and seconds can be seperated by a colon, dates by a dot,...) I don't quite know how to create a fitting regular expression.

You can use a regular expression to remove non digits characters, and then parse value.
DateFormat dateFormat = new SimpleDateFormat("ddMMyyHHmmss");
String[] fileNameDetails = ("Gadget1-010912000000-020912235959").split("-");
/*Catch All non digit characters and removes it. If non exists maintains original string*/
String date = fileNameDetails[1].replaceAll("[^0-9]", "");
try{
dateFormat.parse(fileNameDetails[1]);
}catch (ParseException e) {
}
Hope it helps.

SimpleDateFormat solves your issue. You can define the format with commas, spaces and whatever and simply parse according to the format:
http://docs.oracle.com/javase/6/docs/api/java/text/SimpleDateFormat.html
So you map your format (e.g ddMMyyHHmmss) to a corresponding SimpleDateFormat.
SimpleDateFormat format = new SimpleDateFormat("ddMMyyHHmmss");
Date x = format.parse("010912000000");
If the format changes, you simply change the SimpleDateFormat

You can use a series of date-time formats, trying each until one works.
You may need to order the formats to prioritize matches.
For example, with Joda time, you can use DateTimeFormat.forPattern() and DateTimeFormatter.getParser() for each of a series of patterns. Try DateTimeParser.parseInto() until one succeeds.
One nice thing about this approach is that it is easy to add and remove patterns.

Use Pattern and Matcher class.
Look at the example:
String inputDate = "01.09.12.00:00:00";
Pattern pattern = Pattern.compile(
"([0-9]{2})[\\.]{0,1}([0-9]{2})[\\.]{0,1}([0-9]{2})[\\.]{0,1}([0-9]{2})[:]{0,1}([0-9]{2})[:]{0,1}([0-9]{2})");
Matcher matcher = pattern.matcher(inputDate);
matcher.find();
StringBuilder cleanStr = new StringBuilder();
for(int i = 1; i <= matcher.groupCount(); i++) {
cleanStr.append(matcher.group(i));
}
SimpleDateFormat format = new SimpleDateFormat("ddMMyyHHmmss");
Date x = format.parse(cleanStr.toString());
System.out.println(x.toString());
The most important part is line
Pattern pattern = Pattern.compile(
"([0-9]{2})[\\.]{0,1}([0-9]{2})[\\.]{0,1}([0-9]{2})[\\.]{0,1}([0-9]{2})[:]{0,1}([0-9]{2})[:]{0,1}([0-9]
Here you define regexp and mark groups in paranthesis so ([0-9]{2}) marks a group. Then is expression for possible delimeters [\\.]* in this case 0 or 1 dot, but you can put more possible delimeters for example [\\.|\]{0,1}.
Then you run matcher.find() which returns true if pattern matches. And then using matcher.group(int) you can get group by group. Note that index of first group is 1.
Then I construct clean date String using StringBuilder. And then parse date.
Cheers,
Michal

Best way to get Tokens in java

I have files with some naming conventions -
Ex 1 - filename1.en.html.xslt
Ex 2 - filename2.de.text.xslt
where en/de - language, html/text - output
I need to read individual files and populate the java object accordingly.
Also, en should be converted to en-US etc, while populating the language field.
Format.java
private String language ;
private string output ;
What is the best way to do this? I know it can be done through plain indexOf or using string tokenizer or parsing thru regex.
If regex is better any code samples please?

It really doesn't matter how you parse the filename as long as it works for you. If you want to take the regex route, a Pattern like this will work:
Pattern p = Pattern.compile("([^.]+)\\.([^.]+)\\.([^.]+)\\.xslt");
The first capture group is the filename, the second is the language, and the third is the output.
That said, a regex does seem like overkill, so what's wrong with using String#split()?

You could do it with StringTokenizer, but String.split() should mostly do the trick.
String foo = "filename1.en.html.xslt"
String[] parts = foo.split("\\."); // regex: need to escape dot
System.out.println(parts[1]); // outputs "en"
With StringTokenizer you could do:
String foo = "filename1.en.html.xslt"
StringTokenizer tokenizer = new StringTokenizer(foo, ".");
List<String> parts = new ArrayList<String>();
while(tokenizer.hasMoreTokens()) {
String part = tokenizer.nextToken();
parts.add(part);
}
System.out.println(parts.get(1)); // "en"

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Strip date from String with junk data - Java - java

If you assume a "date" is any series of letter/number/dot/comma/dash chars that ends with a 4-digit "word", match that and replace with a blank to delete it str = str.replaceAll("\\b[A-Za-z0-9.,-]+\\b\\d{4}\\b", "");

Related

Regular expression to find everything except a pattern

Java: get input string and change the format

How to split a string in Java using "%*%" as separator, including the separator in the result list of strings?

Extracting dates from string

Best way to get Tokens in java

Categories

Resources