My database contains URLs stored as text fields and each URL contains a representation of the date of a report, which is missing from the report itself.
So I need to parse the date from the URL field to a String representation such as:
2010-10-12
2007-01-03
2008-02-07
What's the best way to extract the dates?
Some are in this format:
http://e.com/data/invoices/2010/09/invoices-report-wednesday-september-1st-2010.html
http://e.com/data/invoices/2010/09/invoices-report-thursday-september-2-2010.html
http://e.com/data/invoices/2010/09/invoices-report-wednesday-september-15-2010.html
http://e.com/data/invoices/2010/09/invoices-report-monday-september-13th-2010.html
http://e.com/data/invoices/2010/08/invoices-report-monday-august-30th-2010.html
http://e.com/data/invoices/2009/05/invoices-report-friday-may-8th-2009.html
http://e.com/data/invoices/2010/10/invoices-report-wednesday-october-6th-2010.html
http://e.com/data/invoices/2010/09/invoices-report-tuesday-september-21-2010.html
Note the inconsistent use of th following the day of the month in cases such as these two:
http://e.com/data/invoices/2010/09/invoices-report-wednesday-september-15-2010.html
http://e.com/data/invoices/2010/09/invoices-report-monday-september-13th-2010.html
Others are in this format (with three hyphens before the date starts, no year at the end and an optional use of invoices- before report):
http://e.com/data/invoices/2010/09/invoices-report---wednesday-september-1.html
http://e.com/data/invoices/2010/09/invoices-report---thursday-september-2.html
http://e.com/data/invoices/2010/09/invoices-report---wednesday-september-15.html
http://e.com/data/invoices/2010/09/invoices-report---monday-september-13.html
http://e.com/data/invoices/2010/08/report---monday-august-30.html
http://e.com/data/invoices/2009/05/report---friday-may-8.html
http://e.com/data/invoices/2010/10/report---wednesday-october-6.html
http://e.com/data/invoices/2010/09/report---tuesday-september-21.html
You want a regex like this:
"^http://e.com/data/invoices/(\\d{4})/(\\d{2})/\\D+(\\d{1,2})"
This exploits that everything up through the /year/month/ part of the URL is always the same, and that no number follows till the day of the month. After you have that, you don't care about anything else.
The first capture group is the year, the second the month, and the third the day. The day might not have a leading zero; convert from string to integer and format as needed, or just grab the string length and, if it's not two, then concatenate it to the string "0".
As an example:
import java.util.regex.*;
class URLDate {
public static void
main(String[] args) {
String text = "http://e.com/data/invoices/2010/09/invoices-report-wednesday-september-1st-2010.html";
String regex = "http://e.com/data/invoices/(\\d{4})/(\\d{2})/\\D+(\\d{1,2})";
Pattern p = Pattern.compile(regex);
Matcher m = p.matcher(text);
if (m.find()) {
int count = m.groupCount();
System.out.format("matched with groups:\n", count);
for (int i = 0; i <= count; ++i) {
String group = m.group(i);
System.out.format("\t%d: %s\n", i, group);
}
} else {
System.out.println("failed to match!");
}
}
}
gives the output:
matched with groups:
0: http://e.com/data/invoices/2010/09/invoices-report-wednesday-september-1st-2010.html
1: 2010
2: 09
3: 1
(Note that to use Matcher.matches() instead of Matcher.find(), you would have to make the pattern eat the entire input string by appending .*$ to the pattern.)
Related
I have a String replacedtext which is:
Replaced text:OPTIONS (ERRORS=5000)
LOAD DATA
INFILE *
APPEND INTO TABLE REPO.test
Fields terminated by "," optionally enclosed BY '"'
trailing nullcols
(
CODE ,
user,
date DATE "MM/DD/YYYY"
)
I want to count the Number of REPO. in this whole string.So,I tried in this way but it is not working.
String[] words = replacedtext.split("\\s+");
int count=0;
for(String w:words){
if(w.equals("\\bREPO.\\b")){
count++;
}
}
System.out.println ("count is :"+count);
Output coming is:
count is :0
Since in the string REPO. is seen for once. My output needs to be count is:1.
w.equals("\\bREPO.\\b") compares the content of w with \\bREPO.\\b literally and therefore you are getting a wrong result.
You can count the occurrences of REPO using the regex API.
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Main {
public static void main(String[] args) {
String replacedText = """
Replaced text:OPTIONS (ERRORS=5000)
LOAD DATA
INFILE *
APPEND INTO TABLE REPO.test
Fields terminated by "," optionally enclosed BY '"'
trailing nullcols
(
CODE ,
user,
date DATE "MM/DD/YYYY"
)
""";
Matcher matcher = Pattern.compile("\\bREPO\\b").matcher(replacedText);
int count = 0;
while (matcher.find()) {
count++;
}
System.out.println(count);
}
}
Output:
1
Note that \b is used as word boundary matcher.
Java SE 9 onwards, you can use Matcher#results to get a Stream on which you can call count as shown below:
long count = matcher.results().count();
Note: I've used Text Block feature to make the text block look more readable. You can form the string in the Pre-Java15 way.
Probably 2 issues here. . is a special symbol which represents exactly one character. So your REPO.test won't match.
The second issue is equals method. You should use matcher or just filter by regex and then count.
I would do it so:
int count = Pattern.compile("\\s").splitAsStream(replacedtext)
.filter(Pattern.compile("\\bREPO\\b").asPredicate())
.count()
System.out.println(count);
I have a String like this as shown below. From below string I need to extract number 123 and it can be at any position as shown below but there will be only one number in a string and it will always be in the same format _number_
text_data_123
text_data_123_abc_count
text_data_123_abc_pqr_count
text_tery_qwer_data_123
text_tery_qwer_data_123_count
text_tery_qwer_data_123_abc_pqr_count
Below is the code:
String value = "text_data_123_abc_count";
// this below code will not work as index 2 is not a number in some of the above example
int textId = Integer.parseInt(value.split("_")[2]);
What is the best way to do this?
With a little guava magic:
String value = "text_data_123_abc_count";
Integer id = Ints.tryParse(CharMatcher.inRange('0', '9').retainFrom(value)
see also CharMatcher doc
\\d+
this regex with find should do it for you.
Use Positive lookahead assertion.
Matcher m = Pattern.compile("(?<=_)\\d+(?=_)").matcher(s);
while(m.find())
{
System.out.println(m.group());
}
You can use replaceAll to remove all non-digits to leave only one number (since you say there will be only 1 number in the input string):
String s = "text_data_123_abc_count".replaceAll("[^0-9]", "");
See IDEONE demo
Instead of [^0-9] you can use \D (which also means non-digit):
String s = "text_data_123_abc_count".replaceAll("\\D", "");
Given current requirements and restrictions, the replaceAll solution seems the most convenient (no need to use Matcher directly).
u can get all parts from that string and compare with its UPPERCASE, if it is equal then u can parse it to a number and save:
public class Main {
public static void main(String[] args) {
String txt = "text_tery_qwer_data_123_abc_pqr_count";
String[] words = txt.split("_");
int num = 0;
for (String t : words) {
if(t == t.toUpperCase())
num = Integer.parseInt(t);
}
System.out.println(num);
}
}
I'm trying to parse a html tag so far I got the text which can be as follows:
"Guide Price £50,000"
or
"£50,000"
or even
"£50,000 - £55,000"
In the third case to make things simpler all I need is the first price listed.
My question is how can I convert the following numbers into an int or double, preferably an int as the numbers are quite large. Would number formatter do this or would I need a regex expression especially if some text trails the tag block.
Example after what I got so far
String priceNumber = url.select("span.price").text(); //using JSoup Libary
String priceNumber = priceNumber.replaceAll("[^\\d.])
This removes everything which is not a digit I think.
What if the example has 2 numbers in it how do I get the first?
Use a regex with Matcher.find to search for occurrences, then remove the commas and try to parse. Here's the decimal case:
String input = "£50,000 - £55,000";
Pattern regex = Pattern.compile("\\d[\\d,\\.]+");
Matcher finder = regex.matcher(input);
if( finder.find() ) { // or while() if you want to process each
try {
double value = Double.parseDouble(finder.group(0).replaceAll(",", ""));
// do something with value
} catch (NumberFormatException e ) {
// handle unparseable
}
}
Youu can convert any String to a int or double with Integer.parseInt(\\String you want to convert) or Double.parseDouble(\\String you want to convert) respectively.
In your first and second case this would get you 50000.
In the third cae you need to split the string into 2 first and then repeat the trick.
Your title is a bit misleading as you are not asking on how to convert from pound to lets say euro.
Use a regex to remove the unimportant characters and then parse the result as a double. You can then truncate to int if you only care about dollar values.
NumberFormat format = NumberFormat.getInstance();
format.parse(priceNumber.replaceAll("[^\\d]*([\\d,]*).*", "$1")).doubleValue()
The first part of the replace pattern [^\\d] matches and throws away leading characters, the second part ([\\d,]) saves the next series of digits and commas, then the third part .* throws away the rest of the input.
Then the whole input is replaced with the contents of the first saved match (the second part of the replace pattern).
Then you use the NumberFormat class to parse the number (you could use Double.parseDouble() if it weren't for the comma)
This will work I think!
String string = "This is £50,000 pounds, this is £5.00 pounds.";
String newString = string;
while (string.contains("£")) {
if (string.indexOf("£") != -1) {
// it contains £
string = string.substring(string.indexOf("£"));
newString = string.substring(0, string.indexOf(" "));
string = string.replaceFirst(newString, "");
newString = newString.replaceAll("£", "");
newString = newString.replaceAll(",", "");
double money = Double.parseDouble(newString);
System.out.println(money);
}
}
you can try this out (for all the cases),
String priceNumber = "£500001 wcjnwknv122333- £55,000";
String regex = "£(\\d+,?\\d+)\\D?";
Pattern p =Pattern.compile(regex);
Matcher m = p.matcher(priceNumber);
if(m.find()){
System.out.println(m.group(1));
}
Try below regex :
((\$|£)\d+\s|(\$|£)\d+-(\$|£)\d+\s)
I am trying to figure out how to write an regex that will match a time. The time can look like this: 11:15-12:15 or 11-12:15 or 11-12 and so on. What i currently have is this:
\\d{2}:?\\d{0,2}-{1}\\d{2}:?\\d{0,2}
which does work until a date comes along. This regex will capture if a string like this comes 2013-11-05. I don't want it to find dates. I know i should use Lookbehind but i can't get it to work.
And i am using Jsoup Element getElementsMatchingOwnText method if that information is of any interest.
The time string is included in a html source. like this: (but with more text above and below)
<td class="text">2013-11-04</td>
Try this. Start with the base regex:
\d{1,2}(:\d\d)?-\d{1,2}(:\d\d)?
That is:
one-to-two digits, optionally followed by : and two more digits
followed by a hyphen
followed by one-to-two digits, optionally followed by : and two more digits
This matches all your core cases:
11-12
1-2
1:15-2
10-3:45
2:15-11:30
etc. Now mix in negative lookbehind and negative lookahead to invalidate matches that appear within undesired contexts. Let's invalidate the match when a digit or dash or colon appears directly to the left or right of the match:
The negative lookbehind: (?<!\d|-|:)
The negative lookahead: (?!\d|-|:)
Slap the neg-lookbehind at the beginning, and the neg-lookahead at the end, you get:
(?<!\d|-|:)(\d{1,2}(:\d\d)?-\d{1,2}(:\d\d)?)(?!\d|-|:)
or as a Java String (by request)
Pattern p = Pattern.compile("(?<!\\d|-|:)(\\d{1,2}(:\\d\\d)?-\\d{1,2}(:\\d\\d)?)(?!\\d|-|:)");
Now while the lookaround has eliminated matches within dates, you're still matching some silly things like 99:99-88:88 because \d matches any digit 0-9. You can mix more restrictive character classes into this regex to address that issue. For example, with a 12-hour clock:
For the hour part, use
(1[0-2]|0?[1-9])
instead of
\d{1,2}
For the minute part use
(0[0-9]|[1-5][0-9])
instead of
\d\d
Mixing the more restrictive character classes into the regex yields this nearly impossible to grok and maintain beast:
(?<!\d|-|:)(((1[0-2]|0?[1-9]))(:((0[0-9]|[1-5][0-9])))?-(1[0-2]|0?[1-9])(:((0[0-9]|[1-5][0-9])))?)(?!\d|-|:)
As Java code:
Pattern p = Pattern.compile("(?<!\\d|-|:)(((1[0-2]|0?[1-9]))(:((0[0-9]|[1-5][0-9])))?-(1[0-2]|0?[1-9])(:((0[0-9]|[1-5][0-9])))?)(?!\\d|-|:)");
Simple method:
((\d{2}(:\d{2})?)-?){2}
A safer; more verbose regular expression:
([0-1]?[0-9]|[2][0-3])(:([0-5][0-9]))?-([0-1]?[0-9]|[2][0-3])(:([0-5][0-9]))?
Example in action:
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class App {
private static final String TIME_FORMAT = "%02d:%02d";
private static final String TIME_RANGE = "([0-1]?[0-9]|[2][0-3])(:([0-5][0-9]))?-([0-1]?[0-9]|[2][0-3])(:([0-5][0-9]))?";
public static void main(String[] args) {
String passage = "The time can look like this: 11:15-12:15 or 11-12:15 or 11-12 and so on.";
Pattern pattern = Pattern.compile(TIME_RANGE);
Matcher matcher = pattern.matcher(passage);
int count = 0;
while (matcher.find()) {
String time1 = formattedTime(matcher.group(1), matcher.group(3));
String time2 = formattedTime(matcher.group(4), matcher.group(6));
System.out.printf("Time #%d: %s - %s\n", count, time1, time2);
count++;
}
}
private static String formattedTime(String strHour, String strMinute) {
int intHour = parseInt(strHour);
int intMinute = parseInt(strMinute);
return String.format(TIME_FORMAT, intHour, intMinute);
}
private static int parseInt(String str) {
return str != null ? Integer.parseInt(str) : 0;
}
}
Output:
Time #0: 11:15 - 12:15
Time #1: 11:00 - 12:15
Time #2: 11:00 - 12:00
I have a list with file names that look roughly like this: Gadget1-010912000000-020912235959.csv, i.e. they contain two dates indicating the timespan of their data.
The user enters a date format and a file format:
File Format in this case: *GADGET*-*DATE_FROM*-*DATE_TO*.csv
Date format in this case: ddMMyyHHmmss
What I want to do is extracting the three values out of the file name with the given file and date format.
My problem is: Since the date format can differ heavily (hours, minutes and seconds can be seperated by a colon, dates by a dot,...) I don't quite know how to create a fitting regular expression.
You can use a regular expression to remove non digits characters, and then parse value.
DateFormat dateFormat = new SimpleDateFormat("ddMMyyHHmmss");
String[] fileNameDetails = ("Gadget1-010912000000-020912235959").split("-");
/*Catch All non digit characters and removes it. If non exists maintains original string*/
String date = fileNameDetails[1].replaceAll("[^0-9]", "");
try{
dateFormat.parse(fileNameDetails[1]);
}catch (ParseException e) {
}
Hope it helps.
SimpleDateFormat solves your issue. You can define the format with commas, spaces and whatever and simply parse according to the format:
http://docs.oracle.com/javase/6/docs/api/java/text/SimpleDateFormat.html
So you map your format (e.g ddMMyyHHmmss) to a corresponding SimpleDateFormat.
SimpleDateFormat format = new SimpleDateFormat("ddMMyyHHmmss");
Date x = format.parse("010912000000");
If the format changes, you simply change the SimpleDateFormat
You can use a series of date-time formats, trying each until one works.
You may need to order the formats to prioritize matches.
For example, with Joda time, you can use DateTimeFormat.forPattern() and DateTimeFormatter.getParser() for each of a series of patterns. Try DateTimeParser.parseInto() until one succeeds.
One nice thing about this approach is that it is easy to add and remove patterns.
Use Pattern and Matcher class.
Look at the example:
String inputDate = "01.09.12.00:00:00";
Pattern pattern = Pattern.compile(
"([0-9]{2})[\\.]{0,1}([0-9]{2})[\\.]{0,1}([0-9]{2})[\\.]{0,1}([0-9]{2})[:]{0,1}([0-9]{2})[:]{0,1}([0-9]{2})");
Matcher matcher = pattern.matcher(inputDate);
matcher.find();
StringBuilder cleanStr = new StringBuilder();
for(int i = 1; i <= matcher.groupCount(); i++) {
cleanStr.append(matcher.group(i));
}
SimpleDateFormat format = new SimpleDateFormat("ddMMyyHHmmss");
Date x = format.parse(cleanStr.toString());
System.out.println(x.toString());
The most important part is line
Pattern pattern = Pattern.compile(
"([0-9]{2})[\\.]{0,1}([0-9]{2})[\\.]{0,1}([0-9]{2})[\\.]{0,1}([0-9]{2})[:]{0,1}([0-9]{2})[:]{0,1}([0-9]
Here you define regexp and mark groups in paranthesis so ([0-9]{2}) marks a group. Then is expression for possible delimeters [\\.]* in this case 0 or 1 dot, but you can put more possible delimeters for example [\\.|\]{0,1}.
Then you run matcher.find() which returns true if pattern matches. And then using matcher.group(int) you can get group by group. Note that index of first group is 1.
Then I construct clean date String using StringBuilder. And then parse date.
Cheers,
Michal