Regular expression for java.util.regex.Pattern - java

I am trying to create a suitable regular expression to use java.util.regex.Pattern
I am using the regular expression shown below to match Strings like so: feed_user_at_gmail_dot_com_testfile
final static Pattern PATTERN1 = Pattern.compile("feed_(.*)_([^_]*)");
This works as expected. But, I need to create another Pattern to match Strings like so: feed_user_at_gmail_dot_com_testfile_ts_20120413_dot_175531_dot_463
The difference is that the second String is a time stamped version of the first String. These two Strings are examples of file names in my database and I need to identify them both separately. The time stamped version is appended with _ts_ followed by DATE as shown above. All dots in the DATE are changed to _dot_
Thanks,
Sony

How about this:
"feed_(.*)_([^_]*)_ts_[1-9]+(_dot_[1-9]+)*"
Or better yet,
"feed_(.*)_([^_]*)_ts_[1-9]+(_dot_[1-9]+){2}"
if dates always have exactly two dots.

Related

Regex for matching multiple date formats?

Sorry if this is a noob question but I'm not very comfortable with regex and (as of now) this is a little beyond my understanding.
My dilemma is that we have a verity of ID badges that get scanned into an android application and I'm trying to parse out some dates.
For example, some dates are represented like so:
"ISS20141231" format = yyyyMMdd desired output = "20141231"
"ISS12312014" format = MMddyyyy desired output = "12312014"
"ISS12-31-2014" format = MM-dd-yyyy desired output = "12312014"
currently I have a regex pattern:
Pattern p = Pattern.compile("ISS(\\d{8})");
Matcher m = p.matcher(scanData);
which worked fine for the first two examples but recently I have realized that we also occasionally have dates which use dashes (or slashes) as separators.
Is there an efficient means for extracting these dates without having to write multiple patterns and loop through each one checking for a match?
possibly similar to: "ISS([\d{8} (\d{2}\w\d{2}\w\d{4}) (\d{4}\w\d{2}\w\d{2})])"
Thanks!!
[EDIT]
Just to make things a little bit more clear. The substring ("ISSMMddyyyy") is from a much larger string and could be located anywhere within it. So regex must search the original (200+ byte) string for a match.
If that date string is actually a substring of a larger string, and so you need the regex in order to also search for that pattern, you could modify your regex to be:
ISS([\\d\\-/]{8,10})
And then when retrieving the capture group, strip the hyphens and slashes.
String dateStr = m.group(1).replaceAll("[/\\-]", "");
You can do 2 replace i.e. replace ISS first and then replace / or -:
str = str.replaceFirst("^ISS", "").replaceAll("[/-]", "");
Or to only use a regex:
Search: ISS([0-9])([-./])([0-9])([-./])([0-9]*)
Replace: ${1}${3}${5}

Informatica - replace characters in target

Im working in a project which we create hundreds of xml's in informatica everyday, and all the data which is in the xml should be filtered, like removing all kind of special characters like * +.. You get the idea.
Adding regular expressions for every port is too complicated and not possible due the large amount of mapping we have.
I've added a custom property to the session XMLAnyTypeToString=Yes; and now i get some of the characters instead of &abcd, in their usual presentation (" + , ..).
I'm hoping for some custom property or change in XML target to remove these characters completely.
any idea?
You can make use of String.replaceAll method:
String replaceAll(String regex, String replacement)
Replaces each substring of this string that matches the given regular expression with the given replacement.
You can create a set of symbols you want to remove using regex, for example:
[*+"#$%&\(",.\)]
and then apply it to your string:
String myString = "this contains **symbols** like these $#%#$%";
String cleanedString = myString.replaceAll("[*+"#$%&]", "");
now you "cleanedString" is free of the symbols you've chosen.
By the way, you can test your regex expression in this excellent site:
http://www.regexplanet.com/advanced/java/index.html

Undoing automatic linkification using Java and Regex

I am working with a database whose entries contain automatically generated html links: each URL was converted to
URL
I want to undo these links: the new software will generate the links on the fly. Is there a way in Java to use .replaceAll or a Regex method that will replace the fragments with just the URL (only for those cases where the URLs match)?
To clarify, based on the questions below: the existing entries will contain one or more instances of linkified URLs. Showing an example of just one:
I visited http://www.amazon.com/ to buy a book.
should be replaced with
I visited http://www.amazon.com/ to buy a book.
If the URL in the href differs in any way from the link text, the replacement should not occur.
You can use this pattern with replaceAll method:
<a (?>[^h>]++|\Bh|h(?!ref\b))*href\s*=\s*["']?(http://)?([^\s"']++)["']?[^>]*>\s*+(?:http://)?\2\s*+<\/a\s*+>
replacement: $1$2
I wrote the pattern as a raw pattern thus, don't forget to escape double quotes and using double backslashes before using it.
The main interest of this pattern is that urls are compared without the substring http:// to obtain more results.
First, a reminder that regular expressions are not great for parsing XML/HTML: this HTML should parse out the same as what you've got, but it's really hard to write a regex for it:
<
a
foo="bar"
href="URL">
<nothing/>URL
</a
>
That's why we say "don't use regular expressions to parse XML!"
But it's often a great shortcut. What you're looking for is a back-reference:
\1
This will match when the quoted string and the contents of the a-element are the same. The \1 matches whatever was captured in group 1. You can also use named capturing groups if you like a little more documentation in your regular expressions. See Pattern for more options.

Java Regex to find a string which starts with SDPCDR_

what is the regular expression which i should use to match a string which starts with SDPCDR
and contains date in the format 20120826 and ends with .asn ?
an example string is SDPCDR_delsdp3a_6091_20120826-042451.asn
This would work:
^SDPCDR\w+(\d{8})-\w+.asn$
"^SDPCDR.*\\d{8}.*\\.asn$"
Pretty generous on the date part, but the string is probably specific enough already to avoid false matches. If you're looking for a substring rather than trying to match the entire string, instead use
"SDPCDR.*?\\d{8}.*?\\.asn"
SDPCR_[a-z_]*[0-9]{8,8}-[a-z_]*\\\\.asn

Pattern match numbers/operators

Hey, I've been trying to figure out why this regular expression isn't matching correctly.
List l_operators = Arrays.asList(Pattern.compile(" (\\d+)").split(rtString.trim()));
The input string is "12+22+3"
The output I get is -- [,+,+]
There's a match at the beginning of the list which shouldn't be there? I really can't see it and I could use some insight. Thanks.
Well, technically, there is an empty string in front of the first delimiter (first sequence of digits). If you had, say a line of CSV, such as abc,def,ghi and another one ,jkl,mno you would clearly want to know that the first value in the second string was the empty string. Thus the behaviour is desirable in most cases.
For your particular case, you need to deal with it manually, or refine your regular expression somehow. Like this for instance:
Pattern p = Pattern.compile("\\d+");
Matcher m = p.matcher(rtString);
if (m.find()) {
List l_operators = Arrays.asList(p.split(rtString.substring(m.end()).trim()));
// ...
}
Ideally however, you should be using a parser for these type of strings. You can't for instance deal with parenthesis in expressions using just regular expressions.
That's the behavior of split in Java. You just have to take it (and deal with it) or use other library to split the string. I personally try to avoid split from Java.
An example of one alternative is to look at Splitter from Google Guava.
Try Guava's Splitter.
Splitter.onPattern("\\d+").omitEmptyStrings().split(rtString)

Categories