Regex to capture groups - java

My group could either be of the form x/y, x.y or x_y.z. Each group is separated by an underscore. The groups are unordered.
Example:
ABC/DEF_abc.def_PQR/STU_ghi_jkl.mno
I would like to capture the following:
ABC/DEF
abc.def
PQR/STU
ghi_jkl.mno
I have done this using a fairly verbose string iteration and parsing method (shown below), but am wondering if a simple regex can accomplish this.
private static ArrayList<String> go(String s){
ArrayList<String> list = new ArrayList<String>();
boolean inSlash = false;
int pos = 0 ;
boolean inDot = false;
for(int i = 0 ; i < s.length(); i++){
char c = s.charAt(i);
switch (c) {
case '/':
inSlash = true;
break;
case '_':
if(inSlash){
list.add(s.substring(pos,i));
inSlash = false;
pos = i+1 ;
}
else if (inDot){
list.add(s.substring(pos,i));
inDot = false;
pos = i+1;
}
break;
case '.':
inDot = true;
break;
default:
break;
}
}
list.add(s.substring(pos));
System.out.println(list);
return list;
}

Have a try with:
((?:[^_./]+/[^_./]+)|(?:[^_./]+\.[^_./]+)|(?:[^_./]+(?:_[^_./]+)+\.[^_./]+))
I don't know java syntax but in Perl:
#!/usr/bin/perl
use 5.10.1;
use strict;
use warnings;
my $str = q!ABC/DEF_abc.def_PQR/STU_ghi_jkl.mno_a_b_c.z_a_b_c_d.z_a_b_c_d_e.z!;
my $re = qr!((?:[^_./]+/[^_./]+)|(?:[^_./]+\.[^_./]+)|(?:[^_./]+(?:_[^_./]+)+\.[^_./]+))!;
while($str=~/$re/g) {
say $1;
}
will produce:
ABC/DEF
abc.def
PQR/STU
ghi_jkl.mno
a_b_c.z
a_b_c_d.z
a_b_c_d_e.z

There might be a problem with the underscore since it's not always a separator.
Maybe: ((?<=_)\w+_)?\w+[./]\.w+

This regex would probably do (tested with .Net regular expressions):
[a-zA-Z]+[./][a-zA-Z]+|[a-zA-Z]+_[a-zA-Z]+\.[a-zA-Z]+
(If you know your input is well formed there is no need to explicitly match the separator)

This one goes with positive lookahead instead of alternations
[A-Za-z]+(_(?=[A-Za-z]+\.[A-Za-z]+))?[A-Za-z]+[/.][A-Za-z]+

Related

parser with integer literals

I am looking for an easy and efficient way to implement a set of numbers in a lexical parser in java. For example my input code is as follows :
"6+9" ,
the output would have to be a little like this :
Number : 6
Sign : +
Number: 9
The issue I have is i have no way to recognize the number other than to implement it as follows :
static char INTVALUE = ('0') ;
which means I would have to manually enter each number from 0 to 9 and I don't know If such a method would even allows to have a number such as 85 in my input .
This is for a homework assignment by the way
Thanks .
For the simplest grammars you can indeed use regular expressions:
import java.util.regex.*;
// ...
String expression = "(10+9)*2";
Pattern pattern = Pattern.compile("\\s*(\\d+|\\D)\\s*");
Matcher matcher = pattern.matcher(expression);
while (matcher.find()) {
String token = matcher.group(1);
System.out.printf("%s: '%s'%n",
token.matches("\\d+") ? "Number" : "Symbol",
token);
}
In a compiler construction course you will probably be expected to construct an NFA and then transform that into a minimal DFA by implementing an algorithm like this one. In real life you would normally use a tool like ANTLR or JLex.
Why dont use regular expressions for this. It sounds a best fit for what you are attempting to do.
Its fairly simple to learn. Look at Character classes (\d) and Quatifiers(+ ?) in this cheatsheet
To check for integers and doubles us the following.
aStr.matches("-?\\d+(\\.\\d+)?");
For just integers:
aStr.matches("-?\\d+");
You can also do something simple like this:
public List<Token> lex(String s) {
List<Token> result = new ArrayList<Token>();
int pos = 0;
int len = s.length();
while (pos < len) {
switch (s.charAt(pos)) {
case '0':
case '1':
case '2':
case '3':
case '4':
case '5':
case '6':
case '7':
case '8':
case '9':
{
int end = pos;
do {
++end;
} while (end < len && s.charAt(end) >= '0' && s.charAt(end) <= '9');
result.add(new Number(s.substring(pos, end)));
pos = end;
break;
}
case '+':
{
result.add(new Operator("+"));
++pos;
break;
}
// ...
}
}
return result;
}

Convert a html string into a java "out.println"-statement

How can i convert a html string into a java "out.println" statement (with java)?
e.g.
<h1>Hello world</h1><p style="background-color:red">hello</p>
into
out.println("<h1>Hello world</h1>");
out.println("<p style=\"background-color:red\">hello</p>");
If you are trying to ease editing, there is an option for this in Eclipse. Look in: Window -> Preferences -> Java -> Editor -> Typing -> In string literals -> Escape text when pasting into a string literal.
If you are trying to do this programmatically, this will suffice:
public static String escapeForQuotes(String s) {
return escapeForQuotes(s, '\uFFFF');
}
public static String escapeForQuotes(String s, char ignore) {
int len = s.length();
StringBuilder sb = new StringBuilder(len * 6 / 5);
for (int i = 0; i < len; i++) {
char c = s.charAt(i);
if (c == ignore) { sb.append(c); continue; }
switch (c) {
case '\\': case '\"': case '\'': break;
case '\n': c = 'n'; break;
case '\r': c = 'r'; break;
case '\0': c = '0'; break;
default: sb.append(c); continue;
}
sb.append('\\').append(c);
}
return sb.toString();
}
The function returns its input with backslashes inserted before quotes, backslashes, line breaks and nulls. The optional 'ignore' parameter allows you to specify one character that need not be escaped. E.g., ' could be but need not be escaped in a "-quoted string, and vice-versa.
E.g.,
System.out.println("System.out.println(\"" + escapeForQuotes(html, '\'') + "\");");
Try below mentioned links
JSoup
Html Parser Tutorial
Regex replace. I take it you want to do the replace in your IDE. (Doing it in Java would need backslashes escaped, typically like "\w" becoming "\\w".)
Replace
<(\w+)[^>]*>([.\r\n]*)</\1>
by
out.println("$2");\r\n
The \1 matching the first group (the tag name).

How to create a Pattern matching given set of chars?

I get a set of chars, e.g. as a String containing all of them and need a charclass Pattern matching any of them. For example
for "abcde" I want "[a-e]"
for "[]^-" I want "[-^\\[\\]]"
How can I create a compact solution and how to handle border cases like empty set and set of all chars?
What chars need to be escaped?
Clarification
I want to create a charclass Pattern, i.e. something like "[...]", no repetitions and no such stuff. It must work for any input, that's why I'm interested in the corner cases, too.
Here's a start:
import java.util.*;
public class RegexUtils {
private static String encode(char c) {
switch (c) {
case '[':
case ']':
case '\\':
case '-':
case '^':
return "\\" + c;
default:
return String.valueOf(c);
}
}
public static String createCharClass(char[] chars) {
if (chars.length == 0) {
return "[^\\u0000-\\uFFFF]";
}
StringBuilder builder = new StringBuilder();
boolean includeCaret = false;
boolean includeMinus = false;
List<Character> set = new ArrayList<Character>(new TreeSet<Character>(toCharList(chars)));
if (set.size() == 1<<16) {
return "[\\w\\W]";
}
for (int i = 0; i < set.size(); i++) {
int rangeLength = discoverRange(i, set);
if (rangeLength > 2) {
builder.append(encode(set.get(i))).append('-').append(encode(set.get(i + rangeLength)));
i += rangeLength;
} else {
switch (set.get(i)) {
case '[':
case ']':
case '\\':
builder.append('\\').append(set.get(i));
break;
case '-':
includeMinus = true;
break;
case '^':
includeCaret = true;
break;
default:
builder.append(set.get(i));
break;
}
}
}
builder.append(includeCaret ? "^" : "");
builder.insert(0, includeMinus ? "-" : "");
return "[" + builder + "]";
}
private static List<Character> toCharList(char[] chars) {
List<Character> list = new ArrayList<Character>();
for (char c : chars) {
list.add(c);
}
return list;
}
private static int discoverRange(int index, List<Character> chars) {
int range = 0;
for (int i = index + 1; i < chars.size(); i++) {
if (chars.get(i) - chars.get(i - 1) != 1) break;
range++;
}
return range;
}
public static void main(String[] args) {
System.out.println(createCharClass("daecb".toCharArray()));
System.out.println(createCharClass("[]^-".toCharArray()));
System.out.println(createCharClass("".toCharArray()));
System.out.println(createCharClass("d1a3e5c55543b2000".toCharArray()));
System.out.println(createCharClass("!-./0".toCharArray()));
}
}
As you can see, the input:
"daecb".toCharArray()
"[]^-".toCharArray()
"".toCharArray()
"d1a3e5c55543b2000".toCharArray()
prints:
[a-e]
[-\[\]^]
[^\u0000-\uFFFF]
[0-5a-e]
[!\--0]
The corner cases in a character class are:
\
[
]
which will need a \ to be escaped. The character ^ doesn't need an escape if it's not placed at the start of a character class, and the - does not need to be escaped when it's placed at the start, or end of the character class (hence the boolean flags in my code).
The empty set is [^\u0000-\uFFFF], and the set of all the characters is [\u0000-\uFFFF]. Not sure what you need the former for as it won't match anything. I'd throw an IllegalArgumentException() on an empty string instead.
What chars need to be escaped?
- ^ \ [ ] - that's all of them, I've actually tested it. And unlike some other regex implementations [ is considered a meta character inside a character class, possibly due to the possibility of using inner character classes with operators.
The rest of task sounds easy, but rather tedious. First you need to select unique characters. Then loop through them, appending to a StringBuilder, possibly escaping. If you want character ranges, you need to sort the characters first and select contiguous ranges while looping. If you want the - to be at the beginning of the range with no escaping, then set a flag, but don't append it. After the loop, if the flag is set, prepend - to the result before wrapping it in [].
Match all characters ".*" (zero or more repeitions * of matching any character . .
Match a blank line "^$" (match start of a line ^ and end of a line $. Note the lack of stuff to match in the middle of the line).
Not sure if the last pattern is exactly what you wanted, as there's different interpretations to "match nothing".
A quick, dirty, and almost-not-pseudo-code answer:
StringBuilder sb = new StringBuilder("[");
Set<Character> metaChars = //...appropriate initialization
while (sourceString.length() != 0) {
char c = sourceString.charAt(0);
sb.append(metaChars.contains(c) ? "\\"+c : c);
sourceString.replace(c,'');
}
sb.append("]");
Pattern p = Pattern.compile(sb.toString());
//...can check here for the appropriate sb.length cases
// e.g, 2 = empty, all chars equals the count of whatever set qualifies as all chars, etc
Which gives you the unique string of char's you want to match, with meta-characters replaced. It will not convert things into ranges (which I think is fine - doing so smells like premature optimization to me). You can do some post tests for simple set cases - like matching sb against digits, non-digits, etc, but unless you know that's going to buy you a lot of performance (or the simplification is the point of this program), I wouldn't bother.
If you really want to do ranges, you could instead sourceString.toCharArray(), sort that, iterate deleting repetitions and doing some sort of range check and replacing meta characters as you add the contents to StringBuilder.
EDIT: I actually kind of liked the toCharArray version, so pseudo-coded it out as well:
//...check for empty here, if not...
char[] sourceC = sourceString.toCharArray();
Arrays.sort(sourceC);
lastC = sourceC[0];
StringBuilder sb = new StringBuilder("[");
StringBuilder range = new StringBuilder();
for (int i=1; i<sourceC.length; i++) {
if (lastC == sourceC[i]) continue;
if (//.. next char in sequence..//) //..add to range
else {
// check range size, append accordingly to sb as a single item, range, etc
}
lastC = sourceC[i];
}

Java String Special character replacement

I have string which contains alpahanumeric and special character.
I need to replace each and every special char with some string.
For eg,
Input string = "ja*va st&ri%n#&"
Expected o/p = "jaasteriskvaspacestandripercentagenatand"
= "asterisk"
& = "and"
% = "percentage"
# = "at"
thanks,
Unless you're absolutely desperate for performance, I'd use a very simple approach:
String result = input.replace("*", "asterisk")
.replace("%", "percentage")
.replace("#", "at"); // Add more to taste :)
(Note that there's a big difference between replace and replaceAll - the latter takes a regular expression. It's easy to get the wrong one and see radically different effects!)
An alternative would be something like:
public static String replaceSpecial(String input)
{
// Output will be at least as long as input
StringBuilder builder = new StringBuilder(input.length());
for (int i = 0; i < input.length(); i++)
{
char c = input.charAt(i);
switch (c)
{
case '*': builder.append("asterisk"); break;
case '%': builder.append("percentage"); break;
case '#': builder.append("at"); break;
default: builder.append(c); break;
}
}
return builder.toString();
Take a look at the following java.lang.String methods:
replace()
replaceAll()

How to rewrite this block of code using a StringBuilder in Java?

Given a word, I've to replace some specific alphabets with some specific letters such as 1 for a, 5 for b etc. I'm using regex for this. I understand that StringBuilder is the best way to deal with this problem as I'm doing a lot of string manipulations. Here is what I'm doing:
String word = "foobooandfoo";
String converted = "";
converted = word.replaceAll("[ao]", "1");
converted = converted.replaceAll("[df]", "2");
converted = converted.replaceAll("[n]", "3");
My problem is how to rewrite this program using StringBuilder. I tried everything but I can't succeed. Or using String is just fine for this?
I think this is a case where clarity and performance happily coincide. I would use a lookup table to do the "translation".
public static void translate(StringBuilder str, char[] table)
{
for (int idx = 0; idx < str.length(); ++idx) {
char ch = str.charAt(idx);
if (ch < table.length) {
ch = table[ch];
str.setCharAt(idx, ch);
}
}
}
If you have a large alphabet for the str input, or your mappings are sparse, you could use a real map, like this:
public static void translate(StringBuilder str, Map<Character, Character> table)
{
for (int idx = 0; idx < str.length(); ++idx) {
char ch = str.charAt(idx);
Character conversion = table.get(ch);
if (conversion != null)
str.setCharAt(idx, conversion);
}
}
While these implementations work in-place, you could create a new StringBuilder instance (or append to one that's passed in).
I'd actually say that the code is pretty OK in most applications although it's theoretically inferior to other methods. If you don't want to use the Matcher, try it like this:
StringBuilder result = new StringBuilder(word.length());
for (char c : word.toCharArray()) {
switch (c) {
case 'a': case 'o': result.append('1'); break;
case 'd': case 'f': result.append('2'); break;
case 'n': result.append('3'); break;
default: result.append(c); break;
}
}
I don't know if StringBuilder is the tool for you here. I'd consider looking at Matcher which is part of the java regex package and might be faster than your example above in case you really need the performance.
I don't believe you can. All the regex replace APIs use String instead of StringBuilder.
If you're basically converting each char into a different char, you could just do something like:
public String convert(String text)
{
char[] chars = new char[text.length()];
for (int i=0; i < text.length(); i++)
{
char c = text.charAt(i);
char converted;
switch (c)
{
case 'a': converted = '1'; break;
case 'o': converted = '1'; break;
case 'd': converted = '2'; break;
case 'f': converted = '2'; break;
case 'n': converted = '3'; break;
default : converted = c; break;
}
chars[i] = converted;
}
return new String(chars);
}
However, if you do any complex regular expressions, that obviously won't help much.
StringBuilder and StringBuffer can have a big performance difference in some programs. See: http://www.thectoblog.com/2011/01/stringbuilder-vs-stringbuffer-vs.html
Which would be a strong reason to want to hold onto it.
The original post asked for multi-character to be replaced with single character. This has a resize impact, which in turn could affect performance.
That said the simplest way to do this is with a String. But to take care of were it is done so as to minimize the gc and other effect if performance is a concern.
I like P Arrayah's approach, but for a more generic answer it should use a LinkedHashMap or something that preserves order in case the replacements have a dependency.
Map replaceRules = new HashMap();
Map replaceRules = new LinkedHashMap();
I had a look at the Matcher.replaceAll() and I noticed that it returns a String. Therefore, I think that what you've got is going to be plenty fast. Regex's are easy to read and quick.
Remember the first rule of optimization: don't do it!
I understand that StringBuilder is the best way to deal with this problem as I'm doing a lot of string manipulations.
Who say that to you? The best way is those that is more clear to read, to the one that uses StringBuilder. The StringBuilder is some circumnstances but in many does not provide a percetible speed up.
You shouldn't initialize "converted" if the value is always replaced.
You can remove some of the boiler plate to improve your code:
String word = "foobooandfoo";
String converted = word.replaceAll("[ao]", "1")
.replaceAll("[df]", "2")
.replaceAll("[n]", "3");
If you want use StringBuilder you could use this method
java.util.regex.Pattern#matcher(java.lang.CharSequence)
which accept CharSequence (implemented by StringBuilder).
See http://java.sun.com/javase/6/docs/api/java/util/regex/Pattern.html#matcher(java.lang.CharSequence).
StringBuilder vs. regex is a false dichotomy. The reason String#replaceAll() is the wrong tool is because, each time you call it, you're compiling the regex and processing the whole string. You can avoid all that excess work by combining all the regexes into one and using the lower-level methods in Matcher instead of replaceAll(), like so:
String text = "foobooandfoo";
Pattern p = Pattern.compile("([ao])|([df])|n");
Matcher m = p.matcher(text);
StringBuffer sb = new StringBuffer();
while (m.find())
{
m.appendReplacement(sb, "");
sb.append(m.start(1) != -1 ? '1' :
m.start(2) != -1 ? '2' :
'3');
}
m.appendTail(sb);
System.out.println(sb.toString());
Of course, this is still overkill; for a job as simple as this one, I recommend erickson's approach.
I would NOT recommend using any regex for this, those are actually all painfully slow when you're doing simple operations. Instead I'd recommend you start with something like this
// usage:
Map<String, String> replaceRules = new HashMap<String, String>();
replaceRules.put("ao", "1");
replaceRules.put("df", "2");
replaceRules.put("n", "3");
String s = replacePartsOf("foobooandfoo", replaceRules);
// actual method
public String replacePartsOf(String thisString, Map<String, String> withThese) {
for(Entry<String, String> rule : withThese.entrySet()) {
thisString = thisString.replaceAll(rule.getKey(), rule.getValue());
}
return thisString;
}
and after you've got that working, refactor it to use character arrays instead. While I think what you want to do can be done with StringBuilder it most likely won't be worth the effort.

Categories