I want to be able to output both "==" and "=" as tokens.
For example, the input text file is:
biscuit==cookie apple=fruit+-()
The output:
biscuit
=
=
cookie
apple
=
fruit
+
-
(
)
What I want the output to be:
biscuit
==
cookie
apple
=
fruit
+
-
(
)
Here is my code:
Scanner s = null;
try {
s = new Scanner(new BufferedReader(new FileReader("input.txt")));
s.useDelimiter("\\s|(?<=\\p{Punct})|(?=\\p{Punct})");
while (s.hasNext()) {
String next = s.next();
System.out.println(next);
}
} finally {
if (s != null) {
s.close();
}
}
Thank you.
Edit: I want to be able to keep the current regex.
Just split the input string according to the below regex .
String s = "biscuit==cookie apple=fruit";
String[] tok = s.split("\\s+|\\b(?==+)|(?<==)(?!=)");
System.out.println(Arrays.toString(tok));
Output:
[biscuit, ==, cookie, apple, =, fruit]
Explanation:
\\s+ Matches one or more space characters.
| OR
\\b(?==+) Matches a word boundary only if it's followed by a = symbol.
| OR
(?<==) Lookafter to = symbol.
(?!=) And match the boundary only if it's not followed by a = symbol.
Update:
String s = "biscuit==cookie apple=fruit+-()";
String[] tok = s.split("\\s+|(?<!=)(?==+)|(?<==)(?!=)|(?=[+()-])");
System.out.println(Arrays.toString(tok));
Output:
[biscuit, ==, cookie, apple, =, fruit, +, -, (, )]
You might be able to qualify those punctuations with some additional assertions.
# "\\s|(?<===)|(?<=\\p{Punct})(?!(?<==)(?==))|(?=\\p{Punct})(?!(?<==)(?==))"
\s
| (?<= == )
| (?<= \p{Punct} )
(?!
(?<= = )
(?= = )
)
| (?= \p{Punct} )
(?!
(?<= = )
(?= = )
)
Info update
If some characters aren't covered in \p{Punct} just add them as a separate class within
the punctuation subexpressions.
For engines that don't do certain properties well inside classes, use this ->
# Raw: \s|(?<===)|(?<=\p{Punct}|[=+])(?!(?<==)(?==))|(?=\p{Punct}|[=+])(?!(?<==)(?==))
\s
| (?<= == )
| (?<= \p{Punct} | [=+] )
(?!
(?<= = )
(?= = )
)
| (?= \p{Punct} | [=+] )
(?!
(?<= = )
(?= = )
)
For engines that handle properties well inside classes, this is a better one ->
# Raw: \s|(?<===)|(?<=[\p{Punct}=+])(?!(?<==)(?==))|(?=[\p{Punct}=+])(?!(?<==)(?==))
\s
| (?<= == )
| (?<= [\p{Punct}=+] )
(?!
(?<= = )
(?= = )
)
| (?= [\p{Punct}=+] )
(?!
(?<= = )
(?= = )
)
In other words you want to split on
one or more whitespaces
place which has = after it and non-= before it (like foo|= where | represents this place)
place which has = before it it and non-= after it (like =|foo where | represents this place)
In other words
s.useDelimiter("\\s+|(?<!=)(?==)|(?<==)(?!=)");
// ^^^^^ ^^^^^^^^^^^ ^^^^^^^^^^^
//cases: 1) 2) 3)
Since it looks like you are building parser I would suggest using tool which will let you build correct grammar like http://www.antlr.org/. But if you must stick with regex then other improvement which will let you build regex easier would be using Matcher#find instead of delimiter from Scanner. This way your regex and code could look like
String data = "biscuit==cookie apple=fruit+-()";
String regex = "<=|==|>=|[\\Q<>+-=()\\E]|[^\\Q<>+-=()\\E]+";
Pattern p = Pattern.compile(regex);
Matcher m = p.matcher(data);
while (m.find())
System.out.println(m.group());
Output:
biscuit
==
cookie apple
=
fruit
+
-
(
)
You can make this regex more general by using
String regex = "<=|==|>=|\\p{Punct}|\\P{Punct}+";
// ^^^^^^^^^^ ^^^^^^^^^^^-- standard cases
// ^^ ^^ ^^------------------------- special cases
Also this approach would require reading data from file first, and storing it in single String which you would parse. You can find many ways of how to read text from file for instance in this question:
Reading a plain text file in Java
so you can use something like
String data = new String(Files.readAllBytes(Paths.get("input.txt")));
You can specify encoding which String should use while reading bytes from file by using constructor String(bytes, encoding). So you can write it as new String(butes,"UTF-8") or to avoid typos while selecting encoding use one of stored in StandardCharsets class like new String(bytes, StandardCharsets.UTF_8).
(?===)|(?<===)|\s|(?<!=)(?==)|(?<==)(?!=)|(?=\p{P})|(?<=\p{P})|(?=\+)
You can try this.Se demo.
http://regex101.com/r/wQ1oW3/18
Related
String s = "My cake should have ( sixteen | sixten | six teen ) candles, I love and ( should be | would be ) puff them."
Final changed string
My cake should have <div><p id="1">sixteen</p><p id="2">sixten</p><p id="3">six teen</p></div> candles, I love and <div><p id="1">should be</p><p id="2"> would be</p> puff them
I have tried using this:
Pattern pattern = Pattern.compile("\\|\\s*(.*?)(?=\\s*\\|)");
Matcher matcher = pattern.matcher(s);
while (matcher.find()){
System.out.println(matcher.group(1));
}
Use
import java.util.regex.*;
class Program
{
public static void main (String[] args) throws java.lang.Exception
{
String s = "My cake should have ( sixteen | sixten | six teen ) candles, I love and ( should be | would be ) puff them.";
Pattern pattern = Pattern.compile("\\(([^()|]*\\|[^()]*)\\)");
Matcher matcher = pattern.matcher(s);
StringBuffer changed = new StringBuffer();
while (matcher.find()){
String temp = "<div>";
String[] items = matcher.group(1).trim().split("\\s*\\|\\s*");
for (int i = 1; i<=items.length; i++) {
temp += "<p id=\"" + i + "\">" + items[i-1] + "</p>";
}
matcher.appendReplacement(changed, temp+"</div>");
}
matcher.appendTail(changed);
System.out.println(changed.toString());
}
}
See Java proof.
Results: My cake should have <div><p id="1">sixteen</p><p id="2">sixten</p><p id="3">six teen</p></div> candles, I love and <div><p id="1">should be</p><p id="2">would be</p></div> puff them.
Regex used
\(([^()|]*\|[^()]*)\)
EXPLANATION
--------------------------------------------------------------------------------
\( '('
--------------------------------------------------------------------------------
( group and capture to \1:
--------------------------------------------------------------------------------
[^()|]* any character except: '(', ')', '|' (0
or more times (matching the most amount
possible))
--------------------------------------------------------------------------------
\| '|'
--------------------------------------------------------------------------------
[^()]* any character except: '(', ')' (0 or
more times (matching the most amount
possible))
--------------------------------------------------------------------------------
) end of \1
--------------------------------------------------------------------------------
\) ')'
In short: Match parens with pipes, take the content with no parens and split with pipes, trim, combine into one string with loop. StringBuffer and matcher.appendReplacement do the magic with string manipulation during replacing.
I want to replace all Java-style comments (/* */) with the number of new lines for that comment. So far, I can only come up with something that replaces comments with an empty string
String.replaceAll("/\\*[\\s\\S]*?\\*/", "")
Is it possible to replace the matching regexes instead with the number of new lines it contains? If this is not possible with just regex matching, what's the best way for it to be done?
For example,
/* This comment
has 2 new lines
contained within */
will be replaced with a string of just 2 new lines.
Since Java supports the \G construct, just do it all in one go.
Use a global regex replace function.
Find
"/(?:\\/\\*(?=[\\S\\s]*?\\*\\/)|(?<!\\*\\/)(?!^)\\G)(?:(?!\\r?\\n|\\*\\/).)*((?:\\r?\\n)?)(?:\\*\\/)?/"
Replace
"$1"
https://regex101.com/r/l1VraO/1
Expanded
(?:
/ \*
(?= [\S\s]*? \* / )
|
(?<! \* / )
(?! ^ )
\G
)
(?:
(?! \r? \n | \* / )
.
)*
( # (1 start)
(?: \r? \n )?
) # (1 end)
(?: \* / )?
==================================================
==================================================
IF you should ever care about comment block delimiters started within
quoted strings like this
String comment = "/* this is a comment*/"
Here is a regex (addition) that parses the quoted string as well as the comment.
Still done in a single regex all at once in a global find / replace.
Find
"/(\"[^\"\\\\]*(?:\\\\[\\S\\s][^\"\\\\]*)*\")|(?:\\/\\*(?=[\\S\\s]*?\\*\\/)|(?<!\")(?<!\\*\\/)(?!^)\\G)(?:(?!\\r?\\n|\\*\\/).)*((?:\\r?\\n)?)(?:\\*\\/)?/"
Replace
"$1$2"
https://regex101.com/r/tUwuAI/1
Expanded
( # (1 start)
"
[^"\\]*
(?:
\\ [\S\s]
[^"\\]*
)*
"
) # (1 end)
|
(?:
/ \*
(?= [\S\s]*? \* / )
|
(?<! " )
(?<! \* / )
(?! ^ )
\G
)
(?:
(?! \r? \n | \* / )
.
)*
( # (2 start)
(?: \r? \n )?
) # (2 end)
(?: \* / )?
You can do it with a regex "replacement loop".
Most easily done in Java 9+:
String result = Pattern.compile("/\\*(?:[^*]++|\\*(?!/))*+\\*/").matcher(input)
.replaceAll(r -> r.group().replaceAll(".*", ""));
The main regex has been optimized for performance. The lambda has not been optimized.
For all Java versions:
Matcher m = Pattern.compile("/\\*(?:[^*]++|\\*(?!/))*+\\*/").matcher(input);
StringBuffer buf = new StringBuffer();
while (m.find())
m.appendReplacement(buf, m.group().replaceAll(".*", ""));
String result = m.appendTail(buf).toString();
Test
final String input = "Line 1\n"
+ "/* Inline comment */\n"
+ "Line 3\n"
+ "/* One-line\n"
+ " comment */\n"
+ "Line 6\n"
+ "/* This\n"
+ " comment\n"
+ " has\n"
+ " 4\n"
+ " lines */\n"
+ "Line 12";
Matcher m = Pattern.compile("(?s)/\\*(?:[^*]++|\\*(?!/))*+\\*/").matcher(input);
String result = m.replaceAll(r -> r.group().replaceAll(".*", ""));
// Show input/result side-by-side
String[] inLines = input.split("\n", -1);
String[] resLines = result.split("\n", -1);
int lineCount = Math.max(inLines.length, resLines.length);
System.out.println("input |result");
System.out.println("-------------------------+-------------------------");
for (int i = 0; i < lineCount; i++) {
System.out.printf("%-25s|%s%n", (i < inLines.length ? inLines[i] : ""),
(i < resLines.length ? resLines[i] : ""));
}
Output
input |result
-------------------------+-------------------------
Line 1 |Line 1
/* Inline comment */ |
Line 3 |Line 3
/* One-line |
comment */ |
Line 6 |Line 6
/* This |
comment |
has |
4 |
lines */ |
Line 12 |Line 12
Maybe, this expression,
\/\*.*?\*\/
on s mode might be close to what you have in mind.
Test
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class re{
public static void main(String[] args){
final String regex = "\\/\\*.*?\\*\\/";
final String string = "/* This comment\n"
+ "has 2 new lines\n"
+ "contained within */\n\n"
+ "Some codes here 1\n\n"
+ "/* This comment\n"
+ "has 2 new lines\n"
+ "contained within \n"
+ "*/\n\n\n"
+ "Some codes here 2";
final String subst = "\n\n";
final Pattern pattern = Pattern.compile(regex, Pattern.DOTALL);
final Matcher matcher = pattern.matcher(string);
final String result = matcher.replaceAll(subst);
System.out.println(result);
}
}
Output
Some codes here 1
Some codes here 2
If you wish to explore/simplify/modify the expression, it's been
explained on the top right panel of
regex101.com. If you'd like, you
can also watch in this
link, how it would match
against some sample inputs.
I am trying to validate a JSON string using regex. Found the valid regex from another post https://stackoverflow.com/a/3845829/7493427
It uses DEFINE feature in regex. But I think the JRegex library does not support that feature. Is there a work around for this?
I used java.util.regex first, then found out about JRegex library. But this doesn't work too.
String regex = "(?(DEFINE)" +
"(?<number> -? (?= [1-9]|0(?!\\d) ) \\d+ (\\.\\d+)? ([eE] [+-]?
\\d+)? )" +
"(?<boolean> true | false | null )" +
"(?<string> \" ([^\"\\n\\r\\t\\\\\\\\]* | \\\\\\\\
[\"\\\\\\\\bfnrt\\/] | \\\\\\\\ u [0-9a-f]{4} )* \" )" +
"(?<array> \\[ (?: (?&json) (?: , (?&json) )* )? \\s*
\\] )" +
"(?<pair> \\s* (?&string) \\s* : (?&json) )" +
"(?<object> \\{ (?: (?&pair) (?: , (?&pair) )* )? \\s*
\\} )" +
"(?<json> \\s* (?: (?&number) | (?&boolean) | (?&string) | (?
&array) | (?&object) ) \\s* )" +
")" +
"\\A (?&json) \\Z";
String test = "{\"asd\" : \"asdasdasdasdasdasd\"}";
jregex.Pattern pattern = new jregex.Pattern(regex);
jregex.Matcher matcher = pattern.matcher(test);
if(matcher.find()) {
System.out.println(matcher.groups());
}
I expected a match as the test json is valid, but I get an exception.
Exception in thread "main" jregex.PatternSyntaxException: unknown group name in conditional expr.: DEFINE at jregex.Term.makeTree(Term.java:360) at jregex.Term.makeTree(Term.java:219)at jregex.Term.makeTree(Term.java:206) at jregex.Pattern.compile(Pattern.java:164) at jregex.Pattern.(Pattern.java:150) at jregex.Pattern.(Pattern.java:108) at com.cloak.utilities.regex.VariableValidationHelper.main(VariableValidationHelper.java:305)
You can use this rather simple Jackson setup:
private static final ObjectMapper MAPPER = new ObjectMapper();
public static boolean isValidJson(String json) {
try {
MAPPER.readValue(json, Map.class);
return true;
} catch(IOException e) {
return false;
}
}
ObjectMapper#readValue() will throw JsonProcessingExceptions (a sub class of IOException) when the input is invalid.
I have an instruction like:
db.insert( {
_id:3,
cost:{_0:11},
description:"This is a description.\nCool, isn\'t it?"
});
The Eclipse plugin I am using, called MonjaDB splits the instruction by newline and I get each line as a separate instruction, which is bad. I fixed it using ;(\r|\n)+ which now includes the entire instruction, however, when sanitizing the newlines between the parts of the JSON, it also sanitizes the \n and \r within string in the json itself.
How do I avoid removing \t, \r, \n from within JSON strings? which are, of course, delimited by "" or ''.
You need to arrange to ignore whitespace when it appears within quotes,. So as suggested by one of the commenters:
\s+ | ( " (?: [^"\\] | \\ . ) * " ) // White-space inserted for readability
Match java whitespace or a double-quoted string where a string consists of " followed by any non-escape, non-quote or an escape + plus any character, then a final ". This way, whitespaces inside strings are not matched.
and replace with $1 if $1 is not null.
Pattern clean = Pattern.compile(" \\s+ | ( \" (?: [^\"\\\\] | \\\\ . ) * \" ) ", Pattern.COMMENTS | Pattern.DOTALL);
StringBuffer sb = new StringBuffer();
Matcher m = clean.matcher( json );
while (m.find()) {
m.appendReplacement(sb, "" );
// Don't put m.group(1) in the appendReplacement because if it happens to contain $1 or $2 you'll get an error.
if ( m.group(1) != null )
sb.append( m.group(1) );
}
m.appendTail(sb);
String cleanJson = sb.toString();
This is totally off the top of my head but I'm pretty sure it's close to what you want.
Edit: I've just got access to a Java IDE and tried out my solution. I had made a couple of mistakes with my code including using \. instead of . in the Pattern. So I have fixed that up and run it on a variation of your sample:
db.insert( {
_id:3,
cost:{_0:11},
description:"This is a \"description\" with an embedded newline: \"\n\".\nCool, isn\'t it?"
});
The code:
String json = "db.insert( {\n" +
" _id:3,\n" +
" cost:{_0:11},\n" +
" description:\"This is a \\\"description\\\" with an embedded newline: \\\"\\n\\\".\\nCool, isn\\'t it?\"\n" +
"});";
// insert above code
System.out.println(cleanJson);
This produces:
db.insert({_id:3,cost:{_0:11},description:"This is a \"description\" with an embedded newline: \"\n\".\nCool, isn\'t it?"});
which is the same json expression with all whitespace removed outside quoted strings and whitespace and newlines retained inside quoted strings.
I'm trying to use regex to split string into field, but unfortunately it's not working 100% and is skipping some part which should be split. Here is part of program processing string:
void parser(String s) {
String REG1 = "(',\\d)|(',')|(\\d,')|(\\d,\\d)";
Pattern p1 = Pattern.compile(REG1);
Matcher m1 = p1.matcher(s);
while (m1.find() ) {
System.out.println(counter + ": "+s.substring(end, m1.end()-1)+" "+end+ " "+m1.end());
end =m1.end();
counter++;
}
}
The string is:
s= 3101,'12HQ18U0109','11YX27X0041','XX21','SHV7-P Hig, Hig','','GW1','MON','E','A','ASEXPORT-1',1,101,0,'0','1500','V','','',0,'mb-master1'
and the problem is that it doesn't split ,1, or ,0,
Rules for parsing are: String is enclosed by ,' ', for example ,'ASEXPORT-1',
int is enclosed only by , ,
expected output =
3101 | 12HQ18U0109 | 11YX27X0041 | XX21 | SHV7-P Hig, Hig| |GW1 |MON |E | A| ASEXPORT-1| 1 |101 |0 | 0 |1500 | V| | | 0 |mb-master1
Altogether 21 elements.
You can split it with this regex
,(?=([^']*'[^']*')*[^']*$)
It splits at , only if there are even number of ' ahead
So for
3101,'12HQ18,U0109','11YX27X0041'
output would be
3101
'12HQ18,U0109'
'11YX27X0041'
Note
it wont work for nested strings like 'hello 'h,i'world'..If there are any such cases you should use the following regex
(?<='),(?=')|(?<=\d),(?=\d|')|(?<=\d|'),(?=\d)
If you also (for some bizarre reason) need to know each matches start and end index in the original string (like you have it in your sample output), you can use the following pattern:
String regex = "('[^']*'|\\d+)";
which would match an unquoted integer or asingle-quoted string.
You can optionally remove the leading and trailing ' using a "second-pass" on the matching substring:
match = match.replaceAll("\\A'|'\\Z", "");
which replaces a leading and trailing ' with nothing.
The code could look like this:
Pattern pat = Pattern.compile("('[^']*'|\\d+)");
Matcher m = pat.matcher(str);
int counter = 0, start = 0;
while (m.find()) {
String match = m.group(1);
int end = start + match.length();
match = match.replaceAll("\\A'|'\\Z", ""); // <-- comment out for NOT replacing
// leading and trailing quotes
System.out.format("%d: %s [%d - %d]%n", ++counter, match, start, end);
start = end + 1; // <-- the "+1" is to account for the ',' separator
}
See, also, this short demo.