Java regex can't work if have \n character

Java regex can't work if have \n character - java

I have project to detect if editor have write html entities, but when it containt \n it doesnt work? why?
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class RegexTest {
public static void main(String[] args) {
String text = "asdasdas <h1>Test</h1></div>";
String regex = ".*<[^&lt]+>.*";
Pattern pattern = Pattern.compile(regex);
Matcher m = pattern.matcher(text);
System.out.println(m.matches());
}
}

If you want to take \n into consideration, you can do this:
Pattern pattern = Pattern.compile(regex, Pattern.DOTALL);
This takes the escape sequence into consideration.
You can also use Pattern.MULTILINE, which matches the regex with Each Line. So if you add ^ or $ in your regex, it matches the starting and ending of the regex respctively for each new line.
This is a link to the Oracle docs which may help you better understand, rather than just application of the code. The More You Know... :)

Related

Regex ignore tokens that do not start with letter

how can I write a regex that ignores any token that does not start with a letter? it should be used in java.
example: it 's super cool --> regex should match: [it, super, cool] and ignore ['s].

Alternative regex:
"(?:^|\\s)([A-Za-z]+)"
Regex in context:
public static void main(String[] args) {
String input = "it 's super cool";
Matcher matcher = Pattern.compile("(?:^|\\s)([A-Za-z]+)").matcher(input);
while (matcher.find()) {
String result = matcher.group(1);
System.out.println(result);
}
}
Output:
it
super
cool
Note: To match alphabetic characters, letters, in any language (e.g. Hindi, German, Chinese, English etc.), use the following regex instead:
"(?:^|\\s)(\\p{L}+)"
More about the class, Pattern and the classes for Unicode scripts, blocks, categories and binary properties, can be found here.

You can use (?<!\\p{Punct})(\\p{L}+) which means letters not preceded by a punctuation mark. Note that (?<! is used to specify a negative look behind. Check the documentation of Pattern to learn more about it.
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Main {
public static void main(String[] args) {
String str = "it 's super cool";
Pattern pattern = Pattern.compile("(?<!\\p{Punct})(\\p{L}+)");
Matcher matcher = pattern.matcher(str);
while (matcher.find()) {
System.out.println(matcher.group());
}
}
}
Output:
it
super
cool

Java replaceAll but the specified regex

Can't get my head around this for quite some time already. I have this piece of code:
getStringFromDom(doc).replaceAll("contract=\"\\d*\"|name=\"\\p{L}*\"", "");
Basically I need it to work literally the opposite way - to replace everything BUT the specified regex. I've been trying to do it with the negative lookahead to no avail.

For your particular task, I think
getStringFromDom(doc).replaceAll(".*?(contract=\"\\d*\"|name=\"\\p{L}*\").*", "$1");
should do what you need.

You want to remove everything that does not match the pattern. This is the same as simply filtering the pattern matches. Use the regex to find matches for that pattern, then collect the matches in a stringbuilder.
Matcher m = Pattern.compile(your pattern).matcher(your input);
StringBuilder sb = new StringBuilder();
while (m.find()) sb.append (m.group()).append('\n');
String result = sb.toString();

I also think that removing what your are not looking for is a double negative. Concentrate on what you are looking for and use a pattern matching for that. This example searches your document for any name attributes:
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Test {
public static void main(String[] args) {
String input = "<AnotherDoc accNum=\"1111\" docDate=\"2017-09-26\" docNum=\"2222\" name=\"foo\"> <anotherTag>some date</anotherTag>";
Pattern pattern = Pattern.compile("name=\"[^\\\"]*\""); // value are all characters but "
Matcher matcher = pattern.matcher(input);
while (matcher.find())
System.out.println(matcher.group());
}
}
This prints:
name="foo"

Regex in java to match a string begining with html followed by something

Can someone help me out with a regex to match a string which starts with the following eg:
The string can begin with any html tag eg:
< span > or < p > etc so basically I want a regex to check if a string begins with any opening html tag <> and then followed by [apple videoID=
Eg:
<span>[apple videoID=
Here's what I've tried :
static String pattern = "^<[^>]+>[apple videoID=";
static Pattern pattern1 = Pattern.compile(pattern);
What is wrong in the above?

You have a typo in the following line.
static String pattern = "^<[^>]+>[apple videoID=";
This string is not a valid regular expression because you have an unclosed [ right before the word apple, hence the "Unclosed character class" PatternSyntaxException. You either meant to type
static String pattern = "^<[^>]+><apple videoID=";
assuming that apple is an html tag, or
static String pattern = "^<[^>]+>\\[apple videoID=";
if you really did want the [ in front of apple. This is because [ is a special character in regular expressions and must be escaped with a \ which is a special character in Java strings and must be escaped with a \. Therefore \\[.

simple as this:
<[.]+><apple videoID=[.]*

Try this pattern :
"^<[A-Za-z]+>\\[apple videoID=$"
This pattern will match [apple videoID=
Hope this will help you..!

Here is the solution
Pattern.CASE_INSENSITIVE helps to fetch the pattern either in upper case or lower case.
Tested and Executed.
package sireesh.yarlagadda;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Pattern {
public static void main(String[] args) {
String text="<span><apple videoID=";
String patternString = "<[a-zA-Z]*>\\<apple videoID=";
Pattern pattern = Pattern.compile(patternString, Pattern.CASE_INSENSITIVE);
Matcher matcher = pattern.matcher(text);
System.out.println("lookingAt = " + matcher.lookingAt());
System.out.println("matches = " + matcher.matches());
}
}

How to write a regular expressions that extracts tabbed pieces of text?

I have been trying to create a program to replace tab elements with spaces (assuming a tab is equivalent to 8 spaces, one or more of which taken by non-whitespace characters (letter).
I start to extract the text in a file from a scanner by the following:
try {
reader = new FileReader(file)
} catch (IOException io) {
println("File not found")
}
Scanner scanner = new Scanner(reader);
scanner.usedelimiter("//Z");
String text = Scanner.next();
And then I try parsing through pieces of text that end with a tab with ptrn1 below, and extract the length of the last word of each piece with ptrn2:
Pattern ptrn1 = Pattern.compile(".*\\t, Pattern.DOTALL);
Matcher matcher1 = ptrn1.matcher(text);
String nextPiece = matcher1.group();
println(matcher1.group()); /* gives me the first substring ending with tab*/
however:
Pattern ptrn2 = Pattern.compile("\\s.*\\t"); /*supposed to capture the last word in the string*/
Matcher matcher2 = ptrn2.matcher(nextPiece);
String lastword = matcher2.group();
The last line gives me an error since apparently it cannot match anything with the pattern ("\\s.\*\\t"). There is something wrong with this last regular expression, which is intended to say "any number of spaces, followed by any number of characters, followed by a tab. I have not been able to find out what is wrong with it though. I have tried ("\\s*.+\\t"), ("\\s*.*\\t"), and ("\s+.+\\t"); still no luck.
Later on, per recommendations below, I simplified the code and included the sample string in it. As follows:
import acm.program.*;
import acm.util.*;
import java.util.*;
import java.io.*;
import java.util.regex.*;
public class Untabify extends ConsoleProgram {
public void run(){
String s = "Be plain,\tgood son,\tand homely\tin thy drift.\tRiddling\tconfession\tfinds but riddling\tshrift. ";
Pattern ptrn1 =Pattern.compile(".*?\t", Pattern.DOTALL);
Pattern ptrn2 = Pattern.compile("[^\\s+]\t", Pattern.DOTALL);
String nextPiece;
Matcher matcher1 = ptrn1.matcher(s);
while (matcher1.find()){
nextPiece = matcher1.group();
println(nextPiece);
Matcher matcher2 = ptrn2.matcher(nextPiece);
println(matcher2.group());
}
}
}
The program variably crashes, first at "println(matcher2.group())"; and on the next run on "public void run()" with the message: "Debug Current Instruction Pointer" (what is the meaning of it?).

It would be useful to see a sample string. If you just want the last word before the tab, then you can use this:
([^\s]+)\t
Note the () are to put the last word in a group. [^\s]+ means 1 or more non-space.

You do not need to double-escape the tab character (i.e. \\t); \t will do fine. \t is interpreted as a tab character by the java String parser, and that tab character is sent to the regex parser, which interprets it as a tab character. You can see this answer for more information.
Also, you should use Pattern.DOTALL, not Pattern.Dotall.

The pattern "\\s.*\\t" must match a single whitespace character (\s) followed by 0 or more characters (.*) followed by a single tab (\t). If you want to capture the last word and a trailing tab you should use the word boundary escape \b
Pattern.compile("\\b.*\\b\t");
You could replace the . above to use \w or whatever your definition of a word character is if you don't want to match any character.
Here's the code you'd use to match any word immediately before a tab:
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class RegEx {
public static void main(String args[]) {
String text = "ab cd\t ef gh\t ij";
Pattern pattern = Pattern.compile("\\b(\\w+)\\b\t", Pattern.DOTALL);
Matcher matcher = pattern.matcher(text);
while (matcher.find()) {
System.out.println(matcher.group(1));
}
}
}
The above will output
cd
gh
See the Regular Expression Tutorial, especially the sections on Predefined Character Classes and Boundary Matchers for more information.
You can get more detail and experiment with this regular expression on Regex101.

Java RegExp can't get the result ater evaluating pattern

Hi I have been trying to learn RegExpresions using Java I am still at the begining and I wanted to start a little program that is given a string and outputs the syllabels split.This is what I got so far:
String mama = "mama";
Pattern vcv = Pattern.compile("([aeiou][bcdfghjklmnpqrstvwxyz][aeiou])");
Matcher matcher = vcv.matcher(mama);
if(matcher){
// the result of this should be ma - ma
}
What I am trying to do is create a pattern that checks the letters of the given word and if it finds a pattern that contains a vocale/consonant/vocale it will add a "-" like this v-cv .How can I achive this.

In the following example i matched the first vowel and used positive lookahead for the next consonant-vowel group. This is so i can split again if i have a vcvcv group.
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Test {
public static void main(String[] args) {
new Test().run();
}
private void run() {
String mama = "mama";
Pattern vcv =
Pattern.compile("([aeiou])(?=[bcdfghjklmnpqrstvwxyz][aeiou])");
Matcher matcher = vcv.matcher(mama);
System.out.println(matcher.replaceAll("$1-"));
String mamama = "mamama";
matcher = vcv.matcher(mamama);
System.out.println(matcher.replaceAll("$1-"));
}
}
Output:
ma-ma
ma-ma-ma

try
mama.replaceAll('([aeiou])([....][aeiou])', '\1-\2');
replaceAll is a regular expression method

Your pattern only matches if the String starts with a vocal. If you want to find a substring, ignoring the beginning, use
Pattern vcv = Pattern.compile (".*([aeiou][bcdfghjklmnpqrstvwxyz][aeiou])");
If you like to ignore the end too:
Pattern vcv = Pattern.compile (".*([aeiou][bcdfghjklmnpqrstvwxyz][aeiou]).*");

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Java regex can't work if have \n character - java

Related

Regex ignore tokens that do not start with letter

Java replaceAll but the specified regex

Regex in java to match a string begining with html followed by something

How to write a regular expressions that extracts tabbed pieces of text?

Java RegExp can't get the result ater evaluating pattern

Categories

Resources